NLP #5 | Text Segmentation


Summary

  • Text Segmentation : is the process of dividing written text into meaningful units, such as words, sentences, or topics.

    • To improve the results of Information Retrieval system and help users to find relevant passages faster
    • keywords : Text Segmentation, Document Segmentation, Discourse Segmentation
  • Methods



1. Conventional Ideas : Application of Topic Segmentation in Audiovisual information retrieval (2012)

  • Introduction : Overview of Conventional approaches in Topic Segmentation
  • Methods : present several methods used for topic segmentation, based on textual, audio and visual information.
    1. Lexical Cohesion based : use only textual information. They are based on the assumption that the segment we are looking for is lexically coherent (it uses coherent vocabulary)
    2. Feature based (hand-crafted)
      • Textual Features : Lexical Features, Contextual Features, Vocabulary, Lexical Chains
      • Audio Features : Prosodic Features and Converational Features
      • Video Features : Color Similarity, Motion Similarity, and Bag of Visual words
  • Conclusion : would like to use all kinds of modality in the proposed system. Therefore, machine learning method based on the features appears to be the best solution.


2. Topic segmentation in ASR Transcript using Bidrectional RNNs for Change Detection (2017)

  • Introduction : propose a novel approach for topic segmentation in speech recognition transcripts by measuring lexical cohesion using bidirectional Recurrent NeuralNetworks (RNN).

  • Methods : Bidirectional Recurrent Neural Networks

    • Two or more news articles are randomly chosen and concatenated
    • Then the training objective is to mark the boundary be-tween the concatenated articles as a topic change point.
  • Conclusion : These models were trained discriminatively by concatenating news articles from the internet. Evaluation on ASR transcripts of French TV news programs showed that our RNN models can perform better than the C99-LSAand Topic Tiling baseline methods.



3. Text Segmentation as a Supervised Learning Task (2018, Wiki-727)

  • Introduction :

    • Previous work on text segmentation focused on unsupervised methods such as clustering or graph search, due to the paucity in labeled data.

    • In this work, we formulate text segmentation as a supervised learning problem, and present a large new dataset for text segmentation that is automatically extracted and labeled from Wikipedia.

  • Methods

    • Wiki-727k Dataset : 727k English Wikipedia documents

      • Removed all photos, tables, Wikipedia template elements, and other non-text elements.

      • Removed single-sentence segments, documents with less than three segments, and documents where most segments were filtered.

      • Divided each segment into sentences using the PUNKT-tokenizer of the NLTK library (Bird et al., 2009). This is necessary for the use of our dataset as a benchmark, as without a well-defined sentence segmentation, it is impossible to evaluate different models

    • Neural Model for Text Segmentation : two sub-networks based on the LSTM architecture. One generates sentence representations, and the other predicts segment

  • Conclusion : Our text segmentation model outperforms prior methods on Wikipedia documents, and performscompetitively on prior benchmark



4. Attention based Neural Text Segmentation (2018)

  • Introduction : This paper is the first one to present a novel supervised neural approach for text segmentation.

  • Methods : solve binary classification problem,

    • Problem Definition : Classifies whether a given sentence denotes the beginning of a new text segments.
    • Model : using attention based bidirectional LSTM
  • Code : https://github.com/pinkeshbadjatiya/neuralTextSegmentation



5. Text Segmentation by cross segment attention (2020, Google Research)

  • Introduction : Text segmentation is a traditional NLP task that breaks up text into constituents.

    • Document Segmentation : has been shown to improve information retrieval by indexing sub-document units instead of full documents (Llopiset al., 2002; Shtekh et al., 2018)

    • Discourse Segmentation : breaks up pieces of text into sub-sentence elements (EDUs, Fig2)

  • Methods : propose three transformer-based architectures

    • Preprocessing : simply feed the raw input into a word-piece (sub-word) tokenize (Wu et al., 2016)

    • Architecture

      • Cross-segment BERT : uses only local context around candidate breaks. find appropriate location for [SEP] token

      • BERT+BiLSTM : encode each sentence using a BERT model, and then feed the sentence representations into a Bi-LSTM

      • Hierarchical BERT : encode each sentence using BERT and then feed the output sentence representations in another transformer-based model.

    • Dataset : Wiki-727k contains 727 thousand articles from a snapshot of the English Wikipedia,

      • re-use the original splits provided by the authors
  • Conclusion : In particular, we found that a cross-segmentBERT model is extremely competitive. This is surprising as it suggests that local context is sufficient in many cases

  • References

    • word-piece tokenizer implementation (Devlin et al., 2018), which has a vocabulary size of 30,522 word-pieces.


6. Chapter Captor : Text Segmentation in Novels (2020)

  • Introduction : investigate the task of predicting chapter boundaries, as a proxy for the general task of segmenting long texts

  • Contributions:

    • (1) Project Gutenberg Chapter Segmentation Resource : Create a ground-truth data setfor chapter segmentation

    • (2) Local Methods for Chapter Segmentation :

      • unsupervised weighted-cut approach minimizing cross-boundary cross-references

      • supervised neural network building on the BERT language model (Devlinet al., 2019)

    • (3) Global Break Prediction using Optimization

      • augmenting the BERT-based local classifier with dynamic programming
  • References



7. Improving Context Modeling in Neural Topic Segmentation (2020)

  • Introduction :

    • Topic segmentation can be framed as a sequence labeling task where each sentence is either the end of a segment or not.
    • For topic segmentation, it is critical (important) to supervise the model to focus more on the local context.
    • with a proper way of modeling the coherence between adjacent sentences, a topic segmenter can be further enhanced.
  • Methods: given a document represented as a sequence of sentences, model will predict the binary label for each sentence to indicate if the sentence is the end of a topical coherent segment or not.

    • add a coherence-related auxiliary task : to make model learn more informative hidden states for all the sentences in a document
    • restricted self-attention : which enables model to pay attention to the local context and make better use of the information from the closer neighbors of each sentence
  • Discussion

    • Domain Transfer : trained on WIKI-SECTION dataset, evaluated on WIKI-50, which consists of 50 samples randomly generated from the latest English Wikipedia dump,
    • Multilingual Evaluation : trained and tested on two other wikipedia datasets in German and Chinese : SECTION-DE, SECTION-ZH


8. Unsupervised Topic Segmentation of Meetings with BERT Embeddings (2021)

  • Introduction :
    • In the context of meeting recordings and their transcripts, topic segmentation can quickly provide users with a valuable high level understanding of past meetings.
    • Topic segmentation of spoken language is significantly more challenging than written text due to the added complexity that the underlying ASR (Automated Speech Recognition) system
  • Methods : detect topic changes based on a new similarity score using BERT embeddings.
    • Sentence representation model : to extract semantic similarity between sentences (BERT)
    • A segmentation scheme : that employs semantic similarity variations over time to detect topic changes.
  • Conclusion : leveraging the strong semantic representation power of BERT, prposed model shows improved segmentation performance compared to the non neural-based approach