NLP #2 | TextRepresentation


Summary

  • Text Representation (Embedding) : When working with text, the first thing you must do is come up with a strategy to convert strings to numbers (or to “vectorize” the text) before feeding it to the model.

  • Methods

    • Sparse Representation : One-hot encoding, Document Term Matrix, etc.

    • Dense Representation : Word2Vec, Glove, FastText, etc.

    • Pretraind Word Embedding : ELMo, GPT, BERT


1. Sparse Representations

  • Introduction : Sparse Representation embeds word as a vector which have a relatively small number of nonzero elements. (most of elements in vectors are zero)

  • Models

    • Model 1) One-hot Encoding : a 1 × N matrix (vector) used to distinguish each word in a vocabulary from every other word in the vocabulary.

    • Model 2) Bag of Words (BoW) : In this model, a text (such as a sentence or a document) is represented as the bag(multiset) of its words, disregarding grammar and even word order but keeping multiplicity.

      1. Give each word a unique integer index first.

      2. Create a vector recording the number of occurrences of word tokens in each index.

    • Model 3) Document-Term Matrix : BoW for set of multiple sentences (documents)

    • Model 4) TF-IDF : give more score to the terms (token) that occur frequently in this document but not in the others

      • $TF(d,t)$ : The number of occurrences of a specific term $t$ in a particular document $d$.

      • $DF(t)$ : The number of documents that a specific term $t$ appeared.

      • $IDF(t)$ : Inverse $DF(t) = log(\frac{n}{1+DF(t)})$ , where $n$ is the number of sentences in corpus

  • Limitations

    • There is no notion of similarity between words => Word2Vec

    • The dimension of the embedded vector is too high.

corpus = [
    "John likes to watch movies. Mary likes movies too.", # doc1
    "Mary also likes to watch football games." # doc2
    ]

one_hot_John = [1,0,0,0,0,0,0,0,0,0,0] # {"John","likes","to","watch","movies","Mary","too","also","football","games"}
BoW_sentence1 = {"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1}
DTM = [[1,2,1,1,2,1,1,0,0,0],[0,1,1,1,0,1,0,1,1,1]]

# Create DTM with sklearn.CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer().fit(corpus)
DTM = count_vectorizer.transform([corpus[0]])
print(DTM.toarray()) # [[1,2,1,1,2,1,1,0,0,0]]

# Create TF-IDF with sklearn TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer().fit(corpus)
tfidf = tfidf_vectorizer.transform([corpus[0]])
print(tfidf.toarray()) # [0.32 0.46 0.23 0.23 0.64 0.23 0.32 0.00 0.00 0.00]


2. Word2Vec : Efficient Estimation of Word Representations in Vector Space (2013)

  • Introduction :

    • Current NLP systems and techniques treat words as atomic units - there is no notion of similarity between words, as these are represented as indices in a vocabulary

    • The most successful concept is to use distributed representations of word

    • With the expectation that not only will similar words tend to be close to each other, but that words can have multiple degrees of similarity

      $$ vector(\text{King}) - vector(\text{Man}) + vector(\text{Woman}) = vector(\text{Queen}) $$

  • Methods :

    • Continuous Bag of Words Model : predicting the current word based on the context

      • (1) Put context words into a simple neural net in the form of a one-hot vector.

      • (2) Train the model to predict what the central word is.

    • Continuous Skip-Gram Model : maximize classification of a word based on another word in the same sentence

      • (1) Use each current word as an input to a classifier

      • (2) Predict words within a certain range before and after the current word.

CBOW(T) , Skip-Gram (B)



3. ELMo : Deep contextualized word representations (2018)

  • Introduction :

    • Existing embedding models consider the surrounding context only in training step.

    • That is, after training step is finished, the word and embedding vector match 1:1 and do not change.

    • Therefore, although the bank of “River Bank” and “Bank Account” have completely different meanings, they are embedded to the same vector (polysemy).

  • Method: ELMo (Embeddings from Language Model)

    1. Contextualized Word Embedding : ELMo reflect context not only during the training process but also when making predictions (embedding).

    2. Bidirectional-LSTM : A final embedding vector considering context is created by forward-LSTM and backward LSTM

      • Forward LSTM : predicts (n+1)th word by looking at n words

      • Backward LSTM : predicts the (n-1)th word by looking at n words.

  • Conclusion : biLM layers efficiently encode different types of syntactic and semantic information about words in-context



4. BERT : Pre-training of Deep Bidirectional Transformers for Language Understandion (Google, 2018)

  • Introduction : introduce a new language representation model called BERT, that is designed to pretrain deep bi-directional representations from unlabeled text by jointly conditioning on both left and right context in all layers.

  • Method:

    1. Input Representation = Token Embeddings + Segment Embeddings + Position Embeddings

      • Token Embeddings : [CLS] token aggregate sequence representation for classification task and [SEP] token simply diffenciate the sentences

      • Segment Embeddings : details in figure2

      • Positional Embeddings : details in transformer (Attention is all you need)

    2. Architecture : BERT’s model architecture is a multi-layer bidirectional Transformer encoder (Only encoder part)

    3. Pretrain : pre-train BERT using two unsupervised tasks

      • Task #1 Masked LM : simply mask some percentage(15%) of the input tokens at random and then predict those masked tokens

      • Task #2 Next Sentence Prediction : train a model that understands sentence relationships, by a binarized next sentence prediction task.

    4. Finetune : For each sub-task, simply plug in the task-specific inputs and outputs into BERT and fine-tune all the parameters end-to-end

  • Conclusion : generalizing these findings (unsupervised pre-training can improve down-stream NLP tasks) to deep bidirectional architectures, allowing the same pre-trained model to successfully tackle a broad set of NLP tasks.



5. Multilingual SentenceBert : Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation

  • Introduction :

    • We present an easy and efficient method to extend existing sentence embedding models to new languages

    • based on the idea that a translated sentence should be mapped to the same location in the vector space as the original sentence

  • Method :

    • Multilingual knowledge distillation : use the original (monolingual) model to generate sentence embeddings for the source language and then train a new system on translated sentences to mimic the original model.

    • Training : With $s_i$, a sentence in one of the source languages and $t_i$ a sentence in one of the target languages, we minimize the MSE

      $$ \frac{1}{\beta} \sum [ (M(s_j)-M’(s_j))^2 + (M(s_j)-M’(t_j))^2 ] $$

    • Architecture : mainly use an English SBERT model as teacher model $M$ and use XLM-RoBERTa (XLM-R) as student model $M'$

  • Conclusion :

    • presented a method to make monolingual sentence embeddings multilingual, with aligned vector spaces between the languages

    • We demonstrate the effectiveness of our approach for 50+ languages from various language families.

    • Code to extend sentence embeddings models to more than 400 languages is publicly available.