NLP #2 | TextRepresentation

2021-01-13 5. Natural Language Comments

Summary

Text Representation (Embedding) : When working with text, the first thing you must do is come up with a strategy to convert strings to numbers (or to “vectorize” the text) before feeding it to the model.
Methods
- Sparse Representation : One-hot encoding, Document Term Matrix, etc.
- Dense Representation : Word2Vec, Glove, FastText, etc.
- Pretraind Word Embedding : ELMo, GPT, BERT

1. Sparse Representations

Introduction : Sparse Representation embeds word as a vector which have a relatively small number of nonzero elements. (most of elements in vectors are zero)
Models
- Model 1) One-hot Encoding : a 1 × N matrix (vector) used to distinguish each word in a vocabulary from every other word in the vocabulary.
- Model 2) Bag of Words (BoW) : In this model, a text (such as a sentence or a document) is represented as the bag(multiset) of its words, disregarding grammar and even word order but keeping multiplicity.
  1. Give each word a unique integer index first.
  2. Create a vector recording the number of occurrences of word tokens in each index.
- Model 3) Document-Term Matrix : BoW for set of multiple sentences (documents)
- Model 4) TF-IDF : give more score to the terms (token) that occur frequently in this document but not in the others
  - $TF(d,t)$ : The number of occurrences of a specific term $t$ in a particular document $d$.
  - $DF(t)$ : The number of documents that a specific term $t$ appeared.
  - $IDF(t)$ : Inverse $DF(t) = log(\frac{n}{1+DF(t)})$ , where $n$ is the number of sentences in corpus
Limitations
- There is no notion of similarity between words => Word2Vec
- The dimension of the embedded vector is too high.

corpus = [
    "John likes to watch movies. Mary likes movies too.", # doc1
    "Mary also likes to watch football games." # doc2
    ]

one_hot_John = [1,0,0,0,0,0,0,0,0,0,0] # {"John","likes","to","watch","movies","Mary","too","also","football","games"}
BoW_sentence1 = {"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1}
DTM = [[1,2,1,1,2,1,1,0,0,0],[0,1,1,1,0,1,0,1,1,1]]

# Create DTM with sklearn.CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer().fit(corpus)
DTM = count_vectorizer.transform([corpus[0]])
print(DTM.toarray()) # [[1,2,1,1,2,1,1,0,0,0]]

# Create TF-IDF with sklearn TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer().fit(corpus)
tfidf = tfidf_vectorizer.transform([corpus[0]])
print(tfidf.toarray()) # [0.32 0.46 0.23 0.23 0.64 0.23 0.32 0.00 0.00 0.00]

2. Word2Vec : Efficient Estimation of Word Representations in Vector Space (2013)

Introduction :
- Current NLP systems and techniques treat words as atomic units - there is no notion of similarity between words, as these are represented as indices in a vocabulary
- The most successful concept is to use distributed representations of word
- With the expectation that not only will similar words tend to be close to each other, but that words can have multiple degrees of similarity
  
  $$ vector(\text{King}) - vector(\text{Man}) + vector(\text{Woman}) = vector(\text{Queen}) $$
Methods :
- Continuous Bag of Words Model : predicting the current word based on the context
  - (1) Put context words into a simple neural net in the form of a one-hot vector.
  - (2) Train the model to predict what the central word is.
- Continuous Skip-Gram Model : maximize classification of a word based on another word in the same sentence
  - (1) Use each current word as an input to a classifier
  - (2) Predict words within a certain range before and after the current word.

CBOW(T) , Skip-Gram (B)

3. ELMo : Deep contextualized word representations (2018)

Introduction :
- Existing embedding models consider the surrounding context only in training step.
- That is, after training step is finished, the word and embedding vector match 1:1 and do not change.
- Therefore, although the bank of “River Bank” and “Bank Account” have completely different meanings, they are embedded to the same vector (polysemy).
Method: ELMo (Embeddings from Language Model)
1. Contextualized Word Embedding : ELMo reflect context not only during the training process but also when making predictions (embedding).
2. Bidirectional-LSTM : A final embedding vector considering context is created by forward-LSTM and backward LSTM
  - Forward LSTM : predicts (n+1)th word by looking at n words
  - Backward LSTM : predicts the (n-1)th word by looking at n words.
Conclusion : biLM layers efficiently encode different types of syntactic and semantic information about words in-context

4. BERT : Pre-training of Deep Bidirectional Transformers for Language Understandion (Google, 2018)

Introduction : introduce a new language representation model called BERT, that is designed to pretrain deep bi-directional representations from unlabeled text by jointly conditioning on both left and right context in all layers.
Method:
1. Input Representation = Token Embeddings + Segment Embeddings + Position Embeddings
  - Token Embeddings : [CLS] token aggregate sequence representation for classification task and [SEP] token simply diffenciate the sentences
  - Segment Embeddings : details in figure2
  - Positional Embeddings : details in transformer (Attention is all you need)
2. Architecture : BERT’s model architecture is a multi-layer bidirectional Transformer encoder (Only encoder part)
3. Pretrain : pre-train BERT using two unsupervised tasks
  - Task #1 Masked LM : simply mask some percentage(15%) of the input tokens at random and then predict those masked tokens
  - Task #2 Next Sentence Prediction : train a model that understands sentence relationships, by a binarized next sentence prediction task.
4. Finetune : For each sub-task, simply plug in the task-specific inputs and outputs into BERT and fine-tune all the parameters end-to-end
Conclusion : generalizing these findings (unsupervised pre-training can improve down-stream NLP tasks) to deep bidirectional architectures, allowing the same pre-trained model to successfully tackle a broad set of NLP tasks.

5. Multilingual SentenceBert : Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation

Introduction :
- We present an easy and efficient method to extend existing sentence embedding models to new languages
- based on the idea that a translated sentence should be mapped to the same location in the vector space as the original sentence
Method :
- Multilingual knowledge distillation : use the original (monolingual) model to generate sentence embeddings for the source language and then train a new system on translated sentences to mimic the original model.
- Training : With $s_i$, a sentence in one of the source languages and $t_i$ a sentence in one of the target languages, we minimize the MSE
  
  $$ \frac{1}{\beta} \sum [ (M(s_j)-M’(s_j))^2 + (M(s_j)-M’(t_j))^2 ] $$
- Architecture : mainly use an English SBERT model as teacher model $M$ and use XLM-RoBERTa (XLM-R) as student model $M'$
Conclusion :
- presented a method to make monolingual sentence embeddings multilingual, with aligned vector spaces between the languages
- We demonstrate the effectiveness of our approach for 50+ languages from various language families.
- Code to extend sentence embeddings models to more than 400 languages is publicly available.