Summary
-
Text Representation (Embedding) : When working with text, the first thing you must do is come up with a strategy to convert strings to numbers (or to “vectorize” the text) before feeding it to the model.
-
Methods
-
Sparse Representation : One-hot encoding, Document Term Matrix, etc.
-
Dense Representation : Word2Vec, Glove, FastText, etc.
-
Pretraind Word Embedding : ELMo, GPT, BERT
-
1. Sparse Representations
-
Introduction : Sparse Representation embeds word as a vector which have a relatively small number of nonzero elements. (most of elements in vectors are zero)
-
Models
-
Model 1) One-hot Encoding : a 1 × N matrix (vector) used to distinguish each word in a vocabulary from every other word in the vocabulary.
-
Model 2) Bag of Words (BoW) : In this model, a text (such as a sentence or a document) is represented as the bag(multiset) of its words, disregarding grammar and even word order but keeping multiplicity.
-
Give each word a unique integer index first.
-
Create a vector recording the number of occurrences of word tokens in each index.
-
-
Model 3) Document-Term Matrix : BoW for set of multiple sentences (documents)
-
Model 4) TF-IDF : give more score to the terms (token) that occur frequently in this document but not in the others
-
$TF(d,t)$ : The number of occurrences of a specific term $t$ in a particular document $d$.
-
$DF(t)$ : The number of documents that a specific term $t$ appeared.
-
$IDF(t)$ : Inverse $DF(t) = log(\frac{n}{1+DF(t)})$ , where $n$ is the number of sentences in corpus
-
-
-
Limitations
-
There is no notion of similarity between words => Word2Vec
-
The dimension of the embedded vector is too high.
-
corpus = [
"John likes to watch movies. Mary likes movies too.", # doc1
"Mary also likes to watch football games." # doc2
]
one_hot_John = [1,0,0,0,0,0,0,0,0,0,0] # {"John","likes","to","watch","movies","Mary","too","also","football","games"}
BoW_sentence1 = {"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1}
DTM = [[1,2,1,1,2,1,1,0,0,0],[0,1,1,1,0,1,0,1,1,1]]
# Create DTM with sklearn.CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer().fit(corpus)
DTM = count_vectorizer.transform([corpus[0]])
print(DTM.toarray()) # [[1,2,1,1,2,1,1,0,0,0]]
# Create TF-IDF with sklearn TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer().fit(corpus)
tfidf = tfidf_vectorizer.transform([corpus[0]])
print(tfidf.toarray()) # [0.32 0.46 0.23 0.23 0.64 0.23 0.32 0.00 0.00 0.00]
2. Word2Vec : Efficient Estimation of Word Representations in Vector Space (2013)
-
Introduction :
-
Current NLP systems and techniques treat words as atomic units - there is no notion of similarity between words, as these are represented as indices in a vocabulary
-
The most successful concept is to use distributed representations of word
-
With the expectation that not only will similar words tend to be close to each other, but that words can have multiple degrees of similarity
$$ vector(\text{King}) - vector(\text{Man}) + vector(\text{Woman}) = vector(\text{Queen}) $$
-
-
Methods :
-
Continuous Bag of Words Model : predicting the current word based on the context
-
(1) Put context words into a simple neural net in the form of a one-hot vector.
-
(2) Train the model to predict what the central word is.
-
-
Continuous Skip-Gram Model : maximize classification of a word based on another word in the same sentence
-
(1) Use each current word as an input to a classifier
-
(2) Predict words within a certain range before and after the current word.
-
-
CBOW(T) , Skip-Gram (B)
3. ELMo : Deep contextualized word representations (2018)
-
Introduction :
-
Existing embedding models consider the surrounding context only in training step.
-
That is, after training step is finished, the word and embedding vector match 1:1 and do not change.
-
Therefore, although the bank of “River Bank” and “Bank Account” have completely different meanings, they are embedded to the same vector (polysemy).
-
-
Method: ELMo (Embeddings from Language Model)
-
Contextualized Word Embedding : ELMo reflect context not only during the training process but also when making predictions (embedding).
-
Bidirectional-LSTM : A final embedding vector considering context is created by forward-LSTM and backward LSTM
-
Forward LSTM : predicts (n+1)th word by looking at n words
-
Backward LSTM : predicts the (n-1)th word by looking at n words.
-
-
-
Conclusion : biLM layers efficiently encode different types of syntactic and semantic information about words in-context
4. BERT : Pre-training of Deep Bidirectional Transformers for Language Understandion (Google, 2018)
-
Introduction : introduce a new language representation model called BERT, that is designed to pretrain deep bi-directional representations from unlabeled text by jointly conditioning on both left and right context in all layers.
-
Method:
-
Input Representation = Token Embeddings + Segment Embeddings + Position Embeddings
-
Token Embeddings : [CLS] token aggregate sequence representation for classification task and [SEP] token simply diffenciate the sentences
-
Segment Embeddings : details in figure2
-
Positional Embeddings : details in transformer (Attention is all you need)
-
-
Architecture : BERT’s model architecture is a multi-layer bidirectional Transformer encoder (Only encoder part)
-
Pretrain : pre-train BERT using two unsupervised tasks
-
Task #1 Masked LM : simply mask some percentage(15%) of the input tokens at random and then predict those masked tokens
-
Task #2 Next Sentence Prediction : train a model that understands sentence relationships, by a binarized next sentence prediction task.
-
-
Finetune : For each sub-task, simply plug in the task-specific inputs and outputs into BERT and fine-tune all the parameters end-to-end
-
-
Conclusion : generalizing these findings (unsupervised pre-training can improve down-stream NLP tasks) to deep bidirectional architectures, allowing the same pre-trained model to successfully tackle a broad set of NLP tasks.
5. Multilingual SentenceBert : Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation
-
Introduction :
-
We present an easy and efficient method to extend existing sentence embedding models to new languages
-
based on the idea that a translated sentence should be mapped to the same location in the vector space as the original sentence
-
-
Method :
-
Multilingual knowledge distillation : use the original (monolingual) model to generate sentence embeddings for the source language and then train a new system on translated sentences to mimic the original model.
-
Training : With $s_i$, a sentence in one of the source languages and $t_i$ a sentence in one of the target languages, we minimize the MSE
$$ \frac{1}{\beta} \sum [ (M(s_j)-M’(s_j))^2 + (M(s_j)-M’(t_j))^2 ] $$
-
Architecture : mainly use an English SBERT model as teacher model $M$ and use XLM-RoBERTa (XLM-R) as student model $M'$
-
-
Conclusion :
-
presented a method to make monolingual sentence embeddings multilingual, with aligned vector spaces between the languages
-
We demonstrate the effectiveness of our approach for 50+ languages from various language families.
-
Code to extend sentence embeddings models to more than 400 languages is publicly available.
-