Text Representation (Embedding) : When working with text, the first thing you must do is come up with a strategy to convert strings to numbers (or to “vectorize” the text) before feeding it to the model.
Sparse Representation : One-hot encoding, Document Term Matrix, etc.
Dense Representation : Word2Vec, Glove, FastText, etc.
Pretraind Word Embedding : ELMo, GPT, BERT
1. Sparse Representations
Introduction : Sparse Representation embeds word as a vector which have a relatively small number of nonzero elements. (most of elements in vectors are zero)
Model 1) One-hot Encoding : a 1 × N matrix (vector) used to distinguish each word in a vocabulary from every other word in the vocabulary.
Model 2) Bag of Words (BoW) : In this model, a text (such as a sentence or a document) is represented as the bag(multiset) of its words, disregarding grammar and even word order but keeping multiplicity.
Give each word a unique integer index first.
Create a vector recording the number of occurrences of word tokens in each index.
Model 3) Document-Term Matrix : BoW for set of multiple sentences (documents)
Model 4) TF-IDF : give more score to the terms (token) that occur frequently in this document but not in the others
$TF(d,t)$ : The number of occurrences of a specific term $t$ in a particular document $d$.
$DF(t)$ : The number of documents that a specific term $t$ appeared.
$IDF(t)$ : Inverse $DF(t) = log(\frac{n}{1+DF(t)})$ , where $n$ is the number of sentences in corpus
There is no notion of similarity between words => Word2Vec
The dimension of the embedded vector is too high.
corpus = [
"John likes to watch movies. Mary likes movies too.", # doc1
"Mary also likes to watch football games." # doc2
one_hot_John = [1,0,0,0,0,0,0,0,0,0,0] # {"John","likes","to","watch","movies","Mary","too","also","football","games"}
BoW_sentence1 = {"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1}
DTM = [[1,2,1,1,2,1,1,0,0,0],[0,1,1,1,0,1,0,1,1,1]]
# Create DTM with sklearn.CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer().fit(corpus)
DTM = count_vectorizer.transform([corpus[0]])
print(DTM.toarray()) # [[1,2,1,1,2,1,1,0,0,0]]
# Create TF-IDF with sklearn TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer().fit(corpus)
tfidf = tfidf_vectorizer.transform([corpus[0]])
print(tfidf.toarray()) # [0.32 0.46 0.23 0.23 0.64 0.23 0.32 0.00 0.00 0.00]
2. Word2Vec : Efficient Estimation of Word Representations in Vector Space (2013)
Introduction :
Current NLP systems and techniques treat words as atomic units - there is no notion of similarity between words, as these are represented as indices in a vocabulary
The most successful concept is to use distributed representations of word
With the expectation that not only will similar words tend to be close to each other, but that words can have multiple degrees of similarity
$$ vector(\text{King}) - vector(\text{Man}) + vector(\text{Woman}) = vector(\text{Queen}) $$
Methods :
Continuous Bag of Words Model : predicting the current word based on the context
(1) Put context words into a simple neural net in the form of a one-hot vector.
(2) Train the model to predict what the central word is.
Continuous Skip-Gram Model : maximize classification of a word based on another word in the same sentence
(1) Use each current word as an input to a classifier
(2) Predict words within a certain range before and after the current word.
CBOW(T) , Skip-Gram (B)
3. ELMo : Deep contextualized word representations (2018)
Introduction :
Existing embedding models consider the surrounding context only in training step.
That is, after training step is finished, the word and embedding vector match 1:1 and do not change.
Therefore, although the bank of “River Bank” and “Bank Account” have completely different meanings, they are embedded to the same vector (polysemy).
Method: ELMo (Embeddings from Language Model)
Contextualized Word Embedding : ELMo reflect context not only during the training process but also when making predictions (embedding).
Bidirectional-LSTM : A final embedding vector considering context is created by forward-LSTM and backward LSTM
Forward LSTM : predicts (n+1)th word by looking at n words
Backward LSTM : predicts the (n-1)th word by looking at n words.
Conclusion : biLM layers efficiently encode different types of syntactic and semantic information about words in-context
4. BERT : Pre-training of Deep Bidirectional Transformers for Language Understandion (Google, 2018)
Introduction : introduce a new language representation model called BERT, that is designed to pretrain deep bi-directional representations from unlabeled text by jointly conditioning on both left and right context in all layers.
Input Representation = Token Embeddings + Segment Embeddings + Position Embeddings
Token Embeddings : [CLS] token aggregate sequence representation for classification task and [SEP] token simply diffenciate the sentences
Segment Embeddings : details in figure2
Positional Embeddings : details in transformer (Attention is all you need)
Architecture : BERT’s model architecture is a multi-layer bidirectional Transformer encoder (Only encoder part)
Pretrain : pre-train BERT using two unsupervised tasks
Task #1 Masked LM : simply mask some percentage(15%) of the input tokens at random and then predict those masked tokens
Task #2 Next Sentence Prediction : train a model that understands sentence relationships, by a binarized next sentence prediction task.
Finetune : For each sub-task, simply plug in the task-specific inputs and outputs into BERT and fine-tune all the parameters end-to-end
Conclusion : generalizing these findings (unsupervised pre-training can improve down-stream NLP tasks) to deep bidirectional architectures, allowing the same pre-trained model to successfully tackle a broad set of NLP tasks.
5. Multilingual SentenceBert : Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation
Introduction :
We present an easy and efficient method to extend existing sentence embedding models to new languages
based on the idea that a translated sentence should be mapped to the same location in the vector space as the original sentence
Method :
Multilingual knowledge distillation : use the original (monolingual) model to generate sentence embeddings for the source language and then train a new system on translated sentences to mimic the original model.
Training : With $s_i$, a sentence in one of the source languages and $t_i$ a sentence in one of the target languages, we minimize the MSE
$$ \frac{1}{\beta} \sum [ (M(s_j)-M’(s_j))^2 + (M(s_j)-M’(t_j))^2 ] $$
Architecture : mainly use an English SBERT model as teacher model $M$ and use XLM-RoBERTa (XLM-R) as student model $M'$
Conclusion :
presented a method to make monolingual sentence embeddings multilingual, with aligned vector spaces between the languages
We demonstrate the effectiveness of our approach for 50+ languages from various language families.
Code to extend sentence embeddings models to more than 400 languages is publicly available.