NLP #3 | Language Modeling


Summary

  • Language Modeling :  is the task of predicting the next word or character in a document that can be used in downstream tasks like:

    1. Machine Translation : By comparing two sentences with a language model, and return the more natural sentence

    2. Spell Correction : Correct spelling by choosing a more natural vocabulary

    3. Speech Recognition : Correct the recognition result with more natural words



1. Seq2Seq Learning with Neural Networks (2014)

  • Introduction : DNNs work well but they cannot be used to map sequences to sequences. Authors present a general end-to-end approach to sequence learning.
  • Method : Simply using a multilayered LSTM to map the input sequence to a fixed dim vector (context vector) and then another deep LSTM to decode the target sequence from the vector


2. Attention : Neural Machine Translation by Jointly Learning to Align and Translate (2014)

  • Introduction : A potential issue with the encoder-decoder approach is that a NN needs to be able to compress all the necessary information of a source sentence into a fixed-length vector (context vector).

  • Method : Rather than using fixed-length context vector (last hidden state value of encoder), we can use encoder’s each state with current state to generate dynamic context vector

    1. Attention Weight : Simply adding (FC + Softmax) to encoder output, we can get weight values of each input words for output at time step $t$.
    2. Dynamic Context Vector : We can get context vector for each timestep $c_t$ by calculating weighted sum of $h_i$ (hidden state of encoder) and $s_i$ (attention weight, softmax).
    3. Teacher Forcing : If model made wrong prediction $y_t$ for time step $t$. This causes the model to be trained in the wrong way for time step $t+1$. So we passed GT as the nex/t input for timestep $t$ to decoder rather than wrong prediction.


3. Transformer : Attention Is All You Need (2017)

  • Introduction : propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely

  • Method : Self-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.

    • Encoder : Self Attention Layer (Scaled Dot-product Attention) + Feed Forward NN

      1. Word Embedding : word to 512-dim vector
      2. Generate key, query, value vector of $i_{th}$ word : $q_i, k_i, v_i$ by simply multiplying trainable weights, $W_K, W_Q, W_V$
      3. Attention Score, $softmax(s_{ij})$ : by calculating $s_{ij} = q_i \cdot k_j$ we can get score between $i_{th}$ word and other $j_{th}$ words. and scale the dot products by $\frac{1}{\sqrt{d}}$
        • We suspect that for large values of d_k, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients
      4. Weighted Sum : $output_1 = S_{11} v_1 + S_{12}v_2 + S_{13}v_3…$
    • Decoder : Almost same architercture with Encoder

      1. In the decoder, the self-attention layer is only allowed to attend to earlier positions in the output sequence . This is done by masking future positions (setting them to -inf) before the softmax step in the self-attention calculation.
      2. above diffrence makes attention score changed like this : $s_{ij} = q_i \cdot k_j (i<j)$
      3. The decoder stack outputs a vector of floats. And we turn that into a word by the final Linear layer which is followed by a Softmax Layer.
    • Where does query, key, value comes from :

      • The key/value/query concept is analogous to retrieval systems. For example, when you search for videos on Youtube, the search engine will map your query (text in the search bar) against a set of keys (video title, description, etc.) associated with candidate videos in their database, then present you the best matched videos (values).
      • The attention operation can be thought of as a retrieval process as well.
    • Positional Encoding (fixed, different from positional embedding) : no recurrence and no convolution in model, in order for the model to make use of the order of the sequence => we must inject some information about the relative or absolute position of the tokens in the sequence.

      • Simple indexing : just created a new vector where every entry is its index number. => exploding gradients, unstable training

      • Normalized indexing : Just divide everything by the largest integer so all of the values are in [0,1]

      • binarized indexing : Instead of writing say 35 for the 35th element, we could instead represent it via its binary form 100011. => Our binary vectors come from a discrete function, and not a discretization of a continuous function

      • Sinusoidal Positional Encoding : find a way to make the binary vector a discretization of something continuous. (Vanilla transformer)

      • Learnable Positional Embedding : train the vectors in figure of binarized indexing (Vision Transformer)

  • References:

Self-Attention Layer

Handcrafted Positional Encoding