Summary
-
Language Modeling : the task of predicting the next word or character in a document that can be used in downstream tasks like:
-
Machine Translation : By comparing two sentences with a language model, and return the more natural sentence
-
Spell Correction : Correct spelling by choosing a more natural vocabulary
-
Speech Recognition : Correct the recognition result with more natural words
-
Natural Language Generation : produces natural lanugage output
-
-
Model Overview
-
LSTM based models
-
Transformer - Standard in 2025
-
Architecture : Decoder only
-
Layer Normalization : Pre-layer normalization
-
Positional Embedding: Rotary Positional Embedding (RoPE)
-
Mixture of Experts (MoE)
-
-
1. Seq2Seq Learning with Neural Networks (2014)
- Introduction : DNNs work well but they cannot be used to map sequences to sequences. Authors present a general end-to-end approach to sequence learning.
- Method : Simply using a multilayered LSTM to map the input sequence to a fixed dim vector (context vector) and then another deep LSTM to decode the target sequence from the vector
2. Attention : Neural Machine Translation by Jointly Learning to Align and Translate (2014)
-
Introduction : A potential issue with the encoder-decoder approach is that a NN needs to be able to compress all the necessary information of a source sentence into a fixed-length vector (context vector).
-
Method : Rather than using fixed-length context vector (last hidden state value of encoder), we can use encoder’s each state with current state to generate dynamic context vector
- Attention Weight : Simply adding (FC + Softmax) to encoder output, we can get weight values of each input words for output at time step $t$.
- Dynamic Context Vector : We can get context vector for each timestep $c_t$ by calculating weighted sum of $h_i$ (hidden state of encoder) and $s_i$ (attention weight, softmax).
- Teacher Forcing : If model made wrong prediction $y_t$ for time step $t$. This causes the model to be trained in the wrong way for time step $t+1$. So we passed GT as the next input for timestep $t$ to decoder rather than wrong prediction.
3. Transformer : Attention Is All You Need (2017)
-
Introduction :
-
propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.
-
Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train.
-
-
Method : Self-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.
-
Encoder : Self Attention Layer (Scaled Dot-product Attention) + Feed Forward NN
- Word Embedding : word to 512-dim vector
- Generate query, key, value vector of $i_{th}$ word : $q_i, k_i, v_i$ by simply multiplying trainable weights, $W_Q, W_K, W_V$
- Attention Score, $softmax(s_{ij})$ : by calculating $s_{ij} = q_i \cdot k_j$ we can get score between $i_{th}$ word and other $j_{th}$ words. and scale the dot products by $\frac{1}{\sqrt{d}}$
- We suspect that for large values of d_k, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients
- Weighted Sum : $output_1 = S_{11} v_1 + S_{12}v_2 + S_{13}v_3…$
-
Decoder : Almost same architercture with Encoder
- In the decoder, the self-attention layer is only allowed to attend to earlier positions in the output sequence . This is done by masking future positions (setting them to
-inf) before the softmax step in the self-attention calculation. - above diffrence makes attention score changed like this : $s_{ij} = q_i \cdot k_j (i<j)$
- The decoder stack outputs a vector of floats. And we turn that into a word by the final Linear layer which is followed by a Softmax Layer.
- In the decoder, the self-attention layer is only allowed to attend to earlier positions in the output sequence . This is done by masking future positions (setting them to
-
Positional Encoding : no recurrence and no convolution in model, in order for the model to make use of the order of the sequence => we must inject some information about the relative or absolute position of the tokens in the sequence. (5. RoPE in details)
-
Simple indexing : just created a new vector where every entry is its index number. => exploding gradients, unstable training
-
Normalized indexing : Just divide everything by the largest integer so all of the values are in [0,1]
-
binarized indexing : Instead of writing say 35 for the 35th element, we could instead represent it via its binary form 100011. => Our binary vectors come from a discrete function, and not a discretization of a continuous function
-
Sinusoidal Positional Encoding : find a way to make the binary vector a discretization of something continuous. (Vanilla transformer)
-
-
-
References:
-
Appendix1. Where does query, key, value comes from :
- The key/value/query concept is analogous to retrieval systems. For example, when you search for videos on Youtube, the search engine will map your query (text in the search bar) against a set of keys (video title, description, etc.) associated with candidate videos in their database, then present you the best matched videos (values).
- The attention operation can be thought of as a retrieval process as well.
Self-Attention Layer

Handcrafted Positional Encoding
4. On Layer Normalization in the Transformer Architecture : Pre-LN (2020)
- Abstract
- This paper investigates the impact of Layer Normalization (LayerNorm) placement in Transformer architectures
- Traditionally applied after the residual connection (Post-LN), the authors propose an alternative—applying it before the sub-layer (Pre-LN)
- They show that Pre-LN Transformers are more stable during training, especially for deep networks, and converge faster without requiring warm-up or gradient clipping
- Method : LayerNorm is applied before the multi-head attention and feed-forward sub-layers, rather than after the residual connection.
- Mathematical Intuition : consider backpropagation. The gradient in Pre-LN is more directly connected to x^{(l)}, because the residual is untouched by non-linearities or normalization:
- Post-LN: $x^{(l+1)} = \mathcal{LN}(x^{(l)} +\mathcal{F}(x^{(l)}))$
- Pre-LN: $x^{(l+1)} = x^{(l)} + \mathcal{F}(\mathcal{LN}(x^{(l)}))$
- Mathematical Intuition : consider backpropagation. The gradient in Pre-LN is more directly connected to x^{(l)}, because the residual is untouched by non-linearities or normalization:
- Conclustion
- Pre-LN Transformers offer a simple but effective modification to the standard architecture, addressing gradient vanishing issues and facilitating the training of deeper models.
5. Enhanced Transformer with Rotary Position Embedding (2021)
-
Abstract : introduces Rotary Positional Embedding (RoPE), a novel method for encoding positional information in Transformer models.
-
Background1. Position Embedding
- The self-attention first incorporates position information $m$ to the word embeddings $x_i$ and transforms them into queries, keys, and value representations.
- $ q_m= f_q(x_m,m)$,
- $k_n = f_k(x_n,n)$,
- $ v_n = f_v(x_n,n)$
- The self-attention first incorporates position information $m$ to the word embeddings $x_i$ and transforms them into queries, keys, and value representations.
-
Background2. Absolute Position Embedding (2017)
- A typical choice of $f$ for $q, k, v$ is returned by simply adding positional information $p_i$ before weighting attention
- $f_{t:t∈{q,k,v}}(x_i,i) := W_{t:t∈{q,k,v}}(x_i + p_i),$
- Vaswani et al.(2017) have proposed to generate $p_i$ using sinusoidal function
- Lack of relative positional information : absolute position does not help the model to under stand relative distances between tokens
- Poor generalization to longer sequences
- A typical choice of $f$ for $q, k, v$ is returned by simply adding positional information $p_i$ before weighting attention
-
Background3. Relative Position Embedding (2018)
- Replace the absolute position embedding $p_n$ with its sinusoid-encoded relative counterpart $p_{m−n}$ (the distance between position m and n)
- Heavy and complex : require modifying the self-attention mechanism by adding relative position bias terms
- Lack of absolute position information
- Replace the absolute position embedding $p_n$ with its sinusoid-encoded relative counterpart $p_{m−n}$ (the distance between position m and n)
-
Method. Rotary Position Embedding (proposed)
- Instead of adding positional encodings, RoPE applies rotations (matrix transform) to the token embeddings by multiplying $e^{im \theta}$ (comes from Euler’s formula) :
- $f_q(x_m,m) = (W_q x_m)e^{imθ} $
- $f_k(x_n,n) = (W_k x_n)e^{inθ} $
- Applying RoPE to self-attention (inner product of q and k) :
- $q^⊺_mk_n = (R^d _{Θ,m} W_q x_m)^⊺(R^d _{Θ,n}W_k x_n) = x^⊺W_qR^d _{Θ,n−m}W_kx_n$
- Now, relative position can be naturally formulated using vector production in self attention, with absolution position information being encoded through a rotation matrix.
- Instead of adding positional encodings, RoPE applies rotations (matrix transform) to the token embeddings by multiplying $e^{im \theta}$ (comes from Euler’s formula) :
-
Conclusion
- RoPE provides an efficient and effective means of encoding relative positional information in Transformer models.
- It enhances performance on various tasks and maintains robustness across sequence lengths.