LM #2 | Language Model Pretraining


Summary

  • Pretraining Tasks Overview

    • Masked LM : simply mask some of the input tokens at random and then predict those masked tokens

    • Next Sentence Prediction: train a model that understands sentence relationships, by a binarized next sentence prediction task.



1. BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding (Google, 2018)

  • Introduction : introduce a new language representation model called BERT, that is designed to pretrain deep bi-directional representations from unlabeled text by jointly conditioning on both left and right context in all layers.

  • Method:

    1. Input Representation = Token Embeddings + Segment Embeddings + Position Embeddings

      • Token Embeddings : [CLS] token aggregate sequence representation for classification task and [SEP] token simply diffenciate the sentences

      • Segment Embeddings : details in figure2

      • Positional Embeddings : details in transformer (Attention is all you need)

    2. Architecture : BERT’s model architecture is a multi-layer bidirectional Transformer encoder (Only encoder part)

    3. Pretrain : pre-train BERT using two unsupervised tasks

      • Task #1 Masked LM : simply mask some percentage(15%) of the input tokens at random and then predict those masked tokens

      • Task #2 Next Sentence Prediction : train a model that understands sentence relationships, by a binarized next sentence prediction task.

    4. Finetune : For each sub-task, simply plug in the task-specific inputs and outputs into BERT and fine-tune all the parameters end-to-end

  • Conclusion : generalizing these findings (unsupervised pre-training can improve down-stream NLP tasks) to deep bidirectional architectures, allowing the same pre-trained model to successfully tackle a broad set of NLP tasks.



2. GPT-1: Improving Language Understanding by Generative Pre-training (OpenAI, 2018)

  • Introduction

    • Natural language understanding comprises a wide range of diverse tasks such as textual entailment, question answering, semantic similarity assessment, and document classification.
    • Although large unlabeled text corpora are abundant, labeled data for learning these specific tasks is scarce
    • We demonstrate that large gains on these tasks can be realized by generative pre-training of a language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task
    • 117M params trained with BookCorpus (7K book text)
  • Method

    • Unsupervised pretraining : multi-layer transformer decoder
      • use a standard language modeling objective to maximize the following likelihood:
        • $L_1(U) =log P(u_i|u_{i?k} ,…,u_{i?1}; ?)$
    • Supervised fine-tuning :
      • We assume a labeled dataset $C$, where each instance consists of a sequence of input tokens, $x_1,…,x_m$, along with a label $y$.
        • $L_2(C) = \Sigma_{(x,y)}log P(y|x^1,…,x^m).$
  • Conclusion

    • By pre-training on a diverse corpus with long stretches of contiguous text, our model acquires significant world knowledge and ability to process long-range dependencies which are then successfully transferred to solving discriminative tasks such as question answering, semantic similarity assessment, entailment determination, and text classification

Transformer architecture and training objectives



3. GPT-2: Language Models are unsupervised multitask learners (OpenAI, 2019)

  • Introduction :

    • The authors explores the idea that large-scale language models can perform a variety of NLP tasks without explicit supervision.
    • The authors build on their previous model, GPT, scaling it up significantly in size and data. (1.5 billion)
    • The goal is to test whether a sufficiently large transformer trained on a broad dataset of internet text can generalize to new tasks using only task descriptions as input essentially leveraging zero-shot learning.
  • Method

    • based on a Transformer decoder architecture with 1.5 billion parameters.

    • It is trained on a diverse dataset called WebText, which consists of over 8 million documents totaling 40GB of text from the internet.

    • The training is done using a simple unsupervised objective: predicting the next word in a sequence.

      • GPT-2 is evaluated in a zero-shot setting across various tasks such as translation, question answering, and summarization using prompts to guide its behavior without any fine-tuning.
  • Conclusion

    • Previously, pretrained LLMs required supervised fine-tuning for specific downstream tasks (GPT-1).

    • GPT-2 can perform those tasks (translation, question answering, and summarization) via context alone w/o supervised finetuning.



4. GPT-3: Language models are few shot learners (OpenAI, 2020)

  • Introduction
    • The authors investigates whether scaling up language models leads to better performance on a wide range of NLP tasks with minimal supervision.
    • Building on the GPT-2 approach, GPT-3 is significantly larger 175 billion parameters and trained on an even broader dataset.
    • The central idea is that sufficiently large models can perform tasks via few-shot, one-shot, or even zero-shot learning, without gradient updates or task-specific fine-tuning.
  • Method
    • GPT-3 uses the same autoregressive Transformer architecture as GPT-2 but massively scales up the model size and training data.
    • It is trained with a next-word prediction objective on a diverse corpus totaling 570GB of filtered text.
    • Evaluation is conducted by prompting the model with examples of a task (few-shot), one example (one-shot), or just a task description (zero-shot), to assess its ability to generalize without fine-tuning.
  • Conclusion
    • GPT-3 shows strong performance across a broad array of tasks, including translation, question answering, and arithmetic, particularly in the few-shot setting.
    • The paper highlights the power of scale in language models but also notes limitations such as model bias, performance inconsistency, and high computational cost.