VL #01 | Vision Language Pretraining


Introduction

  • Tasks
    • Vision Language Pretraining (VLP) : aims to improve performance of downstream vision and language tasks by pretraining the model on large-scale image-text pairs


CLIP : Learning Transferable Visual Models From Natural Language Supervision (2021)

  • introduction :

    • Pre-training methods which learn directly from raw text have revolutionized NLP over the last few years

    • Could scalable pre-training methods which learn directly from web text result in a similar breakthrough in computer vision?

  • Methods :

    • Dataset :

      • Natural Language Supervision : learn visual representations from text paired with images => easier to scale since it does not require annotations (2.1)

      • Creating a Sufficiently Large Dataset : A major motivation for NLS is the large quantities of data available publicly on the internet, we constructed a new dataset of 400 milion pairs. (2.2)

    • Training : Selecting an Efficient Pre-training Method (2.3)

      • Given $N$ (image, text) pairs, CLIP is trained to predict which of the $N \times N$ pairings acutally occurred

      • Learns multi-modal embedding space by jointly training an image encoder and text encoder :

        • maximize cos-sim of the image and text embeddings of the N real pairs and minimize cos-sim of incorrect pairings
      • detail training params in paper, 12 days on 256-V100 GPUs (2.5)

    • Model : ViT as image encoder and Transformer as text encoder (2.4)

  • Experiments:

    • Zero-Shot Transfer : For each dataset, we use the names of all the classes in the dataset as the set of potential text pairings and predict the most probable pair according to CLIP
      • Using the prompt template a photo of a {LABEL} often improves performance
    • Representation Learning : Fitting a linear classifier on a representation extracted from the model and measuring its performance on various dataset
# extract feature representations of each modality
I_f = image_encoder(I) #[n, d_i]
T_f = text_encoder(T)  #[n, d_t]
# joint multimodal embedding [n, d_e]
I_e = l2_normalize(np.dot(I_f, W_i), axis=1)
T_e = l2_normalize(np.dot(T_f, W_t), axis=1)
# scaled pairwise cosine similarities [n, n]
logits = np.dot(I_e, T_e.T) * np.exp(t)
# symmetric loss function
labels = np.arange(n)
loss_i = cross_entropy_loss(logits, labels, axis=0)
loss_t = cross_entropy_loss(logits, labels, axis=1)
loss   = (loss_i + loss_t)/2


ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision (2021, Google Research)

  • Introduction :

    • While representation learning in NLP has transitioned to training on raw text without human annotations,

    • visual and vision-language representations still rely heavily on curated training datasets that are expensive or require expert knowledge.

  • Methods : leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps in the Conceptual Captions Dataset

    • A Large Scale Noisy Image-Text Dataset : TODO

    • Pretraining and Task Transfer : TODO

  • Result : achieves strong performance when transferred to classification tasks

    • TODO
  • Differences between ALIGN and CLIP



BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation (2022)

  • Introduction : Vision-language pre-training has recently received tremendous success on various multimodal downstream tasks. However, existing methods have two major limitations :

    • (1) Model perspective : most methods either adopt encoder-based model or an encoder-decoder model.

      • encoder-based models : less straight forward to directly transfer to text generation tasks (e.g. image captioning),

      • encoder-decoder models have not been successfully adopted for understanding tasks (e.g. image-text retrieval)

    • (2) Data perspective : most SOTA methods pre-train on image-text pairs collected from the web (noisy web text is suboptimal for VL learning).

  • Methods : BLIP, a new VLP framework which transfers flexibly to both vision-language understanding and generation task

    • Multimodal mixture of Encoder-Decoder (MED) : a multi-task model which can operate in one of the three functionalities:

      • Unimodal encoder : separately encodes image and text.

      • Image-grounded text encoder : filter

      • Image-grounded text decoder : captioner

    • Captioning and Filering (CapFilt) : Both captioner and filter are initialized from the same pre-trained MED model, and finetuned individually on the COCO dataset (high quality human annotated image-text pairs) => combine the filtered image-text pairs with the human-annotated pairs to form a new dataset

      • captioner : an image-grounded text decoder, generates synthetic captions given web images.

      • filter : an image-grounded text encoder which decide whether a text matches an image (remove noisy image-text pairs)

  • Results :



BLIP-2 : Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models (2023)

  • Introduction :

    • Most VLP methods perform end-to-end pretraining using large-scale image-text pair datasets.

    • The cost of VLP has become increasingly prohibitive due to end-to-end training of large-scale models.

    • So, we propose a generic and compute-efficient VLP method by bootstrapping from off-the-shelf pre-trained vision models and language models.

  • Methods : bridges the modality gap with a lightweight Querying Transformer, pre-trained with a new two-stage pre-training strategy

    • Architecture : consists of two transformer submodules ($BERT_{base}$, 188M)

      • an image transformer : interacts with the frozen image encoder for visual feature extraction

      • a text transformer : can function as both text encoder and a text decoder

    • Stage1. Bootstrap Vision-Language Representation Learning from a Frozen Image Encoder :

      • connect Q-Former to a frozen image encoder and perform pre-training using image-text pairs

      • jointly optimize three pre-training objectives ITC, ITG and ITM (BLIP-V1 in detail)

    • Stage2. Bootstrap Vision-to-Language Generative Learning from a Frozen LLM,

      • by connecting the output of the Q-Former to a frozen LLM, and trains the Q-Former such that its output visual representation can be interpreted by the LLM
  • Results :

    • achieves SOTA performance on various vision-language tasks, despite having significantly fewer trainable parameters than existing methods.

    • can be prompted to perform zero-shot image-to-text generation that follows natural language instructions