VL #01 | Vision Language Pretraining

2022-08-01 3. Natural Language Comments

Introduction

Tasks
- Vision Language Pretraining (VLP) : aims to improve performance of downstream vision and language tasks by pretraining the model on large-scale image-text pairs

CLIP : Learning Transferable Visual Models From Natural Language Supervision (2021)

introduction :
- Pre-training methods which learn directly from raw text have revolutionized NLP over the last few years
- Could scalable pre-training methods which learn directly from web text result in a similar breakthrough in computer vision?
Methods :
- Dataset :
  - Natural Language Supervision : learn visual representations from text paired with images => easier to scale since it does not require annotations (2.1)
  - Creating a Sufficiently Large Dataset : A major motivation for NLS is the large quantities of data available publicly on the internet, we constructed a new dataset of 400 milion pairs. (2.2)
- Training : Selecting an Efficient Pre-training Method (2.3)
  - Given $N$ (image, text) pairs, CLIP is trained to predict which of the $N \times N$ pairings acutally occurred
  - Learns multi-modal embedding space by jointly training an image encoder and text encoder :
    - maximize cos-sim of the image and text embeddings of the N real pairs and minimize cos-sim of incorrect pairings
  - detail training params in paper, 12 days on 256-V100 GPUs (2.5)
- Model : ViT as image encoder and Transformer as text encoder (2.4)
Experiments:
- Zero-Shot Transfer : For each dataset, we use the names of all the classes in the dataset as the set of potential text pairings and predict the most probable pair according to CLIP
  - Using the prompt template a photo of a {LABEL} often improves performance
- Representation Learning : Fitting a linear classifier on a representation extracted from the model and measuring its performance on various dataset

# extract feature representations of each modality
I_f = image_encoder(I) #[n, d_i]
T_f = text_encoder(T)  #[n, d_t]
# joint multimodal embedding [n, d_e]
I_e = l2_normalize(np.dot(I_f, W_i), axis=1)
T_e = l2_normalize(np.dot(T_f, W_t), axis=1)
# scaled pairwise cosine similarities [n, n]
logits = np.dot(I_e, T_e.T) * np.exp(t)
# symmetric loss function
labels = np.arange(n)
loss_i = cross_entropy_loss(logits, labels, axis=0)
loss_t = cross_entropy_loss(logits, labels, axis=1)
loss   = (loss_i + loss_t)/2

ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision (2021, Google Research)

Introduction :
- While representation learning in NLP has transitioned to training on raw text without human annotations,
- visual and vision-language representations still rely heavily on curated training datasets that are expensive or require expert knowledge.
Methods : leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps in the Conceptual Captions Dataset
- A Large Scale Noisy Image-Text Dataset : TODO
- Pretraining and Task Transfer : TODO
Result : achieves strong performance when transferred to classification tasks
- TODO
Differences between ALIGN and CLIP

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation (2022)

Introduction : Vision-language pre-training has recently received tremendous success on various multimodal downstream tasks. However, existing methods have two major limitations :
- (1) Model perspective : most methods either adopt encoder-based model or an encoder-decoder model.
  - encoder-based models : less straight forward to directly transfer to text generation tasks (e.g. image captioning),
  - encoder-decoder models have not been successfully adopted for understanding tasks (e.g. image-text retrieval)
- (2) Data perspective : most SOTA methods pre-train on image-text pairs collected from the web (noisy web text is suboptimal for VL learning).
Methods : BLIP, a new VLP framework which transfers flexibly to both vision-language understanding and generation task
- Multimodal mixture of Encoder-Decoder (MED) : a multi-task model which can operate in one of the three functionalities:
  - Unimodal encoder : separately encodes image and text.
  - Image-grounded text encoder : filter
  - Image-grounded text decoder : captioner
- Captioning and Filering (CapFilt) : Both captioner and filter are initialized from the same pre-trained MED model, and finetuned individually on the COCO dataset (high quality human annotated image-text pairs) => combine the filtered image-text pairs with the human-annotated pairs to form a new dataset
  - captioner : an image-grounded text decoder, generates synthetic captions given web images.
  - filter : an image-grounded text encoder which decide whether a text matches an image (remove noisy image-text pairs)
Results :
- image-text retrieval (+2.7% in average recall@1),
- image captioning (+2.8% in CIDEr)
- VQA (+1.6% in VQA score)
- https://github.com/salesforce/BLIP

BLIP-2 : Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models (2023)

Introduction :
- Most VLP methods perform end-to-end pretraining using large-scale image-text pair datasets.
- The cost of VLP has become increasingly prohibitive due to end-to-end training of large-scale models.
- So, we propose a generic and compute-efficient VLP method by bootstrapping from off-the-shelf pre-trained vision models and language models.
Methods : bridges the modality gap with a lightweight Querying Transformer, pre-trained with a new two-stage pre-training strategy
- Architecture : consists of two transformer submodules ($BERT_{base}$, 188M)
  - an image transformer : interacts with the frozen image encoder for visual feature extraction
  - a text transformer : can function as both text encoder and a text decoder
- Stage1. Bootstrap Vision-Language Representation Learning from a Frozen Image Encoder :
  - connect Q-Former to a frozen image encoder and perform pre-training using image-text pairs
  - jointly optimize three pre-training objectives ITC, ITG and ITM (BLIP-V1 in detail)
- Stage2. Bootstrap Vision-to-Language Generative Learning from a Frozen LLM,
  - by connecting the output of the Q-Former to a frozen LLM, and trains the Q-Former such that its output visual representation can be interpreted by the LLM
Results :
- achieves SOTA performance on various vision-language tasks, despite having significantly fewer trainable parameters than existing methods.
- can be prompted to perform zero-shot image-to-text generation that follows natural language instructions