Introduction
- Tasks
- Vision Language Pretraining (VLP) : aims to improve performance of downstream vision and language tasks by pretraining the model on large-scale image-text pairs
CLIP : Learning Transferable Visual Models From Natural Language Supervision (2021)
-
introduction :
-
Pre-training methods which learn directly from raw text have revolutionized NLP over the last few years
-
Could scalable pre-training methods which learn directly from web text result in a similar breakthrough in computer vision?
-
-
Methods :
-
Dataset :
-
Natural Language Supervision : learn visual representations from text paired with images => easier to scale since it does not require annotations (2.1)
-
Creating a Sufficiently Large Dataset : A major motivation for NLS is the large quantities of data available publicly on the internet, we constructed a new dataset of 400 milion pairs. (2.2)
-
-
Training : Selecting an Efficient Pre-training Method (2.3)
-
Given $N$ (image, text) pairs, CLIP is trained to predict which of the $N \times N$ pairings acutally occurred
-
Learns multi-modal embedding space by jointly training an image encoder and text encoder :
- maximize cos-sim of the image and text embeddings of the N real pairs and minimize cos-sim of incorrect pairings
-
detail training params in paper, 12 days on 256-V100 GPUs (2.5)
-
-
Model : ViT as image encoder and Transformer as text encoder (2.4)
-
-
Experiments:
- Zero-Shot Transfer : For each dataset, we use the names of all the classes in the dataset as the set of potential text pairings and predict the most probable pair according to CLIP
- Using the prompt template
a photo of a {LABEL}often improves performance
- Using the prompt template
- Representation Learning : Fitting a linear classifier on a representation extracted from the model and measuring its performance on various dataset
- Zero-Shot Transfer : For each dataset, we use the names of all the classes in the dataset as the set of potential text pairings and predict the most probable pair according to CLIP
# extract feature representations of each modality
I_f = image_encoder(I) #[n, d_i]
T_f = text_encoder(T) #[n, d_t]
# joint multimodal embedding [n, d_e]
I_e = l2_normalize(np.dot(I_f, W_i), axis=1)
T_e = l2_normalize(np.dot(T_f, W_t), axis=1)
# scaled pairwise cosine similarities [n, n]
logits = np.dot(I_e, T_e.T) * np.exp(t)
# symmetric loss function
labels = np.arange(n)
loss_i = cross_entropy_loss(logits, labels, axis=0)
loss_t = cross_entropy_loss(logits, labels, axis=1)
loss = (loss_i + loss_t)/2
ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision (2021, Google Research)
-
Introduction :
-
While representation learning in NLP has transitioned to training on raw text without human annotations,
-
visual and vision-language representations still rely heavily on curated training datasets that are expensive or require expert knowledge.
-
-
Methods : leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps in the Conceptual Captions Dataset
-
A Large Scale Noisy Image-Text Dataset : TODO
-
Pretraining and Task Transfer : TODO
-
-
Result : achieves strong performance when transferred to classification tasks
- TODO
-
Differences between ALIGN and CLIP
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation (2022)
-
Introduction : Vision-language pre-training has recently received tremendous success on various multimodal downstream tasks. However, existing methods have two major limitations :
-
(1) Model perspective : most methods either adopt encoder-based model or an encoder-decoder model.
-
encoder-based models : less straight forward to directly transfer to text generation tasks (e.g. image captioning),
-
encoder-decoder models have not been successfully adopted for understanding tasks (e.g. image-text retrieval)
-
-
(2) Data perspective : most SOTA methods pre-train on image-text pairs collected from the web (noisy web text is suboptimal for VL learning).
-
-
Methods : BLIP, a new VLP framework which transfers flexibly to both vision-language understanding and generation task
-
Multimodal mixture of Encoder-Decoder (MED) : a multi-task model which can operate in one of the three functionalities:
-
Unimodal encoder : separately encodes image and text.
-
Image-grounded text encoder : filter
-
Image-grounded text decoder : captioner
-
-
Captioning and Filering (CapFilt) : Both captioner and filter are initialized from the same pre-trained MED model, and finetuned individually on the COCO dataset (high quality human annotated image-text pairs) => combine the filtered image-text pairs with the human-annotated pairs to form a new dataset
-
captioner : an image-grounded text decoder, generates synthetic captions given web images.
-
filter : an image-grounded text encoder which decide whether a text matches an image (remove noisy image-text pairs)
-
-
-
Results :
- image-text retrieval (+2.7% in average recall@1),
- image captioning (+2.8% in CIDEr)
- VQA (+1.6% in VQA score)
- https://github.com/salesforce/BLIP
BLIP-2 : Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models (2023)
-
Introduction :
-
Most VLP methods perform end-to-end pretraining using large-scale image-text pair datasets.
-
The cost of VLP has become increasingly prohibitive due to end-to-end training of large-scale models.
-
So, we propose a generic and compute-efficient VLP method by bootstrapping from off-the-shelf pre-trained vision models and language models.
-
-
Methods : bridges the modality gap with a lightweight Querying Transformer, pre-trained with a new two-stage pre-training strategy
-
Architecture : consists of two transformer submodules ($BERT_{base}$, 188M)
-
an image transformer : interacts with the frozen image encoder for visual feature extraction
-
a text transformer : can function as both text encoder and a text decoder
-
-
Stage1. Bootstrap Vision-Language Representation Learning from a Frozen Image Encoder :
-
connect Q-Former to a frozen image encoder and perform pre-training using image-text pairs
-
jointly optimize three pre-training objectives ITC, ITG and ITM (BLIP-V1 in detail)
-
-
Stage2. Bootstrap Vision-to-Language Generative Learning from a Frozen LLM,
- by connecting the output of the Q-Former to a frozen LLM, and trains the Q-Former such that its output visual representation can be interpreted by the LLM
-
-
Results :
-
achieves SOTA performance on various vision-language tasks, despite having significantly fewer trainable parameters than existing methods.
-
can be prompted to perform zero-shot image-to-text generation that follows natural language instructions
-