MLCV #6 | Image Retrieval

2018-06-14 3. Computer Vision Comments

Introduction

Tasks:
- Image Retrieval : aims to find similar images to a query image among an image dataset.
Tech Trend :
1. Conventional Methods : relying on local descriptor matching (scale invariant features - local image descriptors - reranking with spatial verifications)
2. using FC layers : after several conv layers as global descriptors [A Babenko et al, A Gordo et al.]
3. using global pooling methods : from the activations of conv layers.
4. boost the performance : by combining different global descriptors which are trained individually.

Introduction : BoW is a simplifying representation used in NLP and information retrieval.
Methods : BoF groups local descriptors.
1. Local Feature Extraction : Extract local features from image (SIFT, SURF, small img patches)
2. Clustering : Cluster (k-means) extracted features and find center features (codeword) of each cluster
3. Image representation : Represent each image using histogram of codeword.
4. Learning and Recognition :
  - Generative ways : based on Bayesian => classification by histogram of each class
  - Discriminative ways: using classifier like SVM => put histogram into classifier as a feature vector
  Figure 1. Overview of BoW

Introduction : propose a simple yet efficient way to aggregating local image descriptors into a vector of limited dimension, which can be viewed as a simplification of the Fisher kernel representation.
Fisher Vector : transform a input variable-size set of independent samples into a fixed size vector representation
1. A Gaussian Mixture Model (GMM) is used to model the distribution of features(e.g. SIFT) extracted over the image.
2. The Fisher Vector encodes the gradients of the log-likelihood of the features under the GMM, with respect to the GMM parameters.
VLAD (Vector of Locally Aggregated Descriptor) : is a feature pooling method, which can be seen as a simplification of the Fisher Kernel. VLAD encodes a set of local feature descriptors extracted from an image using a clustering method such as GMM or K-means.
1. accumulate the differences $x-c_i$ for each visual word $c_i$.
2. subsequently $L_2$ normalized by $v = v / ||v||_2$
3. Can be written using $a_k$ that assigns descriptor $x_i$ to specific cluster centres $c_k$.
  
  $$ v_{i,j} = \sum_{x \in C} x_j-c_{i,j} = \sum_{i=1}^N a_k(x_i)(x_i(j)-c_k(j))$$

Introduction : develop a cnn architecture that aggregates mid-level conv features into a compact single vector representation using generalized VLAD layer, NetVLAD.
Methods : (i) extract top conv featues using pretrained CNN (ii) and pool these features using netVLAD
netVLAD : The source of discontinuous in VLAD is hard assignment $a_k(x_i)$ of descriptor $x_i$ to specific cluster centres $c_k$. (If $c_k$ is the closest cluster, $a_k=1$, else, $a_k=0$). Authors replace it to soft assignment (softmax of -distances to each clusters).

$$ a_k(x_i) = softmax( -|x_i - c_k |^2) = \frac{e^{-\alpha | x_i| ^2 + 2\alpha c_k x_i + |c_k |^2}}{\sum_{k’} e^{-\alpha | x_i| ^2 + 2\alpha c_k x_i + |c_{k’} |^2}} = \frac{e^{2\alpha c_k x_i + |c_k |^2}}{\sum_{k’} e^{ 2\alpha c_k x_i + |c_{k’} |^2}} = \frac{e^{w_k^T x_i + b_k}}{\sum_{k’} e^{w_{k’}^T x_i + b_{k’}}}$$

$$V(j,k) = \sum_{i=1}^{N} a_k (x_i)(x_i(j) - c_k(j))$$

SIFT, SPoC : sum pooling from the feature map which performs well mainly due to the subsequent descriptor whitening.
MAC , regional MAC : performs max pooling (MAC) over regions then sum over the regional MAC descriptor at the end.
GeM: generalizes max and average pooling with a pooling parameter
weighted sum pooling, weighted GeM, multiscale RMAC, etc.
The performance of each global descriptor varies by dataset as each descriptor has different properties. For example, SPoC activates larger regions on the image representation while MAC activates more focused regions

Introduction : investigate possible ways to aggregate local deep features to produce compact global descriptors for image retrieval.
Methods :
1. Sum pooling : The construction of the SPoC descriptor starts with the sum pooling of the deep features.
  
  $$ \psi_1(I) = \sum_{y=1}^{H} \sum_{x=1}^{W} f(x,y)$$
2. Centering prior : objects of interest ted to be located close to the geometrical center of an image. So, incorporate such centering prior using coefficients $\alpha(w,h)$ , (Gaussian)
  
  $$ \psi_2(I) = \sum_{y=1}^{H} \sum_{x=1}^{W} \alpha(x,y)f(x,y)$$
3. Post processing : The obtained representation $\psi(I)$ is subsequently l2 normalized, then PCA compression and whitening are performed.

Introduction : revisit both retrieval stages, namely initial search and reranking
Method :
1. Maximum Activation of Convolutions (MAC) : the feature vector constructed by a spatial max-pooling over all feature map $\chi_i$ from last conv.
  
  $$ \mathbb{f_{\Omega}} = [f_{\Omega,1}, f_{\Omega,2}, … ,f_{\Omega,K}]^T, with f_{\Omega,i} = max \chi_i(p)$$
2. regional MAC : divide conv feature map to multiple regions(for WxH dim not C) and apply MAC for each regions and post-process it. (l2 and PCA-whitening)
3. Two images are compared with the cosine similarity of the K-dim vector produced as described above.

Introduction : propose a novel trainable Generalized Mean Pooling layer that generalizes max and average pooling and show that it boosts retrieval performance
Method :
1. ConvNet Backbone : given an input image, the output is a 3D tensor $\chi$ of $W \times H \times K$ dimensions
2. GeM : add a pooling layer that takes $\chi$ as an input and produces a vector $\mathbb{f}$ as an output of the pooling process.
  
  $$ \mathbb{f_{\Omega}} = [f_{\Omega,1}, f_{\Omega,2}, … ,f_{\Omega,K}]^T, f_{\Omega,k} = ( \frac{1}{|\chi_k|} \sum_{x \in \chi_k}x^{p_k})^{\frac{1}{p_k}}$$

Introduction : Ensembling different models and combining multiple global descriptors lead to performance improvement. However, these processes are not only difficult but also inefficient with respect to time and memory. Here, authors propose a novel framework that exploits multiple global descriptors to get an ensemble effect while it can be trained in an end-to-end manner.
Method : Proposed framework consists of a CNN backbone and two modules. The first main module learns an image representation, which is a combination of multiple global descriptors. Next, an auxiliary module to fine-tune a CNN with a classification loss.
1. Backbone Network : can use any CNN such as Inception, ShuffleNet, Resnet. authors use ResNet50 as a baseline backbone.
2. Main Module - Multiple Global Descriptors : main module has multiple branches that output each image representation by using different global descriptors (SPoC, MAC, GeM) on the last conv layer. And these discriptions are concatenated after whitening(PCA, FC) and l2 normalization .
3. Auxiliary Module : finetunes the CNN backbone based on the first global descriptor of the main module by using a classification loss (train a CNN backbone with a classification loss and then fine-tune the network with a triplet loss). Additional temperature scaling and label smoothing for performance improvement.