MLCV #1 | Image Classification


Introduction

  • Tasks:

    • Image Classification : The task of classifying an image according to its visual content.

    • Image Representation : focus on the way to encode visual contents into vectors (embedding, encoding)



1. AlexNet (2012)

  • Introduction : CNNs have been prohibitively expensive to apply in large scale to high resolution images.
  • Method : Training on Multiple GPUs
def AlexNet(x):
  out = MP(relu(conv11x11(x)))
  out = MP(relu(conv5x5(out)))
  out = relu(conv3x3(out))
  out = relu(conv3x3(out))
  out = MP(relu(conv3x3(out)))
  out = FC(relu(FC(relu(FC(out)))))
  return out


2. VGG Net (2014)

  • Introduction : come up with significantly more accurate ConvNet
  • Method : deeper ConvNet
def VGG16(x):
  out = MP(conv3x3(conv3x3(x)))
  out = MP(conv3x3(conv3x3(out)))
  for i in range(3):
    out = MP(conv3x3(conv3x3(conv3x3(out))))
  out = softmax(FC(FC(FC(out))))
  return out


3. GoogleNet (2015)

  • Introduction : efficient deeper networks (with fewer params than AlexNet)
  • Method : inception module(NIN, Bottleneck)
def inception_block(x):
  branch_1x1 = conv1x1(x)
  branch_3x3 = conv3x3(conv1x1(x))
  branch_5x5 = conv5x5(conv1x1(x))
  branch_pool = conv1x1(MP3x3(x,same))
  out = concat([branch_1x1,branch_3x3,branch_5x5,branch_pool])
  return out


4. ResNet (Microsoft, 2015)

  • Introduction : to solve the degradation problem caused by deeper layer.
  • Method : Residual Block with shortcut(skip) connection defined as :

$$ \mathbf{x}_{l+1} = \mathbf{x}_l + F(\mathbf{x}_l,{W_i}) $$

def residual_block(x):
  out = relu(bn1(conv3x3(x)))
  out = relu(bn2(conv3x3(out)) + x)
  return out



5. DenseNet (2016)

  • Introduction : information about the input or gradient can vanish and wash out as CNNs become deep
  • Method : Dense Connectivity (not sum, just concat)
  • Result : 77.85% of top-1 accuracy in ImageNet

$$ x_l = H_l([x_0, x_1, … , x_{l-1} ]) $$

$[x_0, x_1, … , x_{l-1}]$ means concatenation of the features-maps produced in previous layers.

def dense_block(x):
  out = conv1x1(relu(bn1(x)))  # Bottleneck for comput. efficiency
  out = conv3x3(relu(bn2(out)))
  out = concat([x, out])
  return out


6. ResNeXt (2016)

  • Introduction : present a improved architecture that adopts ResNets strategy of repeating layers.
  • Method : split-transform-merge strategy (cardinality)
  • Result : 80.9% of top-1 accuracy in ImageNet with 83.6M params

$$ \mathbf{x}_{l+1} = \mathbf{x}l + \sum{i=1}^{cardin} F_i(\mathbf{x}_l) $$

def residual_block(x):
  out = relu(bn1(conv3x3(x, groups=cardinality)))
  out = relu(bn2(conv3x3(out, groups=cardinality)) + x)
  return out


7. ShuffleNet(2017)

  • Introduction : extremely computation-efficient CNN architecture named ShuffleNet, designed specially for mobile devices with very limited computing power.

  • Methods : utilizes two new operations, pointwise group convolution and channel shuffle

    1. divide the channels in each group into several subgroups

    2. feed each group in the next layer with difference subgroup



8. FixResNeXt (2019)

  • Introduction : Existing augmentations induce a significant discrepancy between the size of the objects seen by the classifier at train and test time.
  • Method : Simple strategy to optimize the classifier performance, that employs different train and test resolution : in face, a lower train resolution improves the classification at test time.
  • Result : 86.4% of top-1 accuracy in ImageNet with 83.6M params

(L) conventional augmentation method (R) proposed augmentation method


9. ViT : An Image is Worth 16x16 Words (2020, GoogRes)

  • Introduction :

    • Transformer architecture has become the de-facto standard for natural language processing tasks
    • In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place
    • We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks.
  • Methods : applying a standard Transformer directly to images

    • Patch Embedding $x_i$ : extracts N non-overlapping image patches, performs a linear projection ($E$, is equivalent to a 2D conv) and then rasterises them into 1D token.

    • learnable embedding $z_{cls}$ : an optional learned clasification token (Similar to BERT’s [cls]) is prepended to the sequene of embedded patches

    • learnable position embedding $p$ : added to the tokens to retain positional information,

      => When you have no idea about how to hand-craft positional encoding for your data

      => Let the transformer figure out for itself what it needs as positional embeddings

      => simply train the vectors in table of figure at “NLP3 > Transformer > Binarized Indexing”

    $$ \mathbf{z} = [z_{cls}, E_{x_1}, E_{x_1}, …, E_{x_1}] + \mathbf{p}
    $$

  • Result:

    • When trained on mid-sized datasets such as ImageNet(1.3M) without strong regularization, these models yield modest accuracies of a few percentage points below ResNets of comparable size
    • However, the picture changes if the models are trained on larger datasets (JFT-300M, Figure3)


10. VirTex : Learning Visual Representations from Textual Annotations

  • introduction : revisit supervised pretraining, and seek data-efficient alternatives to classification-based pretraining.
    • (1) Semantic density : Captions provide a semantically denser learning signal than unsupervised contrastive methods and supervised classification.
    • (2) simplified data collection : natural language descriptions do not require an explicit ontology and can easily be written by non-expert workers,
  • VirTex : a pretraining approach using semantically dense captions to learn visual representations
    • (1) jointly train a ConvNet and Transformer from scratch to generate natural language captions for images
      • Visual Backbone : a convolutional network which computes visual features of image
      • Textual Head : receives features from thevisual backbone and predicts captions for images
    • (2) transfer the learned features to downstream visual recognition tasks
  • Result
    • show that natural language can provide supervision for learning transferable visual representations with better data-efficiency than other approaches.
    • VirTex matches or exceeds the performance of existing methods for supervised or unsupervised pre-training on ImageNet, despite using up to 10×fewer images