MLCV #3 | Semantic Segmentation

Introduction

  • Tasks:

    • Image Segmentation : The process of assigning a label to every pixel in the image.

    • Semantic Segmentation : treats multiple objects of the same class as a single entity.

    • Instance Segmentation : treats multiple objects of the same class as distinct individual objects.



1. FCN (2015)

  • Introduction : The first end-to-end pixel-wise prediction model based only on convolutional layers.

  • Method:

    1. Feature Extraction : using convolution layers like conventional Image Classification Tasks (layer 1,2,3,4,5)
    2. Convolutionalizing : Downsampling using 1x1 conv rather than FC layer(layer 6,7,8)
    3. Pixel Wise Classification : Last conv1x1 layer performs pixel wise classification for 21 classes.
    4. Upsampling : using deconvolution layer, also called transposed convolution
    5. Fusing Output : x32 upsample from pool5 (FCN-32S) + x16 upsample from pool4 (FCN16S) + x8 upsample from pool3 (FCN8S)

Figure1. Overview of FCN Architecture

Figure2. Overview of upsampling process


2. Mask R-CNN (2017)

  • Introduction : to detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance.

  • Method : just add a third branch that outputs the object mask.

    1. Feature Extraction is same as Faster RCNN

    2. RPN is same as Faster RCNN

    3. The 3rd tails outputs (class + box offset + a binary mask) for each ROI in parallel.

      • 3.1 : Class labels are collapsed into a short output vectors by FC layers, same as Faster RCNN
      • 3.2 : Box offset is collapsed into a short output vectors by FV layers, same as Faster faster_RCNN
      • 3.3 : $m \times m $ masks are predicted for each ROI using an FCN
    4. ROI Align : If we use ROI pool at the above process, there would be small difference between the real ROI and extracted feature map. It does not matter in classification task, but does in segmentation. To address this problem, authors proposed ROI Align.