MLCV #8 | Pose Estimation

Introduction

  • Tasks :
    • Pose Estimation : The task aims to detect the locations of human anatomical keypoints (e.g., elbow, wrist, etc)


1. Deep Pose (2014)

  • Introduction : The first major paper that applied Deep Learning to Human pose estimation

  • Method :

    1. DNN-based regression : Alexnet backend (7 layers) with an extra final layer that outputs 2k joint coordinates (where $k$ is the number of joints).

    2. Cascade of pose regressors : refinement of the predictions using cascaded regressors.

    3. Since the ground truth pose vector is defined in absolute image coordinates and poses vary in size from image to image, authors normalize their training set (coordinates)

    4. linear regression on top of the last network layer to predict a pose vector by minimizing $L_2$ distance between the prediction and the true pose vector.



2. Efficient Object Localization Using ConvNets (2015)

  • Introduction : ConvNet architecture which outpus a heatmap, describing the likelihood of a joint occurring in each spatial location

  • Method : Using an additional ConvNet to refine the localization result of the coarse heat-map.

    1. Coarse Heat-Map Regression Model : Multi-resolution ConvNet that receives multiple input images with the same content but different sizes

    2. Fine Heat Map Regression Model : Siamese Network with $k$ heads($k$ is the number of joint instance)

*Figure 4. Overview of our Cascaded Architecture


3. Simple Baselines for Human Pose Estimation and Tracking (2018)

  • Introduction : This work provides simple and effective baseline method for human pose estimation(Task1) & pose tracking(Task2).

  • Method :

    1. Model Architecture : Simply adds a few deconv layers over the last conv layer in the ResNet.

    2. Training Strategy : Use the label (heatmap, $H^k$ for joint $k$ is generated by applying a 2D Gaussian centered on the $k^{th}$ joint’s ground truth location with std-dev=1 pixel).

    3. Flow-Based Pose Tracking: Two different kinds of human boxes, one is from a human detector and the other are boxes generated from previous frames using optical flow.

    input, label = frames, keypoints_to_hmap(keypoints)
    bbox= Human_Detector(input)
    keypoints = hmap_to_coord(Pose_Estimator(input, bbox_det))
    
    for i in range(len(input)):
      bbox_det = Human_Detector(input)
      bbox_flow = FlowBox_Generator()
    
       # Non-maximum suppression : unify detection and flow boxes
      bbox_unified = NMS(bbox_det, bbox_flow)  
    
      joints = Pose_Estimator(input[1], bbox_det[1])
    
      sim_matrix = calc_sim(output[i-1], joints)
      pose = (sim_matrix, id)
      output.append(pose)  # update the output list
    


4. HR-Net, Microsoft (2019)

  • Introduction : Existing approaches consist of a stem subnetwork, which decreases the resolution based on high-to-low design pattern.

  • Method : Novel network architecture that connects high-to-low subnetworks in parallel that can maintains high-resolution representations through the whole process for spatially precise heatmap estimation.

*Figure1. Illustrating the architecture of the proposed HRNet.


5. Higher HR-Net

  • Top-Down methods : take a dependency on person detector to detect person instances to reduce the problem.

  • However, they are normally computationally intensive and not truly end-to-end systems