Introduction
- Tasks :
- Pose Estimation : The task aims to detect the locations of human anatomical keypoints (e.g., elbow, wrist, etc)
1. Deep Pose (2014)
-
Introduction : The first major paper that applied Deep Learning to Human pose estimation
-
Method :
-
DNN-based regression : Alexnet backend (7 layers) with an extra final layer that outputs 2k joint coordinates (where $k$ is the number of joints).
-
Cascade of pose regressors : refinement of the predictions using cascaded regressors.
-
Since the ground truth pose vector is defined in absolute image coordinates and poses vary in size from image to image, authors normalize their training set (coordinates)
-
linear regression on top of the last network layer to predict a pose vector by minimizing $L_2$ distance between the prediction and the true pose vector.
-
2. Efficient Object Localization Using ConvNets (2015)
-
Introduction : ConvNet architecture which outpus a heatmap, describing the likelihood of a joint occurring in each spatial location
-
Method : Using an additional ConvNet to refine the localization result of the coarse heat-map.
-
Coarse Heat-Map Regression Model : Multi-resolution ConvNet that receives multiple input images with the same content but different sizes
-
Fine Heat Map Regression Model : Siamese Network with $k$ heads($k$ is the number of joint instance)
-
3. Simple Baselines for Human Pose Estimation and Tracking (2018)
-
Introduction : This work provides simple and effective baseline method for human pose estimation(Task1) & pose tracking(Task2).
-
Method :
-
Model Architecture : Simply adds a few deconv layers over the last conv layer in the ResNet.
-
Training Strategy : Use the label (heatmap, $H^k$ for joint $k$ is generated by applying a 2D Gaussian centered on the $k^{th}$ joint’s ground truth location with std-dev=1 pixel).
-
Flow-Based Pose Tracking: Two different kinds of human boxes, one is from a human detector and the other are boxes generated from previous frames using optical flow.
input, label = frames, keypoints_to_hmap(keypoints) bbox= Human_Detector(input) keypoints = hmap_to_coord(Pose_Estimator(input, bbox_det)) for i in range(len(input)): bbox_det = Human_Detector(input) bbox_flow = FlowBox_Generator() # Non-maximum suppression : unify detection and flow boxes bbox_unified = NMS(bbox_det, bbox_flow) joints = Pose_Estimator(input[1], bbox_det[1]) sim_matrix = calc_sim(output[i-1], joints) pose = (sim_matrix, id) output.append(pose) # update the output list
-
4. HR-Net, Microsoft (2019)
-
Introduction : Existing approaches consist of a stem subnetwork, which decreases the resolution based on high-to-low design pattern.
-
Method : Novel network architecture that connects high-to-low subnetworks in parallel that can maintains high-resolution representations through the whole process for spatially precise heatmap estimation.
5. Higher HR-Net
-
Top-Down methods : take a dependency on person detector to detect person instances to reduce the problem.
-
However, they are normally computationally intensive and not truly end-to-end systems