/AI/ Human Pose Estimation with Occlusion (PoseResnet, HRNet, ViTPose)

目录

This is a project of NTU CE7454: Deep Learning for Data Science. The report is here.

Human Pose Estimation (HPE) has been popular in the computer vision community. Various deep learning models have been proposed to achieve superior performance on HPE. However, if parts of the objects are occluded, their performances would degrade due to the loss of context and semantics. Towards this problem, this work proposes an artificial occlusion transformation to imitate in-the-wild occlusions. Its use is tested on three well-known HPE models reproduced by ourselves, i.e., SimpleBaseline, HRNet, and ViTPose. We first show how their performances have been affected when presented with occluded images. Experiments were then conducted to investigate the optimal occlusion settings. Finally, we concluded that fine-tuning images with occlusions could boost the robustness of the model.

We evaluate the performance of three models on the MPII in various occlusion settings. First, it is important to note the significant drop in performance when testing the performance of models trained on non-occluded samples (i.e., the original train and test set) on our occluded test set. The drop in performance ranges from 4% to 30% in different models. It is interesting to note that while HRNet boasts superior performance under non-occluded settings as compared to PoseResnet, the drop in performance of the former is much sharper when occlusion is introduced. Amongst the three models, ViTPose is most robust towards occlusion as suggested by the least drop in performance. This is probably due to MAE’s ability to reduce the noise of input data. By masking random patches of the image, which is similar to our occlusion settings, the model is able to reconstruct the missing patches and learn a better representation of the input images.

Table reports the pose estimation performance of each model under various occlusion settings. The NO column represents the ideal scenario where the entire human body is visible in an image, whereas the O column resembles real-life settings where parts of the human body might be blocked by other human or obstacles. Under the occluded test set setting (O column), it is observed that model trained with non-zero p_occ has a higher score than model trained with zero p_occ, which supports our initial hypothesis - a model trained for a difficult task is performing better as compared to a model trained for a easy task. While all three models perform better using occluded training set, the performance gap for ViTPose is relatively smaller than those in PoseResnet and HRNet, which suggests that transformer based methods are inherently more robust than CNN based methods, therefore less occlusion is needed during training. Under the non-occluded test set setting (NO column), it appears that there is minimal to no benefit when using higher p_occ values, \ie, more occluded samples in train set. The overall results imply that the choice of p_occ value for optimal performance depends on the actual data distribution (\% of occluded data in the test set) and the type of model used (CNN based or transformer based).