Abstract: Human-centric perception tasks are essential to many human-agent interaction applications. Unsupervised multi-modal human-centric pre-training models have been recently proposed to lay the foundation for various downstream tasks. However, existing works usually treat different modalities (e.g., depth and 2D key points) as the auxiliary input for the RGB modality in contrastive learning, which may lead to learning inadequate shared features. To mitigate this problem, we propose a novel Pseudo Body-structured Prior (PBoP) framework to generate a pseudo image to encode the salient human part and body information, which can enhance the feature learning process for the human-centric perception tasks. By contrasting the RGB and depth modalities with the pseudo image, the pre-trained model can significantly enhance foreground details and eliminate the background clutter. In the experiments, we compare our method on various human-centric benchmark datasets, and our model achieves state-of-the-art results.
Loading