Dynamic Pattern Alignment Learning for Pretraining Lightweight Human-Centric Vision Models

ICLR 2026 Conference Submission12718 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: human-centric visual perception, large-scale pretraining, knowledge distillation
Abstract: Human-centric vision models (HVMs) have achieved remarkable generalization due to large-scale pretraining on massive person images. However, their dependence on large neural architectures and the restricted accessibility of pretraining data significantly limits their practicality in real-world applications. To address this limitation, we propose Dynamic Pattern Alignment Learning (DPAL), a distillation-based pretraining framework that effectively transfers generalization capability of large HVMs to lightweight HVMs by mimicking three typical visual patterns, including global identity pattern, local shape pattern and multi-person interaction pattern. Specifically, we design a dynamic pattern decoder (D-PaDe) to avoid inter-pattern conflict during training, where it incorporates three specialized experts to generate those typical patterns independently. After that, three levels of alignment objectives are designed to bridge the gap between lightweight HVMs and large HVMs at global image level, local pixel level, and instance relation level. With these two deliberate designs, the DPAL effectively guides lightweight model to learn all typical visual patterns from large HVMs, thereby improving its generalization across various human-centric vision tasks. Extensive experiments conducted on 15 challenging datasets demonstrate the effectiveness of the DPAL. Remarkably, when employing PATH-B as the teacher, DPAL-ViT/Ti (5M parameters) achieves surprising generalizability similar to existing large HVMs such as PATH-B (84M) and Sapiens-L (307M), and outperforms previous distillation-based pretraining methods including Proteus-ViT/Ti (5M) and TinyMiM-ViT/Ti (5M) by a large margin. More importantly, the DPAL is performed on a limited dataset (i.e., around 1M unlabeled images) that is unseen for large HVMs, which bypasses the need for those inaccessible or constrained pretraining datasets, offering an affordable approach to generalizable HVMs.
Supplementary Material: pdf
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 12718
Loading