CFMAE: A Coarse-to-Fine Vision Pre-training Framework for Hierarchical Representation Learning

15 Sept 2025 (modified: 20 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Hierarchical Representation Learning, Coarse-to-Fine, Masked Autoencoder, Multi-granular, Vision Pre-training
Abstract: Prevailing self-supervised learning paradigms, such as contrastive learning (CL) and masked image modeling (MIM), exhibit opposing limitations. CL excels at learning global semantic representations but sacrifices fine-grained detail, while MIM preserves local details but struggles with high-level semantics due to its semantically-agnostic masking, leading to ``attention drift". To unify the strengths of both, we propose CFMAE, a coarse-to-fine vision pre-training framework that explicitly learns a Masked AutoEncoder over a hierarchy of visual granularities. CFMAE synergistically integrates three data granularities: semantic masks (coarse), instance masks (intermediate), and RGB images (fine). We enforce the coarse-to-fine principle through two key innovations: (1) a cascaded decoder that sequentially predicts scene-level semantics, then object-level instances, and finally pixel-level details, ensuring a structured feature refinement process; and (2) a progressive masking strategy that creates a dynamic training curriculum, shifting the model's focus from coarse scene context to fine local details. To support this, we construct a large-scale, multi-granular dataset by generating high-quality pseudo-labels for ImageNet-1K. Extensive experiments show that CFMAE achieves state-of-the-art performance on image classification, object detection, and semantic segmentation, validating the effectiveness of our hierarchical design in learning more robust and generalizable representations.
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 5963
Loading