AEMP: Autoregressive-Enhanced Masked Pre-training for Robust Indoor Localization

18 Sept 2025 (modified: 15 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Indoor localization, Self-supervised, Pre-training, Channel State Information
Abstract: The major obstacle for learning-based Channel State Information (CSI) localization is to obtain a high-quality large-scale annotated dataset. However, unlike visual datasets that can be easily annotated by human workers, CSI signals are RF signal is non-intuitive and non-interpretable, making the annotation process both time-consuming and labor-intensive. Considering the potential of self-supervised learning to reduce reliance on labeled data, masked reconstruction has emerged as a promising alternative. However, directly applying existing designs to large-scale CSI scenarios faces unique challenges, including unstable representations in unmasked regions, inability to preserve long-range channel correlations, and high sensitivity to variations in access point layouts and propagation environments. To address these issues, we propose an autoregressive-enhanced masked pre-training (AEMP) framework. AEMP employs a hierarchical Transformer architecture where spatial subnetworks perform masked reconstruction to capture local channel features, while a temporal network enforces consistency through autoregressive prediction. In addition, multi-view fusion and span masking improve robustness under dynamic deployment conditions. Extensive experiments demonstrate that AEMP yields stable and transferable representations, achieving superior performance and strong generalization on downstream indoor localization tasks. To the best of our knowledge, this is the first pre-training framework for wireless sensing that integrates temporal prediction to complement masked reconstruction.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 12760
Loading