Abstract: In multi-modal perception tasks, test-phase data often suffers from environmental noise and sensor degradation, which causes distribution shifts from the training phase. Test-time adaptation (TTA) is an emerging unsupervised learning strategy that allows pre-trained models to adapt to new data distributions during testing without requiring source domain data. However, existing TTA methods, primarily designed for single-modal data, often struggle with multi-modal data shifts. They may rely on high-confidence pseudo-labels to update model parameters, leading to worse performance than before fine-tuning when all modalities are corrupted, and can suffer from catastrophic forgetting. To address these issues, we propose a Energy-guided Two-stage Test-time Adaptation (Eng2TTA) framework specifically designed for multi-modal perception. In the first stage, an energy-guided loss function is employed to optimize local model parameters by smoothing class distributions within each batch, thereby reducing overconfidence from noisy pseudo-labels. Concurrently, a memory bank is constructed to store the most representative high-confidence sample features for each class. In the second stage, predictions for low-confidence samples are refined by querying the memory bank using feature similarity, leveraging reliable high-confidence information without requiring additional parameter updates, which effectively mitigates catastrophic forgetting. Our method demonstrates superior robustness in multi-modal tasks, significantly outperforming state-of-the-art methods in scenarios with varying levels of modality corruption, particularly under severe distribution shifts.
Loading