Fast and Stable Continual Test-Time Adaptation via Masked Modeling and Momentum-Guided Updates

Fast and Stable Continual Test-Time Adaptation via Masked Modeling and Momentum-Guided Updates

ICLR 2026 Conference Submission15325 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Continual Test-Time Adaptation, Masked Image Modeling, Teacher-Student Learning, Catastrophic Forgetting, Vision Transformers

TL;DR: MiDEA achieves state of the art continual test time adaptation by combining global local self distillation with dual rate EMA, running 3× faster than existing methods

Abstract: Continual test-time adaptation (CTTA) aims to maintain model accuracy under non-stationary distribution shifts when source training data are unavailable. Existing methods that update models using their own predictions ("pseudo labels") face a fundamental trade-off: they struggle to balance rapid adaptation to new conditions with retention of previously learned knowledge. Many recent approaches address this through multiple forward passes, but this makes them impractical for deployment under strict latency or memory constraints. To overcome these limitations, we introduce **MiDEA** (**M**asked-**i**mage modeling with **D**ual-**E**MA **A**daptation), a simple yet powerful method specifically designed for efficient continual test-time adaptation. MiDEA maintains a single, decoder-free teacher–student architecture that measures distribution gaps at both global image and local patch levels, guiding the student to minimize these gaps at each adaptation step. At the global level, MiDEA aligns full-image predictions from a clean teacher view with those from a strongly augmented student view. Locally, it matches the teacher's patch-wise representations to the student's masked patch embeddings, ensuring fine-grained spatial details adapt effectively. The teacher model is refreshed using a novel dual rate exponential moving average (EMA): attention layers update rapidly to absorb new visual conditions, while MLP weights drift more slowly to preserve prior knowledge, empirically reducing catastrophic forgetting. On ViT-Base, MiDEA reduces ImageNet-C error to **38.1%**, 18 percentage points below a frozen model and 5 points below the previous CTTA state of the art, while running **3x faster** than recent multi-pass methods. Notably, accuracy remains stable even at a batch size of 1, making MiDEA practical for real time, memory constrained deployments.

Supplementary Material: zip

Primary Area: transfer learning, meta learning, and lifelong learning

Submission Number: 15325

Loading