Keywords: Test-Time-Training, Robustness, Layer-Locking
TL;DR: LQN is a lightweight MLP for efficient test-time training and on-the-fly adaptation of OOD dense tasks like segmentation, converges quickly reducing GFLOPs
Abstract: Vision–Language Models (VLMs) struggle to generalize against out-of-distribution (OOD) samples, where conventional fine-tuning is infeasible.
Test-Time Training (TTT) adapts models to each incoming test sample, yet current methods rely on heavy data augmentation and repeated forward/backward passes through the full VLM, incurring high computational cost.
We introduce Layer Query Network (LQN), a lightweight five-layer MLP that adapts a frozen VLM in one forward pass.
LQN employs Binding to distill randomly sampled intermediate-layer tokens from VLM via 3D positional embeddings, and
Recirculation to self-supervise spatial invariance for predicting robust spatially consistent features. This design removes the need to fine-tune the entire VLM, achieving faster convergence and strong dense-prediction performance, outperforming the teacher VLM.
Evaluated across 16 benchmarks spanning natural distribution shifts and cross-dataset generalization, LQN achieves 15\% faster test-time training on ImageNet-Val compared to the state-of-the-art TPS. In segmentation tasks, LQN surpasses Mask2Former on COCO, Cityscapes, and ADE20K while reducing GFLOPs by up to 11\%. Our code will be released upon acceptance.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 6304
Loading