Layer Query Network For Test-Time-Training in Vision-Language-Models

Layer Query Network For Test-Time-Training in Vision-Language-Models

ICLR 2026 Conference Submission6304 Authors

15 Sept 2025 (modified: 23 Dec 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Test-Time-Training, Robustness, Layer-Locking

TL;DR: LQN is a lightweight MLP for efficient test-time training and on-the-fly adaptation of OOD dense tasks like segmentation, converges quickly reducing GFLOPs

Abstract: Vision–Language Models (VLMs) struggle to generalize against out-of-distribution (OOD) samples, where conventional fine-tuning is infeasible. Test-Time Training (TTT) adapts models to each incoming test sample, yet current methods rely on heavy data augmentation and repeated forward/backward passes through the full VLM, incurring high computational cost. We introduce Layer Query Network (LQN), a lightweight five-layer MLP that adapts a frozen VLM in one forward pass. LQN employs Binding to distill randomly sampled intermediate-layer tokens from VLM via 3D positional embeddings, and Recirculation to self-supervise spatial invariance for predicting robust spatially consistent features. This design removes the need to fine-tune the entire VLM, achieving faster convergence and strong dense-prediction performance, outperforming the teacher VLM. Evaluated across 16 benchmarks spanning natural distribution shifts and cross-dataset generalization, LQN achieves 15\% faster test-time training on ImageNet-Val compared to the state-of-the-art TPS. In segmentation tasks, LQN surpasses Mask2Former on COCO, Cityscapes, and ADE20K while reducing GFLOPs by up to 11\%. Our code will be released upon acceptance.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 6304

Loading