Echoes of the Visual Past: Test-Time Prompt Tuning with Multi-Scale Visual Memory

Ziqiang Wang; Zhixiang Chi; Li Gu; Zhi Liu; Konstantinos N. Plataniotis; Yang Wang

Echoes of the Visual Past: Test-Time Prompt Tuning with Multi-Scale Visual Memory

Ziqiang Wang, Zhixiang Chi, Li Gu, Zhi Liu, Konstantinos N. Plataniotis, Yang Wang

11 May 2025 (modified: 29 Oct 2025)Submitted to NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: test-time prompt tuning, vision-language models, foundation models

Abstract: Test-time prompt tuning (TPT) aims to adapt pre-trained vision-language models (VLMs) to various downstream tasks by learning textual prompts using unlabeled data at test time. However, existing TPT methods exhibit a performance gap compared to a line of prompt-engineering-based methods that leverage hand-crafted or LLM-generated prompts for VLM adaptation. We attribute this gap to a core limitation of previous TPT approaches: they learn prompts from only limited class-specific visual knowledge derived from a single test image. As a result, the learned prompts underperform compared to hand-crafted and LLM-generated prompts enriched with diverse, class-specific knowledge. To address this limitation, we propose $\textbf{T}$est-time $\textbf{P}$rompt $\textbf{T}$uning with $\textbf{M}$ulti-scale visual $\textbf{M}$emory $(\text{M}^2\text{TPT})$. Specifically, the memory is constructed to store past seen class-relevant image patches as multi-scale visual descriptions for each class. For each test image, we use it to query the memory and learn the textual prompt using both the test image and the retrieved class-relevant visual memory. Additionally, we introduce holistic visual memory to better handle holistic visual recognition tasks that require global image-level context, and an irrelevance suppression strategy to mitigate the impact of noisy memory entries at test time. We evaluate our method on 15 commonly used benchmark datasets and show that it outperforms existing TPT methods. Furthermore, our framework can incorporate human-designed prompts and achieves state-of-the-art performance compared to recent VLM adaptation methods that use hand-crafted or LLM-generated prompts.

Supplementary Material: zip

Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)

Submission Number: 18659

Loading