Plug-and-Play Global Memory via Test-Time Registers

07 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Attention Sink, Image Generation
TL;DR: Plug-and-Play Global Memory
Abstract: A well-known challenge in large attention-based architectures, e.g., Vision Transformers (ViTs) and large language models (LLMs), is the emergence of attention sinks, where a small subset of tokens disproportionately attracts attention and ultimately degrades downstream performance. Prior work typically casts these high-norm tokens in ViTs as "computational scratchpads" and analogously characterizes sinks in LLMs as pressure valves that absorb surplus attention with little semantic content. This work revisits that view from the perspective of vision models. It is shown that, in large-scale vision architectures, register features in fact encode global, task-relevant information and can act as a plug-and-play global memory. Building on this insight, a training-free framework is proposed that leverages test-time register tokens as global priors to enhance both dense prediction and generative tasks, while mitigating adverse sink effects. In contrast to heuristic placement rules, a theoretically grounded token--neuron interpolation rule is introduced for robust, model-agnostic insertion. Experiments demonstrate improved generative quality and stronger text-image alignment for one-dimensional token generation with ViTs, reflected by gains in FID, IS, CLIPScore, and SigLIP metrics.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 2784
Loading