Change of Thought: Adaptive Test-Time Computation

ICLR 2026 Conference Submission21708 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: deep learning architecture, fixed point iteration, adaptive test-time
TL;DR: FixedPoint SelfAttention replaces token autoregression with latent alignment refinement enabling iterative reasoning within self-attention,by updating alignment matrices through fixed-point iterations, it achieve 25% improvement with parameters reuse
Abstract: Standard Transformers apply a fixed amount of computation to every token, limiting their expressive power, while more powerful iterative approaches often introduce significant architectural complexity and cost. We introduce Fixed-Point Self-Attention (FPSA), a parameter-free, drop-in replacement for self-attention that enables a model to adaptively ``think longer'' by iteratively refining each layer's representations to a fixed point. We train this recurrent process end-to-end using implicit differentiation, ensuring that memory usage during training and inference remains constant and identical to a standard Transformer layer, regardless of the number of refinement steps. Without adding any parameters, FPSA significantly improves strong baselines like BERT-Base and ELECTRA-Base on the GLUE and SQuAD v2.0 benchmarks. We demonstrate similar consistent gains for vision (ViT-B/16) and vision-language models, achieving accuracy improvements of up to 20\%. This performance boost comes at a modest computational cost: a median of 3--6 refinement steps results in a $\approx1.6\times$ GFLOPs and $\approx1.3-1.4\times$ latency overhead compared to an equivalent BERT-Base model. Analysis shows FPSA dynamically allocates compute to challenging inputs and converges to stable fixed points. Furthermore, integrating FPSA into language models improves performance on complex reasoning tasks like GSM8K, BBH, and LogiQA. Ultimately, FPSA bridges the gap between fixed-computation and iterative reasoning, offering a powerful building block that adaptively allocates compute while preserving architectural simplicity.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 21708
Loading