Efficient Reasoning at Fixed Test-Time Cost via Length-Aware Attention Priors and Gain-Aware Training

Rian Atri

Efficient Reasoning at Fixed Test-Time Cost via Length-Aware Attention Priors and Gain-Aware Training

Rian Atri

Published: 16 Oct 2025, Last Modified: 10 Nov 2025NeurIPS 2025 ER WorkshopEveryoneRevisionsBibTeXCC BY 4.0

Keywords: efficient reasoning, fixed test-time cost, small transformers, training-time priors, length-aware attention, fuzzy regime-position alignment (RPA), attention bias, gain-aware controller, late-phase optimization, compute parity, cross-entropy reduction, latency/memory unchanged, structured priors, KL-regularized MAP, validation-aware sharpening, long-span linkage, length generalization, long-context reasoning, retrieval and routing, language modeling (WikiText-2)

TL;DR: We add a zero-param, length-aware attention prior (RPA) and a tiny, training-only gain-aware controller that preserve late-phase improvements and cut WT2 CE without changing inference latency/memory.

Abstract: We study efficient reasoning under tight compute: how to make structured, correct decisions without increasing test-time cost. We add two training-only components to small/medium Transformers that also transfer to broader differentiable optimizers. First, a length-aware attention prior built via fuzzy regime-position alignment (RPA) yields a normalized pre-softmax bias that guides attention like a structured regularizer while adding no new inference parameters. Second, a minimal gain-aware controller (Guardian) nudges attention sharpness only when validation improvements warrant it, following a two-timescale policy-gradient view of nonconvex optimization; it is disabled at inference. A KL perspective shows $\mathrm{softmax}(z+\log \pi)$ as MAP with KL regularization, grounding the prior in a principled objective. Under strict compute parity on WikiText-2, we reduce validation cross-entropy while matching baseline latency and memory. At inference, we add a precomputed, cached prior $B(T)$ as a single additive bias per head; the controller does not run. In practice, this incurs negligible overhead (a cached bias add per head; no measurable p50 latency shift). Our results suggest that length-aware priors and late-phase gain control preserve scarce improvements, especially in long-span, noisy-logit regimes, while keeping test-time costs effectively unchanged.

Submission Number: 204

Loading