Keywords: Chain of thoughts, LLMs, Steering
Abstract: Large language models (LLMs) demonstrate strong performance on multi-step reasoning tasks by producing intermediate explanations, commonly referred to as chains of thought (CoTs). However, the generated rationales are typically verbose, consuming many extra tokens and thus degrading throughput and increasing inference energy. Interestingly, we find that verbose and concise CoTs correspond to distinct regions in the intermediate activation space of the model, respectively, suggesting that verbosity is a steerable latent attribute. Building on this observation, we develop an inference-time method to automatically steer the model response towards concise reasoning traces without updating model parameters. Our method, dubbed \textit{ASC} (Activation-Steered Compression), generates concise CoTs by directly adjusting internal representations via activation steering. A key component of ASC is \textbf{Contrastive Energy-Based Steering (CES)}, a principled procedure for learning a \emph{single} steering vector from a small set of verbose-v-concise CoT pairs by optimizing a length-normalized contrastive energy objective. To further ensure reliable steering and preserve general utility, CES enforces a differentiable \textbf{KL trust region} during steering vector optimization, explicitly constraining the distribution shift within a specified budget. With only 100 pairs of verbose–v-concise examples, ASC reduces the generated token length by as much as 67.4\% on MATH500, GSM8K, and LiveCodeBench while maintaining accuracy across models with 1.5B, 7B, 8B, and 32B parameters. On MATH500, ASC achieves an end-to-end inference speed-up of 2.7$\times$ on an 8B model.
Paper Type: Long
Research Area: LLM Efficiency
Research Area Keywords: Efficient/Low-Resource Methods for NLP
Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 6452
Loading