Keywords: Chain of thought, LLMs, steering vectors, compression
Abstract: Large language models (LLMs) excel at complex reasoning when they include intermediate steps, known as _chains of thought_ (CoTs). However, these rationales are often overly verbose, even for simple problems, leading to wasted context, increased latency, and higher energy consumption. We observe that verbose CoTs and concise CoTs occupy distinct regions in the model’s residual-stream activation space. By extracting and injecting a _steering vector_ to transition between these regions, we can reliably shift generation toward more concise reasoning, effectively compressing CoTs without retraining.
We formalize this approach as **Activation‑Steered Compression (ASC)**, an inference‑time technique that shortens reasoning traces by directly modifying hidden representations. In addition, we provide a theoretical analysis of the impact of ASC on the output distribution, derived from a closed-form KL-divergence-bounded constraint to regulate the steering strength. Using only 50 paired verbose and concise examples, ASC achieves up to **67.43\%** reduction in CoT length on MATH500 and GSM8K datasets, while maintaining accuracy across 7B, 8B, and 32B parameter models. As a training-free method, ASC introduces negligible runtime overhead and, on MATH500, delivers an average **2.73×** speedup in end-to-end reasoning wall clock time on an 8B model. The code is available at https://github.com/ArminAzizi98/ASC.
Submission Number: 78
Loading