RSCE: Training-Free Residual Stream Encoding for Persistent Context Amortization

Published: 01 Apr 2026, Last Modified: 25 Apr 2026ICLR 2026 Workshop LLM ReasoningEveryoneRevisionsBibTeXCC BY 4.0
Track: long paper (up to 10 pages)
Keywords: residual stream, context compression, retrieval-augmented generation, activation injection, prompt compression, training-free, context amortization, long-context inference
Abstract: We propose Residual Stream Context Encoding (RSCE), a training-free method that eliminates redundant long-context prefill costs in retrieval-augmented generation. Given a context document ctx, RSCE extracts a vector C ∈ RdM by mean- pooling residual stream activations at a calibrated intermediate layer f (M ), then injects it as an additive shift at query time—replacing O(|T (ctx)|) attention prefill with an O(1) operation with zero per-query context forward pass. For tasks requiring factual precision, we pair C with a compact explicit fact block F , forming a dual-channel representation amortized across N ≥ 2 queries. We evaluate five decoder-only architectures (7B–70B) on multi-document QA (LongBench, n = 108) and six architectures on cross-file code completion (RepoBench-C), comparing against LongLLMLingua and EHPC. A key mechanistic finding: vector injection alone suppresses parametric recall below the question-only baseline—a dual-pathway interference effect absent in behavioral steering that motivates the dual-channel design. At extreme compression (∼99% token reduction), RSCE Vec+F is competitive with EHPC on smaller architectures (LLaMA-8B F1 0.333 vs. EHPC 0.334; DeepSeek-14B both 0.214) while both substantially outperform LongLLMLingua (0.209, 0.172). On larger models, EHPC’s capacity-scaling token selection widens the gap, reaching F1 0.539 vs. RSCE 0.365 on LLaMA- 70B—a finding we explain through model capacity scaling of in-context reasoning. On RepoBench-C, LongLLMLingua substantially improves over baseline via compression-as-retrieval; RSCE is the only method achieving 81% compression at 100% operational reliability.
Presenter: ~Eric_Xu2
Format: No, the presenting author is unable to, or unlikely to be able to, attend in person.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 198
Loading