Keywords: long-context inference, context compression, attention-based compression, cross-attention fine-tuning, cross-family generalization, long-context code reasoning, resource-adaptive inference, foundation model adaptation
TL;DR: LongAttnComp is a fine-tuning-based long-context compressor that matches full-context performance and transfers across unrelated target model families on long-context code reasoning.
Abstract: As real-world applications increasingly require processing inputs of 100k+ tokens that approach or exceed standard LLM context windows— retrieved documents, long conversations, or extended codebases—the gap between context length and inference efficiency has become a critical bottleneck. Long-context inference imposes significant memory and compute costs, motivating efficient context compression. We observe that long-context task performance decomposes into retrieval and reasoning, and existing training-free attention-based compression methods leave a substantial gap on the retrieval step in demanding long-context settings such as code understanding. We present LongAttnComp, a long-context adaptation of AttnComp (Luo et al., 2025), fine-tuning a lightweight cross-attention scoring layer and introducing token-level chunking, a token-budget top-p algorithm, and a format-agnostic query
parser. Trained solely on NIAH-style data, LongAttnComp matches or exceeds full-context performance and substantially outperforms training-free baselines across multiple target models on InfiniteBench Code-Debug. Per-task analysis on RULER and LongBench v2 confirms LongAttnComp’s strength on long-context code reasoning while indicating that broader long-context
applicability is achievable through more diverse training data.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 61
Loading