Semantic Proximity for Redundancy-Aware Context Compression in Large Language Models

Semantic Proximity for Redundancy-Aware Context Compression in Large Language Models

ICLR 2026 Conference Submission25375 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large language models, context compression, semantic proximity

TL;DR: Compress LLM context by summarizing semantically redundant turns, via embedding similarity (or blended with recency) and extended to cluster-level summaries, alleviating extra LLM calls, outperforming FIFO on an augmented LongMemEval benchmark.

Abstract: LLMs are increasingly bottlenecked by fixed context windows, motivating principled compression of conversational histories. We study semantic-redundancy–aware compression, in which we pair human–assistant turns, embed them, and summarize those that are most semantically overlapping. We introduce STAE (Semantic-Temporal Aware Eviction), a centroid–temporal hybrid policy that scores each pair by a convex combination of semantic distance to a conversation centroid and recency (weighted by $\beta$), alongside an inverted variant and a cluster-aware compressor that summarizes whole embedding-space clusters. Crucially, redundancy is detected from embeddings using lightweight centroid/cluster arithmetic without extra LLM calls, reducing token usage and inference cost. To evaluate retrieval under compression, we augment LongMemEval with a 20-needle-per-dialogue benchmark, addressing the brittleness of single-needle tests and enabling finer-grained measurement of information retention. On this benchmark, summarizing pairs closest to the centroid outperforms FIFO across compression regimes, while compression of those furthest from the centroid degrades at stricter budgets; moreover, local STAE within temporal or semantic groups closely matches a strong temporal upper bound and consistently surpasses global eviction at the same ECR, with inverted (evict-lowest) preserving more needles. We also show that clustered summarization of semantically or temporally similar message pairs provides a strong chunking strategy for compression. The takeaway is simple and actionable: compress where redundancy is highest, measured explicitly via semantic similarity in embedding space, while freeing tokens with minimal loss.

Primary Area: generative models

Submission Number: 25375

Loading