Block-Aware Semantic Efficient Retention for Long-Context LLM Inference

Block-Aware Semantic Efficient Retention for Long-Context LLM Inference

ACL ARR 2026 January Submission7883 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Long-Context Inference, Sparse Attention, Semantic Efficiency, Large Language Models

Abstract: In long-context understanding and reasoning, large language models often struggle to focus on semantically relevant information due to dispersed attention over lengthy inputs, leading to degraded semantic modeling. Existing sparse attention methods typically rely on fixed patterns or posterior filtering, lacking explicit prior modeling of contextual importance. We propose BSER (Block-Aware Semantic Efficient Retention), an attention-guided semantic sparsification approach that performs hierarchical relevance modeling before attention computation. BSER dynamically retains context blocks most relevant to the query and applies local context expansion to preserve semantic coherence. The approach is training-free and can be seamlessly integrated into off-the-shelf language models. Experiments on multiple long-context benchmarks demonstrate that BSER consistently improves performance while significantly reducing inference cost. Codes are available at https://anonymous.4open.science/r/BSER-F488/.

Paper Type: Long

Research Area: LLM Efficiency

Research Area Keywords: LLM Efficiency, pruning, NLP in resource-constrained settings

Contribution Types: Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models

Languages Studied: English

Submission Number: 7883

Loading