Reasoning-Focused Evaluation of Efficient Long-Context Inference Techniques

Joie Zhang; Qiyao Wei; Howard Yen; Xi Ye; Danqi Chen

Reasoning-Focused Evaluation of Efficient Long-Context Inference Techniques

Joie Zhang, Qiyao Wei, Howard Yen, Xi Ye, Danqi Chen

Published: 16 Oct 2025, Last Modified: 10 Nov 2025NeurIPS 2025 ER WorkshopEveryoneRevisionsBibTeXCC BY 4.0

Keywords: efficiency, KV cache compression, quantization, long-context, reasoning

TL;DR: We evaluate how weight quantization and token eviction methods can advance the Pareto frontier of memory and performance for reasoning models.

Abstract: The rise of large reasoning models has demonstrated impressive capabilities in tackling increasingly complex tasks that often require generating long chains of thought. However, such capability also comes with a substantial increase in inference cost. Despite this growing trend, existing evaluations of efficient long-context inference techniques have primarily focused on preserving model performance for long inputs. We argue that this focus overlooks two crucial aspects of reasoning models: (1) their ability to generate long outputs, and (2) their ability to synthesize dispersed information -- to reason over information distributed across both the input context and previously generated outputs. In this work, we systematically evaluate an array of efficient inference techniques across 12 tasks that differ in their requirements for long-output generation and information synthesis, on both instruction-tuned and reasoning models. We find that NF4 and Int8 weight quantization strongly outperform baselines and key-value (KV) cache token eviction methods on Pareto-optimality in both memory and performance across evaluated models, tasks, and context lengths. We observe that existing long-context efficient inference techniques perform especially poorly on tasks with long outputs and high token dispersion, which are critical for reasoning. Specifically, our experiments suggest that token eviction methods struggle in these settings because they cannot reliably perform exact string retrieval, likely because they tend to eject critical tokens during decoding. Finally, we find that using reasoning models can partially mitigate this degradation, recovering performance close to the full-cache baseline even at smaller cache sizes, thereby advancing the Pareto frontier of memory efficiency and long-context performance.

Submission Number: 299

Loading