Reuse, Don't Recompute: Efficient Large Reasoning Model Inference via Memory Orchestration

Daivik Patel; Shrenik Patel

Reuse, Don't Recompute: Efficient Large Reasoning Model Inference via Memory Orchestration

Daivik Patel, Shrenik Patel

Published: 16 Oct 2025, Last Modified: 10 Nov 2025NeurIPS 2025 ER WorkshopEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Reasoning Models (LRMs), Efficient inference, Test-time scaling (TTS), Memory orchestration, Typed retrieval, Fact Cards, Controlled citation, Long-horizon reasoning, Context compression, Token efficiency, Latency reduction, LLM-as-Judge, LoCoMo benchmark, LongMemEval benchmark, HealthBench, Reuse don't recompute, efficient reasoning, LRM

TL;DR: We introduce ENGRAM-R, a compact memory layer that uses typed evidence and a citation policy to make LRMs reason efficiently during inference.

Abstract: Large reasoning models (LRMs) achieve strong accuracy through test-time scaling (TTS), generating longer chains of thought or sampling multiple solutions, but at steep costs in tokens and latency. We argue that memory is a core ingredient for efficient reasoning: when evidence already exists, models should “think less” by reusing structured memory instead of recomputing derivations. We present ENGRAM-R, an inference-time memory layer that integrates typed retrieval with compact fact card representations and explicit citation control. On the LoCoMo benchmark, ENGRAM-R reduces input tokens by 85\% and reasoning tokens by 75\% versus full context while maintaining high accuracy. On a multi-hop slice of the LongMemEval benchmark, it achieves similar efficiency with substantial accuracy gains. These results show that memory is not only critical for long-horizon correctness, but also a practical lever for efficient reasoning under tight compute, memory, and latency budgets.

Submission Number: 247

Loading