FlashMem: Distilling Intrinsic Latent Memory via Computation Reuse

FlashMem: Distilling Intrinsic Latent Memory via Computation Reuse

ACL ARR 2026 January Submission2704 Authors

03 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Autonomous Agents, Latent Memory, Efficient Inference, Long-Context Reasoning, Computation Reuse, Uncertainty Estimation

Abstract: The stateless architecture of Large Language Models inherently lacks the mechanism to preserve dynamic context, compelling agents to redundantly reprocess history to maintain long-horizon autonomy. While latent memory offers a solution, current approaches are hindered by architectural segregation, relying on auxiliary encoders that decouple memory from the reasoning backbone. We propose \textbf{FlashMem}, a framework that distills intrinsic memory directly from transient reasoning states via computation reuse. Leveraging the property that internal representations uniquely encode input trajectories, FlashMem identifies the last hidden state as a sufficient statistic for the interaction history. This enables a \textbf{Shared-KV Consolidator} to synthesize memory by attending directly to the backbone's frozen cache, eliminating redundant re-parameterization. Furthermore, a parameter-free \textbf{Cognitive Monitor} leverages attention entropy to adaptively trigger consolidation only when high epistemic uncertainty is detected. Experiments demonstrate that FlashMem matches the performance of heavy baselines while reducing inference latency by \textbf{5 times}, effectively bridging the gap between efficiency and persistent cognition. Our code is available at \url{https://anonymous.4open.science/r/FlashMem-2124}.

Paper Type: Long

Research Area: AI/LLM Agents

Research Area Keywords: agent memory, LLM agents, autonomous agents, LLM Efficiency, inference methods, long-form summarization, code generation, Mathematical reasoning

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English

Submission Number: 2704

Loading