Deliberation in Latent Space via Differentiable Cache Augmentation

Luyang Liu; Jonas Pfeiffer; Jiaxing Wu; Jun Xie; Arthur Szlam

Deliberation in Latent Space via Differentiable Cache Augmentation

Luyang Liu, Jonas Pfeiffer, Jiaxing Wu, Jun Xie, Arthur Szlam

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: This paper introduces a method to improve LLM reasoning by using a separate "coprocessor" to add learned latent information to the LLM's KV-cache without changing the original LLM, leading to better performance on complex tasks.

Abstract: Techniques enabling large language models (LLMs) to "think more" by generating and attending to intermediate reasoning steps have shown promise in solving complex problems. However, the standard approaches generate sequences of discrete tokens immediately before responding, and so they can incur significant latency costs and be challenging to optimize. In this work, we demonstrate that a frozen LLM can be augmented with an offline coprocessor that operates on the model's key-value (kv) cache. This coprocessor augments the cache with a set of latent embeddings designed to improve the fidelity of subsequent decoding. We train this coprocessor using the language modeling loss from the decoder on standard pretraining data, while keeping the decoder itself frozen. This approach enables the model to learn, in an end-to-end differentiable fashion, how to distill additional computation into its kv-cache. Because the decoder remains unchanged, the coprocessor can operate offline and asynchronously, and the language model can function normally if the coprocessor is unavailable or if a given cache is deemed not to require extra computation. We show experimentally that when a cache is augmented, the decoder achieves lower perplexity on numerous subsequent tokens. Furthermore, even without any task-specific training, our experiments demonstrate that cache augmentation consistently improves performance across a range of reasoning-intensive tasks.

Lay Summary: Large Language Models (LLMs), the AI behind many chatbots, often need to "think" through multiple steps to solve complex problems, but current methods for this can be slow or require changing the LLM itself. This makes it hard to boost their reasoning abilities efficiently. We've developed a way to help an LLM "think" more deeply without altering its original design. We introduce a separate "helper" system (a coprocessor) that examines the LLM's current working memory. This helper then subtly adds refined, concentrated information—what we call latent embeddings—back into the LLM's memory to improve its understanding. This process is designed to be efficient and can even run in the background or offline. Our approach allows the main LLM—which remains unchanged—to achieve significantly better results on tasks requiring complex reasoning and to make more accurate predictions. Because the helper system is external, the LLM can still function normally if the helper isn't active. This method opens the door for AI to perform more thoughtful deliberation on information, potentially leading to more capable and efficient AI assistants.

Primary Area: Deep Learning->Large Language Models

Keywords: Latent reasoning, Cache augmentation, LLM

Submission Number: 403

Loading