Keywords: Latent Cache Flow, LCF, Cache-to-Cache, C2C, KV Cache, Cache Communication, Cross-Model Communication, Cross-Context Communication, Multi-Model Systems, LLM Communication, LLM Inference, Parameter Efficiency, Adapter Efficiency, Bottleneck Compression, Key-Value Compression, Text-to-Text Communication
TL;DR: Language is for humans. Latents are for LLMs.
Abstract: LLM agents today communicate via text, which incurs considerable latency and information loss due to the need to autoregressively decode the sharer model's state and encode at the receiver model. Recent work such as Cache-to-Cache seeks to exchange KV caches by learning adapters that translate sharer KV matrices to the receiver model. However, the adapters are large and expensive to train, and translate individual tokens, which requires the target context to be identical. This is unsuitable for agent communication, where the LLMs have differing context.
We introduce Latent Cache Flow (LCF). To address efficiency, we observe that keys and values can be jointly translated and compressed, reducing the adapter to about 4% of C2C's size. To address differing context, we design the adapter to transmit a summary of new information that the target model does not have. Our early experiments show that a pruned 13 MB LCF adapter can be more accurate than C2C at 956 MB in shared-context settings; for different contexts, LCF improves F1 by 7.5% and Exact Match by 23% while 8.5 times faster than text-based communication.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 116
Loading