Episode-Level Multimodal KV Caching for Embodied Question Answering

Published: 16 Oct 2025, Last Modified: 10 Nov 2025NeurIPS 2025 ER WorkshopEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Embodied Question Answering, Multimodal KV Cache
Abstract: Embodied Question Answering (EQA) requires agents to sustain a representation of the world while answering multi-turn queries in real time. A key challenge is how to maintain and update this world model efficiently under resource constraints. Existing approaches repeatedly re-encode visual inputs or apply retrieval-augmented generation, both of which introduce latency that limits interactive use. We propose an episode-level multimodal KV cache that is constructed once from uniformly sampled frames and reused across all queries in the same episode. This cache serves as a lightweight multimodal memory that reduces redundant computation while preserving relevant context. On the openEQA benchmark, our method achieves up to an 82\% reduction in total question-answering time compared to naïve multi-image inference, with only a modest drop in accuracy. These findings demonstrate that reusing an episode-level cache provides an effective mechanism for maintaining and updating world models to achieve efficient reasoning in EQA.
Submission Number: 99
Loading