Attention-Aware Multi-Level Caching for Efficient Multimodal Vision-Language Inference

FNU Harsh

Published: 05 Jan 2026, Last Modified: 13 Feb 2026IEEE CCWC 2026EveryoneCC BY 4.0

Abstract: —Multimodal vision-language models (VLMs) have achieved remarkable capabilities but suffer from high inference latency, particularly due to repeated visual encoding operations. While caching techniques have proven effective for text-only large language models, existing approaches fail to address the unique characteristics of multimodal inference: heterogeneous token types with different computational costs, cross-modal de pendencies, and the distinction between expensive visual encoding versus lightweight text generation. We present a novel multi-level caching architecture that employs attention-based importance scoring and cross-modal cache awareness to optimize multimodal inference. Our Phase 1 prototype on Apple Silicon with MLX validates output-level caching (L2), achieving 52.8% hit rate and 2.12× real-world speedup through 1291ms effective latency versus 2733ms baseline. Evaluated on SmolVLM2-2.2B across 250 visual question answering queries with three cache policies (LRU, importance-based, cross-modal), all policies demonstrate statistically identical performance (p < 0.001) due to cache capacity exceeding working set. We validate the theoretical multi-level caching framework and identify embedding-level caching (L1) as requiring standard vision-language models with accessible intermediate representations—blocked in Phase 1 by SmolVLM2’s video architecture but feasible in Phase 2 dis tributed simulation with PyTorch models. Index Terms—Multimodal inference, vision-language models, caching systems, attention mechanisms, performance optimiza tion