Abstract: —Multimodal vision-language models (VLMs) have
achieved remarkable capabilities but suffer from high inference
latency, particularly due to repeated visual encoding operations.
While caching techniques have proven effective for text-only
large language models, existing approaches fail to address the
unique characteristics of multimodal inference: heterogeneous
token types with different computational costs, cross-modal de
pendencies, and the distinction between expensive visual encoding
versus lightweight text generation. We present a novel multi-level
caching architecture that employs attention-based importance
scoring and cross-modal cache awareness to optimize multimodal
inference. Our Phase 1 prototype on Apple Silicon with MLX
validates output-level caching (L2), achieving 52.8% hit rate
and 2.12× real-world speedup through 1291ms effective latency
versus 2733ms baseline. Evaluated on SmolVLM2-2.2B across
250 visual question answering queries with three cache policies
(LRU, importance-based, cross-modal), all policies demonstrate
statistically identical performance (p < 0.001) due to cache
capacity exceeding working set. We validate the theoretical
multi-level caching framework and identify embedding-level
caching (L1) as requiring standard vision-language models with
accessible intermediate representations—blocked in Phase 1 by
SmolVLM2’s video architecture but feasible in Phase 2 dis
tributed simulation with PyTorch models.
Index Terms—Multimodal inference, vision-language models,
caching systems, attention mechanisms, performance optimiza
tion
Loading