Track: Extended abstracts (2 pages)
Keywords: Vision-Language Models, Adaptive Inference, Token Pruning, Model Efficiency, Robustness, Cross-Modal Attention, Edge Deployment, Resource-Constrained AI
TL;DR: We present an adaptive token pruning method that speeds up vision-language models by focusing computation on the most informative visual and textual tokens, achieving around 40% efficiency gains with minimal accuracy loss.
Abstract: As vision-language models (VLMs) continue to advance toward real-world deployment in domains such as robotics, autonomous systems, and assistive technologies, their computational and memory demands pose a persistent bottleneck. Existing architectures typically process all visual and textual tokens uniformly, regardless of their contribution to the final prediction, leading to inefficiencies and latency that hinder scalability. In this work, we introduce Adaptive Token Pruning (ATP), a dynamic inference mechanism that identifies and retains only the most informative subset of multimodal tokens based on their contextual relevance. ATP operates by analyzing cross-modal attention distributions at each transformer layer, estimating token importance scores derived from both inter- and intra-modal saliency. Tokens deemed redundant are pruned progressively, allowing the model to focus computation on semantically rich regions and phrases while maintaining alignment across modalities.
Unlike static compression or distillation approaches, ATP adapts to each input instance without modifying the backbone architecture. We propose ATP as a lightweight gating module compatible with popular VLM backbones such as BLIP-2, LLaVA, and Flamingo. Preliminary evaluations across VQAv2, GQA, and COCO Captioning indicate that ATP can reduce inference FLOPs by around 40\% and achieve roughly 1.5× speedups in end-to-end latency, with negligible loss (<1\%) in task accuracy. Moreover, qualitative analyses suggest that ATP preserves visual grounding and contextual reasoning fidelity, indicating that token pruning can also serve as a lens into model interpretability.
Beyond efficiency, we investigate the robustness of ATP-enhanced models under visual corruption and linguistic perturbation scenarios. Our observations suggest that adaptive pruning tends to suppress spurious correlations and hallucinated features, yielding improved stability across noise conditions. These findings suggest that resource-constrained inference and model reliability are not necessarily competing objectives—adaptive mechanisms can improve both simultaneously. Finally, we discuss how ATP can be integrated into deployment pipelines for multimodal edge computing, emphasizing its role as a general design principle for efficient, robust, and real-time VLM reasoning.
Confirmation: I have read and agree with the workshop's policy on behalf of myself and my co-authors.
Submission Number: 11
Loading