OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We propose a new compression framework for omni-modal large language models.
Abstract: Omni-modal Large Language Models (Omni-LLMs) have demonstrated strong capabilities in audio-video understanding tasks. However, their reliance on long multimodal token sequences leads to substantial computational overhead. Despite this challenge, token compression methods designed for Omni-LLMs remain limited. To bridge this gap, we propose OmniSIFT (Omni-modal Spatio-temporal Informed Fine-grained Token compression), a modality-asymmetric token compression framework tailored for Omni-LLMs. Specifically, OmniSIFT adopts a two-stage compression strategy: (i) a spatio-temporal video pruning module that removes video redundancy arising from both intra-frame structure and inter-frame overlap, and (ii) a vision-guided audio selection module that filters audio tokens. The entire framework is optimized end-to-end via a differentiable straight-through estimator. Extensive experiments on five representative benchmarks verify the efficacy and robustness of OmniSIFT. Notably, for Qwen2.5-Omni-7B, OmniSIFT adds 4.85M parameters while still achieving lower latency than training-free baselines such as OmniZip. With only 25% of the original token context, OmniSIFT consistently outperforms all compression baselines and even surpasses the full-token model on several tasks.
Lay Summary: Modern AI assistants increasingly need to understand videos together with their sounds, not just text. However, long audio-video inputs are converted into many tokens, which makes omni-modal language models slow and expensive to use. Token compression is a natural way to reduce this cost, but omni-modal videos bring a special challenge: different modalities have different kinds of redundancy. Visual information is often repetitive across nearby frames or image regions. Audio information, however, is harder to judge alone, because whether a sound matters often depends on what is happening on screen. We introduce OmniSIFT, a lightweight method designed around this modality asymmetry. It first compresses visual tokens to obtain compact visual anchors, and then uses these anchors to select the most relevant audio tokens. In this way, the model focuses on the audio-video evidence that matters most. Experiments show that OmniSIFT greatly reduces audio-video tokens while maintaining, and sometimes improving, model accuracy.
Originally Submitted Supplementary Material: zip
Link To Code: https://github.com/dingyue772/OmniSIFT
Primary Area: Deep Learning->Large Language Models
Keywords: efficiency, multimodal, omni
Originally Submitted PDF: pdf
Submission Number: 49
Loading