MANTA: Cross-Modal Semantic Alignment and Information-Theoretic Optimization for Long-form Multimodal Understanding
Keywords: Multimodal Understanding, Information-Theoretic Optimization
TL;DR: MANTA unifies visual and auditory inputs into structured textual representations through information-theoretic optimization, achieving up to 22.6% improvement over state-of-the-art models on long-form multimodal understanding tasks.
Abstract: The fundamental challenge in multimodal understanding lies not merely in processing individual modalities, but in discovering optimal strategies for their semantic unification across vastly different representational spaces; current approaches maintain separate encoders for each modality, leading to semantic fragmentation and computational inefficiency, and we present MANTA (Multimodal Abstraction and Normalization via Textual Alignment), a theoretically grounded framework that reconceptualizes multimodal integration as an information-theoretic optimization problem with provable guarantees, showing that natural language serves as a universal semantic bridge and providing three theoretical contributions—(1) hierarchical linguistic projection achieves (1−ϵ)-optimal information preservation, (2) cross-modal contrastive alignment converges to maximal mutual information with rate O(1/√T), and (3) our retrieval mechanism achieves the best trade-off between relevance and diversity—while guiding practical algorithms for multi-scale representation learning, information-theoretic content selection, cross-modal semantic alignment, and retrieval-augmented generation, validated by extensive experiments on long-form video understanding with unprecedented gains (22.6% on Video-MME, 27.3% on videos over 30 minutes, 25.1% on cross-modal reasoning), establishing new theoretical foundations for linguistic abstraction as a unifying principle for multimodal AI with implications for robotics, embodied AI, and human-computer interaction.
Primary Area: generative models
Submission Number: 9830
Loading