MANTA: Cross-Modal Semantic Alignment and Information-Theoretic Optimization for Long-form Multimodal Understanding

Ziqi Zhong; Xunzhu Tang

MANTA: Cross-Modal Semantic Alignment and Information-Theoretic Optimization for Long-form Multimodal Understanding

Ziqi Zhong, Xunzhu Tang

17 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Understanding, Information-Theoretic Optimization

TL;DR: MANTA unifies visual and auditory inputs into structured textual representations through information-theoretic optimization, achieving up to 22.6% improvement over state-of-the-art models on long-form multimodal understanding tasks.

Abstract: The fundamental challenge in multimodal understanding lies not merely in processing individual modalities, but in discovering optimal strategies for their semantic unification across vastly different representational spaces; current approaches maintain separate encoders for each modality, leading to semantic fragmentation and computational inefficiency, and we present MANTA (Multimodal Abstraction and Normalization via Textual Alignment), a theoretically grounded framework that reconceptualizes multimodal integration as an information-theoretic optimization problem with provable guarantees, showing that natural language serves as a universal semantic bridge and providing three theoretical contributions—(1) hierarchical linguistic projection achieves (1−ϵ)-optimal information preservation, (2) cross-modal contrastive alignment converges to maximal mutual information with rate O(1/√T), and (3) our retrieval mechanism achieves the best trade-off between relevance and diversity—while guiding practical algorithms for multi-scale representation learning, information-theoretic content selection, cross-modal semantic alignment, and retrieval-augmented generation, validated by extensive experiments on long-form video understanding with unprecedented gains (22.6% on Video-MME, 27.3% on videos over 30 minutes, 25.1% on cross-modal reasoning), establishing new theoretical foundations for linguistic abstraction as a unifying principle for multimodal AI with implications for robotics, embodied AI, and human-computer interaction.

Primary Area: generative models

Submission Number: 9830

Loading