Prompt-Guided Low-Level Recovery and High-Level Fusion for Incomplete Multimodal Sentiment Analysis

06 Sept 2025 (modified: 13 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodal Learning, Sentiment Analysis, Incomplete Modalities, Cross-modal Fusion, Modality Reconstruction
Abstract: Multimodal Sentiment Analysis seeks to understand emotions by combining language, audio, and visual signals, but its real challenge lies in building models that stay robust when one or more modalities are missing or corrupted. Recent studies attempted to leverage available embedding to complement missing regions by single-level feature reconstruction or cross-modal fusion. However, both reconstruction-only and fusion-only pipelines are limited: the former amplifies noise from imperfect recovery, while the latter overlooks semantic restoration, leaving cross-modal gaps and complex intermodal relationships inadequately captured for robust generalization. To overcome these limitations, we propose Prompt-Guided Low-level recovery and High-level fusion (PGLH) for incomplete multimodal sentiment analysis, achieving deep cross-modal interactions from low-level semantic recovery to high-level semantic fusion through adaptive prompts. Specifically, PGLH mainly consists of Prompted Cross-Modal Masking (PCM2) and Unimodal-to-Bimodal Prompt Fusion (UBPF). First, PCM2 extends masked autoencoding to multimodal inputs by leveraging language-guided prompts to restore corrupted audio and visual tokens. This enables both structural fidelity and semantic grounding for low-level recovery. Secondly, in UBPF, self-guided prompts are introduced into each modality to extract fine-grained unimodal structures by selectively attending to informative regions. Next, they are progressively aligned with language-guided prompts for robust high-level fusion. Finally, PCM2 and UBPF realize the dual-level adaptation from low-level token reconstruction to high-level semantic integration, thereby effectively bridging modality gaps and more robust representations. Extensive experiments on MOSI, MOSEI, and SIMS demonstrate that PGLH consistently achieves impressive performance with missing data.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 2530
Loading