AOEPT: Breaking the Implicit Modality-Reduction Bottleneck in Modality-Missing Prompt Tuning

Jian Lang; Rongpei Hong; Ting Zhong; Fan Zhou

AOEPT: Breaking the Implicit Modality-Reduction Bottleneck in Modality-Missing Prompt Tuning

Jian Lang, Rongpei Hong, Ting Zhong, Fan Zhou

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We propose AOEPT, which pioneers a novel and lightweight modal-contextualized prompting paradigm, overcoming the Implicit Modality-Reduction bottleneck in existing methods.

Abstract: Deploying multimodal systems in real-world environments often entails handling modality-missing scenarios, where one or more modalities are unavailable. While recent studies address this challenge for the general Multimodal Transformer (MT) architecture via prompt tuning, we identify a fundamental limitation in these methods: the Implicit Modality-Reduction bottleneck. By conditioning prompts solely on the observed modalities, they inadvertently restrict the reasoning scope of MTs to the modality-reduced subspace, cutting off access to the latent information sources of the missing modalities. To overcome this limitation, we propose AOEPT, which pioneers a novel modal-contextualized prompting fashion. Specifically, we introduce lightweight Modal-Contextualized Prompts (MCPs) that distill global modality-wise priors from training data, serving as latent repositories of the information sources for missing modalities. Conditioned on the remaining modalities, these MCPs are instantiated into instance-aware prompts that selectively augment missing-modality information for each sample, thereby restoring the reasoning scope of MTs beyond the observed-modality-only subspace. Experiments across various multimodal benchmarks and backbones confirm the strong performance of AOEPT, with minimal computational overhead.

Lay Summary: If we think of a Multimodal Transformer as Officer Judy from Zootopia, it is supposed to solve a case by combining different kinds of evidence, such as images, records, and audio. In real-world settings, however, some evidence may be missing due to sensor failures or transmission errors. Existing methods provide extra “prompts” to help Judy, but these prompts are still derived only from the remaining evidence. This is like asking Judy to reason only from an incomplete case file, which restricts her view of the case. We aim to give Judy (the Transformer) a lightweight “manual” for each type of evidence. When one type of evidence is missing, she can recall relevant experience from the corresponding manual of that evidence type, helping her recover a broader reasoning view while keeping prompt learning efficient and lightweight.

Originally Submitted Supplementary Material: zip

Link To Code: https://github.com/Jian-Lang/AOEPT

Primary Area: Deep Learning->Robustness

Keywords: modality missing learning, incomplete multimodal learning, prompt tuning, multimodal transformer, implicit modality-reduction bottleneck

Originally Submitted PDF: pdf

Submission Number: 3014

Loading