When is More Better? Efficient and Adaptive Modality Acquisition in Multimodal Learning

When is More Better? Efficient and Adaptive Modality Acquisition in Multimodal Learning

TMLR Paper7884 Authors

11 Mar 2026 (modified: 16 Mar 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Multimodal machine learning can improve flexibility, performance, and robustness, but incorporating many modalities increases acquisition costs and system complexity. This motivates adaptive modality acquisition (AMA), where a subset of the most informative modalities is selected before observation to balance predictive performance against cost. Prior work has largely focused on population-level acquisition, selecting a fixed subset of modalities that performs well on average. In this work, we instead adapt modality acquisition per sample, which is critical in settings such as healthcare where the value and cost of additional tests depend on the specific patient, improving efficiency while naturally supporting heterogeneous acquisition costs. This setting leads to the problem of multi-stage subset selection with unobserved items and heterogeneous costs, which poses challenges in both uncertainty and scalability. Our key idea is to learn a compositional energy-based value function that scores candidate modality subsets for their expected contribution to downstream prediction. We implement this through recursive value functions that estimate the value of acquiring any subset of modalities conditioned on the currently observed modalities, allowing the same model to be applied iteratively as new modalities are acquired. Our key contributions are: (1) learning this recursive value function as an energy-based model; (2) designing and characterizing suitable value functions for this setting, with a selection rule based on the model confusion rate (MCR; probability that added modalities flip a correct prediction); and (3) showing that, under a natural submodularity assumption on modality value, the acquisition objective can be optimized efficiently via submodular optimization. This framework yields scalable training and inference algorithms, collectively referred to as Efficient Adaptive Modality Acquisition (EAMA), that scale linearly in the number of modalities. Across multiple real-world multimodal datasets with up to $M=15$ modalities, EAMA achieves up to an $8\times$ improvement in balancing accuracy gains against acquisition costs relative to baseline methods. In some cases, EAMA is able to do more with less, improving accuracy while using only $27.4\%$ of the available modalities on average.

Submission Type: Long submission (more than 12 pages of main content)

Assigned Action Editor: ~Qi_Yu1

Submission Number: 7884

Loading