Position: Multiplicity is an Inevitable and Inherent Challenge in Multimodal Learning

Sanghyuk Chun; Olga Russakovsky

Position: Multiplicity is an Inevitable and Inherent Challenge in Multimodal Learning

Sanghyuk Chun, Olga Russakovsky

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 Position Paper Track regularEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: Multiplicity is an inevitable and inherent challenge in multimodal learning, and multiplicity should be treated as a first-class consideration for multimodal tasks.

Abstract: Multimodal learning has seen remarkable progress, particularly with large-scale pre-training across various modalities. Most current approaches are built on the assumption of a deterministic one-to-one alignment between modalities. However, this oversimplifies real-world multimodal relationships, where their nature is inherently many-to-many. The many-to-many property, or multiplicity, is not a side-effect of noise or annotation error, but an inevitable outcome of intra-modal variability, representational asymmetry, and task-dependent ambiguity in multimodal tasks. We argue that multiplicity is a fundamental bottleneck that affects all stages of the multimodal learning pipeline: from data construction to model training and evaluation benchmarks. By formalizing its causes and consequences, we demonstrate how ignoring multiplicity leads to training uncertainty, unreliable evaluation, and degraded dataset quality. This position paper calls for new research directions on multimodal learning, including multiplicity-aware learning frameworks and dataset construction and evaluation protocols.

Lay Summary: AI systems increasingly learn by connecting different kinds of information, such as images and captions, videos and sounds, or robot actions and instructions. Many datasets and tests make a simple assumption: each item has one correct match. In real life, this is often false. The same image can have many good captions, and the same caption can describe many possible images. This paper argues that these multiple valid matches are not just annotation mistakes, but an unavoidable part of multimodal learning. When this issue is ignored, AI systems may learn from incomplete labels, treat valid matches as wrong, or be judged unfairly by benchmarks that count only one answer as correct. We explain where this problem comes from and how it affects data collection, training, and evaluation. We call for future datasets, methods, and benchmarks that explicitly account for multiple valid matches.

Primary Area: Research Priorities, Methodology, and Evaluation

Keywords: multiplicity, multimodal learning, many-to-many alignment

Originally Submitted PDF: pdf

Submission Number: 11

Loading