Towards understanding multimodal in-context learning

08 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: in-context learning, multimodal learning
Abstract: Multimodal large language models (MLLMs) often exhibit in-context learning (ICL) abilities, yet the conditions under which multimodal ICL emerges, and the mechanisms underlying it, remain poorly understood. In particular, how training data statistics and architectural choices jointly shape this capability is still an open question. To address this, we reverse-engineer multimodal ICL by training small transformer models on controlled synthetic classification tasks with varying data statistics and architectural choices. We begin by revisiting core principles of unimodal ICL in modern transformers. While several prior findings replicate, our experiments yield two notable observations. First, Rotary Position Embeddings (RoPE), a standard component in contemporary LLMs, can delay the onset of ICL circuits. Second, larger models require stronger statistical cues in the training data for strong ICL to appear. Extending our analysis to the multimodal setting reveals a fundamental learning asymmetry. Once a primary modality has learned a core ICL circuit from statistically diverse data, a secondary modality can reach comparable ICL performance with far less data complexity. In contrast to the unimodal regime, we further find that model scaling consistently improves multimodal ICL. To understand why these patterns emerge, we turn to mechanistic analysis. Using progress measures that track circuit formation during training, we show that ICL accuracy is tightly correlated with the strength of an induction-style circuit that copies labels from in-context exemplars that match the query. Both unimodal and multimodal ICL rely on this induction mechanism, while multimodal training primarily refines and extends it across modalities. Together, these results provide a mechanism-level account of ICL in modern multimodal transformers, offer explanations for several empirical phenomena observed in MLLMs, and introduce a controlled testbed for future work on multimodal ICL.
Primary Area: interpretability and explainable AI
Submission Number: 3055
Loading