Abstract: Multimodal language models now integrate text,
audio, and video for unified reasoning. Yet existing RL post-training pipelines treat all input
signals as equally relevant, ignoring which modalities each task actually requires. This modalityblind training inflates policy-gradient variance,
slows convergence, and degrades robustness to
real-world distribution shifts where signals may
be missing, added, or reweighted. We introduce MAPLE, a complete modality-aware posttraining and learning ecosystem comprising: (1)
MAPLE-bench, the first benchmark explicitly annotating minimal signal combinations required
per task; (2) MAPO, a modality-aware policy optimization framework that stratifies batches by
modality requirement to reduce gradient variance
from heterogeneous group advantages; (3) Adaptive weighting and curriculum scheduling that balances and prioritizes harder signal combinations.
Systematic analysis across loss aggregation, clipping, sampling, and curriculum design establishes
MAPO’s optimal training strategy. Adaptive
weighting and curriculum focused learning further
boost performance across signal combinations.
MAPLE narrows uni/multi-modal accuracy gaps
by 30.24%, converges 3.18× faster, and maintains
stability across all modality combinations under
realistic reduced signal access. MAPLE constitutes a complete recipe for deployment-ready multimodal RL post-training.
Loading