Dual Language Models: Balancing sample-efficiency and overfitting resilience

ICLR 2026 Conference Submission20953 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: language model, pretraining, training objective, mixed training objective, masked diffusion
TL;DR: We simultaneously train language models on autoregressive and masked-diffusion objectives, resulting in flexible models that outperform the single-objective models in both settings.
Abstract: This paper combines autoregressive and masked-diffusion training objectives without any architectural modifications, resulting in flexible models that outperform single-objective baselines in both settings. Autoregressive language modeling has been a popular training approach, partly because of its sample efficiency; however, this comes at the cost of susceptibility to overfitting. On the other hand, masked-diffusion language models are less sample-efficient while being more resilient to overfitting. In this work, we demonstrate that dual-objective training achieves the best of both worlds. To derive the optimal ratio of the masked-diffusion and autoregressive objectives, we train and evaluate 50 language models under varying levels of data repetition. We show that it is optimal to combine both objectives under all evaluated settings and that the optimal ratio is similar whether targeting autoregressive or masked-diffusion downstream performance.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 20953
Loading