Omni-Thinker: Scaling Multi-Task RL in LLMs with Hybrid Reward and Task Scheduling

ICLR 2026 Conference Submission20192 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multi-Task Learning, Large Language Model, Reinforcement Learning, Curriculum Learning
TL;DR: We present Omni-Thinker, a unified multi-task RL framework that combines verifiable and generative rewards to train LLMs across diverse tasks, achieving strong overall performance through curriculum-guided optimization.
Abstract: The pursuit of general-purpose artificial intelligence depends on large language models (LLMs) that can handle both structured reasoning and open-ended generation. We present OMNI-THINKER, a unified reinforcement learning (RL) framework that scales LLMs across diverse tasks by combining hybrid rewards with backward-transfer–guided scheduling. Hybrid rewards integrate rule-based verifiable signals with preference-based evaluations from an LLM-as-a-Judge, enabling learning in both deterministic and subjective domains. Our scheduler orders tasks according to accuracy backward transfer (BWT), reducing forgetting and improving multi-task performance. Experiments across four domains show gains of $6.2\%$ over joint training and $12.4\%$ over model merging. Moreover, we demonstrate that simple assumptions on accuracy transfer yield accurate predictions of curriculum outcomes, with entropy dynamics explaining deviations due to generative tasks. These findings underscore the importance of BWT-aware scheduling and hybrid supervision for scaling RL-based post-training toward general-purpose LLMs.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 20192
Loading