DecompRL: Solving More Problems with Less Tokens

TMLR Paper8119 Authors

26 Mar 2026 (modified: 26 Mar 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: While repeated sampling from Large Language Models (LLMs) is a robust baseline for competitive programming and other automatically verifiable problems, it comes at a steep GPU cost. Reinforcement learning (RL)-based post-training can reduce the necessary sample size, but often worsens generations diversity, which limits performance in the large-scale sampling regime. Online RL is itself bottlenecked by the performance of the starting policy and the heavy compute required for inference. We introduce DecompRL, an algorithm inspired by modular inference that trains policies to decompose complex problems into separate, parallelizable functions. By recombining these modules into polynomially many solutions, DecompRL shifts the RL bottleneck from GPU-based inference to CPU-based evaluation. This enables massive scaling at a fraction of the cost, improving sparse reward discovery and solving complex problems that remain out of reach for standard RL.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~lin_Yang1
Submission Number: 8119
Loading