Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs

Published: 23 Sept 2025, Last Modified: 07 Dec 2025FoRLM 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: chain of thought, online reinforcement learning, in-context learning, multi-hop reasoning, prompting techniques, bettertogether
TL;DR: Extends GRPO to modular LM programs; shows that composing GRPO with prompt optimization works the best Abstract:
Abstract: Group Relative Policy Optimization (GRPO) has proven to be an effective tool for post-training language models (LMs). However, AI systems are increasingly expressed as modular programs that mix together multiple LM calls with distinct prompt templates and other tools, and it is not clear how best to leverage GRPO to improve these systems. We begin to address this challenge by defining mmGRPO, a simple multi-module generalization of GRPO that groups LM calls by module across rollouts and handles variable-length and interrupted trajectories. We find that mmGRPO, composed with automatic prompt optimization via the BetterTogether method of Soylu et al. (2024), improves accuracy by 11% on average across classification, many-hop search, and privacy-preserving delegation tasks against the post-trained LM—and by 5% against prompt optimization on its own. is released as an open-source optimizer for compound-AI systems. We open-source mmGRPO as the dspy.GRPO optimizer in the DSPy library at dspy.ai.
Submission Number: 37
Loading