Keywords: Large Language Models, Reinforcement Learning with Verifiable Rewards, Generalization, Causal Reasoning
Abstract: Reinforcement learning with verifiable rewards (RLVR) is increasingly used for post-training large language models (LLMs) to reason, but it remains unclear when RLVR yields reliable generalization. This paper investigates the generalization of RLVR using causal reasoning problems as a testbed, namely probabilistic inference in a causal graphical model. We choose this setting because causality is an important area that LLMs still struggle with, and because this setting provides us with two natural axes of difficulty along which to systematically probe generalization: the level of the probabilistic query—associational, interventional, and counterfactual—and the complexity of the query, as measured by the size of its relevant subgraph. We generate datasets of causal graphs and queries spanning these axes of difficulty and use them to fine-tune Qwen-2.5-Instruct models using RLVR and SFT, varying the query level seen during training and the model scale (3B–32B). Our experiments show that RLVR achieves stronger within- and across-level generalization than SFT, but only on a subset of (model scale, query level) configurations. We trace the source of RLVR's effectiveness (or lack thereof) partly to the reasoning capability of the LLM on a particular level prior to fine-tuning. RLVR then improves the marginalization strategy and reduces probability derivation errors in the reasoning steps, significantly boosting accuracy overall and especially on more complex queries. Overall, we found RLVR significantly improves generalization on casual reasoning queries at the association and intervention level, but counterfactual level queries remain challenging for all models investigated in our experiments.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 21667
Loading