Keywords: length control, efficient reasoning, reinforcement learning, large language models
TL;DR: RL-trained reasoning models achieve their best sample accuracy at an intermediate reasoning length: longer outputs can make the modal answer more correct while increasing sample dispersion.
Abstract: Reinforcement learning substantially improves reasoning in large language models,
but it also tends to lengthen chain-of-thought outputs and increase computational cost.
Although length-control methods have been proposed,
the length–accuracy relationship they induce remains unclear.
We train policies with several length-control methods on multiple base models in a controlled setup and find that,
across both mathematical reasoning and code generation,
accuracy is non-monotonic in output length, peaking at an intermediate value.
Mode accuracy, however, continues to improve with length even in settings where sample accuracy plateaus or declines,
indicating that the non-monotonic length–accuracy relationship is driven by dispersion around an increasingly correct center.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 126
Loading