Keywords: RLHF, Rule-based Rewards, Reasoning Chain-of-Thought
TL;DR: We introduce AutoRule, a fully automated method for extracting rules from preference feedback and formulating them into rule-based rewards.
Abstract: Existing rule-based rewards in preference-based reinforcement learning rely on manual engineering, limiting scalability. We present AutoRule, a fully automated method for extracting rules from preference feedback and formulating them into rule-based rewards. AutoRule extraction operates in three stages: it leverages a reasoning model to interpret user preferences, identifies candidate rules from the reasoning chains of these interpretations, and synthesizes them into a unified rule set. Using the finalized rule set, we employ language-model verifiers to judge rule satisfaction, using this metric as an auxiliary reward alongside the learned reward model during policy optimization. Empirically, AutoRule yields gains for both Llama-3-8B and Olmo-2-7B in both in-distribution and out-of-distribution benchmarks. On Llama-3-8B, it achieves a 25.6\% relative improvement in length-controlled win rate against GPT4 on AlpacaEval2.0, and a 6.1\% relative gain in second-turn performance on a held-out MT-Bench subset, compared to baseline models. Further analysis shows that the extracted rules exhibit strong agreement with dataset preferences and are behaviorally consistent across multiple runs, extraction scales, and aggregated scores. Notably, these rules also contribute to mitigating reward hacking in reward models, likely because they serve as constraints that prevent the policy from exploiting spurious features. Extracted rules are provided; code and model checkpoints will be open-sourced.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 22143
Loading