Rubric as Reward: Decomposing Verification Signals for Logical Reasoning in GRPO

Ishaan Gangwani; Aayam Bansal

Rubric as Reward: Decomposing Verification Signals for Logical Reasoning in GRPO

Ishaan Gangwani, Aayam Bansal

Published: 05 Mar 2026, Last Modified: 17 Apr 2026ICLR 2026 Workshop LLM ReasoningEveryoneRevisionsBibTeXCC BY 4.0

Track: tiny / short paper (up to 4 pages)

Keywords: reinforcement learning from verifiable rewards, RLVR, rubric-based rewards, reward decomposition, logical reasoning, first-order logic, GRPO, formal verification, Z3 theorem prover, reward shaping, LLM reasoning evaluation

Abstract: Reinforcement learning from verifiable rewards (RLVR) has improved LLM reasoning, yet reward functions remain monolithic: a model producing a correct answer via flawed reasoning receives the same signal as one reasoning validly but extracting the wrong answer. We propose rubric-grounded rewards, a framework that decomposes reward into independently weighted criteria spanning a verifiable-to-soft spectrum. Applied to logical reasoning, our five-criterion rubric separates answer correctness, Z3-checked step validity, and format compliance (all machine-verifiable) from premise utilization and reasoning completeness (requiring judgment). We train Qwen2.5- 3B-Instruct via GRPO under five reward conditions and evaluate on 166 hard FOLIO and ProntoQA examples. Three findings emerge: (1) rubricstructured verifiable rewards achieve the highest accuracy (51.8%, +6.6pp over baseline) with the most balanced True/False/Unknown performance; (2) rubric profiling reveals that conditions with near-identical accuracy exhibit substantially different quality profiles, exposing an “optimization tax” where RL training improves verifiable criteria while degrading soft ones; and (3) reward structure matters independently of reward content, as decomposing the same verification signals into explicit criteria outperforms their monolithic composite.

Presenter: ~Ishaan_Gangwani1

Format: Yes, the presenting author will attend in person if this work is accepted to the workshop.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Funding: No, the presenting author of this submission does *not* fall under ICLR’s funding aims, or has sufficient alternate funding.

Submission Number: 58

Loading