Logic-Verified GRPO: Graded Z3 Process Rewards for Logical Reasoning in Small LLMs

Published: 05 Mar 2026, Last Modified: 17 Apr 2026ICLR 2026 Workshop LLM ReasoningEveryoneRevisionsBibTeXCC BY 4.0
Track: tiny / short paper (up to 4 pages)
Keywords: logical reasoning, reinforcement learning, GRPO, process rewards, solver-in-the-loop, Z3, SMT solving, formal verification, first-order logic, calibration, abstention/unknown prediction, FOLIO, ProntoQA, small language models
TL;DR: We train a 3B LLM with GRPO using graded, step-level Z3 verification rewards (instead of binary outcome checks) to improve logical reasoning: especially better “Unknown”/abstention calibration on FOLIO and ProntoQA.
Abstract: Recent work integrates symbolic solvers into reinforcement learning for LLM reasoning, but existing approaches typically use binary chain-level verification: a full reasoning trace is either correct or not. We introduce Logic-Verified GRPO, which uses the Z3 SMT solver to provide graded, step-level process rewards within GRPO training—each step is independently verified and receives proportional credit based on its formal validity. We evaluate on FOLIO and ProntoQA using Qwen2.5-3B-Instruct. While both GRPO variants improve accuracy over the baseline (+8–10pp), our key finding is that graded step verification produces markedly better epistemic calibration: the Z3-verified model achieves +14pp improvement on Unknown (unprovable) conclusions (55.6% vs. 41.7% baseline), while outcome-only GRPO actually degrades Unknown recognition (38.9%). This suggests that graded symbolic process rewards teach models to distinguish “valid proof found” from “no valid derivation exists”—a distinction invisible to outcome-only or binary verification rewards.
Presenter: ~Ishaan_Gangwani1
Format: Yes, the presenting author will attend in person if this work is accepted to the workshop.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Funding: No, the presenting author of this submission does *not* fall under ICLR’s funding aims, or has sufficient alternate funding.
Submission Number: 50
Loading