Logic-Verified GRPO: Graded Z3 Process Rewards for Logical Reasoning in Small LLMs

Ishaan Gangwani; Aayam Bansal

Logic-Verified GRPO: Graded Z3 Process Rewards for Logical Reasoning in Small LLMs

Ishaan Gangwani, Aayam Bansal

Published: 05 Mar 2026, Last Modified: 17 Apr 2026ICLR 2026 Workshop LLM ReasoningEveryoneRevisionsBibTeXCC BY 4.0

Track: tiny / short paper (up to 4 pages)

Keywords: logical reasoning, reinforcement learning, GRPO, process rewards, solver-in-the-loop, Z3, SMT solving, formal verification, first-order logic, calibration, abstention/unknown prediction, FOLIO, ProntoQA, small language models

TL;DR: We train a 3B LLM with GRPO using graded, step-level Z3 verification rewards (instead of binary outcome checks) to improve logical reasoning: especially better “Unknown”/abstention calibration on FOLIO and ProntoQA.

Abstract: Recent work integrates symbolic solvers into reinforcement learning for LLM reasoning, but existing approaches typically use binary chain-level verification: a full reasoning trace is either correct or not. We introduce Logic-Verified GRPO, which uses the Z3 SMT solver to provide graded, step-level process rewards within GRPO training—each step is independently verified and receives proportional credit based on its formal validity. We evaluate on FOLIO and ProntoQA using Qwen2.5-3B-Instruct. While both GRPO variants improve accuracy over the baseline (+8–10pp), our key finding is that graded step verification produces markedly better epistemic calibration: the Z3-verified model achieves +14pp improvement on Unknown (unprovable) conclusions (55.6% vs. 41.7% baseline), while outcome-only GRPO actually degrades Unknown recognition (38.9%). This suggests that graded symbolic process rewards teach models to distinguish “valid proof found” from “no valid derivation exists”—a distinction invisible to outcome-only or binary verification rewards.

Presenter: ~Ishaan_Gangwani1

Format: Yes, the presenting author will attend in person if this work is accepted to the workshop.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Funding: No, the presenting author of this submission does *not* fall under ICLR’s funding aims, or has sufficient alternate funding.

Submission Number: 50

Loading