Stabilizing Iterative Self-Training with Verified Reasoning via Symbolic Recursive Self-Alignment

Xinyu Zhang

Stabilizing Iterative Self-Training with Verified Reasoning via Symbolic Recursive Self-Alignment

Xinyu Zhang

Published: 05 Mar 2026, Last Modified: 25 Apr 2026ICLR 2026 Workshop LLM ReasoningEveryoneRevisionsBibTeXCC BY 4.0

Track: long paper (up to 10 pages)

Keywords: recursive self-improvement, neuro-symbolic verification, self-training, chain-of-thought reasoning, symbolic computation, direct preference optimization, model collapse, mathematical reasoning

TL;DR: We stabilize recursive self-improvement by embedding symbolic verification (via sympy) into the self-training loop, filtering training data at the reasoning-step level to eliminate "lucky guesses" and enable deeper iterative self-training.

Abstract: Recursive self-improvement—where a model iteratively trains on its own outputs—promises unbounded capability growth but faces a fundamental obstacle: recursive drift. As models train on self-generated data across multiple iterations, errors in intermediate reasoning compound, leading to mode collapse and performance degradation. We propose Neuro-Symbolic Recursive Self-Alignment (NSRSA), which stabilizes iterative self-training by embedding a symbolic verification subsystem that gates training data quality at the reasoning step level. Unlike outcome-only filtering (which admits "lucky guesses" with flawed reasoning), NSRSA verifies each arithmetic operation via sympy, checks logical flow consistency across reasoning steps, and enforces domain constraints. We evaluate NSRSA on GSM8K using Qwen3-4B across 5 self-training iterations under three conditions: no verification, outcome verification, and full NSRSA verification. Our results show that NSRSA enables deeper recursive self-improvement—sustaining accuracy gains over more iterations—while outcome-only filtering plateaus and unfiltered training collapses. We further demonstrate that constructing DPO preference pairs from NSRSA verification (choosing symbolically-verified solutions over lucky guesses) teaches the model to internalize sound reasoning. NSRSA provides an extensible framework that demonstrates how external symbolic verification can make recursive self-improvement measurable and reliable within domains where automated verification is available.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Funding: Yes, the presenting author of this submission falls under ICLR’s funding aims, and funding would significantly impact their ability to attend the workshop in person.

Submission Number: 47

Loading