Improving LLM Code Reasoning via Semantic Equivalence Self-Play with Formal Verification

Improving LLM Code Reasoning via Semantic Equivalence Self-Play with Formal Verification

ACL ARR 2026 January Submission4912 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLMs, coding, reasoning, self-play, formal verification, Haskell

Abstract: We introduce a self-play framework for semantic equivalence in Haskell, utilizing formal verification to guide adversarial training between a generator and an evaluator. The framework leverages Liquid Haskell proofs for validating equivalence and execution-based counterexamples for inequivalence, organized via a difficulty-aware curriculum. To facilitate this, we release OpInstruct-HSx, a synthetic dataset of $\approx$28k validated Haskell programs. Empirical experiments show that our evaluator transfers effectively to downstream tasks, achieving up to 13.3pp accuracy gain on EquiBench and consistent gains on PySecDB. Ablation studies on the SEQ-SINQ regimes indicate that while inequivalence supervision provides data volume, equivalence proofs are uniquely responsible for the model's reasoning capabilities. The entire training pipeline and dataset are publicly released on GitHub and Hugging Face respectively.

Paper Type: Long

Research Area: Code Models

Research Area Keywords: code understanding, code generation, program verification, vulnerability detection, code reasoning; formal methods with LLMs, neurosymbolic approaches, red teaming, self-supervised learning, adversarial training

Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources

Languages Studied: Haskell, Python, C

Submission Number: 4912

Loading