SOKRATES: Distilling Symbolic Knowledge into Option-Level Reasoning via Solver-Guided Preference Optimization
Keywords: Chain-of-thought reasoning, Logical reasoning, Neuro-symbolic AI, Direct Preference Optimization, Options and Knowledge framework, First-order logic, Solver-guided supervision, Process supervision, Proof verification
TL;DR: LLMs get right answers via wrong reasoning (94% acc, 2% valid proofs). SOKRATES uses solver feedback on inference-rule options: 44× improvement (2%→92% valid). First system with option vocabulary + success predictor + solver-guided DPO.
Abstract: A language model that achieves 94% accuracy on logical reasoning sounds impressive—until you discover that only 2% of its proofs are actually valid. This is the state of chain-of-thought prompting: models produce plausible rationales that frequently contain invalid inference steps, hidden contradictions, or skipped derivations. The right answer emerges despite, not because of, the reasoning process. We introduce SOKRATES (Symbolic Option-Knowledge Reasoning Alignment via Trace Evaluation with Solver), a method that instantiates Sutton's Options and Knowledge (OaK) framework in a first-order logic micro-world. SOKRATES represents proofs as sequences of discrete inference-rule options (e.g., MODUS_PONENS, UNIV_INSTANTIATION), verified step-by-step by a FOL solver. From solver feedback we (i) train an option-success predictor that estimates validity before execution, and (ii) construct preference pairs for Direct Preference Optimization (DPO), aligning the model's option policy with solver-induced correctness. On PrOntoQA, SOKRATES raises accuracy from 94.2% to 97.6%, step validity from 27.3% to 98.5%, and full-trace validity from 2.1% to 92.0%, a 44x improvement in logically sound proofs. The learned predictor is well calibrated (ECE = 0.08), and the option policy transfers zero-shot to FOLIO, improving accuracy from 45.3% to 53.2%. To our knowledge, SOKRATES is the first LLM reasoning system that (i) represents proofs as a fixed option vocabulary, (ii) learns an explicit option-success predictor, and (iii) aligns the option policy using solver-derived DPO preferences in an iterative loop.
Submission Number: 102
Loading