Semi-structured LLM Reasoners Can Be Rigorously Audited

Semi-structured LLM Reasoners Can Be Rigorously Audited

ICLR 2026 Conference Submission14575 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, Generation Auditing, Verifiability

Abstract: Although Large Language Models (LLMs) have become capable reasoners, the problem of faithfulness persists: their reasoning can contain errors and omissions that are difficult to detect and that may obscure biases in model outputs. To address this issue, we introduce Semi-Structured Reasoning Models (SSRMs), which are trained to produce semi-structured representations of reasoning. SSRMs generate reasoning traces in a *non-executable* Pythonic syntax that names each reasoning step and marks its inputs and outputs. This structure allows SSRM traces to be automatically *audited* to identify reasoning flaws. We evaluate three types of audits: hand-crafted *structured reasoning audits*, written in a domain-specific language (DSL) implemented in Python; LLM-generated *structured reasoning audits*; and learned *typicality audits*, which apply probabilistic models over reasoning traces. We show that all of these methods can be used to effectively flag probable reasoning errors. Importantly, the auditability of SSRMs does not appear to compromise overall accuracy: in evaluation on twelve benchmarks and two model families, SSRMs demonstrate strong performance and generalizability relative to other models of comparable size.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 14575

Loading