Keywords: AI for Mathematics, Formal Verification, Semantic Reduction, Intermediate Representation, Large Language Models
TL;DR: A typed semantic middle layer catches unit-mismatch errors (apples + dollars) in LLM math reasoning that Python's dynamic typing silently accepts, achieving 96.8% accepted-contract precision.
Abstract: A language model solves a math problem step by step. The reasoning reads fluently, but should we trust it? Today's
dominant strategy -- generating Python code for execution -- handles arithmetic reliably yet is structurally blind to
semantic unit errors: apples + dollars compiles and runs without complaint. We argue that the missing ingredient is
not a better generator or a stronger verifier, but an explicit semantic middle layer that makes the structure of a
solution machine-checkable before any numbers are computed.
We propose SC-IR (Semantic Contract Intermediate Representation), a typed contract language whose type system tracks
ontology-aware quantity kinds -- Count[apple], Rate[km,litre], Frac -- and enforces kind consistency via six
division-aware typing rules. SC-IR maps a reasoning trace to a typed contract and either accepts it (with discharged
proof obligations) or rejects it with one of three failure labels: reduction, typing, or verification failure. Each
label implies a distinct repair strategy, making the pipeline's failures operationally actionable.
We evaluate SC-IR on the full GSM8K test set (1,319 problems) under a true blind protocol -- the model sees only the
problem text, no answer hints -- and compare against Program-of-Thought (PoT). When SC-IR accepts a contract, it is
correct 87.0% of the time, compared with PoT's 77.0% overall accuracy, showing the precision of selective typed
acceptance. With DeepSeek-V4-Pro, SC-IR reaches 61.8% coverage, 59.8% overall accuracy, and 96.8% accepted-contract
precision; PoT with the same generator remains the stronger accuracy baseline at 95.0%. Critically, SC-IR catches 21
semantic unit errors that PoT silently executes. An agentic repair loop adds +8.5% cumulative accuracy, with
verification failures showing the highest repair rate (30.4%) -- confirming that structured failure attribution
enables targeted correction. Removing ontology typing raises the false-accept rate from 2.1% to 3.6%.
Track: Track 3: ML Competition Proposals for Social Impact in Muslim Communities
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Non Archival Confirmation: I understand that submissions to MusIML are non-archival and can be submitted to other venues.
Submission Number: 51
Loading