Semantic Contracts as the Missing Middle Layer for Reliable AI Mathematics

ZhangHao

Semantic Contracts as the Missing Middle Layer for Reliable AI Mathematics

ZhangHao

Published: 14 Jun 2026, Last Modified: 14 Jun 2026ICML 2026 Workshop MusIML PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: AI for Mathematics, Formal Verification, Semantic Reduction, Intermediate Representation, Large Language Models

TL;DR: A typed semantic middle layer catches unit-mismatch errors (apples + dollars) in LLM math reasoning that Python's dynamic typing silently accepts, achieving 96.8% accepted-contract precision.

Abstract: A language model solves a math problem step by step. The reasoning reads fluently, but should we trust it? Today's dominant strategy -- generating Python code for execution -- handles arithmetic reliably yet is structurally blind to semantic unit errors: apples + dollars compiles and runs without complaint. We argue that the missing ingredient is not a better generator or a stronger verifier, but an explicit semantic middle layer that makes the structure of a solution machine-checkable before any numbers are computed. We propose SC-IR (Semantic Contract Intermediate Representation), a typed contract language whose type system tracks ontology-aware quantity kinds -- Count[apple], Rate[km,litre], Frac -- and enforces kind consistency via six division-aware typing rules. SC-IR maps a reasoning trace to a typed contract and either accepts it (with discharged proof obligations) or rejects it with one of three failure labels: reduction, typing, or verification failure. Each label implies a distinct repair strategy, making the pipeline's failures operationally actionable. We evaluate SC-IR on the full GSM8K test set (1,319 problems) under a true blind protocol -- the model sees only the problem text, no answer hints -- and compare against Program-of-Thought (PoT). When SC-IR accepts a contract, it is correct 87.0% of the time, compared with PoT's 77.0% overall accuracy, showing the precision of selective typed acceptance. With DeepSeek-V4-Pro, SC-IR reaches 61.8% coverage, 59.8% overall accuracy, and 96.8% accepted-contract precision; PoT with the same generator remains the stronger accuracy baseline at 95.0%. Critically, SC-IR catches 21 semantic unit errors that PoT silently executes. An agentic repair loop adds +8.5% cumulative accuracy, with verification failures showing the highest repair rate (30.4%) -- confirming that structured failure attribution enables targeted correction. Removing ontology typing raises the false-accept rate from 2.1% to 3.6%.

Track: Track 3: ML Competition Proposals for Social Impact in Muslim Communities

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Non Archival Confirmation: I understand that submissions to MusIML are non-archival and can be submitted to other venues.

Submission Number: 51

Loading