Do Large Language Models Respect Contracts? Evaluating and Enforcing Contract-Adherence in Code Generation

Soohan Lim; Joonghyuk Hahn; Hyunwoo Park; Sang-Ki Ko; Yo-Sub Han

Do Large Language Models Respect Contracts? Evaluating and Enforcing Contract-Adherence in Code Generation

Soohan Lim, Joonghyuk Hahn, Hyunwoo Park, Sang-Ki Ko, Yo-Sub Han

20 Sept 2025 (modified: 21 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Test-Case Generation, Contract-Violating Test Cases, Contract-Aware Evaluation, SMT solver, Code Generation

TL;DR: A contract-aware benchmark and generation framework that pairs LLMs with an SMT solver to create violation focused tests and quantitatively assess whether generated code satisfies explicit contracts.

Abstract: Prevailing code generation benchmarks, such as HumanEval+ and MBPP+, primarily evaluate large language models (LLMs) with $\textit{pass@k}$ on functional correctness using well-formed inputs. However, they ignore a crucial aspect of real-world software: adherence to $\textit{contracts}$$\textemdash$the preconditions and validity constraints that dictate how ill-formed inputs must be rejected. This critical oversight means that existing benchmarks fail to measure, and models consequently fail to generate, truly robust and reliable code snippets. We introduce $\textbf{PACT}$, a program assessment and contract-adherence evaluation framework, to bridge this gap. PACT is the first framework designed to systematically evaluate and enhance contract-adherence in LLM-generated code snippets alongside functional correctness. PACT's contributions are threefold: First, it provides a comprehensive test-suite corpus focused on contract violations, extending HumanEval+ and MBPP+. Second, it enables a systematic analysis of code generation under varied prompting conditions. This analysis demonstrates that augmenting prompts with contract-violating test cases significantly enhance a model's ability to respect contracts compared to using contract description alone. Finally, it introduces novel metrics to rigorously quantify contract adherence in both test generation and code generation. By revealing critical errors that conventional benchmarks overlook, PACT provides the rigorous and interpretable metrics to evaluate the robustness of LLM-generated code snippets in both functionality and contract-adherence.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 25412

Loading