Do Large Language Models Respect Contracts? Evaluating and Enforcing Contract-Adherence in Code Generation
Keywords: Test-Case Generation, Contract-Violating Test Cases, Contract-Aware Evaluation, SMT solver, Code Generation
TL;DR: A contract-aware benchmark and generation framework that pairs LLMs with an SMT solver to create violation focused tests and quantitatively assess whether generated code satisfies explicit contracts.
Abstract: Prevailing code generation benchmarks, such as HumanEval+ and MBPP+, primarily evaluate
large language models (LLMs) with $\textit{pass@k}$ on functional correctness using well-formed inputs.
However, they ignore a crucial aspect of real-world software: adherence to $\textit{contracts}$$\textemdash$the
preconditions and validity constraints that dictate how ill-formed inputs must be rejected.
This critical oversight means that existing benchmarks fail to measure, and models consequently fail to generate, truly robust and reliable code snippets.
We introduce $\textbf{PACT}$, a program assessment and contract-adherence evaluation framework, to bridge this gap.
PACT is the first framework designed to systematically evaluate and enhance contract-adherence in LLM-generated code snippets
alongside functional correctness.
PACT's contributions are threefold:
First, it provides a comprehensive test-suite corpus
focused on contract violations, extending HumanEval+
and MBPP+.
Second, it enables a systematic analysis of code
generation under varied prompting conditions.
This analysis demonstrates that augmenting prompts with
contract-violating test cases significantly enhance a
model's ability to respect contracts compared to using
contract description alone.
Finally, it introduces novel metrics to rigorously quantify contract adherence in both test generation and code generation.
By revealing critical errors that conventional benchmarks overlook, PACT provides the rigorous and interpretable metrics
to evaluate the robustness of LLM-generated code snippets in both functionality and contract-adherence.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 25412
Loading