Local Success Does Not Compose: Benchmarking Large Language Models for Compositional Formal Verification

Xu Xu; Xin Li; Xingwei Qu; Jie Fu; Binhang Yuan

Local Success Does Not Compose: Benchmarking Large Language Models for Compositional Formal Verification

Xu Xu, Xin Li, Xingwei Qu, Jie Fu, Binhang Yuan

Published: 26 Jan 2026, Last Modified: 01 May 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, Formal Language, benchmark

TL;DR: We present DafnyCOMP, a benchmark showing that while LLMs excel at single-function verification, they catastrophically fail at compositional reasoning across multi-function programs, exposing a critical gap in verifiable code generation.

Abstract: Despite rapid advances in code generation, current Large Language Models (LLMs) still struggle with reliable and verifiable code synthesis in the presence of *compositional reasoning* requirements across multi-function programs. To study this systematically, we introduce DafnyCOMP, a benchmark for generating compositional Dafny specifications for programs consisting of multiple interacting functions with non-trivial data dependencies. Unlike prior benchmarks that focus primarily on single-function annotation, DafnyCOMP targets programs composed of 2-5 functions arranged in acyclic call graphs, requiring specifications that establish correctness across component boundaries. DafnyCOMP contains 400 automatically synthesized programs: 300 chain-structured instances and 100 non-chain DAG instances generated from 10 topology templates. We evaluate frontier LLMs from major providers under a unified prompting and verification protocol. While these models achieve high syntactic well-formedness (>99%) and moderate end-to-end verification (>58%) on prior single-function Dafny verification benchmarks, they obtain near-zero end-to-end verification on DafnyCOMP. On the chain split, even the strongest evaluated model reaches only 2% verification at Pass@8, with most models below 1%; the difficulty persists under broader topologies and stronger test-time scaling. Our analysis identifies three recurring failure modes that hinder cross-functional reasoning: *specification fragility*, *implementation--proof* misalignment, and *reasoning instability*. DafnyCOMP provides a diagnostic benchmark for tracking progress in verifiable code generation, highlighting that bridging local correctness to compositional verification remains a key open challenge.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 19018

Loading