Can Large Language Models Model Programs Formally?

Zhiyong Chen; Jialun Cao; Jiarong Wu; Chang Xu; Shing-Chi Cheung

Can Large Language Models Model Programs Formally?

Zhiyong Chen, Jialun Cao, Jiarong Wu, Chang Xu, Shing-Chi Cheung

20 Sept 2025 (modified: 25 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: model checking, large language model, formal verification

Abstract: In the digital age, ensuring the correctness, safety, and reliability of software through formal verification is paramount, particularly as software increasingly underpins critical infrastructure. Formal verification, split into theorem proving and model checking, provides a feasible and reliable path. Unlike theorem proving, which yields notable advances, model checking has been less focused due to the difficulty of automatic program modeling. To fill this gap, we introduce \name, a benchmark and an accompanying pipeline for evaluating and improving LLMs' program modeling capability by modeling Python programs into verification-ready model checking specifications checkable by its accompanying model checker. \name comprises 400 Python programs derived from three well-known benchmarks (HumanEval, MBPP, and LiveCodeBench). Our extensive experiments reveal significant limitations in LLMs' program modeling and further provide inspiring directions.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 24772

Loading