ATLAS: Automated Toolkit for Large-Scale Verified Code Synthesis

Mantas Baksys, Stefan Zetzsche, Olivier Bouissou, Rémi Delmas, Soonho Kong

Published: 11 Nov 2025, Last Modified: 05 May 2026Dafny Workshop at the Symposium on Principles of Programming LanguagesEveryoneRevisionsCC BY 4.0

Abstract: Large language models have become proficient at generating functional code, but ensuring the output truly matches the programmer's intent remains difficult. Testing improves trust, yet for safety-critical applications, formal verification provides the only true guarantees through machine-checked proofs. However, verified code remains scarce compared to mainstream languages or mathematical theorem proving, limiting LLM capabilities in this domain. We present ATLAS, an automated pipeline that synthesizes verified programs to address this data bottleneck. Applied to the TACO dataset of Python solutions to LeetCode-style problems, ATLAS generates 2.7K verified Dafny programs, each with high-quality specifications and machine-checked proofs. Through task decomposition, we extract 19K training examples. Fine-tuning Qwen 2.5 7B Coder on this data improves performance from 32.4% to 56.9% on DafnyBench and from 15.8% to 65.8% on DafnySynthesis, demonstrating that synthetic data generation is a viable path to scaling LLM capabilities for formal verification.