Abstract: Large Language Models (LLMs) have emerged as a transformative AI paradigm, profoundly influencing broad aspects of daily life.
Despite their remarkable performance, LLMs exhibit a fundamental limitation: hallucination—the tendency to produce misleading outputs that appear plausible.
This inherent unreliability poses significant risks, particularly in high-stakes domains where trustworthiness is essential.
On the other hand, Formal Methods (FMs), which share foundations with symbolic AI, provide mathematically rigorous techniques for modeling, specifying, reasoning, and verifying the correctness of systems.
These methods have been widely employed in mission-critical domains such as aerospace, defense, and cybersecurity. However, the broader adoption of FMs remains constrained by significant challenges, including steep learning curves, limited scalability, and difficulties in adapting to the dynamic requirements of daily applications.
To build trustworthy AI agents, we argue that the integration of LLMs and FMs is necessary to overcome the limitations of both paradigms. LLMs offer adaptability and human-like reasoning but lack formal guarantees of correctness and reliability.
FMs provide rigor but need enhanced accessibility and automation to support broader adoption from LLMs.
Lay Summary: Large language models (LLMs), like those behind today’s AI assistants, are powerful tools capable of writing, reasoning, and coding. Yet, they often produce confident but incorrect responses—a phenomenon known as "hallucination"—making them unreliable in high-stakes domains such as healthcare and law.
Formal methods (FMs), which provide mathematical guarantees about system behavior, have long been used in safety-critical fields like aviation. However, they are typically difficult to use, lack scalability, and struggle with dynamic real-world tasks.
To build trustworthy AI agents, we argue that LLMs and FMs must be deeply integrated—not just in one direction, but both. First, we explore how LLMs can enhance FMs by improving their automation and scalability. Then, we investigate how FMs can make LLMs more accurate, consistent, and reliable. These two directions go hand-in-hand: each strengthens the other, and together they pave the way toward building verifiably trustworthy AI systems.
Verify Author Names: My co-authors have confirmed that their names are spelled correctly both on OpenReview and in the camera-ready PDF. (If needed, please update ‘Preferred Name’ in OpenReview to match the PDF.)
No Additional Revisions: I understand that after the May 29 deadline, the camera-ready submission cannot be revised before the conference. I have verified with all authors that they approve of this version.
Pdf Appendices: My camera-ready PDF file contains both the main text (not exceeding the page limits) and all appendices that I wish to include. I understand that any other supplementary material (e.g., separate files previously uploaded to OpenReview) will not be visible in the PMLR proceedings.
Latest Style File: I have compiled the camera ready paper with the latest ICML2025 style files <https://media.icml.cc/Conferences/ICML2025/Styles/icml2025.zip> and the compiled PDF includes an unnumbered Impact Statement section.
Paper Verification Code: OThkM
Permissions Form: pdf
Primary Area: Research Priorities, Methodology, and Evaluation
Keywords: Large Language Models, Formal Methods, SMT Solver, Automated Reasoning, Autoformalization, Theorem Proving, Testing and Verification, Model Checking
Submission Number: 153
Loading