Generating Safety-Critical Automotive C-programs using LLMs with Formal Verification

Published: 29 Aug 2025, Last Modified: 29 Aug 2025NeSy 2025 - Phase 2 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Formal Verification, Safety-Critical Automotive Software, Prompt Engineering, Non-Functional Requirements
Abstract: We evaluate the feasibility of generating formally verified C code that adheres to both functional and non-functional requirements using Large Language Models (LLMs) for three real industrial, automotive safety-critical software modules. We explore the capabilities of ten LLMs and four prompting techniques — Zero-Shot, Zero-Shot Chain-of-Thought, One-Shot, and One-Shot Chain-of-Thought — to generate C programs for the three modules. Functional correctness of generated programs is assessed through functional verification, and adherence to non-functional requirements is evaluated using an industrial static analyzer, along with human evaluation. The results demonstrate that it is feasible for LLMs to generate functionally correct code, with success rates of 540/800, 59/800, and 46/800 for the three modules. Additionally, the generated programs frequently adhere to the defined non-functional requirements. In the cases where the LLM-generated programs did not adhere to the non-functional requirements, deviations typically involve violations of single-read and single-write access patterns or minimal variable scope constraints. These findings highlight the promise and limitations of using LLMs to generate industrial safety-critical C programs, providing insight into improving automated LLM-based program generation in the automotive safety-critical domain.
Track: Main Track
Paper Type: Long Paper
Resubmission: Yes
Changes List: We added the following changes compared to the previous submission: 1. Improved on checking adherence to non-functional requirements. We incorporated a tool along with manual analysis to check for non-functional requirements. We increased the number of inspected generated implementations from 10 to around 280, while ensuring that all combinations of LLM and module are analyzed. Moreover, we have done and included a thorough analysis of the manually analyzed programs. 2. Interpretation of results. We have revised and changed the statements made in the paper to avoid overstating the results. 3. We have revised the conclusion to incorporate results from the paper itself, without making exaggerated claims. Additionally, we have given three paths for future work. This addresses a comment of gjH1 that a substantial revision of the conclusion is suggested. 4. We have added experiments with more recent LLMs to the paper. Additionally, we have redone the analysis for the new experiments. We ensured that we selected five open-source and five closed-source models of varying nature (reasoning and non-reasoning) and sizes (although the exact sizes of the closed-source models are unknown). 5. We added a discussion section to compare human software development against LLM-based software development. This added another direction for future work, and also addresses an essential limitation of the work, increasing transparency. For this, we calculated the cost of all conducted experiments and present this in the paper. 6. A lot of small changes were made to make the paper more readable (i.e. explain why we picked this temperature, added highlighting in tables, fixed capitalization in references).
Publication Agreement: pdf
Submission Number: 27
Loading