RAG-Driven multiple assertions generation with large language models

Zhuang Liu, Hailong Wang, Tongtong Xu, Bei Wang

Published: 2025, Last Modified: 16 Oct 2025Empir. Softw. Eng. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Software testing is one of the most crucial parts of the software development life cycle. Developers spend substantial amount of time and effort on software testing. Recently, there has been a growing scholarly interest in the automation of software testing. However, recent studies have revealed significant limitations in the quality and efficacy of the generated assert statements. These limitations primarily arise due to: (i) the inherent complexity involved in generating assert statements that are both meaningful and effective; (ii) the challenge of capturing the relationship between multiple assertions in a single test case. In recent research, deep learning techniques have been employed to generate meaningful assertions. However, it is typical for a single assertion to be generated for each test case, which contradicts the current situation where over 40% of test cases contain multiple assertions. Compared with deep learning techniques, the advantages of large language models (LLMs) in test generation tasks have been proven. This paper proposes a new approach named ALLMAssert (Augmented Large Language Model Assertion Generation) to automatically generate multiple assertions for test methods. ALLMAssert exploits two LLMs to collaboratively generate test assertions for developers. ALLMAssert first fine-tune the codellama-34B-instruct model to obtain a specialized model for multi-assert generation. We then mine more contextual information in the Java project. Through a series of information augmentation steps, we prompt the base LLM to correct the assert statements generated by the fine-tuned LLM. To evaluate the effectiveness of our approach, we conduct extensive experiments on the dataset built on the top of Methods2Test dataset. Experimental results show that ALLMAssert achieves scores of 56.61%, 20.43%, and 15.07% in terms of CodeBLEU, accuracy and perfect prediction and substantially outperforms the baselines. Furthermore, we evaluate the effectiveness of ALLMAssert on the task of bug detection and the result indicates that the assert sequences generated by ALLMAssert can assist in exposing 76 real-world bugs extracting from Defects4J, outperforming the SOTA approaches by a large margin as well.

External IDs:dblp:journals/ese/LiuWXW25