LLMs and Prompting for Unit Test Generation: A Large-Scale Evaluation

Wendkûuni C. Ouédraogo; Abdoul Kader Kaboré; Haoye Tian; Yewei Song; Anil Koyuncu; Jacques Klein; David Lo; Tegawendé F. Bissyandé

LLMs and Prompting for Unit Test Generation: A Large-Scale Evaluation

Wendkûuni C. Ouédraogo, Abdoul Kader Kaboré, Haoye Tian, Yewei Song, Anil Koyuncu, Jacques Klein, David Lo, Tegawendé F. Bissyandé

Published: 01 Jan 2024, Last Modified: 17 Apr 2025ASE 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Unit testing, essential for identifying bugs, is often neglected due to time constraints. Automated test generation tools exist but typically lack readability and require developer intervention. Large Language Models (LLMs) like GPT and Mistral show potential in test generation, but their effectiveness remains unclear.This study evaluates four LLMs and five prompt engineering techniques, analyzing 216 300 tests for 690 Java classes from diverse datasets. We assess correctness, readability, coverage, and bug detection, comparing LLM-generated tests to EvoSuite. While LLMs show promise, improvements in correctness are needed. The study highlights both the strengths and limitations of LLMs, offering insights for future research.

Loading