Abstract: Software defect prediction serves as a critical precursor task to software defect detection. In recent years, most research efforts have focused on leveraging static code metrics for this task, yet such approaches face cross-project generalization challenges due to the absence of code semantic features. While emerging studies recognize the importance of code semantics, the lack of high-quality open-source datasets persists due to the prohibitive costs of large-scale manual annotation. With the remarkable capabilities demonstrated by Large Language Models (LLMs) like GPT in data synthesis tasks, we propose leveraging LLMs for automated software defect data synthesis and partially open-sourcing the generated datasets. Our methodology employs Common Weakness Enumeration(CWE) as the defect taxonomy standard, designs structured prompts grounded in software engineering and defect detection principles for data sampling and labeling, and systematically analyzes both model-specific synthesis limitations and dataset quality. The experimental results reveal intriguing insights that provide new perspectives for automated software defect annotation research. (For dataset access inquiries, please contact us via email at your convenience.)
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: code generation and understanding
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Data resources
Languages Studied: Chinese, Java, JS, Python, C++
Submission Number: 4087
Loading