A Comprehensive Fine-Grained Evaluation of LLMs in Data Race Detection

18 Sept 2025 (modified: 27 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Data race detection, LLM benchmarking, Evaluation
TL;DR: This paper introduces DRDBench, a benchmark, and FineEval-Race, a framework for fine-grained LLM evaluation in data race detection.
Abstract: Data races are a major cause of concurrency-related bugs and have long posed a critical challenge in software engineering. Recent advancements in large language models (LLMs) have inspired researchers to investigate the potential of LLMs in detecting data races. However, the effectiveness of LLMs in this domain still remains largely unexplored, primarily due to the coarse-grained program-level evaluation methodology of existing benchmarks. This article introduces DRDBench, a novel benchmark, together with FineEval-Race, a pioneering evaluation framework, to assess the race detection capabilities of LLMs at the fine-grained individual data race level. DRDBench consists of 1,003 real-world and handcrafted pthreads-based programs, encompassing 549 data races in 226 programs, each annotated with precise line-level race locations. Leveraging this detailed race location information, FineEval-Race establishes fine-grained correspondences between model outputs and ground truth at the level of individual data races, enabling a nuanced evaluation. Based on these fine-grained correspondences, FineEval-Race further evaluates the performance of models under three different response aggregation strategies to investigate the boundary of model capabilities. This methodology not only quantifies LLMs' direct utility in race detection but also provides insights into their genuine understanding of concurrency. We evaluated 25 popular open-source LLMs on DRDBench with FineEval-Race. The evaluation results revealed considerable variation in model performance, with DRDBench presenting a significant challenge to many models. The top-performing reasoning and non-reasoning models, DeepSeek-R1 and DeepSeek-V3, achieved recall of 75.23% and 55.19%, and precision of 75.36% and 54.69%, respectively. These evaluations yield actionable insights. Furthermore, we identify two failure modes shared across models that can cause up to 92\% and 98\% performance degradation on DeepSeek-R1 and DeepSeek-V3, respectively. We believe that DRDBench and FineEval-Race, coupled with our identified actionable insights and failure modes, will serve as crucial guidance for applying LLMs to race detection and inspire future model training efforts to enhance their comprehension of concurrency.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 12771
Loading