A Comprehensive Fine-Grained Evaluation of LLMs in Data Race Detection

A Comprehensive Fine-Grained Evaluation of LLMs in Data Race Detection

ICLR 2026 Conference Submission12771 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Data race detection, LLM benchmarking, Evaluation

TL;DR: This paper introduces DRDBench, a benchmark, and FineEval-Race, a framework for fine-grained LLM evaluation in data race detection.

Abstract: Data races are a major cause of concurrency-related bugs and have long posed a critical challenge in software engineering. Recent advancements in large language models (LLMs) have inspired researchers to investigate the potential of LLMs in detecting data races. However, the effectiveness of LLMs in this domain still remains largely unexplored, primarily due to the coarse-grained program-level evaluation methodology of existing benchmarks. This article introduces DRDBench, a novel benchmark, together with FineEval-Race, a pioneering evaluation framework, to assess the race detection capabilities of LLMs at the fine-grained individual data race level. DRDBench consists of 1,003 real-world and handcrafted pthreads-based programs, encompassing 549 data races in 226 programs, each annotated with precise line-level race locations. Leveraging this detailed race location information, FineEval-Race establishes fine-grained correspondences between model outputs and ground truth at the level of individual data races, enabling a nuanced evaluation. Based on these fine-grained correspondences, FineEval-Race further evaluates the performance of models under three different response aggregation strategies to investigate the boundary of model capabilities. We evaluated 25 popular open-source LLMs on DRDBench with FineEval-Race. The evaluation results revealed considerable variation in model performance, with DRDBench presenting a significant challenge to many models. The top-performing reasoning and non-reasoning models, DeepSeek-R1 and DeepSeek-V3, achieved recall of 74.41% and 54.59%, and precision of 75.36% and 54.69%, respectively. Furthermore, we identify two failure modes shared across models that can cause up to 92% and 98% performance degradation on DeepSeek-R1 and DeepSeek-V3, respectively. We believe that DRDBench and FineEval-Race, coupled with the insights and failure modes from this evaluation, will provide crucial guidance for advancing research in this domain.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 12771

Loading