LLM4FL: Multi-Agent Repository-Level Software Fault Localization via Graph-Based Retrieval and Iterative Refinement

LLM4FL: Multi-Agent Repository-Level Software Fault Localization via Graph-Based Retrieval and Iterative Refinement

TMLR Paper5637 Authors

14 Aug 2025 (modified: 28 Oct 2025)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Locating and fixing software faults is a time-consuming and resource-intensive task in software development. Traditional fault localization methods, such as Spectrum-Based Fault Localization (SBFL), rely on statistical analysis of test coverage data but often lack accuracy. While more effective, learning-based techniques require large training datasets and can be computationally intensive. Recent advancements in Large Language Models (LLMs) have shown potential for improving fault localization by enhancing code comprehension and reasoning. LLMs are typically pretrained and can be leveraged for fault localization without additional training. However, these LLM-based techniques face challenges, including token limitations, performance degradation with long inputs, and difficulties managing large-scale projects with complex, interacting components. We introduce LLM4FL, a multi-LLM-agent-based fault localization approach to address these challenges. LLM4FL utilizes three agents. First, the Context Extraction Agent uses an order-aware division strategy to divide and analyze extensive coverage data into small groups within the LLM's token limit, identify the failure reason, and prioritize failure-related methods. The prioritized methods are sent to the Debugger Agent, which uses graph-based retrieval to identify failure reasons and rank suspicious methods in the codebase. Then the Reviewer Agent re-evaluates and re-ranks buggy methods using verbal reinforcement learning and self-criticism. Evaluated on the Defects4J (V2.0.0) benchmark of 675 faults from 14 Java projects, LLM4FL outperforms AutoFL by 18.55% in Top-1 accuracy and surpasses supervised methods like DeepFL and Grace, all without task-specific training. Coverage splitting and prompt chaining further improve performance, boosting Top-1 accuracy by up to 22%.

Submission Length: Long submission (more than 12 pages of main content)

Assigned Action Editor: ~Greg_Durrett1

Submission Number: 5637

Loading