Keywords: Automated Code Review, Multi-Agent Systems, Large Language Models for Code, Repository-Level Code Understanding, NLP for Software Engineering, Benchmarking, Static Analysis, Code Intelligence, Agentic Frameworks, Program Analysis, Pull Request Analysis, Tool-Augmented LLMs, Software Maintenance, Code Quality, Dataset Creation, Multilingual Dataset
Abstract: As Large Language Models make code writing faster than ever, the bottleneck has shifted to the critical task of reviewing code before merge: ensuring correctness, security, and compliance with project guidelines. We introduce $\textbf{Magistrate}$, an automated code review framework designed to augment human reviewers by detecting complex, cross-file logic errors that single-pass analysis misses. Unlike existing tools that analyze isolated diffs, Magistrate employs a hierarchical multi-agent architecture: a $\textit{Delegator}$ partitions changed files into semantically coherent batches, while parallel $\textit{IssueDetector}$ agents combine static analysis (ast-grep) with LLM-based semantic reasoning over the full repository context. We also present $\textbf{Magistrate-Bench}$, a benchmark of 2,042 Pull Requests across 12 programming languages, with complete repository workspaces cached at evaluation time to enable realistic dependency tracing. Evaluated on a 108-PR subset across four frontier models (Gemini 3 Flash, Devstral, Minimax M2, and Grok 4.1 Fast), Magistrate consistently improves F1 scores by 2.2--5.5$\times$ over single-shot baselines, with the best model (Gemini 3 Flash) achieving an F1 of 0.214 and a hallucination rate of just 0.6\%. Across all models, Magistrate identified $\textbf{997 valid issues}$ that human reviewers had overlooked, demonstrating its utility as a complementary filter for objective correctness issues.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: Natural Language Processing Applications, Code Generation and Understanding, Evaluation Methodologies, Resources and Corpora
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources
Languages Studied: English
Submission Number: 9638
Loading