Abstract: Code serves as the primary evidence behind computational publications, yet a detailed review of an unfamiliar codebase imposes a commonly prohibitive time burden on volunteer reviewers. Consequently, self-reported reproducibility checklists at major machine learning venues face little empirical verification, leaving code as a significant blind spot in peer review. To bridge this gap, we introduce AuditOwl, an autonomous, verification-centric LLM pipeline designed to make code auditing feasible for authors pre-submission and reviewers post-submission. With this framework, we conduct an audit of 100 randomly sampled empirical papers from the NeurIPS 2025 main track. For each paper, an LLM agent inspects the repository and evaluates its scientific claims against the evidence in the underlying code. Following an independent adversarial verification pass to maximize precision, the system raises 605 discrepancies (averaging 6.1 per paper). It forms a graded evidence trail: findings are quote-anchored, cite specific code locations, and about half are backed by executable verification checks that the agent implements. On a manually validated subset of findings, we measure an error rate of 6.2%. Our approach reveals a steep reproducibility funnel: only 87% of papers in our sample release code at all, and for just 9% all analyzed published findings trace cleanly to the repository. Discrepancies we find are heavily dominated by incompleteness of code and mismatches between what the paper describes and what the code does. We also detect a prevalence of technical bugs and serious methodological issues. Operating at reasonable cost, agentic code auditing can augment human peer review and help to make computational science more reproducible. Code available at: https://github.com/anonym0wl/AuditOwl.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Jeff_Phillips1
Submission Number: 9513
Loading