Track: Track 1: Original Research/Position/Education/Attention Track
Abstract: Automated AI research shows great promise for accelerating scientific discovery, but ensuring the integrity of AI-generated research remains a critical challenge. In this work, we introduce FabScore, a new framework for fine-grained evaluation of fabrications in automated AI research. Given a research paper and its associated code, FabScore extracts numerical results and figure labels as individual claims, employs a coding agent to evaluate each claim through static analysis and code execution, and ultimately assigns one of six verdict categories covering fabricated, reproducible, and unverifiable outcomes. Human evaluation confirms that FabScore achieves a high precision of 98.6\% in detecting fabrications. Applying FabScore to 144 papers from various sources, we find the overall claim-level fabrication rate to be 21.2\%. Notably, over 70\% of AI-authored real conference submissions contain fabrications, with accepted submissions still reaching a paper-level fabrication rate of 59.3\%. Experiment fabrication is the most prevalent type, indicating that AI research systems often struggle to correctly implement the experiments described in the paper. Finally, over 85\% of FabScore-detected fabrications are missed by AI reviewers, suggesting that our framework can serve as a valuable complementary tool to existing AI review processes.
Keywords: Fabrication Evaluation, AI Scientist, Automated Research
Submission Number: 340
Loading