The NeurIPS 2024 Checklist Assistant: Pioneering the Use of LLMs in Scientific Peer Review

Isabelle Guyon

10 Dec 2024OpenReview News ArticleEveryoneRevisionsCC BY 4.0

In 2024, NeurIPS conducted an experiment exploring how large language models (LLMs) could improve the quality and compliance of scientific submissions. The NeurIPS 2024 Checklist Assistant, developed by Alexander Goldberg, Ihsan Ullah, and collaborators, represented one of the first large-scale, structured evaluations of an AI system designed to aid the peer-review process in a real-world academic setting.

The assistant was an LLM-based tool that helped authors verify their papers against the NeurIPS author checklist—a set of standards ensuring reproducibility, transparency, and ethical rigor. By analyzing authors’ responses and offering targeted feedback, the tool aimed to help researchers refine their submissions before review.

Crucially, this tool was accessible only to authors, not reviewers. This design ensured fairness and focused the study on understanding how AI could empower authors while avoiding influence on formal peer evaluation

The experiment involved 234 voluntarily submitted papers and a series of pre- and post-use surveys. The responses revealed positive impact and author engagement. Over 70% of authors found the assistant useful and said they would revise their papers based on its feedback. Authors reported that the assistant offered granular, actionable feedback, typically providing 4–6 distinct suggestions per checklist question. Many participants expanded or clarified their checklist justifications, suggesting that LLM feedback prompted deeper reflection on research practices and documentation quality. These results are an indication that an LLM assistant is useful to encourage self-assessment and learning to help authors reach higher standards of authorship, transparency and reproducibility.

However, authors also reported some challenges. Roughly 20 of 52 respondents mentioned inaccuracies, and 14 found the LLM to be too strict in its requirements. The study also demonstrated that the system could be manipulated by adversarial rewording, emphasizing the importance of safeguards in future AI-driven review tools. The researchers noted that while causal attribution is complex, qualitative evidence showed that the assistant meaningfully helped improve submissions. This finding reinforced the idea that AI can enhance—but not replace—human expertise in scholarly evaluation.

The NeurIPS 2024 is a precursor experiment, which proved that, while automation cannot replace critical judgment, it can meaningfully elevate research quality, reduce oversight errors, and promote reflective authorship. As AI continues to integrate into scholarly workflows, the Checklist Assistant stands as an early example of how human-AI collaboration can strengthen scientific integrity and efficiency.

More details and the full paper are available here: https://blog.neurips.cc/2024/12/10/results-of-the-neurips-2024-experiment-on-the-usefulness-of-llms-as-an-author-checklist-assistant-for-scientific-papers/