Benchmarking LLMs for Automatic Responsible Checklist Generation

ACL ARR 2025 May Submission7632 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: For a few years, some of the most important conferences have started using checklists as a support for author submissions. The utility of these checklists is twofold. First, it can be used as a self-assessment tool for authors, providing them references on how to improve the quality of their submissions. In addition, reviewers can also use checklists to assist them during the review task. Although useful, filling out the checklist is usually a time-consuming task, as it is done manually. LLMs can be a powerful tool for providing assistance for this task due to their capacity to emulate human-like reasoning. This paper presents a study of three different LLMs for the author checklist completion task: GPT-3.5-turbo, DeepSeek-R1, and Llama-3. The results show that, while for some checklist points LLMs can accurately respond and simulate human responses, there is still a significant gap in the responses provided by the authors and LLMs. Moreover, the experimentation shows discrepancies between the results provided by the different models, which are especially noticeable in smaller LLMs.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Generation, Human-Centered NLP, NLP Applications, Resources and Evaluation
Contribution Types: Model analysis & interpretability, Reproduction study, Surveys
Languages Studied: English
Submission Number: 7632
Loading