ArxEval: Evaluating Retrieval and Generation in Language Models for Scientific Literature

TMLR Paper3999 Authors

17 Jan 2025 (modified: 12 Apr 2025)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Language Models [LMs] are now playing an increasingly large role in information generation and synthesis; the representation of scientific knowledge in these systems needs to be highly accurate. A prime challenge is hallucination; that is, generating apparently plausible but actually false information, including invented citations and nonexistent research papers. This kind of inaccuracy is dangerous in all the domains that require high levels of factual correctness, such as academia and education. This work presents a pipeline for evaluating the frequency with which language models hallucinate in generating responses in the scientific literature. We propose ArxEval, an evaluation pipeline with two tasks using ArXiv as a repository: Jumbled Titles and Mixed Titles. Our evaluation includes fifteen widely used language models and provides comparative insights into their reliability in handling scientific literature.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Revisions Made: - **Added FactCC as an Additional Quantitative Measure**: Incorporated FactCC scores to evaluate the factual consistency of generated outputs for the Jumbled Titles Task. - **Improved Paper Structure**: Reorganized images to enhance the flow and readability of the paper. - **Clarified Motivations and Definitions**: Provided clearer explanations and definitions for key terms and concepts discussed in the paper. - **Added Limitations**: Added limitations to our work with respect to Human evaluation and Data Contamination - **Future Work**: - **Exploration of RAG**: Introduced a future work section focusing on the exploration of Retrieval-Augmented Generation (RAG) within the evaluation pipeline.
Assigned Action Editor: ~Greg_Durrett1
Submission Number: 3999
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview