YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering

YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering

ACL ARR 2025 February Submission7988 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large Language Models (LLMs) drive scientific question-answering on modern search engines, yet their evaluation robustness remains underexplored. We introduce YESciEval, an open-source framework that combines fine-grained rubric-based assessment with reinforcement learning to mitigate optimism bias in LLM evaluators. We release multidisciplinary scienceQ&A datasets, including adversarial variants, with evaluation scores from multiple LLMs. Independent of proprietary models and human feedback, our approach enables scalable, cost-free evaluation. By advancing reliable LLM-as-a-judge models, this work supports AI alignment and fosters robust, transparent evaluation essential for scientific inquiry and artificial general intelligence.

Paper Type: Long

Research Area: Efficient/Low-Resource Methods for NLP

Research Area Keywords: science question answering, llm-as-a-judge, robust LLMs, LLM plausibility, adversarial tests

Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models, Data resources

Languages Studied: English

Submission Number: 7988

Loading