HypoEval: Hypothesis-Guided Evaluation for Natural Language Generation

ACL ARR 2025 July Submission1297 Authors

29 Jul 2025 (modified: 22 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large language models (LLMs) have demonstrated great potential for automating the evaluation of natural language generation. Previous frameworks of LLM-as-a-judge fall short in two ways: they either use zero-shot setting without consulting any human input, which leads to low alignment, or fine-tune LLMs on labeled data, which requires a non-trivial number of samples. Moreover, previous methods often provide little reasoning behind automated evaluations. In this paper, we propose HypoEval, Hypothesis-guided Evaluation framework, which first uses a small corpus of human evaluations to generate more detailed rubrics for human judgments and then incorporates a checklist-like approach to combine LLM's assigned scores on each decomposed dimension to acquire overall scores. With only 30 human evaluations, HypoEval achieves state-of-the-art performance in alignment with both human rankings (Spearman correlation) and human scores (Pearson correlation), on average outperforming G-Eval by 11.86% and fine-tuned Llama-3.1-8B-Instruct with at least 3 times more human evaluations by 11.95%. Furthermore, we conduct systematic studies to assess the robustness of HypoEval, highlighting its effectiveness as a reliable and interpretable automated evaluation framework.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: LLM application, automatic evaluation, human-centered evaluation
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models
Languages Studied: English
Reassignment Request Area Chair: This is not a resubmission
Reassignment Request Reviewers: This is not a resubmission
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: No
A2 Elaboration: There is no foreseeable risk with this work.
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: Section 3
B2 Discuss The License For Artifacts: Yes
B2 Elaboration: Appendix B.3
B3 Artifact Use Consistent With Intended Use: Yes
B3 Elaboration: Appendix B.3
B4 Data Contains Personally Identifying Info Or Offensive Content: No
B4 Elaboration: For our experiments, we only used released human evaluation scores without any personal information.
B5 Documentation Of Artifacts: Yes
B5 Elaboration: Section 3 and Appendix B
B6 Statistics For Data: Yes
B6 Elaboration: Section 3
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: Appendix B.4
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: Appendix B
C3 Descriptive Statistics: Yes
C3 Elaboration: Section 4 and Appendix B
C4 Parameters For Packages: Yes
C4 Elaboration: Section 3 and Appendix B
D Human Subjects Including Annotators: No
D1 Instructions Given To Participants: N/A
D2 Recruitment And Payment: N/A
D3 Data Consent: N/A
D4 Ethics Review Board Approval: N/A
D5 Characteristics Of Annotators: N/A
E Ai Assistants In Research Or Writing: Yes
E1 Information About Use Of Ai Assistants: No
E1 Elaboration: We only used AI assistants to polish wording or to help debugging.
Author Submission Checklist: yes
Submission Number: 1297
Loading