HypoEval: Hypothesis-Guided Evaluation for Natural Language Generation

HypoEval: Hypothesis-Guided Evaluation for Natural Language Generation

ACL ARR 2025 July Submission1297 Authors

29 Jul 2025 (modified: 22 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large language models (LLMs) have demonstrated great potential for automating the evaluation of natural language generation. Previous frameworks of LLM-as-a-judge fall short in two ways: they either use zero-shot setting without consulting any human input, which leads to low alignment, or fine-tune LLMs on labeled data, which requires a non-trivial number of samples. Moreover, previous methods often provide little reasoning behind automated evaluations. In this paper, we propose HypoEval, Hypothesis-guided Evaluation framework, which first uses a small corpus of human evaluations to generate more detailed rubrics for human judgments and then incorporates a checklist-like approach to combine LLM's assigned scores on each decomposed dimension to acquire overall scores. With only 30 human evaluations, HypoEval achieves state-of-the-art performance in alignment with both human rankings (Spearman correlation) and human scores (Pearson correlation), on average outperforming G-Eval by 11.86% and fine-tuned Llama-3.1-8B-Instruct with at least 3 times more human evaluations by 11.95%. Furthermore, we conduct systematic studies to assess the robustness of HypoEval, highlighting its effectiveness as a reliable and interpretable automated evaluation framework.

Paper Type: Long

Research Area: NLP Applications

Research Area Keywords: LLM application, automatic evaluation, human-centered evaluation

Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models

Languages Studied: English

Reassignment Request Area Chair: This is not a resubmission

Reassignment Request Reviewers: This is not a resubmission

A1 Limitations Section: This paper has a limitations section.

A2 Potential Risks: No

A2 Elaboration: There is no foreseeable risk with this work.

B Use Or Create Scientific Artifacts: Yes

B1 Cite Creators Of Artifacts: Yes

B1 Elaboration: Section 3

B2 Discuss The License For Artifacts: Yes

B2 Elaboration: Appendix B.3

B3 Artifact Use Consistent With Intended Use: Yes

B3 Elaboration: Appendix B.3

B4 Data Contains Personally Identifying Info Or Offensive Content: No

B4 Elaboration: For our experiments, we only used released human evaluation scores without any personal information.

B5 Documentation Of Artifacts: Yes

B5 Elaboration: Section 3 and Appendix B

B6 Statistics For Data: Yes

B6 Elaboration: Section 3

C Computational Experiments: Yes

C1 Model Size And Budget: Yes

C1 Elaboration: Appendix B.4

C2 Experimental Setup And Hyperparameters: Yes

C2 Elaboration: Appendix B

C3 Descriptive Statistics: Yes

C3 Elaboration: Section 4 and Appendix B

C4 Parameters For Packages: Yes

C4 Elaboration: Section 3 and Appendix B

D Human Subjects Including Annotators: No

D1 Instructions Given To Participants: N/A

D2 Recruitment And Payment: N/A

D3 Data Consent: N/A

D4 Ethics Review Board Approval: N/A

D5 Characteristics Of Annotators: N/A

E Ai Assistants In Research Or Writing: Yes

E1 Information About Use Of Ai Assistants: No

E1 Elaboration: We only used AI assistants to polish wording or to help debugging.

Author Submission Checklist: yes

Submission Number: 1297

Loading