SAGE: A Search-AuGmented Evaluation of Large Language Models on Free‑Form QA

SAGE: A Search-AuGmented Evaluation of Large Language Models on Free‑Form QA

ACL ARR 2025 July Submission894 Authors

29 Jul 2025 (modified: 22 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: As Large Language Models (LLMs) become increasingly used for question-answering (QA), relying on static, pre-annotated references for evaluation poses significant challenges in cost, scalability, and completeness. We propose Search-AuGmented Evaluation (SAGE), a framework to assess LLM outputs without predetermined ground-truth answers. Unlike conventional metrics that compare to static references or depend solely on LLM-as-a-judge knowledge, SAGE acts as an agent that actively retrieves and synthesizes external evidence. It iteratively generates web queries, collects information, summarizes findings, and refines subsequent searches through reflection. By reducing dependence on static reference-driven evaluation protocols, SAGE offers a scalable and adaptive alternative for evaluating the factuality of LLMs. Experimental results on multiple free-form QA benchmarks show that SAGE achieves substantial to perfect agreement with human evaluations.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: automatic evaluation, LLM/AI agents, evaluation methodologies, factuality, human evaluation, a

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English

Reassignment Request Area Chair: This is not a resubmission

Reassignment Request Reviewers: This is not a resubmission

A1 Limitations Section: This paper has a limitations section.

A2 Potential Risks: Yes

A2 Elaboration: Section 12

B Use Or Create Scientific Artifacts: Yes

B1 Cite Creators Of Artifacts: Yes

B1 Elaboration: Sections 2 and 3. Yes, we cited the creators of artifacts such as LLMs owner organizations, etc.

B2 Discuss The License For Artifacts: N/A

B3 Artifact Use Consistent With Intended Use: Yes

B3 Elaboration: Section 2: We used all artifacts in accordance with their intended research use, and our derived resources are likewise intended solely for research purposes

B4 Data Contains Personally Identifying Info Or Offensive Content: N/A

B4 Elaboration: We used only publicly available QA datasets that do not contain personally identifying information or offensive content, and no additional data was collected.

B5 Documentation Of Artifacts: N/A

B6 Statistics For Data: Yes

B6 Elaboration: Sections 2 and 3. Yes, Relevant dataset statistics, including the number of examples and evaluation splits, are reported.

C Computational Experiments: Yes

C1 Model Size And Budget: Yes

C1 Elaboration: Section 4.4

C2 Experimental Setup And Hyperparameters: Yes

C2 Elaboration: The experimental setup is described in Section 3.

C3 Descriptive Statistics: Yes

C3 Elaboration: Section 4. Yes, we performed different ablation experiments like effect of iterations, etc.

C4 Parameters For Packages: Yes

C4 Elaboration: Section 3. Yes, we utilize LLMs, EM, and F1 and clearly described the use.

D Human Subjects Including Annotators: Yes

D1 Instructions Given To Participants: Yes

D1 Elaboration: Sections 2 and 8.3.3

D2 Recruitment And Payment: Yes

D2 Elaboration: Section 3 and Section 8.3.3

D3 Data Consent: No

D3 Elaboration: We did not collect any personal data, and all human annotations were performed by volunteer lab members on publicly available datasets without involving personal or sensitive information.

D4 Ethics Review Board Approval: N/A

D5 Characteristics Of Annotators: N/A

D5 Elaboration: We did not collect any personal data, and all human annotations were performed by volunteer lab members on publicly available datasets without involving personal or sensitive information.

E Ai Assistants In Research Or Writing: No

E1 Information About Use Of Ai Assistants: N/A

Author Submission Checklist: yes

Submission Number: 894

Loading