Abstract: As Large Language Models (LLMs) become increasingly used for question-answering (QA), relying on static, pre-annotated references for evaluation poses significant challenges in cost, scalability, and completeness. We propose Search-AuGmented Evaluation (SAGE), a framework to assess LLM outputs without predetermined ground-truth answers. Unlike conventional metrics that compare to static references or depend solely on LLM-as-a-judge knowledge, SAGE acts as an agent that actively retrieves and synthesizes external evidence. It iteratively generates web queries, collects information, summarizes findings, and refines subsequent searches through reflection. By reducing dependence on static reference-driven evaluation protocols, SAGE offers a scalable and adaptive alternative for evaluating the factuality of LLMs. Experimental results on multiple free-form QA benchmarks show that SAGE achieves substantial to perfect agreement with human evaluations.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: automatic evaluation, LLM/AI agents, evaluation methodologies, factuality, human evaluation, a
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
Reassignment Request Area Chair: This is not a resubmission
Reassignment Request Reviewers: This is not a resubmission
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: Yes
A2 Elaboration: Section 12
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: Sections 2 and 3. Yes, we cited the creators of artifacts such as LLMs owner organizations, etc.
B2 Discuss The License For Artifacts: N/A
B3 Artifact Use Consistent With Intended Use: Yes
B3 Elaboration: Section 2: We used all artifacts in accordance with their intended research use, and our derived resources are likewise intended solely for research purposes
B4 Data Contains Personally Identifying Info Or Offensive Content: N/A
B4 Elaboration: We used only publicly available QA datasets that do not contain personally identifying information or offensive content, and no additional data was collected.
B5 Documentation Of Artifacts: N/A
B6 Statistics For Data: Yes
B6 Elaboration: Sections 2 and 3. Yes, Relevant dataset statistics, including the number of examples and evaluation splits, are reported.
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: Section 4.4
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: The experimental setup is described in Section 3.
C3 Descriptive Statistics: Yes
C3 Elaboration: Section 4. Yes, we performed different ablation experiments like effect of iterations, etc.
C4 Parameters For Packages: Yes
C4 Elaboration: Section 3. Yes, we utilize LLMs, EM, and F1 and clearly described the use.
D Human Subjects Including Annotators: Yes
D1 Instructions Given To Participants: Yes
D1 Elaboration: Sections 2 and 8.3.3
D2 Recruitment And Payment: Yes
D2 Elaboration: Section 3 and Section 8.3.3
D3 Data Consent: No
D3 Elaboration: We did not collect any personal data, and all human annotations were performed by volunteer lab members on publicly available datasets without involving personal or sensitive information.
D4 Ethics Review Board Approval: N/A
D5 Characteristics Of Annotators: N/A
D5 Elaboration: We did not collect any personal data, and all human annotations were performed by volunteer lab members on publicly available datasets without involving personal or sensitive information.
E Ai Assistants In Research Or Writing: No
E1 Information About Use Of Ai Assistants: N/A
Author Submission Checklist: yes
Submission Number: 894
Loading