AURA-QG: Automated Unsupervised Replicable Assessment for Question Generation

AURA-QG: Automated Unsupervised Replicable Assessment for Question Generation

ACL ARR 2025 July Submission260 Authors

26 Jul 2025 (modified: 04 Sept 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Question Generation (QG) is central to information retrieval, education, and knowledge assessment, yet its progress is bottlenecked by unreliable and non-scalable evaluation practices. Traditional metrics fall short in structured settings like document-grounded QG, and human evaluation, while insightful, remains expensive, inconsistent, and difficult to replicate at scale. We introduce AURA-QG: an Automated, Unsupervised, Replicable Assessment pipeline that scores question sets using only the source document. It captures four orthogonal dimensions i.e., answerability, non-redundancy, coverage, and structural entropy, without needing reference questions or relative baselines. Our method is modular, efficient, and agnostic to the question generation strategy. Through extensive experiments across four domains i.e., car manuals, economic surveys, health brochures, and fiction, we demonstrate its robustness across input granularities and prompting paradigms. Chain-of-Thought prompting, which first extracts answer spans and then generates targeted questions, consistently yields higher answerability and coverage, validating the pipeline’s fidelity. The metrics also exhibit strong agreement with human judgments, reinforcing their reliability for practical adoption.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: Automatic evaluation of datasets, evaluation methodologies, evaluation, metrics, reproducibility

Contribution Types: Publicly available software and/or pre-trained models

Languages Studied: English

Reassignment Request Area Chair: This is not a resubmission

Reassignment Request Reviewers: This is not a resubmission

Software: zip

A1 Limitations Section: This paper has a limitations section.

A2 Potential Risks: N/A

B Use Or Create Scientific Artifacts: Yes

B1 Cite Creators Of Artifacts: N/A

B2 Discuss The License For Artifacts: N/A

B3 Artifact Use Consistent With Intended Use: N/A

B4 Data Contains Personally Identifying Info Or Offensive Content: No

B4 Elaboration: We only use public brochures.

B5 Documentation Of Artifacts: Yes

B5 Elaboration: It's provided in the ReadMe file in the code submitted.

B6 Statistics For Data: Yes

B6 Elaboration: Section 5.2

C Computational Experiments: Yes

C1 Model Size And Budget: Yes

C1 Elaboration: Section 4.3 and 5.3

C2 Experimental Setup And Hyperparameters: Yes

C2 Elaboration: Section 4 and 5

C3 Descriptive Statistics: Yes

C3 Elaboration: Section 6

C4 Parameters For Packages: Yes

C4 Elaboration: Section 4.3

D Human Subjects Including Annotators: Yes

D1 Instructions Given To Participants: Yes

D1 Elaboration: Appendix B

D2 Recruitment And Payment: Yes

D2 Elaboration: Appendix B

D3 Data Consent: Yes

D3 Elaboration: Appendix B

D4 Ethics Review Board Approval: No

D4 Elaboration: It is just an evaluation of a question set based on a publicly available passage. There is no ethical concerns.

D5 Characteristics Of Annotators: No

D5 Elaboration: Ethnicity and demography are not relevant to this task.

E Ai Assistants In Research Or Writing: Yes

E1 Information About Use Of Ai Assistants: No

E1 Elaboration: AI assistance was limited to minor language edits and grammar refinement, and did not impact the scientific content, so it was not mentioned in the paper.

Author Submission Checklist: yes

Submission Number: 260

Loading