AURA-QG: Automated Unsupervised Replicable Assessment for Question Generation

ACL ARR 2025 July Submission260 Authors

26 Jul 2025 (modified: 04 Sept 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Question Generation (QG) is central to information retrieval, education, and knowledge assessment, yet its progress is bottlenecked by unreliable and non-scalable evaluation practices. Traditional metrics fall short in structured settings like document-grounded QG, and human evaluation, while insightful, remains expensive, inconsistent, and difficult to replicate at scale. We introduce AURA-QG: an Automated, Unsupervised, Replicable Assessment pipeline that scores question sets using only the source document. It captures four orthogonal dimensions i.e., answerability, non-redundancy, coverage, and structural entropy, without needing reference questions or relative baselines. Our method is modular, efficient, and agnostic to the question generation strategy. Through extensive experiments across four domains i.e., car manuals, economic surveys, health brochures, and fiction, we demonstrate its robustness across input granularities and prompting paradigms. Chain-of-Thought prompting, which first extracts answer spans and then generates targeted questions, consistently yields higher answerability and coverage, validating the pipeline’s fidelity. The metrics also exhibit strong agreement with human judgments, reinforcing their reliability for practical adoption.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Automatic evaluation of datasets, evaluation methodologies, evaluation, metrics, reproducibility
Contribution Types: Publicly available software and/or pre-trained models
Languages Studied: English
Reassignment Request Area Chair: This is not a resubmission
Reassignment Request Reviewers: This is not a resubmission
Software: zip
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: N/A
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: N/A
B2 Discuss The License For Artifacts: N/A
B3 Artifact Use Consistent With Intended Use: N/A
B4 Data Contains Personally Identifying Info Or Offensive Content: No
B4 Elaboration: We only use public brochures.
B5 Documentation Of Artifacts: Yes
B5 Elaboration: It's provided in the ReadMe file in the code submitted.
B6 Statistics For Data: Yes
B6 Elaboration: Section 5.2
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: Section 4.3 and 5.3
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: Section 4 and 5
C3 Descriptive Statistics: Yes
C3 Elaboration: Section 6
C4 Parameters For Packages: Yes
C4 Elaboration: Section 4.3
D Human Subjects Including Annotators: Yes
D1 Instructions Given To Participants: Yes
D1 Elaboration: Appendix B
D2 Recruitment And Payment: Yes
D2 Elaboration: Appendix B
D3 Data Consent: Yes
D3 Elaboration: Appendix B
D4 Ethics Review Board Approval: No
D4 Elaboration: It is just an evaluation of a question set based on a publicly available passage. There is no ethical concerns.
D5 Characteristics Of Annotators: No
D5 Elaboration: Ethnicity and demography are not relevant to this task.
E Ai Assistants In Research Or Writing: Yes
E1 Information About Use Of Ai Assistants: No
E1 Elaboration: AI assistance was limited to minor language edits and grammar refinement, and did not impact the scientific content, so it was not mentioned in the paper.
Author Submission Checklist: yes
Submission Number: 260
Loading