What Would You Ask When You First Saw $a^2+b^2=c^2$? Evaluating LLM on Curiosity-Driven Question Generation

ACL ARR 2025 July Submission809 Authors

28 Jul 2025 (modified: 05 Sept 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large language models (LLMs) are increasingly widely used as critical components of knowledge retrieval systems and agentic systems. These systems can benefit from knowledge-seeking capabilities of LLMs, in other words, curiosity. However, this capability has not been evaluated quantitatively. Towards bridging this gap, we propose an evaluation framework, CDQG (Curiosity-Driven Question Generation). The CDQG task prompts LLMs to generate questions about a statement introducing scientific knowledge, simulating a curious person when facing the statement for the first time. The CDQG dataset contains 1,988 statements including physics, chemistry, and mathematics with distinct levels of difficulty, general knowledge statements, and intentionally erroneous statements. We score the qualities of the questions generated by LLMs along multiple dimensions. These scores are validated by rigorous controlled ablation studies and human evaluations. While large models like GPT-4 and Mistral 8x7b can generate highly coherent and relevant questions, the smaller Phi-2 model is equally or more effective. This indicates that size does not solely determine a model’s knowledge acquisition potential. CDQG quantifies a critical model capability, and opens up research opportunities for developing future knowledge retrieval systems driven by LLMs.
Paper Type: Long
Research Area: Dialogue and Interactive Systems
Research Area Keywords: Questioning, Curiosity, Evaluation, Science
Contribution Types: Model analysis & interpretability
Languages Studied: English
Previous URL: https://openreview.net/forum?id=i5BN85npOy
Explanation Of Revisions PDF: pdf
Reassignment Request Area Chair: Yes, I want a different area chair for our submission
Reassignment Request Reviewers: No, I want the same set of reviewers from our previous submission (subject to their availability)
Justification For Not Keeping Action Editor Or Reviewers: The meta reviewer didn't consider the rebuttal discussion, the reviewer haven't replied to our rebuttals, we did justify the things mentioned, but neither reviewer not meta reviewer consider any of that. We also reported the meta-reviewer but no action was taken.
Data: zip
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: N/A
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: We created the CDQG dataset and the CDQG evaluation framework, methodology section of the papers details it.
B2 Discuss The License For Artifacts: No
B2 Elaboration: We created the dataset and not used it from somewhere else.
B3 Artifact Use Consistent With Intended Use: Yes
B3 Elaboration: section 3
B4 Data Contains Personally Identifying Info Or Offensive Content: No
B4 Elaboration: The data is research papers
B5 Documentation Of Artifacts: Yes
B5 Elaboration: Section 4
B6 Statistics For Data: Yes
B6 Elaboration: Section 4
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: Section 5 and Appendix
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: 3
C3 Descriptive Statistics: Yes
C3 Elaboration: 6
C4 Parameters For Packages: N/A
D Human Subjects Including Annotators: Yes
D1 Instructions Given To Participants: Yes
D1 Elaboration: We have given the instructions in person, our annotators size was small
D2 Recruitment And Payment: N/A
D3 Data Consent: N/A
D4 Ethics Review Board Approval: N/A
D5 Characteristics Of Annotators: Yes
D5 Elaboration: 3
E Ai Assistants In Research Or Writing: Yes
E1 Information About Use Of Ai Assistants: No
E1 Elaboration: Used for coding and grammar corrections in writing
Author Submission Checklist: yes
Submission Number: 809
Loading