BloomQA: Automated Benchmark Generation from Domain Guidelines Using Bloom's Taxonomy

Si Chen; Le Huy Khiem; Annalisa Szymanski; Ronald Metoyer; Ting Hua; Nitesh V Chawla

BloomQA: Automated Benchmark Generation from Domain Guidelines Using Bloom's Taxonomy

Si Chen, Le Huy Khiem, Annalisa Szymanski, Ronald Metoyer, Ting Hua, Nitesh V Chawla

17 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Bloom’s Taxonomy, Food AI, Pedagogy AI

Abstract: Open-ended question answering (QA) refers to evaluation tasks where models must generate responses that integrate domain knowledge, contextual understanding, and pedagogical clarity, rather than simply retrieving fixed facts. Such tasks are especially difficult for large language models (LLMs), and existing benchmarks often rely on exam banks or narrow factual datasets that fail to capture performance across various levels in practice-based domains. We introduce BloomQA, a novel framework for automated benchmark generation from domain guidelines using Bloom’s Taxonomy. BloomQA extracts expert-curated practices, converts them into violation scenarios, and expands them into multiple-choice questions (MCQs) and dialogues scaffolded by Remember, Understand, Apply, and Analyze. Applied to teaching and dietetics, our method produces 20k MCQs and 5k dialogues per domain. Psychometric-informed evaluation shows that BloomQA capture difficulty and discrimination, separate strong and weak models, and identify question items that could be problematic. Fine-tuning with dialogue data further improves performance, especially at higher Bloom levels. BloomQA provides a principled and extensible framework for benchmarking LLMs in applied domains with open-ended QA.

Primary Area: datasets and benchmarks

Submission Number: 8162

Loading