Keywords: Bloom’s Taxonomy, Food AI, Pedagogy AI
Abstract: Open-ended question answering (QA) refers to evaluation tasks where models must generate responses that integrate domain knowledge, contextual understanding, and pedagogical clarity, rather than simply retrieving fixed facts. Such tasks are especially difficult for large language models (LLMs), and existing benchmarks often rely on exam banks or narrow factual datasets that fail to capture performance across various levels in practice-based domains. We introduce BloomQA, a novel framework for automated benchmark generation from domain guidelines using Bloom’s Taxonomy. BloomQA extracts expert-curated practices, converts them into violation scenarios, and expands them into multiple-choice questions (MCQs) and dialogues scaffolded by Remember, Understand, Apply, and Analyze. Applied to teaching and dietetics, our method produces 20k MCQs and 5k dialogues per domain. Psychometric-informed evaluation shows that BloomQA capture difficulty and discrimination, separate strong and weak models, and identify question items that could be problematic. Fine-tuning with dialogue data further improves performance, especially at higher Bloom levels. BloomQA provides a principled and extensible framework for benchmarking LLMs in applied domains with open-ended QA.
Primary Area: datasets and benchmarks
Submission Number: 8162
Loading