BloomXplain: A Framework and Benchmark Dataset for Pedagogically Sound LLM-Generated Explanations Based on Bloom’s Taxonomy

Maria-Eleni Zoumpoulidi; Eleni Batsi; Georgios Paraskevopoulos; Vassilis Katsouros; Alexandros Potamianos

BloomXplain: A Framework and Benchmark Dataset for Pedagogically Sound LLM-Generated Explanations Based on Bloom’s Taxonomy

Maria-Eleni Zoumpoulidi, Eleni Batsi, Georgios Paraskevopoulos, Vassilis Katsouros, Alexandros Potamianos

Published: 24 Sept 2025, Last Modified: 05 Nov 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Bloom's taxonomy, Benchmark dataset, LLMs, LLM-generated explanations, explainability

TL;DR: BloomXplain is a benchmark dataset and framework that generates and evaluates LLM-based instructional explanations across Bloom’s Taxonomy levels, improving pedagogical soundness and accuracy in educational AI.

Abstract: The ability of Large Language Models (LLMs) to generate accurate and pedagogically sound instructional explanations is necessary for their effective deployment in educational applications, such as AI tutors and teaching assistants. However, little research has systematically evaluated their performance across varying levels of cognitive complexity. Believing that such a direction serves the dual goal of not only producing more educationally sound and human-aligned outputs, but also fostering more robust reasoning and, thus, leading to more accurate results, we introduce BloomXplain, a framework designed to generate and assess LLM-generated instructional explanations across Bloom’s Taxonomy levels. We first construct a STEM-focused benchmark dataset of question–answer pairs categorized by Bloom’s cognitive levels, filling a key gap in NLP resources. Using this dataset and widely used benchmarks, we benchmark multiple LLMs with diverse prompting techniques, assessing correctness, alignment with Bloom's Taxonomy and pedagogical soundness. Our findings show that BloomXplain not only produces more pedagogically grounded outputs but also achieves accuracy on par with, and sometimes exceeding, existing approaches. This work sheds light on the strengths and limitations of current models and paves the way for more accurate and explainable results.

Submission Number: 51

Loading