Good Fit Bad Policy: Why Fit Statistics Are a Biased Measure of Knowledge Tracer Quality

Napol Rachatasumrit, Daniel Weitekamp III, Kenneth R. Koedinger

Published: 2024, Last Modified: 25 Jan 2026AIED Companion (2) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Knowledge tracers are typically evaluated on the basis of the goodness-of-fit of their underlying student performance models. However, for the purposes of supporting mastery learning the true measure of a good knowledge tracer is not its goodness-of-fit, but the degree to which it optimally selects next problem items. In this context, a knowledge tracer should minimize under-practice to ensure students master learning materials and minimize over-practice to reduce wasted time. Prior work has suggested that fit-statistic-based measures of knowledge tracer quality may misrank the relative quality of knowledge tracers’ item selection. In this work, we evaluate this claim by measuring over- and under-practice directly in synthetic data drawn from ground-truth learning curves. We conduct an experiment with 3 well-known student performance models: Performance Factor Analysis (PFA), BestLR, and Deep Knowledge Tracing (DKT), and find that in 43% of the synthetic datasets, the models with higher measures of overall predictive performance (e.g. AUC and MSE) were worse than a comparison model with a lower predictive performance at minimizing over-practice and under-practice. These results support the hypothesis that overall fit statistics are not a reliable measure of a knowledge tracer’s ability to optimally select next items for students, and bring into question the validity of traditional methods of knowledge tracer comparison.

External IDs:dblp:conf/aied/RachatasumritWK24