An AI-Powered Evaluation: Understanding which Knowledge Tracing Models Work Best in which Contexts

An AI-Powered Evaluation: Understanding which Knowledge Tracing Models Work Best in which Contexts

05 Sept 2025 (modified: 08 Oct 2025)Submitted to Agents4ScienceEveryoneRevisionsBibTeXCC BY 4.0

Keywords: knowledge tracing, student modeling, digital learning, AI powered research

Abstract: Knowledge tracing (KT) models a learner’s evolving mastery from interaction logs and underpins personalization in tutors, practice systems, and learning analytics. Over three decades, many KT models have been proposed; however, performance varies by dataset characteristics for which the models are trained on, so a model that excels in one setting may under-perform in another. In this work, conducted by an LLM and conceptualized through human-LLM partnership, we explore this phenomenon by conducting a structured synthesis of 124 KT papers spanning classic probabilistic, generalized logistic/factorization, deep sequence, attention/transformer, graph-based, and LLM-augmented approaches (with each paper proposing one or more new models or variants). For each study, we extract key information, including modeling idea, data setting, and outcomes, then code them along eight key contextual dimensions (data scale; sequence length; structure availability: concept-item relations; temporal irregularity/forgetting cues; modality: binary vs. text/code/dialogue; cohort heterogeneity; cold-start/unseen items; interpretability/operational constraints). We apply a two-stage aggregation: (1) within-paper ranking of models on the authors’ primary metrics, and (2) context-level win rates/median ranks with quality weights favoring student-wise, chronological, and out-of-distribution protocols, with sensitivity checks for robustness. We find attention/transformers lead on large, long-history logs; graph/dynamic-graph KT dominates when reliable (static or evolving) structure is available; Hawkes/spacing-aware methods win when timing and forgetting matter; LLM/semantic KT excels on text/code/dialogue and improves unseen-item generalization; mixture-of-experts helps in heterogeneous cohorts; and generalized logistic/factorization families remain competitive, interpretable choices in data-constrained settings. We highlight common evaluation pitfalls and synthesize context-dependent patterns across models and datasets, providing practical guidance for context-aware KT model selection.

Supplementary Material: zip

Submission Number: 87

Loading