A Diagnostic Benchmark for Transformer Training Failures: Establishing Baseline Methods and Quantifying the Accuracy-Interpretability Tradeof
Abstract: Training failures in transformer models waste substantial computational resources and delay
research progress, but diagnostic approaches have never been systematically evaluated. We
established the first quantitative foundation for automated training diagnostics by providing
a benchmark of 76 reproducible failure scenarios across five categories: memory hardware,
optimization, data pipeline, model software, and ambiguous “unknown” cases. We evaluate
representative diagnostic methods, including simple rule-based heuristics, learned rule-based
classifiers, and local LLM-based diagnostic agents (Mistral, Llama 3) via Ollama. Our
evaluation reveals a fundamental tradeoff: simple rules achieve 57.1% accuracy with full
transparency, while local LLM agents Mistral and Llama 3 8B reach 77.6% and 73.7%
accuracy respectively with detailed natural language explanations. In contrast, machine
learning ensembles reach 95.7% accuracy on the benchmark, suggesting that the engineered
feature set contains strong diagnostic signal. We mitigate this through glassbox models
(EBM) and a proposed hybrid triage-escalation system. Detailed analysis of “unknown”
cases further clarifies the limits of automated diagnosis, identifying ambiguous signals and
conservative labeling as primary causes for abstention. This work provides the infrastructure
and baselines necessary to transition machine learning debugging from an ad hoc craft to a
systematic, evidenced-based science.
Submission Type: Long submission (more than 12 pages of main content)
Changes Since Last Submission: Changes Since Last Submission:
We thank the reviewers and Action Editor for their detailed feedback. We have fully addressed every critical request:
To Reviewer xJ1Q:
1. Added learned rule-based classifier (new subsection 4.3, 89.4% accuracy, exportable as ≤ 25 human-readable rules).
2. Labelled LLM row in Table 1 as “LLM Diagnostic Agents (Coding Agents)” with explanation faithfulness scores.
3. Renamed “Unknown” → “Complex Issues” and added subsection 6.12 with the three root-cause breakdowns you requested (Truly Ambiguous 37.5%, Insufficient Signal 25%, Conservative Labelling 37.5%) plus LLM recovery rates.
To Reviewer bcT4:
1. Added bootstrap 95% CIs everywhere (including the 38.6 pp gap).
2. Included EBM glassbox baseline (95.7% accuracy) and revised the tradeoff narrative (subsection 6.3).
3. Added Temporal Lead-Time Analysis (subsection 6.4 + Figure 4, 55.3% at T-100 steps).
4. Added per-class F1 for Complex Issues, explicit simulated-human limitations box, distributed/large-scale discussion, and feature-leakage subsection.
To Reviewer M175:
1. Added workflow diagram (Figure 1), all formal equations, and granular permutation importance figure.
2. Expanded Limitations with full emerging-architectures subsection (SSMs/Mamba, CNNs, ViT) including transferability tables.
3. Expanded hybrid triage-escalation system (subsection 6.2 + new Figure 8) with calibration details and end-to-end metrics (87.4%).
Assigned Action Editor: ~Zhouxing_Shi1
Submission Number: 6980
Loading