A Diagnostic Benchmark for Transformer Training Failures: Establishing Baseline Methods and Quantifying the Accuracy–Interpretability Tradeoff

A Diagnostic Benchmark for Transformer Training Failures: Establishing Baseline Methods and Quantifying the Accuracy–Interpretability Tradeoff

TMLR Paper6980 Authors

12 Jan 2026 (modified: 16 Jan 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: A fundamental trade-off in automated diagnostics is revealed by our evaluation, which establishes quantitative baselines: simple rule-based heuristics achieve 57.1% accuracy with full transparency, while machine learning classifiers reach 95.7% accuracy but sacrifice interpretability. This 38.6 percentage point gap quantifies a core tension: methods practitioners can trust and understand perform poorly, while methods that work well offer no insight into their reasoning. Transformer training failures incur significant costs through wasted computational resources and delayed research progress, but diagnostic approaches have never been systematically evaluated before. Training dynamics, such as gradient norms and loss trajectories, contribute 48% of the diagnostic signal, according to feature importance analysis, indicating that practitioners should log these metrics more frequently than static configuration parameters. Validated against simulated expert behaviour, our framework exhibits uncertainty handling with a 30.3% abstention rate on ambiguous cases. In addition to identifying hybrid approaches that combine rule-based transparency with machine learning accuracy as a promising direction for bridging the interpretability gap, this work establishes the first quantitative foundation for automated training diagnostics. To facilitate repeatable advancement in this crucial but understudied field, all code, data, and assessment procedures are made public.

Submission Type: Long submission (more than 12 pages of main content)

Assigned Action Editor: ~Zhouxing_Shi1

Submission Number: 6980

Loading