A Mechanistic View of Catastrophic Overfitting

A Mechanistic View of Catastrophic Overfitting

21 Jan 2026 (modified: 21 Apr 2026)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Adversarial Training (AT) suffers from a critical failure mode known as Catastrophic Overfitting (CO), where robustness to weak single-step adversaries does not translate to strong multi-step adversaries. Despite progress in mitigating CO, its underlying mechanisms remain poorly understood. In this work, we address two central questions: (1) Why does CO appear? and (2) What role do the number of Projected Gradient Descent (PGD) steps and PGD initialization play in CO? Using mathematically tractable models, we reveal a phase transition in the adversarial budget $\epsilon$, above which non-robust solutions become optimal. Furthermore, we show that CO exists for any well separated dataset, any number of PGD steps $S$, $\epsilon$ as small as desired, and randomized initialization. Our insights align with empirical observations in the community and help explain the difficulties in avoiding CO at larger scales. We believe our results deepen the understanding of CO and provide a foundation for developing future-proof solutions.

Submission Type: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~Venkatesh_Babu_Radhakrishnan2

Submission Number: 7094

Loading