Keywords: Adversarial Robustness, Adversarial Training, Fast Adversarial Training, Catastrophic Overfitting
Abstract: Adversarial Training (AT) is a leading defense against adversarial examples but often suffers from *Catastrophic Overfitting* (CO) in efficient single-step variants, where robustness to multi-step attacks collapses despite high single-step performance.
We address this failure mode with two contributions.
First, we identify *Epsilon Overfitting* (EO), a previously overlooked phenomenon in which fixed perturbation magnitudes exacerbate CO, and show that introducing perturbation variability significantly improves robust generalization across different architectures and datasets.
Second, we propose **PertAlign** (Perturbation Alignment), a theoretically grounded, computationally negligible metric that predicts CO onset by measuring gradient alignment across attack stages.
Leveraging these insights, we introduce **SORA**, an adaptive step-size adversarial training method that dynamically adjusts perturbations based on loss-surface geometry.
SORA consistently prevents CO, achieves state-of-the-art robustness and clean accuracy, and generalizes across datasets and architectures using a single fixed set of hyperparameters.
Extensive experiments on diverse datasets and architectures, show that SORA matches or surpasses the robustness of prior methods while delivering higher clean accuracy and superior efficiency.
Code is available at [https://anonymous.4open.science/r/2026_ICLR_SORA](https://anonymous.4open.science/r/2026_ICLR_SORA).
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 7319
Loading