Why Multi-Grade Deep Learning Outperforms Single-Grade: Theory and Practice

Why Multi-Grade Deep Learning Outperforms Single-Grade: Theory and Practice

ICLR 2026 Conference Submission14653 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multi-Grade Deep Learning, theory, practice

TL;DR: This paper explores the reason why multi-grade deep learning outperforms single-grade deep learning.

Abstract: Multi-grade deep learning (MGDL) has recently emerged as an alternative to standard end-to-end training, referred to here as single-grade deep learning (SGDL), showing strong empirical promise. This work provides both theoretical and experimental evidence of MGDL’s computational advantages. We establish convergence guarantees for gradient descent (GD) applied to MGDL, demonstrating greater robustness to learning-rate choices compared to SGDL. In the case of ReLU activations with single-layer grades, we further show that MGDL reduces to a sequence of convex optimization subproblems. For more general settings, we analyze the eigenvalue distributions of Jacobian matrices from GD iterations, revealing structural properties underlying MGDL’s enhanced stability. Practically, we benchmark MGDL against SGDL on image regression, denoising, and deblurring tasks, as well as on CIFAR-10 and CIFAR-100, covering fully connected networks, CNNs, and transformers. These results establish MGDL as a scalable framework that unites rigorous theoretical guarantees with broad empirical improvements.

Supplementary Material: pdf

Primary Area: interpretability and explainable AI

Submission Number: 14653

Loading