Keywords: Generalisation, Robustness, Optimisation
Abstract: Are we there yet? It’s hard to say when we don't know where we’re going. The only thing that seems to stay the same in the rapidly evolving field of deep learning are long training runs until test performance saturates. Moreover, it’s not clear when to stop training; practitioners observe extrinsic metrics like the training error, test error, and regularization terms but still might be tempted to stop early lest the deep network (DN) overfit. In this paper, we develop the first intrinsic, analytical, and interpretable characterization of where the deep learning process is headed. The key is to analyze the geometry of the tessellation of the DN input space that is induced by a continuous piecewise-affine approximation to its input-output mapping. Analogous to the Voronoi tiling that underlies K-means clustering, each tile in a DN’s power diagram tiling is parameterized by a centroid vector that equals the sum of the rows of the Jacobian of the DN input-output mapping. Our key result on learning is that a DN first reaches the point of generalization when the training data become aligned (in the sense of maximum cosine similarity) with the centroids of the tiles containing them. The DN then later reaches the point of maximum robustness when the training data become aligned with each row of the (rank-one) Jacobian. Hence, centroid and Jacobian alignment are the destination that learning algorithms aspire to reach. We leverage this new understanding in GrokAlign, a regularisation strategy for DN learning that provably and efficiently induces centroid and Jacobian alignment. Our experiments with convnets and transformers demonstrate that GrokAlign significantly accelerates delayed generalization (so-called "grokking") and improves robustness.
Primary Area: optimization
Submission Number: 22932
Loading