Escaping Random Teacher Initialization Enhances Signal Propagation and Representation

Published: 07 Nov 2023, Last Modified: 13 Dec 2023M3L 2023 PosterEveryoneRevisionsBibTeX
Keywords: teacher-student, self-distillation, losslandscape, phenomenology, training trajectories, representation learning
Abstract: Recent work shows that by mimicking a random teacher network, student networks learn to produce better feature representations, even if they are initialized at the teacher. In this paper, we characterize how students escape this global optimum and investigate how this process translates into concrete properties of the representations. To that end, we first describe a simplified setup and identify very large step sizes as the main driver of this phenomenon. Then, we investigate key signal propagation and representation separability properties during the escape. Our analysis reveals a two-stage process: the network first undergoes a form of representational collapse, then steers to a parameter region that not only allows for better propagation of input signals but also gives rise to well-conditioned representations. This might relate to the edge of stability and label-independent dynamics.
Submission Number: 93