Keywords: Understanding high-level properties of models, Diffusion models, Developmental interpretability
TL;DR: We introduce Prior-Guided Drift Diffusion (PGDD), a training-free method that treats neural networks as generative models, revealing that they acquire semantic knowledge well before achieving high classification performance.
Abstract: Interpretability at the neuron level has provided valuable insights into how individual units respond to specific features and patterns. To advance interpretability at the network level, we propose treating networks as generative models to probe their learned statistical priors. We introduce Prior-Guided Drift Diffusion (PGDD), which accesses the implicit statistical structure networks acquire during training. PGDD iteratively refines inputs according to the network's learned priors, essentially probing what patterns emerge from the network's internal statistical knowledge. For adversarially robust networks, this leverages implicit denoising operators shaped by robust training. For standard networks, our extension uses gradient smoothing techniques to stabilize the generative process. Applying this method during early training reveals that networks appear to acquire rich semantic representations well before achieving reliable classification performance. This demonstrates a dissociation between internal representation learning and classification performance, where networks develop structured knowledge before they can reliably use it. Our training-free approach provides direct access to this latent representational structure in the models we tested.
Submission Number: 229
Loading