Evolution of CLIP representations: Texture Bias, Human Alignment and Noise Robustness
Track: tiny / short paper (up to 5 pages)
Domain: machine learning
Abstract: CLIP models demonstrate strong generalization, yet the evolution of their visual representation during training remains poorly understood. We perform an epoch-by-epoch analysis of texture bias, noise robustness, and alignment with human vision across multiple CLIP sizes. We observe a consistent shift from texture-dominated to more shape-sensitive representations as training progresses, though models remain more texture-biased than humans. Despite the improvement in shape-texture bias with further training, alignment with human benchmarks for saliency and mid-low level perception follows an inverse trajectory, reaching its peak early in the training process. As training continues, classification accuracy improves, while perceptual and attentional alignment gradually decline. There is also a trade-off between perceptual alignment and robustness that evolves over the training. These results indicate that different training stages favor different properties, and that selecting models based solely on final accuracy may overlook stages where representations more closely resemble human visual processing.
Presenter: ~Pablo_Hernández-Cámara1
Submission Number: 21
Loading