Evolution of CLIP representations: Texture Bias, Human Alignment and Noise Robustness

Pablo Hernández-Cámara; Jose Manuel Jaén-Lorites; Alexandra Gomez-Villa; Jorge Vila-Tomás; Valero Laparra; Jesus Malo

Evolution of CLIP representations: Texture Bias, Human Alignment and Noise Robustness

Pablo Hernández-Cámara, Jose Manuel Jaén-Lorites, Alexandra Gomez-Villa, Jorge Vila-Tomás, Valero Laparra, Jesus Malo

Published: 02 Mar 2026, Last Modified: 14 May 2026ICLR 2026 Re-Align WorkshopEveryoneRevisionsBibTeXCC BY 4.0

Track: tiny / short paper (up to 5 pages)

Domain: machine learning

Abstract: CLIP models demonstrate strong generalization, yet the evolution of their visual representation during training remains poorly understood. We perform an epoch-by-epoch analysis of texture bias, noise robustness, and alignment with human vision across multiple CLIP sizes. We observe a consistent shift from texture-dominated to more shape-sensitive representations as training progresses, though models remain more texture-biased than humans. Despite the improvement in shape-texture bias with further training, alignment with human benchmarks for saliency and mid-low level perception follows an inverse trajectory, reaching its peak early in the training process. As training continues, classification accuracy improves, while perceptual and attentional alignment gradually decline. There is also a trade-off between perceptual alignment and robustness that evolves over the training. These results indicate that different training stages favor different properties, and that selecting models based solely on final accuracy may overlook stages where representations more closely resemble human visual processing.

Presenter: ~Pablo_Hernández-Cámara1

Submission Number: 21

Loading