Interpreting style–content parsing in vision–language models

Fan L. Cheng; Xin Jing

Interpreting style–content parsing in vision–language models

Fan L. Cheng, Xin Jing

Published: 23 Sept 2025, Last Modified: 17 Feb 2026CogInterp @ NeurIPS 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Style-content disentanglement, Vision–language models (VLMs), Vision Transformers (ViTs), Representational similarity analysis (RSA), Linear probes, Shape and texture bias, Robustness to style shifts

Abstract: Style refers to the distinctive manner of expressing content, and humans can both recognize content across stylistic transformations and detect stylistic consistencies across different contents. Prior work has shown that vision–language models (VLMs) exhibit steerable texture–shape biases, with language supervision shifting this tradeoff at the behavioral level. However, the internal representational dynamics of style and content—how they emerge across layers and how language pathways modulate them—remain poorly understood. Here, we adapt neuroscience-inspired tools to dissect style and content representations in a large VLM. We show that vision encoders strongly preserve stylistic signals while progressively enhancing content selectivity, and that language pathways further amplify content representations at the expense of style. Prompting can modestly steer these balances, but content remains dominant in deeper layers. These findings provide systematic evidence of style–content dissociation in multimodal models, guiding the design of architectures that more effectively balance style and content.

Submission Number: 29

Loading