ConceptAttention: Diffusion Transformers Learn Highly Interpretable Features

Alec Helbling; Tuna Han Salih Meral; Benjamin Hoover; Pinar Yanardag; Duen Horng Chau

ConceptAttention: Diffusion Transformers Learn Highly Interpretable Features

Alec Helbling, Tuna Han Salih Meral, Benjamin Hoover, Pinar Yanardag, Duen Horng Chau

Published: 07 May 2025, Last Modified: 29 May 2025VisCon 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: diffusion, interpretability, saliency maps

TL;DR: Create rich saliency maps of text concepts from the representations of diffusion transformers.

Abstract: Do the rich representations of multi-modal diffusion transformers (DiTs) exhibit unique properties that enhance their interpretability? We introduce ConceptAttention, a novel method that leverages the expressive power of DiT attention layers to generate high-quality saliency maps that precisely locate textual concepts within images. Without requiring additional training, ConceptAttention repurposes the parameters of DiT attention layers to produce highly contextualized *concept embeddings*, contributing the major discovery that performing linear projections in the output space of DiT attention layers yields significantly sharper saliency maps compared to commonly used cross-attention mechanisms. Remarkably, ConceptAttention even achieves state-of-the-art performance on zero-shot image segmentation benchmarks, outperforming 15 other zero-shot interpretability methods on the ImageNet-Segmentation dataset and on a single-class subset of PascalVOC. Our work contributes the first evidence that the representations of multi-modal DiT models like Flux are highly transferable to vision tasks like segmentation, even outperforming multi-modal foundation models like CLIP.

Submission Number: 38

Loading