Analysis-by-Proxy: Localization Signals in VLMs Operating as Condition Encoders

Yoav Baron; Sara Dorfman; Roni Paiss; Daniel Cohen-Or; Or Patashnik

Analysis-by-Proxy: Localization Signals in VLMs Operating as Condition Encoders

Yoav Baron, Sara Dorfman, Roni Paiss, Daniel Cohen-Or, Or Patashnik

Published: 11 Jun 2026, Last Modified: 11 Jun 2026Mech Interp Workshop ICML 2026 SpotlightEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Methods (probing, steering, causal interventions), Applications of interpretability

Other Keywords: image editing,diffusion transfomers

TL;DR: Analysis-by-Proxy enables efficient interpretability for VLMs used as condition encoders. We find that the localization gap in image editing is a result of how the VLM is used, with spatial signals harder to decode in the final layer.

Abstract: Vision-Language Models (VLMs) are increasingly utilized as the conditioning backbone for diffusion-based image editing due to their remarkable multimodal reasoning capabilities. While standalone VLMs demonstrate strong localization capabilities, editing pipelines frequently struggle to maintain this accuracy, particularly in complex, multi-entity scenes. In this work, we investigate this performance gap, hypothesizing that it stems from treating the VLM as a condition encoder. In this role, the model is restricted to a single forward pass, preventing the autoregressive generation process for which it was optimized, thereby failing to fully expose its capabilities. To investigate whether this spatial understanding persists when the VLM is used as a condition encoder, we introduce Analysis-by-Proxy. In this framework, we train a lightweight, interpretable proxy model on the VLM's intermediate representations using an auxiliary localization task. By analyzing the VLM through this proxy, we uncover the specific VLM representations that encode localization information. Our findings expose a fundamental mismatch between how spatial knowledge is represented within a VLM condition encoder and how it is extracted by current editing pipelines. We reveal that under single-pass constraints, the localization signal does not reliably propagate to the predefined layer configurations commonly used for conditioning. Instead, this crucial signal remains hidden within intermediate representations, at locations that vary depending on the input prompt. Using our introduced Analysis-by-Proxy framework, we reveal the fundamental failures of existing condition extraction strategies in editing pipelines, opening the door to more principled design of conditioning architectures.

Submission Number: 294

Loading