Keywords: binding, compositionality, linear representation
TL;DR: This paper introduces geometric, functional, and behavioral tools to quantify the extent to which vision models bind features into coherent objects.
Abstract: Existing studies of neural networks have focused largely on $\textit{compositionality}$—whether individual features can be linearly decoded and reused—while overlooking the equally important issue of $\textit{binding}$, i.e., how features are linked together to form coherent objects. This leaves a gap in understanding whether models truly represent feature conjunctions rather than mere unstructured feature bags. We propose a geometric and functional framework for quantifying binding, introducing a binding score based on principal angles between concept subspaces and validating it with linear or non-linear probes. To complement this, we design a behavioral diagnostic dataset in which pairs of images share identical feature bags but differ in how those features are bound into objects. Together, these frameworks highlight binding as a distinct and measurable dimension of representation, providing tools to diagnose where current vision models succeed—and where they fail—in capturing object structure.
Submission Number: 90
Loading