Keywords: Image Classification, Patch Reordering, Deep Vision Models
Abstract: Modern vision models, such as Vision Transformers (ViTs), operate by decomposing images into local patches and aggregating their information for recognition.
This process implicitly requires the model to not only identify the correct local features but also to correctly understand how they are spatially composed.
However, this capacity for compositional reasoning is often fragile and biased.
We find that in numerous misclassification cases, the model correctly attends to the right object parts, yet still yields an incorrect prediction.
This paper uncovers a surprising phenomenon: by simply permuting the arrangement of these local patches—thereby preserving local features but destroying their spatial composition—we can consistently correct these misclassifications.
We propose that this reveals the existence of "faulty compositional information" within the model.
The original patch arrangement may trigger these flawed information, leading to failure.
Our search for a corrective permutation, guided by a genetic algorithm, effectively finds an arrangement that bypasses these faulty information, forcing the model to rely on a more robust, non-compositional evidence aggregation mechanism, akin to a sophisticated bag-of-words model.
Our work provides the first direct, operational tool to diagnose and understand compositional failures in vision models, highlighting a key challenge on the path toward more robust visual reasoning.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 25076
Loading