Abstract: Recent state-of-the-art vision models have introduced new architectures, learning
paradigms, and larger pretraining data, leading to impressive performance on tasks
such as classification. While previous generations of vision models were shown to
lack robustness to factors such as pose, the extent to which this next generation
of models are more robust remains unclear. To study this question, we develop a
dataset of more than 7 million images with controlled changes in pose, position
background, lighting color, and size. We study not only how robust recent state-of-
the-art models are, but also the extent to which models can generalize to variation in
each of these factors. We consider a catalog of recent vision models, including vision
transformers (ViT), self-supervised models such as masked autoencoders (MAE),
and models trained on larger datasets such as CLIP. We find that even today’s best
models are not robust to common changes in pose, size, and background. When
some samples varied during training, we found models required a significant portion
of instances seen varying to generalize—though eventually robustness did improve.
When variability is only witnessed for some classes however, we found that models
did not generalize to other classes unless the classes were very similar to those seen
varying during training. We hope our work will shed further light on the blind
spots of SoTA models and spur the development of more robust vision models.
Submission Length: Regular submission (no more than 12 pages of main content)
Supplementary Material: zip
Assigned Action Editor: ~Dumitru_Erhan1
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Submission Number: 707
Loading