TL;DR: Models fail at human-like contour integration; those that integrate contours in a more human-like way also reach better accuracy and robustness.
Abstract: Despite the tremendous success of deep learning in computer vision, models still fall behind humans in generalizing to new input distributions. Existing benchmarks do not investigate the specific failure points of models by analyzing performance under many controlled conditions. Our study systematically dissects where and why models struggle with contour integration - a hallmark of human vision -- by designing an experiment that tests object recognition under various levels of object fragmentation. Humans (n=50) perform at high accuracy, even with few object contours present. This is in contrast to models which exhibit substantially lower sensitivity to increasing object contours, with most of the over 1,000 models we tested barely performing above chance. Only at very large scales ($\sim5B$ training dataset size) do models begin to approach human performance. Importantly, humans exhibit an integration bias - a preference towards recognizing objects made up of directional fragments over directionless fragments. We find that not only do models that share this property perform better at our task, but that this bias also increases with model training dataset size, and training models to exhibit contour integration leads to high shape bias. Taken together, our results suggest that contour integration is a hallmark of object vision that underlies object recognition performance, and may be a mechanism learned from data at scale.
Lay Summary: Computer vision is not as robust as human vision. Why? One reason is that humans adapt better to unseen circumstances -- for example, humans can recognize familiar objects even when only scattered pieces of their outlines are visible, but cutting-edge AI vision systems often fail under these conditions.
To understand this phenomenon in humans and AI models, we showed 50 people and over a thousand AI models images in which objects were broken into disconnected fragments at many levels of difficulty. This allowed us to study why and where humans and models face the greatest difficulties. While humans stayed highly accurate even when the edges of the object were highly fragmented, most AI models performed near chance unless trained on extremely large image datasets.
We also discovered that both people and the biggest models excel when fragments align along the object’s true contour—a “gap-filling” ability known as contour integration. This work reveals that piecing together broken outlines is fundamental to robust vision and can emerge in AI purely through massive data exposure.
Primary Area: Applications->Neuroscience, Cognitive Science
Keywords: psychophysics, machine vision, human vision, contour integration, robustness, visual perception
Submission Number: 15810
Loading