Do computer vision foundation models learn the low-level characteristics of the human visual system?
Abstract: Computer vision foundation models, such as DINO or
OpenCLIP, are trained in a self-supervised manner on large
image datasets. Analogously, substantial evidence suggests
that the human visual system (HVS) is influenced by the
statistical distribution of colors and patterns in the natu-
ral world, characteristics also present in the training data
of foundation models. The question we address in this pa-
per is whether foundation models trained on natural images
mimic some of the low-level characteristics of the human
visual system, such as contrast detection, contrast masking,
and contrast constancy. Specifically, we designed a pro-
tocol comprising nine test types to evaluate the image en-
coders of 45 foundation and generative models. Our results
indicate that some foundation models (e.g., DINO, DINOv2,
and OpenCLIP), share some of the characteristics of human
vision, but other models show little resemblance. Founda-
tion models tend to show smaller sensitivity to low contrast
and rather irregular responses to contrast across frequen-
cies. The foundation models show the best agreement with
human data in terms of contrast masking. Our findings sug-
gest that human vision and computer vision may take both
similar and different paths when learning to interpret im-
ages of the real world. Overall, while differences remain,
foundation models trained on vision tasks start to align with
low-level human vision, with DINOv2 showing the closest
resemblance.
Loading