Abstract: Human visual experience is markedly different from the large-scale computer vision datasets consisting of internet images. Babies densely sample a few $3D$ scenes with diverse variations such as object viewpoints or illuminations, while datasets like ImageNet contain one single snapshot from millions of 3D scenes. We investigated how these differences in input data composition (\ie visual diet) impact the Out-Of-Distribution (OOD) generalization capabilities of a visual system. Training models on a dataset mimicking attributes of the human-like visual diet improved generalization to OOD lighting, material, and viewpoint changes by up to $18\%$. This observation held despite the fact that the models were trained on $1,000$-fold less training data. Furthermore, when trained on purely synthetic data and tested on natural images, incorporating these visual diet attributes in the training dataset improved OOD generalization by $17\%$. These experiments are enabled by our newly proposed benchmark---the Human Visual Diet (HVD) dataset, and a new model (Human Diet Network) designed to leverage the attributes of a human-like visual diet. These findings highlight a critical problem in modern day Artificial Intelligence---building better datasets requires thinking beyond dataset size and rather focus on improving data composition. All data and source code are available at \url{https://bit.ly/3yX3PAM}.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Evan_G_Shelhamer1
Submission Number: 3992
Loading