Keywords: out-of-distribution generalization, OOD, visual diet, data diversity, scene context, viewpoints, lighting, materials
Abstract: Human visual experience is markedly different from the large scale computer vision datasets constructed by scraping the internet. Babies densely sample a few $3D$ scenes with diverse variations, while datasets like ImageNet contain one single snapshot from millions of 3D scenes. We investigated how these differences in input data composition (i.e., visual diet) impact the Out-Of-Distribution (OOD) generalization capabilities of a visual system. We found that training models on a dataset mimicking attributes of the human-like visual diet improved generalization to OOD lighting, material, and viewpoint changes by up to $18$%. This was true despite being trained on $1,000$-fold lesser training data. Furthermore, when trained on purely synthetic data and tested on natural images, incorporating these attributes in the training dataset improved OOD generalization by $17$%. These experiments are enabled by our newly proposed benchmark---the Human Visual Diet (HVD) dataset, and a new model (Human Diet Network) designed to leverage the attributes of a human-like diet. These findings highlight a critical problem in modern day Artificial Intelligence---building better datasets requires thinking beyond dataset size, and improving data composition. All data and source code are available at https://bit.ly/3yX3PAM.
Submission Number: 28
Loading