TL;DR: Fine-tuning can make models better at visual cognition tasks, but it does not lead to robust human-like generalization to other tasks.
Abstract: Pre-trained vision language models still fall short of human visual cognition. In an effort to improve visual cognition and align models with human behavior, we introduce visual stimuli and human judgments on visual cognition tasks, allowing us to systematically evaluate performance across cognitive domains under a consistent environment. We fine-tune models on ground truth data for intuitive physics and causal reasoning and find that this improves model performance in the respective fine-tuning domain. Furthermore, it can improve model alignment with human behavior. However, we find that task-specific fine-tuning does not contribute to robust human-like generalization to data with other visual characteristics or to tasks in other cognitive domains.
Lay Summary: Modern large machine learning models can perform remarkable feats. However, they still struggle on visual tasks that are relatively easy for human observers. To make them better at these tasks, and to ideally make models behave more like humans, we train models on selected tasks from the psychology literature. We find that this makes models better on the task they are trained on, but that they can not transfer what they have learned to other related tasks.
Primary Area: Applications->Neuroscience, Cognitive Science
Keywords: Cognitive Science, Machine Learning, Fine-tuning, Vision language models, Intuitive phyics, Causal reasoning, Counterfactual reasoning
Submission Number: 12153
Loading