Mind the (Data) Gap: Evaluating Vision Systems in Small Data Applications

Published: 09 Oct 2025, Last Modified: 09 Oct 2025NeurIPS 2025 Workshop ImageomicsEveryoneRevisionsBibTeXCC BY 4.0
Submission Track: Short papers presenting ongoing research or work submitted to other venues (up to 5 pages, excluding references)
Keywords: MLLM, data efficiency, small data
TL;DR: We demonstrate that vision models outperform multimodal large language models in the small data regime (100-10K samples), which is important for practical ML applications.
Abstract: The practical application of AI tools for specific computer vision tasks relies on the ``small-data regime'' of hundreds to thousands of labeled samples. This small-data regime is vital for applications requiring expensive expert annotations, such as ecological monitoring, medical diagnostics or industrial quality control. We find, however, that computer vision research has ignored the small data regime as evaluations increasingly focus on zero- and few-shot learning. We use the Natural World Tasks (NeWT) benchmark to compare multi-modal large language models (MLLMs) and vision-only methods across varying training set sizes. MLLMs exhibit early performance plateaus, while vision-only methods improve throughout the small-data regime, with performance gaps widening beyond 10 training examples. We provide the first comprehensive comparison between these approaches in small-data contexts and advocate for explicit small-data evaluations in AI research to better bridge theoretical advances with practical deployments.
Submission Number: 53
Loading