Keywords: Multimodal Large Language Models, Visual In-context Learning, Visual Reasoning, Cross-modal Alignment
Abstract: Visual Fast Mapping (VFM) refers to the human ability to rapidly form new visual concepts from minimal examples based on experience and knowledge, a keystone of inductive capacity extensively studied in cognitive science. In the realm of computer vision, early endeavors attempt to achieve this capability through one-shot learning methods yet achieving limited generalization. Despite the recent advancements in Visual Language Models (VLMs), requiring large-scale image-text corpora, this human-like capability still has not been acquired. In this paper, we introduce VFM Bench, designed to evaluate the VFM ability in realistic industrial scenarios and reveal a performance gap over $19.0\%$ between human and VLMs. Most VLMs tend to response on pure-vision discriminative features, rather than make use of prior language knowledge for test-time alignment. Notably, emerging visual reasoning models demonstrate early-stage performance improvement yet still with gap to average human, suggesting a promising direction for leveraging cross-modal information in context. The code and dataset for VFM Bench are anonymously available at: https://anonymous.4open.science/r/VisualFastMappingBenchmark.
Primary Area: datasets and benchmarks
Submission Number: 19438
Loading