Benchmarking Visual Fast Mapping: Probing VLMs' Test-time Image-text Alignment

ICLR 2026 Conference Submission19438 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodal Large Language Models, Visual In-context Learning, Visual Reasoning, Cross-modal Alignment
Abstract: Visual Fast Mapping (VFM) refers to the human ability to rapidly form new visual concepts from minimal examples based on experience and knowledge, a keystone of inductive capacity extensively studied in cognitive science. In the realm of computer vision, early endeavors attempt to achieve this capability through one-shot learning methods yet achieving limited generalization. Despite the recent advancements in Visual Language Models (VLMs), requiring large-scale image-text corpora, this human-like capability still has not been acquired. In this paper, we introduce VFM Bench, designed to evaluate the VFM ability in realistic industrial scenarios and reveal a performance gap over $19.0\%$ between human and VLMs. Most VLMs tend to response on pure-vision discriminative features, rather than make use of prior language knowledge for test-time alignment. Notably, emerging visual reasoning models demonstrate early-stage performance improvement yet still with gap to average human, suggesting a promising direction for leveraging cross-modal information in context. The code and dataset for VFM Bench are anonymously available at: https://anonymous.4open.science/r/VisualFastMappingBenchmark.
Primary Area: datasets and benchmarks
Submission Number: 19438
Loading