Multiscale Diagnostics of Visual Language Models

Published: 05 Sept 2025, Last Modified: 11 Feb 2026ICCV Workshop CV4DCEveryoneCC BY 4.0
Abstract: Vision-Language Models (VLMs) have emerged as a transformative approach at the intersection of computer vision and natural language processing, demonstrating remarkable zero-shot performance that rivals supervised methods on classification tasks. In this work, we investigate how object size affects the performance of both contrastive and generative vision-language architectures across diverse visual domains. We benchmark six popular models including CLIP variants, SmolVLM, BLIP-VQA, and ViLT on datasets from diverse domains: general object detection (PascalVOC), wildlife photography (African Wildlife), and vehicle detection (Vehicles-OpenImages). Our results reveal significant size-dependent performance variations across models and datasets, with certain models demonstrating strong recognition ability for specific object classes even when these classes consist of relatively smaller objects. Traditional CLIP variants show substantial accuracy improvements from tiny to large objects (18.9% to 81.5% for CLIP on PASCAL VOC), while SmolVLM demonstrates exceptional performance with minimal scale sensitivity, achieving 98.5-99.7% accuracy across all object sizes on PASCAL VOC and African Wildlife datasets. However, performance patterns vary considerably by dataset, with some models like ViLT showing unexpected performance drops on larger objects in certain domains. These findings provide concrete guidance for model selection in real-world deployments: SmolVLM's scale invariance makes it ideal for applications requiring consistent performance across viewing distances, while CLIP variants should be avoided for small object recognition tasks.
Loading