FADE: Probing the Limits of VLMs on fine-grained OCR

Published: 05 May 2026, Last Modified: 10 May 20264th ALVR PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision language models, fine-grained perception, watermark benchmark, OCR
TL;DR: We introduce FADE, a benchmark proving that frontier MLLMs struggle with low-signal visual perception. As transparency increases, accuracy collapses, revealing that these models fail at fine-grained visual grounding
Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable success in semantic visual reasoning, yet their capacity for fine-grained, low-level perception remains critically under-evaluated. This perceptual fragility limits their reliability in noisy, real-world environments where visual signals are degraded. Furthermore, existing benchmarks often entangle visual perception with language priors, masking these underlying deficits. To address this, we introduce the **FAint numeric Detection Evaluation (FADE)** dataset, a novel evaluation suite designed to probe the limits of zero-shot Optical Character Recognition (OCR) in frontier MLLMs. By embedding synthetic, strictly numerical sequences over cluttered natural backgrounds at varying levels of transparency ($\alpha$), FADE explicitly disentangles pure visual perception from semantic predictability. We evaluate state-of-the-art models including Gemini 3.0, Claude 4.5 Sonnet, and Gemma 3 against a specialized UNet segmentation baseline. Our results reveal a striking limitation in frontier architectures: while they achieve near-perfect transcription at high visibility, their performance collapses under high transparency. Conversely, the UNet pipeline maintains robust spatial grounding, significantly outperforming generalist models at the lowest visibility thresholds. FADE provides a reproducible dataset to expose and diagnose the perceptual breakage points of modern multimodal systems.
Submission Number: 41
Loading