The Low-Frequency Trap: Why Scaling Doesn't Solve Simple Temporal Counting

Published: 02 Mar 2026, Last Modified: 09 Mar 2026ICLR 2026 Workshop ICBINBEveryoneRevisionsCC BY 4.0
Keywords: Video-Language Models (VLMs), Temporal Reasoning, Diagnostic Evaluation
TL;DR: We uncover the "Low-Frequency Trap" : a scale-invariant failure where VLMs fail to count simple events , suggesting they rely on continuous interpolation rather than discrete state tracking.
Abstract: Large multimodal models demonstrate strong performance on complex video understanding benchmarks, leading to the expectation that they should trivialy handle simple temporal reasoning tasks. In this work, we show that this assumption is fundamentally flawed. Using parametric profiling -- systematically varying event frequency, event count, and temporal span -- we uncover a striking failure mode: state-of-the-art video–language models fail catastrophically on conceptually simple tasks. While performance generally degrades as event frequency increases (as expected), we observe a counter-intuitive collapse in the easy regime: even at low frequencies (0.5 -- 1 Hz) with visually distinct events, performance plummets once the event count exceeds a trivial threshold (e.g., $N > 4$). Moreover, scaling model size from 8B to 235B does not resolve this limitation; large and small models exhibit nearly identical capability boundaries. Our analysis suggests that errors arise not from high-level reasoning or counting per se, but from systematic temporal misinterpretation, including event merging, hallucinated intermediate states, and color-based temporal interpolation. These results reveal a blind spot in current models’ temporal abstraction that is masked by aggregate benchmark scores and largely invariant to scale. Our findings highlight the need for diagnostic evaluation beyond average accuracy and suggest that scaling alone is insufficient to resolve fundamental limitations in temporal event reasoning.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 108
Loading