Semantic Gravity: When Parametric Memory Overpowers Visual Thermodynamics in Video-LLMs

Published: 04 Jun 2026, Last Modified: 04 Jun 2026ICML MemFM 2026 Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Physical Reasoning, Video-LLMs, Memorization, Information Theory, Diagnostic Probing.
TL;DR: We expose "Semantic Gravity" in Video-LLMs: a fundamental failure where models ignore contradictory physical evidence to follow memorized linguistic scripts, revealing they function more as script-retrievers than true world simulators.
Abstract: Video Language Models (Video-LLMs) have demonstrated impressive spatiotemporal capabilities, yet it remains unclear if they reason about physical laws or rely on learned priors. We investigate this tension by utilizing the Thermodynamic Arrow of Time as a diagnostic probe for parametric memory. We introduce the Observational Entropy Benchmark (OEB), a dataset of chiral video pairs where high-entropy physical events are presented in both forward and time-reversed order. This setup creates causal friction where visual evidence in reversed sequences directly contradicts a model’s learned thermodynamic priors. To quantify this effect, we propose Semantic Gravity G_js, an information-theoretic metric that measures the dominance of internal linguistic scripts (priors) over visual grounding. Our evaluation of state-of-the-art models reveals significant ``semantic gravity''; models frequently override visual evidence of entropy decrease to maintain standard narrative scripts. These findings suggest that current Video-LLMs function primarily as script-retrievers rather than adaptive world-models, posing a fundamental limitation for their deployment in safety-critical and scientific domains.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 74
Loading