Asleep at the Wheel: Benchmarking the Inattention of Vision Language Models to Clinical Sleep Signals

Xihe Qiu; Yue Zhang; Chao Qu; Ruikang Zhao; Xiaoyu Tan

Asleep at the Wheel: Benchmarking the Inattention of Vision Language Models to Clinical Sleep Signals

Xihe Qiu, Yue Zhang, Chao Qu, Ruikang Zhao, Xiaoyu Tan

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Sleep Staging, Vision-Language Models, Polysomnography

Abstract: Clinical sleep staging, a cornerstone for diagnosing and managing sleep disorders, presents a complex multimodal challenge for artificial intelligence solutions. While Vision Language Models (VLMs) have achieved unprecedented success in general purpose multimodal understanding, their capacity to interpret long-duration, noisy physiological signals like Polysomnography (PSG) remains largely unexplored. To systematically probe this frontier, we introduce the {first comprehensive benchmark}, SleepVLM-Bench, tailored for evaluating VLM in clinical sleep staging, where we meticulously reformulate single-night multichannel PSG into VLM-consumable modalities: time and frequency images, clinician-derived textual features, and raw/discrete feature sequences. Our rigorous evaluation pits leading VLMs, including the strong multimodal model GPT-4o, against strong CNN, RNN, and Transformer baselines across three diverse, real-world cohorts (DCSM, SHHS, ISRUC) using standardized subject-wise and cross-cohort evaluations. Our central and surprising finding reveals that state-of-the-art VLMs consistently fail to outperform established end-to-end deep learning baselines in this critical task, and their performance remains insufficient for independent, clinical-grade deployment. Although incorporating expert feature prompting offers improved interpretability and modest robustness in low-data regimes, it does not bridge this significant performance gap. These results provide a critical, reproducible reference point, clarifying the current limitations of VLMs in handling complex, long-duration physiological data and highlighting the urgent need for future research into specialized multimodal fusion architectures, the integration of physiological priors, and efficient model designs tailored to lay the groundwork for truly robust and clinically deployable solutions with VLMs. Our dataset will be released upon acceptance.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 7097

Loading