Asleep at the Wheel: Benchmarking the Inattention of Vision Language Models to Clinical Sleep Signals
Keywords: Sleep Staging, Vision-Language Models, Polysomnography
Abstract: Clinical sleep staging, a cornerstone for diagnosing and managing sleep disorders, presents a complex multimodal challenge for artificial intelligence solutions. While Vision Language Models (VLMs) have achieved unprecedented success in general purpose multimodal understanding, their capacity to interpret long-duration, noisy physiological signals like Polysomnography (PSG) remains largely unexplored. To systematically probe this frontier, we introduce the {first comprehensive benchmark}, SleepVLM-Bench, tailored for evaluating VLM in clinical sleep staging, where we meticulously reformulate single-night multichannel PSG into VLM-consumable modalities: time and frequency images, clinician-derived textual features, and raw/discrete feature sequences. Our rigorous evaluation pits leading VLMs, including the strong multimodal model GPT-4o, against strong CNN, RNN, and Transformer baselines across three diverse, real-world cohorts (DCSM, SHHS, ISRUC) using standardized subject-wise and cross-cohort evaluations. Our central and surprising finding reveals that state-of-the-art VLMs consistently fail to outperform established end-to-end deep learning baselines in this critical task, and their performance remains insufficient for independent, clinical-grade deployment. Although incorporating expert feature prompting offers improved interpretability and modest robustness in low-data regimes, it does not bridge this significant performance gap. These results provide a critical, reproducible reference point, clarifying the current limitations of VLMs in handling complex, long-duration physiological data and highlighting the urgent need for future research into specialized multimodal fusion architectures, the integration of physiological priors, and efficient model designs tailored to lay the groundwork for truly robust and clinically deployable solutions with VLMs. Our dataset will be released upon acceptance.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 7097
Loading