Hierarchical Vision-Language Model with Multi-Level Feature Alignment and Visually Enhanced Language-Guided Reasoning for EEG Image-Based Sleep Stage Prediction

Hierarchical Vision-Language Model with Multi-Level Feature Alignment and Visually Enhanced Language-Guided Reasoning for EEG Image-Based Sleep Stage Prediction

ACL ARR 2025 May Submission1496 Authors

17 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Single-channel electroencephalography (EEG) plays a vital role in evaluating sleep quality and diagnosing sleep disorders, making sleep stage classification using EEG an essential task in clinical practice. Traditional machine learning methods rely heavily on prior knowledge and handcrafted feature extraction, while deep learning approaches still face limitations in modeling frequency-domain features. Recently, Vision-Language Models (VLMs) have made significant progress in the medical domain. However, they still perform poorly when applied to physiological waveform data, especially EEG signals. These challenges mainly stem from their limited visual understanding and insufficient reasoning capability.To address this, we propose a hierarchical vision-language model that integrates multi-level feature alignment with visually enhanced language-guided reasoning to improve performance on sleep stage classification using EEG. Our approach introduces a specialized visual enhancement module that utilizes intermediate-layer outputs to construct high-level visual tokens, enabling the extraction of deep semantic information from EEG images. Subsequently, a multi-level feature alignment mechanism is employed to fuse these high-level tokens with low-level visual tokens extracted by CLIP, enhancing the VLM’s image-processing capabilities in this context. In addition, by incorporating a Chain-of-Thought (CoT) reasoning strategy, the complex medical inference process is decomposed into interpretable logical steps, effectively simulating expert decision-making. Experimental results demonstrate that the proposed method significantly improves both the accuracy and interpretability of VLMs in sleep stage classification using EEG.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: Sleep Stage Prediction, Hierarchical Vision-Language Model, Multi-Level Feature Alignment, Language-Guided Reasoning, Electroencephalogram

Contribution Types: Model analysis & interpretability, Data analysis

Languages Studied: English

Submission Number: 1496

Loading