From Token to Rhythm: A Multi-Scale Approach for ECG-Language Pretraining

Fuying Wang; Jiacheng Xu; Lequan Yu

From Token to Rhythm: A Multi-Scale Approach for ECG-Language Pretraining

Fuying Wang, Jiacheng Xu, Lequan Yu

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We proposes an ECG-Language pretraining model leveraging multi-scale cross-modal supervision.

Abstract: Electrocardiograms (ECGs) play a vital role in monitoring cardiac health and diagnosing heart diseases. However, traditional deep learning approaches for ECG analysis rely heavily on large-scale manual annotations, which are both time-consuming and resource-intensive to obtain. To overcome this limitation, self-supervised learning (SSL) has emerged as a promising alternative, enabling the extraction of robust ECG representations that can be efficiently transferred to various downstream tasks. While previous studies have explored SSL for ECG pretraining and multi-modal ECG-language alignment, they often fail to capture the multi-scale nature of ECG signals. As a result, these methods struggle to learn generalized representations due to their inability to model the hierarchical structure of ECG data. To address this gap, we introduce MELP, a novel Multi-scale ECG-Language Pretraining (MELP) model that fully leverages hierarchical supervision from ECG-text pairs. MELP first pretrains a cardiology-specific language model to enhance its understanding of clinical text. It then applies three levels of cross-modal supervision—at the token, beat, and rhythm levels—to align ECG signals with textual reports, capturing structured information across different time scales. We evaluate MELP on three public ECG datasets across multiple tasks, including zero-shot ECG classification, linear probing, and transfer learning. Experimental results demonstrate that MELP outperforms existing SSL methods, underscoring its effectiveness and adaptability across diverse clinical applications. Our code is available at https://github.com/HKU-MedAI/MELP.

Lay Summary: Training deep learning models to analyze heart signals (ECGs) typically requires massive amounts of manually labeled data – a slow, expensive process. Existing self-supervised approaches also struggle because they treat ECGs as one uniform signal, ignoring the natural hierarchy of heartbeats and rhythms crucial for accurate diagnosis. We introduce MELP, a novel AI model that learns from unlabeled ECG-text pairs. MELP first deeply understands cardiology reports. It then aligns ECG signals with text descriptions at three clinically meaningful levels: fine-grained waveform features (token), individual heartbeats (beat), and overall rhythm patterns – mimicking how cardiologists interpret ECGs. MELP learns highly transferable ECG representations without expensive manual labels. It significantly outperforms prior methods across three public datasets, excelling at tasks like identifying new heart conditions with minimal training ("zero-shot") and adapting efficiently to various ECG analysis jobs. This paves the way for faster, cheaper, and more adaptable AI tools for heart health monitoring.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Link To Code: https://github.com/HKU-MedAI/MELP

Primary Area: Applications->Health / Medicine

Keywords: ECG, Representation Learning, Multimodal Pretraining

Submission Number: 4729

Loading