TAMER: A Tri-Modal Contrastive Alignment and Multi-Scale Embedding Refinement Framework for Zero-Shot ECG Diagnosis
Keywords: Multi-modal;Self-Supervised;Electrocardiogram;Electrocardiogram
TL;DR: We propose TAMER, a tri-modal contrastive framework that integrates ECGs, spectrograms, and diagnostic reports to achieve state-of-the-art zero-shot and cross-domain ECG classification.
Abstract: Cardiovascular disease (CVD) diagnosis relies heavily on electrocardiograms (ECGs). However, most existing self-supervised uni-modal methods suffer from limited representational capacity, while multi-modal frameworks are hindered by coarse-grained semantic alignment across modalities, thus restricting their generalizability in clinical settings. To address these limitations, we propose TAMER, a Tri-modal contrastive A lignment and Multi-scale Embedding Refinement framework that jointly models ECG recordings, spectrograms, and diagnostic reports. TAMER is composed of three key components: First, the tri-modal feature encoding and projection (TFEP) module employs modality-specific encoders to extract global and local features from ECG recordings, spectrograms, and diagnostic reports, and projects them into latent spaces. Then, the global-local temporal-spectral alignment (GLTSA) module captures complementary rhythm- and wave-level characteristics via contrastive alignment and attentive interaction between temporal and spectral modalities. Finally, the report-aware alignment and refinement (RAAR) module performs diagnostic-level alignment and wave-level refinement with clinical reports, enabling semantic enrichment of ECG representations.
Extensive experiments on three public ECG datasets demonstrate that TAMER achieves state-of-the-art zero-shot classification performance (AUC: 81.2\%) and strong cross-domain generalization (AUC: 83.1\%), outperforming existing uni-modal and multi-modal baselines methods. The source code is available at \url{https://anonymous.4open.science/r/TAMER-FB58}.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 7651
Loading