TAMER: A Tri-Modal Contrastive Alignment and Multi-Scale Embedding Refinement Framework for Zero-Shot ECG Diagnosis

Xuewei Zhou; Yajie Meng; Pan Zeng; Xianfang Tang; Feifei Cui; Qiangguo Jin; Jialiang Yang; Junlin Xu

TAMER: A Tri-Modal Contrastive Alignment and Multi-Scale Embedding Refinement Framework for Zero-Shot ECG Diagnosis

Xuewei Zhou, Yajie Meng, Pan Zeng, Xianfang Tang, Feifei Cui, Qiangguo Jin, Jialiang Yang, Junlin Xu

16 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multi-modal；Self-Supervised；Electrocardiogram；Electrocardiogram

TL;DR: We propose TAMER, a tri-modal contrastive framework that integrates ECGs, spectrograms, and diagnostic reports to achieve state-of-the-art zero-shot and cross-domain ECG classification.

Abstract: Cardiovascular disease (CVD) diagnosis relies heavily on electrocardiograms (ECGs). However, most existing self-supervised uni-modal methods suffer from limited representational capacity, while multi-modal frameworks are hindered by coarse-grained semantic alignment across modalities, thus restricting their generalizability in clinical settings. To address these limitations, we propose TAMER, a Tri-modal contrastive A lignment and Multi-scale Embedding Refinement framework that jointly models ECG recordings, spectrograms, and diagnostic reports. TAMER is composed of three key components: First, the tri-modal feature encoding and projection (TFEP) module employs modality-specific encoders to extract global and local features from ECG recordings, spectrograms, and diagnostic reports, and projects them into latent spaces. Then, the global-local temporal-spectral alignment (GLTSA) module captures complementary rhythm- and wave-level characteristics via contrastive alignment and attentive interaction between temporal and spectral modalities. Finally, the report-aware alignment and refinement (RAAR) module performs diagnostic-level alignment and wave-level refinement with clinical reports, enabling semantic enrichment of ECG representations. Extensive experiments on three public ECG datasets demonstrate that TAMER achieves state-of-the-art zero-shot classification performance (AUC: 81.2\%) and strong cross-domain generalization (AUC: 83.1\%), outperforming existing uni-modal and multi-modal baselines methods. The source code is available at \url{https://anonymous.4open.science/r/TAMER-FB58}.

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Submission Number: 7651

Loading