Simulator‑Based Synthetic ECGs for Self-Supervised Pretraining

15 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: simulator synthesized ECG, self‑supervised pretraining, knowledge‑driven simulators
TL;DR: Pretraining a Transformer with MAE on simulator-synthesized ECG matches real-data pretraining, beats supervised baselines (+5.49% mean F1, 26 tasks), and stays robust with few labels and demographic shifts—without using patient data.
Abstract: Medical data remain scarce and sensitive despite rich domain knowledge. We ask whether knowledge-driven parametric ECG simulators can supply scalable self-supervised pretraining signals without using patient records during pretraining. We use two established simulator-based ECG generators to synthesize $10$-s, $500$-Hz lead~II signals for pretraining a Transformer encoder with masked autoencoding, and compare it to pretraining on real PTB-XL ECGs and on VAE/GAN-generated ECGs. We then fine-tune on 26 abnormal-ECG classification tasks across PTB-XL, G12EC, and CPSC2018 and benchmark against five strong supervised baselines. The Transformer pretrained on simulator-based synthetic ECG (SimECG) performs comparably to real-data pretraining and outperforms supervised baselines on 24 of 26 tasks, yielding a mean 5.49% relative F1 improvement over the strongest baseline. Under reduced labeled-data budgets and across patient demographics, it largely preserves the advantages of real-data pretraining and maintains competitive performance across all 12 single-lead configurations. Crucially, simulator pretraining still avoids any exposure to patient data during pretraining. These results indicate that knowledge-driven synthetic ECG corpora can provide practical, privacy-enhancing initialization for downstream ECG models in data-limited regimes.
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 5752
Loading