MyHealthDL: A Prior-Seeded Synthetic Malaysian Clinical Dataset for Multi-Task Deep Learning in Data-Scarce Healthcare Settings
Keywords: Synthetic Tabular Data; Multi-Task Learning (MTL); Deep Learning; CTGAN
TL;DR: To bypass Malaysia's privacy laws , we built MyHealthDL, a 10k-record synthetic clinical dataset. Seeded from national stats and refined via CTGAN , it tests guideline-fidelity learning.
Abstract: Malaysia’s Personal Data Protection Act 2010 restricts access to clinical records, creating a structural barrier to health AI research. We introduce MyHealthDL, a 10,000-record synthetic Malaysian clinical tabular dataset built through a two-stage pipeline: a parametric seed sampled from published NHMS 2019, MOH 2022/2023, and MDTR 2022 statistics, followed by CTGAN refinement to capture inter-feature correlations. Labels derive from Malaysian clinical practice guidelines (CPGs), making experiments a test of guideline-fidelity learning rather than clinical generalisation. We introduce a rule-based ceiling baseline, characterise the bias-variance trade-off of conditional oversampling for rare comorbidity structures, evaluate downstream utility via TSTRNP with dual probe learners (RF and XGBoost), and report ethnicity-stratified fidelity and fairness metrics. Code and data will be released upon acceptance (withheld for anonymous review).
Track: Track 2: ML Research by Muslim Authors
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Non Archival Confirmation: I understand that submissions to MusIML are non-archival and can be submitted to other venues.
Submission Number: 83
Loading