Sample Reweighting to Effectively Use Synthetic Data during Model Training

17 Sept 2025 (modified: 03 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: synthetic, generated, augmentation, reweight, healthcare, pain, gait, ecg
TL;DR: extend sample-based weighting to reweight synthetic samples by difficulty
Abstract: Training robust machine learning models, especially in healthcare applications, faces critical challenges due to limited labeled data, noisy labels, and class imbalance. Synthetic data generation has emerged as a promising approach to overcome these limitations. However, naively incorporating synthetic samples often introduces new challenges, such as sample quality variability and distribution mismatch. To address these issues, we propose an integrated framework that leverages Lightweight Learnable Adaptive Weighting (LiLAW) to dynamically reweigh synthetic samples based on their evolving difficulty during training. We extend LiLAW, which was developed for the multi-class classification setting, to the multi-label classification and regression settings. We then apply LiLAW to two recently introduced synthetic datasets: \synpain, a large-scale dataset of synthetic facial expressions designed for automated pain classification, and \gaitgen, a dataset generating clinically relevant synthetic gait sequences for Parkinson's disease severity estimation. Furthermore, we validate our framework on ECG5000, a healthcare time-series dataset for heartbeat classification, with simple augmentations as well. We obtain state-of-the-art results on all of these datasets and demonstrate that LiLAW significantly improves model performance by adaptively prioritizing synthetic samples according to their difficulty. Our approach provides a computationally efficient and practical solution to improve the quality and inclusion of synthetic data in model training.
Primary Area: transfer learning, meta learning, and lifelong learning
Submission Number: 9523
Loading