Close to Reality: Interpretable and Feasible Data Augmentation for Imbalanced Learning

TMLR Paper5826 Authors

05 Sept 2025 (modified: 18 Feb 2026)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Many machine learning classification tasks involve imbalanced datasets, which are often subject to over-sampling techniques aimed at improving model performance. However, these techniques are prone to generating unrealistic or infeasible samples. Furthermore, they often function as black boxes, lacking interpretability in their procedures. This opacity makes it difficult to track their effectiveness and provide necessary adjustments, and they may ultimately fail to yield significant performance improvements. To bridge this gap, we introduce the Decision Predicate Graphs for Data Augmentation (DPG-da), a framework that extracts interpretable decision predicates from trained models to capture domain rules and enforce them during sample generation. This design ensures that over-sampled data remain diverse, constraint-satisfying, and interpretable. In experiments on synthetic and real-world benchmark datasets, DPG-da consistently improves classification performance over traditional over-sampling methods, while guaranteeing logical validity and offering clear, interpretable explanations of the over-sampled data.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: 1. Surrogate Model Performance and Evaluation: Added macro F1-scores of the surrogate Random Forest (RF) on holdout training sets for all datasets and added tables and figures in Appendix F to display the differences in performance of the Surrogate model in baseline data and classifiers trained on synthetic data. Clarified that downstream classifiers are evaluated on untouched test sets to avoid bias from the surrogate, and added more details on holdout splits, precision, and recall metrics. 2. Methodology Clarifications: Expanded descriptions of DPG-da, including how feasible regions are defined, GA initialization, fitness function components (feasibility, sparsity, locality, diversity), and handling of discrete features. 3. Interpretability and Traceability: Updated Section 4.3 to distinguish between traceability of the augmentation process (tracking evolutionary trajectories of synthetic samples) and interpretability of the data structure (understanding feasible regions and feature constraints), addressing reviewer suggestions for conceptual precision. 4. Scalability and Computational Analysis: Added discussions of runtime, GA complexity, and empirical correlations with dataset size, feature dimensionality, and constraint set size to Appendix G. Moved the limitations section from Appendix A to the main text to set realistic expectations for practitioners. 5. Evaluation and Figures: Clarified aggregation of F1-scores across 10 repetitions per dataset and augmentation level; added scatterplots and boxplots in the Appendix to show baseline vs. DPG-da performance and highlight any sensitivity in metrics. Normalized violation rates are shown in updated figures for ease of comparison. Minor Edits: Corrected spacing issues in equations and citations, clarified terminology, and improved descriptions of GA objectives, predicate evaluation, and sampling strategy in the main text.
Assigned Action Editor: ~Rahaf_Aljundi1
Submission Number: 5826
Loading