Attributing Model Behavior: The Predominant Influence of Dataset Complexity Over Hyperparameters in Classification
Keywords: Model Behavior Attribution, Complexity Meta-features, Hyperparameters, Bias-Variance Decomposition
Abstract: Understanding the drivers of machine learning performance is essential for optimizing model accuracy and robustness. While significant attention has been given to hyperparameter tuning and data preprocessing, the impact of intrinsic data complexity (e.g., class overlap, feature overlap, dimensionality, etc) remains less explored. This study investigates the comparative influence of data complexity and hyperparameter configurations on the performance of classification algorithms, specifically Random Forests (RF), Support Vector Machines (SVM), Decision Tree (DT), Adaptive Boosting (AB) and Multi-layer Perceptron (MLP). Using 270 diverse OpenML datasets and 304 hyperparameter configurations, we employ functional analysis of variance (fANOVA) and Ordinary Least Squares (OLS) regression to quantify the relative importance and effect sizes of hyperparameters and complexity meta-features. Our results reveal that data complexity exerts a more substantial influence on both bias and variance components than hyperparameter tuning, underscoring the importance of addressing intrinsic dataset challenges. These findings suggest that efforts to mitigate data complexity factors, such as class overlap or imbalance, may yield greater performance improvements than extensive hyperparameter optimization. This study provides actionable insights for machine learning practitioners and highlights the need for further research into the interplay between dataset properties and algorithmic performance.
Supplementary Material: zip
Primary Area: other topics in machine learning (i.e., none of the above)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 11407
Loading