ensemble machine learning approaches for breast cancer prediction: A comprehensive analysis with smote and lasso feature selection

18 Aug 2025 (modified: 06 Dec 2025)Agents4Science 2025 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: breast cancer predication, ensemble learning, SMOTE, LASSO regularization, machine learning, medical diagnosis
Abstract: Background Breast cancer remains one of the leading causes of cancer-related mortality worldwide, necessitating accurate and early detection methods. Machine learning approaches have shown promising results in medical diagnosis, particularly in breast cancer prediction. Objective This study aims to develop and evaluate ensemble machine learning models for breast cancer prediction using comprehensive data preprocessing techniques including SMOTE for class imbalance handling and LASSO regularization for feature selection. Methods A dataset comprising 334 samples with 15 lifestyle and dietary features was analyzed. The methodology included missing value imputation using median substitution, correlation analysis for multicollinearity detection, SMOTE implementation for class balance, and LASSO regularization for optimal feature selection. Six ensemble machine learning algorithms were evaluated: Random Forest, Gradient Boosting, XGBoost, AdaBoost, Extra Trees, and Voting Classifier. Model performance was assessed using 5-fold cross-validation and evaluated on accuracy, precision, recall, F1-score, and ROC-AUC metrics. Results The Voting Classifier achieved the highest performance with 93.0% accuracy, 91.0% precision, 95.0% recall, 93.0% F1-score, and 96.0% ROC-AUC. XGBoost showed the second-best performance with 92.0% accuracy and 95.0% ROC-AUC. LASSO feature selection identified the top 10 most significant predictive features, improving model efficiency while maintaining high accuracy. Conclusion Ensemble methods, particularly the Voting Classifier, demonstrate superior performance in breast cancer prediction tasks. The integration of SMOTE and LASSO techniques significantly enhances model robustness and interpretability, providing a reliable framework for clinical decision support systems.
Supplementary Material: zip
Submission Number: 33
Loading