Exploring the Impacts of Features in Diabetes Prediction Models using Machine Learning Algorithms: Explainable Artificial Intelligence (XAI) Approach
Keywords: Diabetes, Impacts, predictive model, Explainable, Interpretable
Abstract: This study aimed to the exploration of feature impacts within diabetes prediction models using machine learning algorithms and an explainable artificial intelligence approach. In this study, the data were extracted from the Centers for Disease Control. The data were preprocessed to get quality data that are suitable for the algorithms to develop a model that predicts diabetes. The class level of the dataset was imbalanced, and we handled this problem using SMOTE + Tomek methods, and we have used decision trees, random forests, cat boost, XGBoost, and LGBM classifier algorithms to construct the predictive model. For constructing the model, ten experiments were conducted with a total of 227804 datasets with 20 features, and a training, and testing, dataset of 80/20 using stratified methods. We have evaluated the model using accuracy, precision, f1_score, and roc curve. applying an explainable AI approach, aimed to shed light on the critical features impacting the predictions of the predictive models. By understanding the significance of various features, we can enhance the interpretability and trustworthiness of these models, crucial for their adoption in clinical settings. To explore the impacts of the feature, we have used removable-based explanation and leave-one-column out (loco), methods. By using these methods, we have explored the impacts of the features by developing the original accuracy and the accuracy of the remaining features when we removed the single features and explored the impacts of the removed features. We have also explained the predictive model using Shapley Additive explanations (SHAP) and LIME explainable artificial intelligence techniques to trust the results of the model. From all the developed predictive models, the LGBMBoost performs the best with the accuracy, and precision of 83.33% and 78.56% respectively using the imbalanced dataset.
Submission Number: 4
Loading