Enhancing Cervical Cancer Classification Through Augmented and Synthetic Data in Machine Learning

Reza Gheibi, Dean Hougen

Published: 01 Jan 2025, Last Modified: 30 Jul 2025ICPHM 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Cervical cancer remains a significant public health challenge worldwide, especially in low-resource settings where screening and early detection programs are limited. Recent advances in machine learning (ML) have shown potential in transforming the landscape of cervical cancer detection by offering innovative, efficient, and potentially more accessible methods. This paper explores various ML approaches that have been applied to cervical cancer classification, highlighting their implications and effectiveness. In addition, integration of data augmentation and synthetic data generation techniques to address the challenge of limited datasets in cervical cancer screening. We employed a variety of data augmentation methods, including synthetic minority over-sampling, Generative Adversarial Networks (GANs), Large Language Models (LLMs), and Forest Diffusion, to enrich our training tabular dataset, effectively expanding our dataset while maintaining realistic data properties. Our ML models, including multilayer perceptron neural networks (MLPs) and classifiers such as Logistic Regression, Decision Tree, Random Forest, Gradient Boost, Gaussian NB and K-Nearest Neighbors were trained on this augmented dataset. The performance of these models was compared against traditional ML models trained on non-augmented data. Our findings demonstrate that the augmented data not only matches, but also exceeds the original datasets in certain instances, with improvements of up to 2.2% in accuracy and 1.2% in the F1-score, particularly when using datasets filled with synthetic values. This underscores the value of augmented and synthetic datasets in addressing the challenges posed by limited medical data sizes.