Reproducibility Under Preprocessing Variability: A Comparative Study of Modern Machine Learning Methods on Social Data

Published: 07 Oct 2025, Last Modified: 10 Nov 2025OpenReview Archive Direct UploadEveryoneCC0 1.0
Abstract: The reproducibility crisis in modern machine learning (ML) research has raised concerns about the stability of published results, particularly when subtle variations in data preprocessing can lead to significantly different outcomes. This paper presents a neutral, reproducible comparison of Random Forest, XGBoost, LightGBM, and shallow Neural Networks applied to a simulated case study using the UCI Social Media Sentiment Dataset. Two distinct preprocessing pipelines are evaluated to quantify the impact of tokenization, normalization, and imputation differences on predictive performance. Results demonstrate that performance variations of up to 8% in accuracy and 12% in F1 score can emerge solely from preprocessing inconsistencies. The findings underscore the need for transparent, documented preprocessing workflows in social data research and provide a reproducibility checklist aligned with Computo’s editorial principles.
Loading