Reproducibility Under Preprocessing Variability: A Comparative Study of Modern Machine Learning Methods on Social Data
Abstract: The reproducibility crisis in modern machine learning (ML) research has raised
concerns about the stability of published results, particularly when subtle variations in data
preprocessing can lead to significantly different outcomes. This paper presents a neutral,
reproducible comparison of Random Forest, XGBoost, LightGBM, and shallow Neural Networks
applied to a simulated case study using the UCI Social Media Sentiment Dataset. Two distinct
preprocessing pipelines are evaluated to quantify the impact of tokenization, normalization, and
imputation differences on predictive performance. Results demonstrate that performance variations
of up to 8% in accuracy and 12% in F1 score can emerge solely from preprocessing
inconsistencies. The findings underscore the need for transparent, documented preprocessing
workflows in social data research and provide a reproducibility checklist aligned with Computo’s
editorial principles.
Loading