Bot detection: Simulations and application in people-centered health measurement surveys with missing data
DOI: 10.64028/idns583103
Keywords: Patient-reported outcome measures; Person-centered measurement, Survey bots, Careless responding, Machine Learning, Missing Data, Permutation test
TL;DR: We adapted an algorithm for detecting random responders to the case when missing data happens and ensured that a dataset on people-centered health measurement was not overrun with random responders.
Abstract: In the context of improving the measurement of pain and emotional well-being among diverse populations, we sought to detect random responders or survey bots that are not responsive to item content. We adapted the L1P1 algorithm by Ilagan and Falk (2024), which uses a permutation test and outlier statistics to compute a p-value and do classification under the null that the response vector is exchangeable. As the response options for the Likert-type items could yield missing data, simulations evaluated two variants of outlier statistic computations that used the expectation-maximization algorithm: one in which means and covariances were pre-computed and re-used for all rows, and another in which a leave-one-out approach was used. Results indicated that the L1P1 algorithm works as expected, but a leave-one-out strategy works best, and respondents with few completed items are flagged at higher rates due to loss of specificity. Based on simulations, we then performed classification for an empirical dataset (N = 11,197) with 76 Likert-type items. Flagging rates were similarly higher for respondents with fewer completed items, but otherwise low. We therefore expect that random responders would likely not have strong influence on subsequent analyses for this measurement project.
Supplementary Material: zip
Submission Number: 12
Loading