Abstract: In the last decade, machine learning (ML) has shown tremendous success in areas such
as vision, language, strategic games, and more. Parallel to this, hospitals’ capacity for data
collection has greatly increased with the adoption and continuing maturation of electronic
health records (EHRs). The result of these trends has been a large degree of excitement
and optimism about how ML will revolutionize healthcare once researchers get access to
data. In this work, we present a cautionary tale of the instinct some computer scientists
have to “let the data speak for itself.” Using a popular, public EHR dataset as a case study,
we demonstrate numerous examples where a non-clinician’s intuition may lead to incorrect
– and potentially harmful – modeling assumptions. We explore both non-obvious quirks in
the data (i.e., hypothetical incorrect assumptions) and examples of published papers that
misunderstood the data generating process (i.e., actual incorrect assumptions). This case
study is meant to serve as a cautionary tale to encourage every data scientist to approach
their projects with the humility to know what they can do well and what they cannot.
Without the guidance of stakeholders that understand the data generating process, data
scientists run the risk of “garbage-in, garbage-out” analysis because their models are not
measuring meaningful relationships.
Loading