What you see is not what you get!: Detecting Simpson's Paradoxes during Data Exploration

Yue Guo, Carsten Binnig, Tim Kraska

Published: 01 Jan 2017, Last Modified: 19 Sept 2024HILDA@SIGMOD 2017EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Visual data exploration tools, such as Vizdom or Tableau, significantly simplify data exploration for domain experts and, more importantly, novice users. These tools allow to discover complex correlations and to test hypotheses and differences between various populations in an entirely visual manner with just a few clicks, unfortunately, often ignoring even the most basic statistical rules. For example, there are many statistical pitfalls that a user can "tap" into when exploring data sets.As a result of this experience, we started to build QUDE [1], the first system to Quantifying the Uncertainty in Data Exploration, which is part of Brown's Interactive Data Exploration Stack (called IDES). The goal of QUDE is to automatically warn and, if possible, protect users from common mistakes during the data exploration process. In this paper, we focus on a different type of error, the Simpson's Paradox, which is a special type of error in which a high-level aggregate/visualization leads to the wrong conclusion since a trend reverts when splitting the visualized data set into multiple subgroups (i.e., when executing a drill-down).