Investigating User Estimation of Missing Data in Visual Analysis

Maoyuan Sun; Yuanxin Wang; Courtney Bolton; Yue Ma; Tianyi Li; Jian Zhao

Investigating User Estimation of Missing Data in Visual Analysis

Maoyuan Sun, Yuanxin Wang, Courtney Bolton, Yue Ma, Tianyi Li, Jian Zhao

Published: 13 May 2024, Last Modified: 28 May 2024GI 2024 SDEveryoneRevisionsBibTeXCC BY 4.0

Letter Of Changes: We would like to thank the reviewers for their time and valuable comments for improving our manuscript. We have carefully considered all the concerns raised and endeavored to address each of them. _1. The issue regarding the generalizability and suitability of study participants._ We have added a paragraph (i.e., the last paragraph in Section 6.2) to discuss the limitation of the participant group in our study and included the suggested reference in the discussion. The detailed edit is listed below. > Last, the participants, recruited in this study, may not be quite representative of all the real-world use cases that need to handle incomplete data for visual analysis. Such cases often involve people, who can be domain experts and have to make informed decisions about complex datasets. As we recruited participants from MTurk, they (i.e., users in this particular online platform) cannot be fully representative of the whole user population for actual analysis use cases. While there are prior studies that have investigated statistical chart interpretation with non-statisticians (e.g., [21, 39, 41]), it has been found that an expert population was more likely to answer questions and provide feedback more accurately [35]. Thus, due to the limitation of participants in this study, our findings may not hold for cases with different groups of users, which requires further verifications with a more diverse group of participants. _2. The generalizability concern of the specific scenario (i.e., time-series data) used in the study._ We have added more discussion about this limitation in the second paragraph in Section 6.2. The detailed edit is listed below. > Second, we used time-series data, particularly weather data, as our test bed in this study, as it is commonly seen in daily life and used in imputation algorithms. However, there are different forms of data (e.g., trees, graphs, and text) that can have missing values. Our results may not be generalized to them, as not all the specific controls in our study can be directly applied to them (e.g., trees, graphs, and text often use different visualizations than a line chart). However, the four key aspects (data, computation, interface, and user) that our controls follow can be generalized to a broad set of data types and analysis tasks, because these aspects are commonly in- volved in computing-supported data analytics. In addition, weather data scenarios are relatable to the general public, so laypeople can make reasonable predictions based on their real-life experiences. While participants from MTurk with random assignments can provide a reasonable representation of users’ estimations of missing weather data, our study did not consider the role of domain knowledge and expertise in missing data estimation. Future research is needed for other scenarios with different datasets where missing values may not be possibly estimated without domain expertise. _3. The concern regarding the rationale of studying “end missing”._ We have further clarified this in the first paragraph in Section 1, and the last paragraph in Section 3.2.1. The detailed edit is listed below. > Moreover, regarding time, future data values can be considered as a special type of missing data (for now they are not present, but existent in the future), and they are commonly analyzed and predicted in domains that involve temporal measurements (e.g., weather forecast, and stock or housing market prediction). [in Section 1] > $𝐷_{𝑒𝑛𝑑}$ regards cases of predicting future data, which is considered as a special type of missing data (i.e., missing for the present time, from a future point of view). It is heavily used in many real-world applications (e.g., weather forecast, stock market analysis, and disease control). [in Section 3.2.1] _4. The concern regarding a clarification of the “prior knowledge” in the study._ We have added more discussion about the prior knowledge to clarify its setting in our study in the last paragraph in Section 3.2.4. The detailed edit is listed below. > We use whether or not to show historical data to users to control their prior knowledge. If the historical data is presented for users to see, we consider that they have prior knowledge about the data; if not, we consider them without prior knowledge.

Keywords: Missing data, time series, visualization.

Abstract: Missing data is a pervasive issue in real-world analytics, stemming from a multitude of factors (e.g., device malfunctions and network disruptions), making it a ubiquitous challenge in many domains. Misperception of missing data impacts decision-making and causes severe consequences. To mitigate risks from missing data and facilitate proper handling, computing methods (e.g., imputation) have been studied, which often culminate in the visual representation of data for analysts to further check. Yet, the influence of these computed representations on user judgment regarding missing data remains unclear. To study potential influencing factors and their impact on user judgment, we conducted a crowdsourcing study. We controlled 4 factors: the _distribution_, _imputation_, and _visualization_ of missing data, and the _prior knowledge_ of data. We compared users’ estimations of missing data with computed imputations under different combinations of these factors. Our results offer useful guidance for visualizing missing data and their imputations, which informs future studies on developing trustworthy computing methods for visual analysis of missing data.

Supplementary Material: pdf

Submission Number: 10

Loading