Annotation Sensitivity: Drivers of Training Data Quality

ICLR 2024 Workshop DMLR Submission12 Authors

Published: 04 Mar 2024, Last Modified: 02 May 2024DMLR @ ICLR 2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Annotation; Data Quality; Sensitivity; Training Data
TL;DR: We conduct experiments to understand what drives the quality of annotated data.
Abstract: When developing Machine Learning (ML) solutions, many efforts and resources go into algorithm optimization to maximize performance metrics and reduce the resources employed. However, at some point the real-life performance of ML applications will be limited by the quality of the underlying training data. More importantly, unwanted biases and flaws within the annotated training data can also creep into the resulting models and lead to an overreliance on erroneous data. A data-centric approach can help gain a better understanding of determinants of bias and data quality in ML. Thorough experimental research, that carefully evaluates current practices in the light of their effect on training data and models, is key to develop new best practices for annotation. To foster the improvement of annotation practices, we follow a research agenda that assesses the quality of ML training data and its drivers. Inspired by the realization that annotation tasks are similar to web surveys, we derive hypotheses from research in survey methodology and social psychology. More specifically, surveys and annotation tasks both provide the human with a fixed stimulus and ask to select one or more fixed response categories. Informed by a rich interdisciplinary body of literature we conduct experimental research to gain an understanding of mechanisms that impact the quality of annotated training data.
Primary Subject Area: Data collection and benchmarking techniques
Paper Type: Extended abstracts: up to 2 pages
DMLR For Good Track: Participate in DMLR for Good Track
Participation Mode: In-person
Confirmation: I have read and agree with the workshop's policy on behalf of myself and my co-authors.
Submission Number: 12
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview