CleanPatrick: A Benchmark for Image Data Cleaning

Fabian Gröger; Simone Lionetti; Philippe Gottfrois; Alvaro Gonzalez-Jimenez; Ludovic Amruthalingam; Elisabeth Victoria Goessinger; Hanna Lindemann; Marie Bargiela; Marie Hofbauer; Omar Badri; Philipp Tschandl; Arash Koochek; Matthew Groh; Alexander A. Navarini; Marc Pouly

CleanPatrick: A Benchmark for Image Data Cleaning

Fabian Gröger, Simone Lionetti, Philippe Gottfrois, Alvaro Gonzalez-Jimenez, Ludovic Amruthalingam, Elisabeth Victoria Goessinger, Hanna Lindemann, Marie Bargiela, Marie Hofbauer, Omar Badri, Philipp Tschandl, Arash Koochek, Matthew Groh, Alexander A. Navarini, Marc Pouly

Published: 01 Jun 2026, Last Modified: 01 Jun 2026Accepted by DMLREveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Robust machine learning depends on clean data, yet current image data cleaning benchmarks rely on synthetic noise or narrow human studies, limiting comparison and real-world relevance. We introduce CleanPatrick, the first large-scale benchmark for data cleaning in the image domain, built upon the publicly available Fitzpatrick17k dermatology dataset. We collect 496,377 binary annotations from 933 medical crowd workers, identify off-topic samples (4\%), near-duplicates (21\%), and label errors (32\%), and employ an aggregation model inspired by item-response theory followed by expert review to derive high-quality ground truth. CleanPatrick formalizes issue detection as a ranking task and employs standard ranking metrics that mirror real audit workflows. We benchmark classical anomaly detectors, perceptual hashing, SSIM, Confident Learning, NoiseRank, FINE, BHN, and SelfClean. On CleanPatrick, self-supervised representations excel at near-duplicate detection, classical methods achieve competitive off-topic detection under constrained review budgets, and detecting implausible labels under conservative human judgment remains challenging for fine-grained medical classification. By releasing both the dataset and the evaluation framework, CleanPatrick enables a systematic comparison of image-cleaning strategies.

Keywords: data quality issues, data cleaning, data-centric AI, benchmark

Changes Since Last Submission: We have revised the manuscript to (i) clarify and formalize task semantics, (ii) better highlight CleanPatrick's distinct evaluation objectives, (iii) emphasize the practical audit-oriented design of near-duplicate construction, and (iv) add a threshold sensitivity analysis. As the editor requested, we additionally: - Bounded the near-duplicate retrieval bias with an audit of 450 additional high-similarity pairs from pHash, SSIM, and ImageNet ViT-T. - Corrected IAA wording to "moderate agreement" per Landis-Koch; added LE leave-one-out + hierarchical bootstrap showing the ranking is stable. - Added an annotation-count sensitivity sweep, showing ranking preserved on the surviving subset. - Tone, figure-style, and revision-markup cleanup.

Code: https://github.com/Digital-Dermatology/CleanPatrick

Assigned Action Editor: ~Andreas_Kirsch1

Submission Number: 128

Loading