Keywords: Diffusion Models, Inpainting, LAION, Privacy, Dataset
TL;DR: We present a scalable dataset transformation pipeline for ensuring privacy-compliance in training AI models.
Abstract: Large-scale web-scraped datasets have contributed significantly to progress in deep learning, yet the extensive presence of biometrics data, such as faces, poses a legitimate legal, ethics, and privacy issue. Existing approaches address this by removing sensitive images entirely, often sacrificing downstream performance, or purchasing use of licensed images. To address this gap, we present a novel privacy preserving transformation pipeline that uses a diffusion-based inpainting model to systematically replace detected faces in images with multiple, synthetic variants conditioned on different demographic attributes, resulting in a novel, privacy-preserving dataset of distinct face images. Our method, evaluated on 12,000 images transformed from LAION-400M and CelebA-HQ, eliminates privacy risks without significant loss of image quality or diversity. This transformation pipeline will serve as a scalable guideline for the creation of datasets that follow legal and ethical privacy constraints.
Submission Number: 41
Loading