Keywords: data provenance, encoded ensembles, training data contribution
Abstract: The widespread adoption of diffusion models for creative uses such as image, video, and audio synthesis has raised serious legal and ethical concerns surrounding the use of training data and its regulation. Due to the size and complexity of these models, the effect of training data is difficult to characterize with existing methods, confounding regulatory efforts. In this work we propose a novel approach to trace the impact of training data using an encoded ensemble of diffusion models. In our approach, individual models in an ensemble are trained on encoded subsets of the overall training data to permit the identification of important training samples. The resulting ensemble allows us to efficiently remove the impact of any training sample. We demonstrate the viability of these ensembles for assessing influence and consider the regulatory implications of this work.
Submission Number: 44
Loading