Abstract: This paper studies a range of AI/ML trust concepts, including memorization, data poisoning, and copyright, which can be modeled as constraints on the influence of data on a (trained) model, characterized by the outcome difference from a processing function (training algorithm). In this realm, we show that provable trust guarantees can be efficiently provided through a new framework termed Data-Specific Indistinguishability (DSI) to select trust-preserving randomization tightly aligning with targeted outcome differences, as a relaxation of the classic Input-Independent Indistinguishability (III). We establish both the theoretical and algorithmic foundations of DSI with the optimal multivariate Gaussian mechanism. We further show its applications to develop trustworthy deep learning with black-box optimizers. The experimental results on memorization mitigation, backdoor defense, and copyright protection show both the efficiency and effectiveness of the DSI noise mechanism.
Lay Summary: This paper addresses a fundamental problem: how to provably ensure proper data usage and trustworthy behavior of a trained machine learning model. We introduce a new methodology, termed Data-Specific Indistinguishability (DSI), to provide high-probability guarantees regarding data usage and data influence. The central idea is to ensure that the output of a (potentially black-box) machine learning algorithm is statistically indistinguishable from that of a set of safe reference models.
For instance, in the context of memorization mitigation, these reference models may be trained without access to specific sensitive data. In the case of copyright protection, they could be models trained on datasets excluding particular artworks from a given artist.
To operationalize DSI, we propose an optimal noise mechanism, adding a minimal amount of Gaussian noise to reduce divergence between the target output and the safe references. Extensive experiments demonstrate the effectiveness of our approach in several applications, including memorization mitigation in large language models, defense against poisoned backdoor attacks, and copyright protection.
Primary Area: Social Aspects
Keywords: Differential trustworthiness, data-specific indistinguishability, memorization, backdoor attacks, copyright
Submission Number: 4722
Loading