Certifying robustness to adaptive data poisoning

Published: 17 Jun 2024, Last Modified: 17 Jun 2024FoRLaC PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: The rise of foundational models fine-tuned with human feedback from potentially untrusted users has increased the risk of adversarial data poisoning, necessitating the study of robustness of learning algorithms against such attacks. While existing research focuses on certifying robustness for static adversaries acting on offline datasets, dynamic attack algorithms have shown to be more effective. Relevant for models with periodic updates where an adversary can adapt based on the algorithm's behavior, such as those in RLHF, we present a novel framework for computing certified bounds on the impact of dynamic poisoning, and use these certificates to design robust learning algorithms. We give an illustration of the framework for the mean-estimation problem.
Format: Short format (up to 4 pages + refs, appendix)
Publication Status: No
Submission Number: 73
Loading