everyone
since 04 Oct 2024">EveryoneRevisionsBibTeXCC BY 4.0
Foundation models are typically pre-trained on uncurated unlabeled data collected from various domains on the Internet. As a result, they are fundamentally vulnerable to backdoor attacks, where an attacker injects carefully crafted poisoned inputs into the pre-training data via hosting them on the Internet. A backdoored foundation model outputs an attacker-desired embedding vector for any input with an attacker-chosen trigger. In this work, we propose FoundationForensics, the first forensics method to trace back poisoned pre-training inputs for foundation models after a backdoor attack has happened and a trigger-embedded input has been detected. Our FoundationForensics first calculates a maliciousness score for each pre-training input by quantifying its contribution to the foundation model's backdoor behavior for the detected trigger-embedded input and then detects the pre-training inputs with outlier maliciousness scores as poisoned. We theoretically analyze the security of FoundationForensics and empirically evaluate it on single-modal and multi-modal foundation models, three datasets, four existing backdoor attacks, and seven adaptive ones. Our results show that FoundationForensics can accurately traceback the poisoned pre-training inputs for foundation models.