Detecting Backdoors with Meta-Models

Lauro Langosco; Neel Alex; William Baker; David Quarel; Herbie Bradley; David Krueger

Detecting Backdoors with Meta-Models

Lauro Langosco, Neel Alex, William Baker, David Quarel, Herbie Bradley, David Krueger

Published: 28 Oct 2023, Last Modified: 13 Mar 2024NeurIPS 2023 BUGS PosterEveryoneRevisionsBibTeX

Keywords: backdoors, interpretability, meta-models

TL;DR: We propose to use \emph{meta-models}, neural networks that take another network's parameters as input, to detect backdoors directly from model weights.

Abstract: It is widely known that it is possible to implant backdoors into neural networks, by which an attacker can choose an input to produce a particular undesirable output (e.g.\ misclassify an image). We propose to use \emph{meta-models}, neural networks that take another network's parameters as input, to detect backdoors directly from model weights. To this end we present a meta-model architecture and train it on a dataset of approx.\ 4000 clean and backdoored CNNs trained on CIFAR-10. Our approach is simple and scalable, and is able to detect the presence of a backdoor with $>99\%$ accuracy when the test trigger pattern is i.i.d., with some success even on out-of-distribution backdoors.

Submission Number: 8

Loading