Engravings, Secrets, and Interpretability of Neural Networks

Nathaniel Hobbs, Periklis A. Papakonstantinou, Jaideep Vaidya

Published: 2024, Last Modified: 20 May 2025IEEE Trans. Emerg. Top. Comput. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: This work proposes a definition and examines the problem of undetectably engraving special input/output information into a Neural Network (NN). Investigation of this problem is significant given the ubiquity of neural networks and society's reliance on their proper training and use. We systematically study this question and provide (1) definitions of security for secret engravings, (2) machine learning methods for the construction of an engraved network, (3) a threat model that is instantiated with state-of-the-art interpretability methods to devise distinguishers/attackers. In this work, there are two kinds of algorithms. First, the constructions of engravings through machine learning training methods. Second, the distinguishers associated with the threat model. The weakest of our engraved NN constructions are insecure and can be broken by our distinguishers, whereas other, more systematic engravings are resilient to each of our distinguishing attacks on three prototypical image classification datasets. Our threat model is of independent interest, as it provides a concrete quantification/benchmark for the “goodness” of interpretability methods.