Formal Interpretability with Merlin-Arthur Classifiers

Stephan Waeldchen; Kartikey Sharma; Max Zimmer; Berkant Turan; Sebastian Pokutta

Formal Interpretability with Merlin-Arthur Classifiers

Stephan Waeldchen, Kartikey Sharma, Max Zimmer, Berkant Turan, Sebastian Pokutta

Published: 01 Feb 2023, Last Modified: 13 Feb 2023Submitted to ICLR 2023Readers: Everyone

Keywords: interpretability, explainable ai

Abstract: We propose a new type of multi-agent interactive classifier that provides, for the first time, provable interpretability guarantees even for complex agents such as neural networks. In our setting, which is inspired by the Merlin-Arthur protocol from Interactive Proof Systems, two agents cooperate to provide a classification: the prover selects a small set of features as a certificate and presents it to the verifier who decides the class. A second, adversarial prover ensures the truthfulness of the system and allows us to connect the game-theoretic equilibrium between the provers and the verifier to guarantees on the exchanged features. We define completeness and soundness metrics to provide a lower bound on the mutual information between the features and the class. Our experiments demonstrate good agreement between theory and practice using neural network classifiers, and we show how our setup practically prevents manipulation.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Deep Learning and representational learning

TL;DR: We introduce a new type of interpretable classifier with theoretical guarantees based on the Merlin-Arthur protocol from Interactive Proof Systems.

Supplementary Material: zip

17 Replies

Loading