Models That Prove Their Own Correctness

Noga Amit; Shafi Goldwasser; Orr Paradise; Guy N. Rothblum

Models That Prove Their Own Correctness

Noga Amit, Shafi Goldwasser, Orr Paradise, Guy N. Rothblum

Published: 19 Jun 2024, Last Modified: 12 Aug 2024ICML 2024 TiFA WorkshopEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Trustworthy ML, Transformers, Interactive Proofs, Verifiability, Theory

TL;DR: A framework for models that prove their own correctness to a verification algorithm, and how to learn such models.

Abstract:

How can we trust the correctness of a learned model on a particular input of interest? Model accuracy is typically measured on average over a distribution of inputs, giving no guarantee for any fixed input. This paper proposes a theoretically-founded solution to this problem: to train Self-Proving models that prove the correctness of their output to a verification algorithm $V$ via an Interactive Proof. We devise a generic method for learning Self-Proving models, and we prove convergence bounds under certain assumptions. As an empirical exploration, our learning method is used to train a Self-Proving transformer that computes the Greatest Common Divisor (GCD) and proves the correctness of its answer.

Submission Number: 10

Loading