Shh, don't say that! Domain Certification in LLMs

Cornelius Emde; Preetham Arvind; Alasdair Paren; Maxime Kayser; Tom Rainforth; Thomas Lukasiewicz; Philip Torr; Adel Bibi

Shh, don't say that! Domain Certification in LLMs

Cornelius Emde, Preetham Arvind, Alasdair Paren, Maxime Kayser, Tom Rainforth, Thomas Lukasiewicz, Philip Torr, Adel Bibi

Published: 09 Oct 2024, Last Modified: 04 Dec 2024SoLaR PosterEveryoneRevisionsBibTeXCC BY 4.0

Track: Technical

Keywords: large language model, natural language processing, adversarial robustness, adversary, natural text generation, certification, verification

TL;DR: We propose a novel framework to certify natural language generation and provide an algorithm to achieve an adversarial bound.

Abstract: Foundation language models, such as LLama, are often deployed in constrained environments. For instance, a customer support bot may utilize a large language model (LLM) as its backbone due to the model's extensive language comprehension, which typically enhances downstream performance. However, these LLMs are adversarially susceptible, potentially generating outputs outside the intended target domain. To formalize, assess and mitigate this risk, we introduce \emph{domain certification}. We formalize a guarantee that accurately characterizes the out-of-domain behavior of language models and propose an algorithm that provides adversarial bounds as a certificate. Finally, we evaluate our method across various datasets and models, demonstrating that it yields meaningful certificates.

Submission Number: 10

Loading