Track: Technical
Keywords: large language model, natural language processing, adversarial robustness, adversary, natural text generation, certification, verification
TL;DR: We propose a novel framework to certify natural language generation and provide an algorithm to achieve an adversarial bound.
Abstract: Foundation language models, such as LLama, are often deployed in constrained environments. For instance, a customer support bot may utilize a large language model (LLM) as its backbone due to the model's extensive language comprehension, which typically enhances downstream performance. However, these LLMs are adversarially susceptible, potentially generating outputs outside the intended target domain. To formalize, assess and mitigate this risk, we introduce \emph{domain certification}. We formalize a guarantee that accurately characterizes the out-of-domain behavior of language models and propose an algorithm that provides adversarial bounds as a certificate. Finally, we evaluate our method across various datasets and models, demonstrating that it yields meaningful certificates.
Submission Number: 10
Loading