Stability Evaluation of Large Language Models via Distributional Perturbation Analysis

Jiashuo Liu; Jiajin Li; Peng Cui; Jose Blanchet

Stability Evaluation of Large Language Models via Distributional Perturbation Analysis

Jiashuo Liu, Jiajin Li, Peng Cui, Jose Blanchet

Published: 09 Oct 2024, Last Modified: 03 Jan 2025Red Teaming GenAI Workshop @ NeurIPS'24 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: stability evaluation, optimal transport

TL;DR: This work proposes an Optimal Transport (OT)-based stability criterion for large language models (LLMs) that addresses both data corruptions and sub-population shifts within a unified framework

Abstract: The performance of Large Language Models (LLMs) can degrade when exposed to shifts such as changes in language style or domain-specific knowledge that is underrepresented in the training data. To ensure robust deployment, we propose a stability evaluation criterion based on distributional perturbations. Conceptually, this criterion measures the minimal perturbation required in the data to induce a specified deterioration in model performance. We employ optimal transport (OT) discrepancy with moment constraints on the (sample, density) space to quantify these perturbations. This allows our stability criterion to address both **data corruptions** and **sub-population shifts**, which are common in real-world LLM applications. To make this approach practical, we provide tractable convex formulations and computational methods tailored to different classes of loss functions used in LLMs. Empirically, we validate the utility of our stability criterion by testing LLMs on tasks such as jailbreak attempts and general question-answering tasks, demonstrating its effectiveness in assessing model robustness and providing insights into improving stability under diverse real-world scenarios.

Serve As Reviewer: liujiashuo77@gmail.com

Submission Number: 32

Loading