Keywords: stability evaluation, optimal transport
TL;DR: This work proposes an Optimal Transport (OT)-based stability criterion for large language models (LLMs) that addresses both data corruptions and sub-population shifts within a unified framework
Abstract: The performance of Large Language Models (LLMs) can degrade when exposed to shifts such as changes in language style or domain-specific knowledge that is underrepresented in the training data. To ensure robust deployment, we propose a stability evaluation criterion based on distributional perturbations. Conceptually, this criterion measures the minimal perturbation required in the data to induce a specified deterioration in model performance. We employ optimal transport (OT) discrepancy with moment constraints on the (sample, density) space to quantify these perturbations. This allows our stability criterion to address both **data corruptions** and **sub-population shifts**, which are common in real-world LLM applications. To make this approach practical, we provide tractable convex formulations and computational methods tailored to different classes of loss functions used in LLMs. Empirically, we validate the utility of our stability criterion by testing LLMs on tasks such as jailbreak attempts and general question-answering tasks, demonstrating its effectiveness in assessing model robustness and providing insights into improving stability under diverse real-world scenarios.
Serve As Reviewer:
Submission Number: 32