Keywords: Large Language Models, Bias, Certification
TL;DR: We present the first framework to formally certify counterfactual bias in the responses of LLMs.
Abstract: Large Language Models (LLMs) can produce biased responses that can cause representational harms. However, conventional studies are insufficient to thoroughly
evaluate biases across LLM responses for different demographic groups (a.k.a.
counterfactual bias), as they do not scale to large number of inputs and do not
provide guarantees. Therefore, we propose the first framework, LLMCert-B that
certifies LLMs for counterfactual bias on distributions of prompts. A certificate
consists of high-confidence bounds on the probability of unbiased LLM responses
for any set of counterfactual prompts - prompts differing by demographic groups,
sampled from a distribution. We illustrate counterfactual bias certification for
distributions of counterfactual prompts created by applying prefixes sampled from
prefix distributions, to a given set of prompts. We consider prefix distributions consisting random token sequences, mixtures of manual jailbreaks, and perturbations
of jailbreaks in LLM’s embedding space. We generate non-trivial certificates for
SOTA LLMs, exposing their vulnerabilities over distributions of prompts generated
from computationally inexpensive prefix distributions.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 8292
Loading