Abstract: Frontier Large Language Models (LLMs) can be socially discriminatory or sensitive to spurious features of their inputs. Because only well-resourced corporations can train frontier LLMs, we need robust test-time strategies to control such biases. Existing solutions, which instruct the LLM to be fair or robust, rely on the model’s implicit understanding of bias. Causality provides a rich formalism through which we can be explicit about our debiasing requirements. Yet, as we show, a naive application of the standard causal debiasing strategy, counterfactual data augmentation, fails to fulfill individual-level debiasing requirements at test time. To address this, we develop stratified invariance, a flexible debiasing notion that can capture a range of debiasing requirements, from population level to individual level, through an additional measurement that stratifies the predictions. We developed a complete test for this new approach and introduced a data augmentation strategy that guarantees stratified invariance at test time under suitable assumptions, together with a prompting strategy that encourages stratified invariance in LLMs. We show that our prompting strategy, unlike implicit instructions, consistently reduces the bias of frontier LLMs across a suite of synthetic and real-world benchmarks without requiring additional data, finetuning or pre-training.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Stefan_Feuerriegel1
Submission Number: 4274
Loading