Abstract: Frontier Large Language Models (LLMs) can be socially discriminatory or sensitive to spurious features of their inputs. Because only well-resourced corporations can train frontier LLMs, we need robust test-time strategies to control such biases. Existing solutions, which instruct the LLM to be fair or robust, rely on the model’s implicit understanding of bias. Causality provides a rich formalism through which we can be explicit about our debiasing requirements. Yet, as we show, a naive application of the standard causal debiasing strategy, counterfactual data augmentation, fails to fulfill individual-level debiasing requirements at test time. To address this, we develop stratified invariance, a flexible debiasing notion that can capture a range of debiasing requirements, from population level to individual level, through an additional measurement that stratifies the predictions. We developed a complete test for this new approach and introduced a data augmentation strategy that guarantees stratified invariance at test time under suitable assumptions, together with a prompting strategy that encourages stratified invariance in LLMs. We show that our prompting strategy, unlike implicit instructions, consistently reduces the bias of frontier LLMs across a suite of synthetic and real-world benchmarks without requiring additional data, finetuning or pre-training.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We addressed the editor's comments:
1. Expanded the Broader Impact section, adding Limitations, and moving it to the main paper: importantly, we highlight the reliance on the LLM's counterfactual generation abilities and our choice of adjustment set S.
2. The reviewer's and editor's comments about adjustment set choices are addressed in p. 7 "Choosing S as an adjustment set".
3. We also highlight the fact that our theory serves as motivation for OOC, explicitly acknowledging its gap with practice:
> theoretical contributions serve as a motivation for the implementation of OOC, but are not guaranteed to hold in practice, since we cannot test counterfactual or adjustment set generations.
Assigned Action Editor: ~Stefan_Feuerriegel1
Submission Number: 4274
Loading