Keywords: Debias, Out-of-Distribution, Test-Time Adaptation, Narrative Generation, Preconditioned Optimization
Abstract: Although debiased LLMs perform well on known bias patterns, they often fail to generalize to unfamiliar bias prompts, producing toxic outputs.
We first validate that such high-bias prompts constitute a \emph{distribution shift} via OOD detection, and show static models degrade under this shift.
To adapt on-the-fly, we propose \textbf{CAP-TTA}, a test-time adaptation framework that performs context-aware LoRA updates only when bias-risk \emph{trigger} exceeds a threshold, using a precomputed diagonal \emph{preconditioner} for fast and stable updates.
Across toxic-prompt settings and benchmarks, CAP-TTA reduces bias (confirmed by human evaluation) while achieving much lower update latency than AdamW/SGD; it also mitigates catastrophic forgetting by significantly improving narrative fluency over SOTA debiasing baseline while maintaining comparable debiasing effectiveness.
Paper Type: Long
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: model bias/fairness evaluation, model bias/unfairness mitigation, bias/toxicity, continual learning, safety and alignment, domain adaptation, parameter-efficient-training
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 1099
Loading