COFT: Counterfactual–Conformal Decoding for Fair Chain‑of‑Thought Reasoning in Large Language Models
Keywords: LLM, Reasoning, Chain-of-Thought, Fairness, Bias, Counterfactual, Conformal, Decoding
Abstract: Large language models (LLMs) reveal and can amplify societal bias in chain-of-thought (CoT) generation. We present COFT—Chain of Fair Thought—a training-free, decoding-time method that provides instance-level fairness control with statistical guarantees for any frozen causal LM. COFT pairs each prompt with a masked counterfactual of its sensitive spans, compares factual vs.\ masked logit trajectories over the shared vocabulary, attenuates span-driven tokens via lightweight logit fusion, and uses dual-branch split-conformal calibration to certify per-step candidate sets at a chosen risk level. Across six models, COFT reduces standard bias metrics by 30–55\% (median $\approx$38\%) while preserving task utility and language-model quality (reasoning accuracies unchanged within run-to-run noise) and adding a predictable overhead comparable to one masked forward pass (<11\%). COFT thus offers a clear, auditable path to safer CoT generation: significant bias reduction, negligible utility loss, and no retraining, auxiliary classifiers, or weight access.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 20857
Loading