COFT: Counterfactual–Conformal Decoding for Fair Chain‑of‑Thought Reasoning in Large Language Models

COFT: Counterfactual–Conformal Decoding for Fair Chain‑of‑Thought Reasoning in Large Language Models

ICLR 2026 Conference Submission20857 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, Reasoning, Chain-of-Thought, Fairness, Bias, Counterfactual, Conformal, Decoding

Abstract: Large language models (LLMs) reveal and can amplify societal bias in chain-of-thought (CoT) generation. We present COFT—Chain of Fair Thought—a training-free, decoding-time method that provides instance-level fairness control with statistical guarantees for any frozen causal LM. COFT pairs each prompt with a masked counterfactual of its sensitive spans, compares factual vs.\ masked logit trajectories over the shared vocabulary, attenuates span-driven tokens via lightweight logit fusion, and uses dual-branch split-conformal calibration to certify per-step candidate sets at a chosen risk level. Across six models, COFT reduces standard bias metrics by 30–55\% (median $\approx$38\%) while preserving task utility and language-model quality (reasoning accuracies unchanged within run-to-run noise) and adding a predictable overhead comparable to one masked forward pass (<11\%). COFT thus offers a clear, auditable path to safer CoT generation: significant bias reduction, negligible utility loss, and no retraining, auxiliary classifiers, or weight access.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 20857

Loading