SCD: Soft $\textit{top-k}$ Contrastive Decoding for Universal LLM Detoxification

SCD: Soft $\textit{top-k}$ Contrastive Decoding for Universal LLM Detoxification

ICLR 2026 Conference Submission20083 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Detoxification, Safety, Fairness

TL;DR: We introduce SCD, a universally applicable source-level detoxification method that rewrites raw corpora using soft top-k contrastive decoding in logits space, preserving semantics.

Abstract: We present SCD (Soft top-k Contrastive Decoding) for universal LLM detoxification. Prior approaches typically target specific model families or lean on bespoke decoding tricks, limiting cross-model/task generalization; others distill "cleaned" datasets, which adds training cost yet fails to address toxicity at its source. Motivated by intervening at the data origin, we attempt to detoxify directly on raw corpora; however, naively applying vanilla contrastive decoding to corpus rewriting yields low-quality or semantically drifting edits and often fails to produce usable replacements. Instead, we intervene at the corpus level: SCD guides an LLM to localize and rewrite toxic spans in raw data while preserving semantics, yielding a detoxified corpus that can drop-in replace the original for fine-tuning or other training. On GPT2-XL, SCD attains state-of-the-art detoxification, reducing Toxicity Probability (TP) from 0.42 to 0.18 and Expected Maximum Toxicity (EMT) from 0.43 to 0.20. We further validate consistent best-in-class results on LLaMA2-7B, OPT-6.7B, and Falcon-7B. These findings show that semantics-preserving, corpus-level rewriting with SCD effectively suppresses downstream toxicity while retaining data utility and allowing seamless source-level mitigation.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 20083

Loading