Keywords: Large Language Models, Detoxification, Safety, Fairness
TL;DR: We introduce SCD, a universally applicable source-level detoxification method that rewrites raw corpora using soft top-k contrastive decoding in logits space, preserving semantics.
Abstract: We present SCD (Soft top-k Contrastive Decoding) for universal LLM detoxification. Prior approaches typically target specific model families or lean on bespoke decoding tricks, limiting cross-model/task generalization; others distill "cleaned" datasets, which adds training cost yet fails to address toxicity at its source. Motivated by intervening at the data origin, we attempt to detoxify directly on raw corpora; however, naively applying vanilla contrastive decoding to corpus rewriting yields low-quality or semantically drifting edits and often fails to produce usable replacements. Instead, we intervene at the corpus level: SCD guides an LLM to localize and rewrite toxic spans in raw data while preserving semantics, yielding a detoxified corpus that can drop-in replace the original for fine-tuning or other training. On GPT2-XL, SCD attains state-of-the-art detoxification, reducing Toxicity Probability (TP) from 0.42 to 0.18 and Expected Maximum Toxicity (EMT) from 0.43 to 0.20. We further validate consistent best-in-class results on LLaMA2-7B, OPT-6.7B, and Falcon-7B. These findings show that semantics-preserving, corpus-level rewriting with SCD effectively suppresses downstream toxicity while retaining data utility and allowing seamless source-level mitigation.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 20083
Loading