Keywords: value alignment; diffusion model; Manifold-Constrained
Abstract: LLM-based detoxification often shifts explicit toxicity into subtler forms: profanities vanish while harm persists through insinuations, stereotypes, microaggressions, and subtly discriminatory framing. We reformulate detoxification from a value-alignment perspective as a multi-principle constrained generation problem that enforces explicit harmlessness, guards against insinuations, stereotypes, microaggressions, and subtly discriminatory framing, and simultaneously maintains fairness consistency, and helpfulness. Specifically, we propose a diffusion-based process-level aligner, SafeManifold-Diffusion, a novel framework that combines conditional latent diffusion with a diffusion-map-based semantic safety manifold to enforce both semantic fidelity and value alignment. Given an offensive input, our model generates a rewritten sentence by simulating a denoising trajectory in latent space, conditioned on the original content embedding. To prevent the trajectory from entering semantically toxic regions, we construct a nonlinear safe semantic manifold from verified non-offensive latent representations using diffusion geometry, and constrain generation via explicit manifold projection at each sampling step. Experiments on four detoxification datasets demonstrate that SafeManifold-Diffusion achieves state-of-the-art performance in reducing both explicit and implicit toxicity while preserving intent, improving fairness, and producing more helpful outputs. Our results suggest that aligning generation with structured semantic constraints is crucial for building trustworthy language systems
Supplementary Material: pdf
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 6626
Loading