Aligning Stance Dynamics in Foundation Dialogue Models: Towards Context-Aware Safety Control
Keywords: Stance Alignment, Contextual Offensive Language, Controllable Text Generation, Neural Dialogue Safety, Foundation Models, Societal Impact
TL;DR: We pioneer a context-aware safety framework that breaks neural dialogue models' dangerous tendency to implicitly agree with toxic statements through dynamic stance control and bias-aware counter-speech generation.
Abstract: Neural dialogue foundation models exhibit critical safety vulnerabilities by implicitly amplifying toxic discourse through contextual stance alignment. Current mitigation strategies fail to address the nuanced interplay between conversational dynamics and implicit harm. We propose a paradigm shift with three core contributions:
First, we introduce context-aware stance classifiers leveraging graph-based reasoning and contrastive learning. These models decode complex stance expressions—including sarcastic agreement and indirect bias—that perpetuate harmful echo chambers.
Second, we design dynamic safety controllers combining attribute-guided decoding with retrieval-augmented counter-speech generation. This hybrid approach steers foundation models toward constructive responses while preserving conversational integrity.
Third, we establish multidimensional harm metrics quantifying implicit biases across intersectional identities, moving beyond surface-level toxicity. Our framework pioneers joint optimization of stance neutrality, bias mitigation, and fluency preservation.
Validated against adversarial conversational contexts, our method demonstrates significant improvements in ethical alignment for generative dialogue systems. We open-source toolkit facilitates safer deployment of foundation language models.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: true
Submission Guidelines: true
Anonymous Url: true
No Acknowledgement Section: true
Submission Number: 2629
Loading