Track: Regular paper
Keywords: large language models, safety alignment, bias mitigation, toxicity reduction, prompt engineering, inference-time steering
TL;DR: A three-stage prompting framework that proactively generates dynamic, context-specific guardrails to reduce bias and toxicity in LLM outputs.
Abstract: Large Language Models (LLMs) are increasingly used as tools for content creation, yet they often generate biased and toxic content, and common reactive mitigation strategies like self-correction fail to address the underlying flawed reasoning. This paper introduces Dynamic Guardrail Generation (DGG), a proactive, three-stage prompting framework that compels a model to perform a safety analysis before generating a response. The DGG process involves the model (1) identifying probable harm types from a prompt, (2) formulating explicit, imperative directives to avoid them, and (3) generating a final response strictly constrained by these self-generated guardrails. We evaluated DGG using GPT-3.5 on the BOLD-1.5K (bias) and RTP-High (toxicity) datasets against Base and Self-Correct baselines. Results show DGG is highly effective at mitigating societal bias (41%). While DGG also reduces toxicity (up to 60%), it does not yet match the performance of the reactive Self-Correct approach in that domain. The framework’s specific contribution is that it makes safety rules dynamic and prompt-specific, which distinguishes it from related concepts like Constitutional AI where models follow a static set of rules. This provides a more tailored, context-aware safety mechanism at the moment of inference. The work’s broad impact is its effort to shift the paradigm in AI safety from reactive correction to proactive self-governance. By compelling a model to analyze risks and set its own rules before generating a response, it offers a new direction for improving AI safety that doesn’t require external tools or post-generation fixes.
Submission Number: 45
Loading