Keywords: Large Language Models, data leakage, enterprise AI, content moderation, input-side guardrail, output-side guardrail, human-in-the-loop, named entity recognition, semantic analysis, policy compliance
Abstract: Large Language Models (LLMs) are transforming enterprise workflows but introduce security and ethics challenges when employees inadvertently share confidential data or generate policy-violating content. This paper proposes SafeGPT, a two-sided guardrail system preventing sensitive data leakage and unethical outputs. SafeGPT integrates input-side detection/redaction, output-side moderation/reframing, and human-in-the-loop feedback. Experiments demonstrate SafeGPT effectively reduces data leakage risk and biased outputs while maintaining satisfaction.
Paper Type: Short
Research Area: Safety and Alignment in LLMs
Research Area Keywords: Language model safety, Ethics and social responsibility, Content moderation, Information extraction, Evaluation and analysis of NLP systems, Security and privacy
Contribution Types: Model analysis & interpretability, Data analysis
Languages Studied: English
Submission Number: 2820
Loading