SafeGPT: Preventing Data Leakage and Unethical Outputs in Enterprise LLM Use

ACL ARR 2026 January Submission2820 Authors

03 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, data leakage, enterprise AI, content moderation, input-side guardrail, output-side guardrail, human-in-the-loop, named entity recognition, semantic analysis, policy compliance
Abstract: Large Language Models (LLMs) are transforming enterprise workflows but introduce security and ethics challenges when employees inadvertently share confidential data or generate policy-violating content. This paper proposes SafeGPT, a two-sided guardrail system preventing sensitive data leakage and unethical outputs. SafeGPT integrates input-side detection/redaction, output-side moderation/reframing, and human-in-the-loop feedback. Experiments demonstrate SafeGPT effectively reduces data leakage risk and biased outputs while maintaining satisfaction.
Paper Type: Short
Research Area: Safety and Alignment in LLMs
Research Area Keywords: Language model safety, Ethics and social responsibility, Content moderation, Information extraction, Evaluation and analysis of NLP systems, Security and privacy
Contribution Types: Model analysis & interpretability, Data analysis
Languages Studied: English
Submission Number: 2820
Loading