Language Models Can’t See GHOST : Defending Digital Text Data from Exploitation with Tokenizer Overloading

ACL ARR 2025 May Submission4446 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: As large language models (LLMs) increasingly extract text data without consent, content creators face growing risks to intellectual property and privacy. This paper outlines the emerging threats of unauthorized content repurposing and sensitive data exposure by AI systems and the need for robust defensive strategies. We introduce SHROUD (Structured Hierarchy for Rewriting and Obfuscation in Unstructured Data), the first unified guide to text perturbation and obfuscation techniques for protecting text content from misuse. Additionally, we present GHOST (Generated Hidden Obfuscation STrategy), a novel, lightweight defensive method that prevents model training and inference on text by injecting invisible tokens that overload a model's tokenizer context window. GHOST allows for an imperceptible yet effective obfuscation, preserving human readability while shielding content from unauthorized machine learning use. GHOST serves as a privacy mechanism for evading automated moderation and protecting intellectual property, enabling users to share sensitive content and online authors to prevent unauthorized AI training on their work.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: NLP Applications, Dialogue and Interactive Systems, Generation, Information Extraction, Language Modeling, Question Answering, Sentiment Analysis, Stylistic Analysis, and Argument Mining:
Contribution Types: Model analysis & interpretability, Approaches to low-resource settings, Data analysis
Languages Studied: English
Keywords: NLP Applications, Dialogue and Interactive Systems, Generation, Information Extraction, Language Modeling, Question Answering, Sentiment Analysis, Stylistic Analysis, and Argument Mining:
Submission Number: 4446
Loading