Large Language Models (LLMs) have revolutionized Natural Language Processing through advanced text generation capabilities. However, their use raises legal and ethical concerns, particularly related to copyright infringement. While traditional methods assess the entire generated output for potential violations, this study introduces a novel framework that detects copyright risks by analyzing LLMs' internal states before any text is generated. This proactive approach enhances efficiency by identifying issues early in the generation process. To implement this framework, we used a dataset of literary works to derive both the LLMs' internal states and reference materials. These were used to train a neural network classifier capable of detecting potential copyright concerns. Additionally, this method helps prevent the unintended release of copyrighted content, offering an extra layer of protection. We also integrated this framework into a Retrieval-Augmented Generation (RAG) system, using FAISS (Facebook AI Similarity Search) and SQLite to efficiently manage reference texts. These texts are sourced from a protected copyright database, improving the accuracy and reliability of our detection process. By comparing generated content to known copyrighted material, our system ensures better compliance with legal and ethical standards. Overall, our findings demonstrate the value of analyzing internal states for proactive copyright monitoring, providing a scalable and effective solution for responsible AI-driven text generation.
Abstract:
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: Large Language Model, Internal States, Copyright Infringement
Contribution Types: NLP engineering experiment, Theory
Languages Studied: English
Submission Number: 5525
Loading