Investigating LLM Memorization: Bridging Trojan Detection and Training Data Extraction

Manoj Acharya; Xiao Lin; Susmit Jha

Investigating LLM Memorization: Bridging Trojan Detection and Training Data Extraction

Manoj Acharya, Xiao Lin, Susmit Jha

Published: 12 Oct 2024, Last Modified: 21 Nov 2024SafeGenAi PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Backdoor, LLM, Memorization, Trojans

TL;DR: Trojans in LLMs are a form of memorization; can we rank these malicious memorization higher than benign memorization ?

Abstract: In recent years, researchers have delved into how Large Language Models (LLMs) memorize information. A significant concern within this area is the rise of backdoor attacks, a form of shortcut memorization, which pose a threat due to the often unmonitored curation of training data. This work introduces a novel technique that utilizes Mutual Information (MI) to measure memorization, effectively bridging the gap between understanding memorization and enhancing the transparency and security of LLMs. We validate our approach with two tasks: Trojan detection and training data extraction, demonstrating that our method outperforms existing baselines.

Submission Number: 169

Loading