TL;DR: Fine-tuned LLMs can leak sensitive data; we propose a threat model using partial, unordered fragment sets to attack LLMs and propose effective inference methods under this model.
Abstract: Large language models (LLMs) can leak sensitive training data through memorization and membership inference attacks. Prior work has primarily focused on strong adversarial assumptions, including attacker access to entire samples or long, ordered prefixes, leaving open the question of how vulnerable LLMs are when adversaries have only partial, unordered sample information. For example, if an attacker knows a patient has "hypertension," under what conditions can they query a model fine-tuned on patient data to learn the patient also has "osteoarthritis?" In this paper, we introduce a more general threat model under this weaker assumption and show that fine-tuned LLMs are susceptible to these fragment-specific extraction attacks. To systematically investigate these attacks, we propose two data-blind methods: (1) a likelihood ratio attack inspired by methods from membership inference, and (2) a novel approach, PRISM, which regularizes the ratio by leveraging an external prior. Using examples from medical and legal settings, we show that both methods are competitive with a data-aware baseline classifier that assumes access to labeled in-distribution data, underscoring their robustness.
Lay Summary: Imagine a chatbot has been trained on private medical or legal notes. Even if an adversarial individual (hacker, etc.) only knows a few scattered facts -- say that a patient has "hypertension" and takes "beta-blockers" -- our study shows that this person can still prod the chatbot to reveal other hidden details, such as additional illnesses. We introduce a new way of thinking about this risk and design two attack methods that work without any insider knowledge of the training data. In tests on real medical summaries, these "fragment attacks" succeeded often enough to raise serious privacy alarms. Our results suggest that simply scrubbing verbatim text or checking for wholesale memorization is not enough: developers need stronger defenses before deploying fine-tuned language models in sensitive domains.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Primary Area: Deep Learning->Large Language Models
Keywords: memorization, extraction, membership inference, attacks, LLMs, large language models, privacy
Submission Number: 2911
Loading