When AI Coding Assistants Leak Training Data: Study Memorization in Code LLMs

Published: 28 Mar 2026, Last Modified: 28 Mar 2026AIware 2026EveryoneRevisionsCC BY 4.0
Keywords: Large Language Models (LLMs), Code Memorization, Membership Inference Attack, Software Security and Privacy
Abstract: Large Language Models (LLMs) for code generation risk memorizing and reproducing sensitive training data, including licensed code and proprietary information. We investigate memorization behavior in state-of-the-art code LLMs using a two-stage attack pipeline combining membership inference and data extraction. We evaluate four models (StarCoder2-3B, StarCoder2-7B, Llama3-8B, and DeepSeek-R1-distilled-Llama-8B) on a custom dataset of 30,000+ Python files. Our results reveal memorization rates of 42-64\%, with code-specialized models exhibiting higher rates than general-purpose models. Categorical analysis shows that repetitive content (license headers, documentation) is memorized at rates up to 70\%, while complex code exhibits lower susceptibility. Notably, realistic code completion scenarios trigger unintentional memorization in 13-14\% of cases, posing practical risks for AI coding assistants. We demonstrate that knowledge distillation reduces extraction rates by approximately 19\%, offering a cost-effective mitigation approach. Our findings confirm that memorization persists in modern LLMs and is influenced more by training domain and content characteristics than by parameter count alone.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public.
Paper Type: Short papers (i.e., vision, new ideas, and position papers). 2–4 pages
Reroute: false
Submission Number: 38
Loading