Understanding Secret Leakage Risks in Code LLMs: A Tokenization Perspective

Published: 22 Sept 2025, Last Modified: 25 Nov 2025DL4C @ NeurIPS 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Tokenization, Code LLM, Secret Leakage
Abstract: Code secrets are sensitive assets for software developers, while their leakage imposes high risks to cybersecurity. While the rapid development of AI code assistants powered by Code Large Language Models (CLLMs) revolutionizes the landscape of software engineering, CLLMs are shown to inadvertently leak such secrets due to a notorious memorization phenomenon. This study firstly reveals that Byte-Pair Encoding (BPE) tokenization leads to unexpected behaviour of secret memorization. Specifically, we discovered that some secrets are among the easiest for CLLMs to memorize. These secrets exhibit high character-level entropy, but low token-level entropy. We name it as gibberish bias. We identified the root of the bias as the token distribution shift between the CLLM training data and the secret data. We further discuss how gibberish bias manifests under the "larger vocabulary" trend. To conclude the paper, we discuss potential mitigation strategies and the broader implications for the current tokenizer design.
Submission Number: 17
Loading