Keywords: Large Language Model, LLM Watermarking, Code Watermarking
TL;DR: We propose a novel code watermarking framework that learn from Abstract Syntax Trees of code to identify where and how to insert watermarks.
Abstract: The rapid advancement of Large Language Models (LLMs) in code generation has raised significant attribution and intellectual property concerns. Code watermarking offers a potential solution but faces unique challenges due to programming languages' strict syntactic constraints and semantic requirements.
To address these challenges, we introduce ACW (AST-guided Code Watermarking), a novel adaptive framework that leverages Abstract Syntax Tree (AST) analysis during training to learn watermark embedding strategies. Our framework identifies substitutable code components and strategically biases token selections to embed watermarks. We also propose a novel sampling scheme that distributes tokens between green/red lists according to semantic context, ensuring statistical distinguishability while preserving code functionality. Extensive experiments demonstrate that ACW achieves a significant improvement in watermark detection accuracy compared to existing methods, with negligible impact on code functionality. This adaptive framework offers a promising solution for effective and practical code watermarking in the age of LLMs. Our code is available at: https://github.com/TimeLovercc/code-watermark.
Supplementary Material: zip
Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)
Submission Number: 14047
Loading