Keywords: Large Language Models, Watermarking, Fine-tuned Models, Parameter Integration
TL;DR: We propose a train-efficient watermarking framework for fine-tuned LLMs that avoids large domain-specific datasets,integrates smoothly with existing watermarks,and empirically analyzes watermark transfer mechanisms.
Abstract: Watermarking of large language model (LLM) generations embeds imperceptible statistical patterns within text, enabling algorithmic detection. It provides a promising defense for ensuring traceability, accountability, and integrity of open-source models. However, current watermarking approaches face two key limitations: incompatibility with fine-tuned models and intense training cost.
In this work, we propose WAPITI, a watermark framework tailored for fine-tuned models. Our contributions are threefold: (1) We introduce a train-efficient watermarking that eliminates the need for large domain-specific datasets and requires substantially less training. (2) We enable seamless integration of our framework with existing watermarking techniques, making it broadly compatible with diverse watermarking schemes. (3) We provide an in-depth empirical analysis of the mechanism underlying watermark transfer, offering insights into how parameter-level operations influence both watermark strength and model capabilities.
Extensive experiments across architectures and watermarking strategies demonstrate that WAPITI effectively injects watermarks into fine-tuned models while preserving their adapted capabilities and robustness.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 22661
Loading