SAEMark: Steering Personalized Multilingual LLM Watermarks with Sparse Autoencoders

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Watermarking, Large Language Models, Sparse Autoencoders, Interpretability, Accountability of LLMs, AIGC Detection
TL;DR: We embed personalized LLM Watermarks responsively into generated text with Sparse Autoencoders, with only inference-time sampling on black-box access LLMs and no extra training, and achieved high accuracy while preserving text quality.
Abstract: Watermarking LLM-generated text is critical for content attribution and misinformation prevention, yet existing methods compromise text quality and require white-box model access with logit manipulation or training, which exclude API-based models and multilingual scenarios. We propose SAEMark, an **inference-time framework** for *multi-bit* watermarking that embeds personalized information through *feature-based rejection sampling*, fundamentally different from logit-based or rewriting-based approaches: we **do not modify model outputs directly** and require only **black-box access**, while naturally supporting multi-bit message embedding and generalizing across diverse languages and domains. We instantiate the framework using *Sparse Autoencoders* as deterministic feature extractors and provide theoretical worst-case analysis relating watermark accuracy to computational budget. Experiments across 4 datasets demonstrate strong watermarking performance on English, Chinese, and code while preserving text quality. SAEMark establishes a new paradigm for **scalable, quality-preserving watermarks** that work seamlessly with closed-source LLMs across languages and domains.
Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)
Submission Number: 3618
Loading