Abstract: Audio watermarking provides an effective approach for tracing and protecting synthetic audio content. Traditional methods often apply watermarking as a post-processing step, which makes the watermark vulnerable to removal or degradation through signal processing or model editing. To address these issues, our paper introduces GenMark, a novel approach that embeds watermarks directly into the decoder of neural audio generation models during training. Our approach combines time-frequency perceptual losses, a mask-based localization model, and adversarial training to ensure high audio quality and watermark robustness. Experimental results on speech and music generation tasks demonstrate superior detection accuracy (TPR: 99.9% for speech, 100.0% for music). GenMark also preserves perceptual quality with less than 2% degradation in MUSHRA scores, establishing it as a strong candidate for practical and secure watermarking in generative audio systems.
Paper Type: Long
Research Area: Speech Recognition, Text-to-Speech and Spoken Language Understanding
Research Area Keywords: Text-to-Speech and Spoken Language Understanding, Generation, NLP Applications
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 3240
Loading