HashMark: Watermarking Tabular/Synthetic Data For Machine Learning Via Cryptographic Hash Functions

Published: 23 Sept 2025, Last Modified: 22 Oct 2025RegML 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Watermarking Synthetic Data, Tabular Data
TL;DR: This paper proposes a watermarking scheme for tabular synthetic data that's simple yet efficient.
Abstract: Watermarking is a critical tool for protecting datasets against malicious or unauthorized use, yet existing methods often face limitations in data type support, fidelity preservation, and detection efficiency. In this work, we introduce $\mathsf{HashMark}$, a novel and versatile watermarking scheme for tabular datasets, including synthetic data, without imposing restrictions on data types. At its core, $\mathsf{HashMark}$ employs a cryptographic hash function to map \emph{any} data into binary values, enabling efficient and robust watermark embedding. Our design generalizes and simplifies some prior approaches, such as the recent works Ngo et al. (arXiv 2024) and TabularMark (ACM CCS 2024), while addressing their key shortcomings. Unlike Ngo et al., $\mathsf{HashMark}$ supports categorical and mixed-type data with a unified framework. Compared to TabularMark, it enables efficient watermark detection without requiring access to the original dataset. Further, unlike TabularMark, we present experiments for categorical data. Finally, we run experiments comparing the accuracy of synthetically generated data and watermarked, synthetic data on three classifiers over several datasets using three approaches for generating synthetic data. These experiments clearly demonstrate negligible impact on utility for intended machine learning tasks when $\mathsf{HashMark}$ is used.
Submission Number: 69
Loading