Keywords: Data watermarking, Tabular data, Data provenance, Cryptographic hash functions, Synthetic data, Data governance, Robustness, Fidelity, Machine learning security, Accountability
TL;DR: We introduce $\mathsf{HashMark}$,a simple hash-based framework for watermarking tabular data that is type-agnostic,high-fidelity,and robust,enabling scalable data provenance and accountability.”
Abstract: As enterprises increasingly rely on data for decision-making and machine learning pipelines, ensuring data provenance, ownership, and responsible use has become essential. Data watermarking offers a promising solution by embedding imperceptible markers into datasets, enabling traceability and accountability. While prior work has primarily focused on perceptual domains such as images, audio, and text, watermarking for tabular data remains underexplored despite its central role in enterprise systems. Tabular data presents unique challenges due to its heterogeneity, lack of redundancy, and susceptibility to structural modifications.
We introduce $\mathsf{HashMark}$, a suite of cryptographic watermarking protocols explicitly designed for tabular datasets. Our methods embed bits into table cells using seeded hash functions, achieving \emph{data-type agnostic}, high-fidelity watermarking with minimal distortion. We present two complementary schemes: (i) $\mathsf{HashMark}_1$, a sparse embedding mechanism that modifies only $\Theta(1)$ cells, and (ii) $\mathsf{HashMark}_2$, a dense embedding mechanism that enforces uniform statistical properties across the dataset while supporting categorical and alphanumeric domains. Both schemes feature low detection cost, broad applicability, and formal fidelity guarantees.
Extensive experiments across various settings demonstrate that $\mathsf{HashMark}$ maintains downstream model performance while significantly improving the quality of the watermarking scheme, when compared to prior work. Our results establish hash-based watermarking as a simple, efficient, and general solution for securing tabular data against unauthorized use, while also enabling scalable data governance.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 20656
Loading