Data Forging Is Harder Than You Think

Published: 05 Mar 2024, Last Modified: 04 May 2024PMLEveryoneRevisionsBibTeXCC BY 4.0
Keywords: data forging, machine learning, privacy, proof of learning
TL;DR: Existing data forging attacks have serious privacy implications; we scrutinize the existing state of the art, and theoretically analyse the concept of forgeability.
Abstract: Recent research has introduced \emph{data forging} attacks, which involve replacing mini-batches used in training with \emph{different} ones that yield nearly identical model parameters. These attacks pose serious privacy concerns, as they can undermine membership inference predictions and falsely suggest machine unlearning without actual unlearning. Given such critical privacy implications, this paper aims to scrutinize existing attacks and understand the notion of data forging. First, we argue that state-of-the-art data forging attacks have key limitations, which make them \emph{unrealistic} and easily detectable. Through experimentation on multiple hardware platforms, we demonstrate that \emph{approximation errors} that existing attacks report are orders-of-magnitude higher than benign errors caused by numerical deviations. Next, we formulate data forging as an optimisation problem and show that solving it via simple gradient-based methods also results in high approximation errors. Finally, we theoretically analyse data forging for logistic regression. Our theoretical results suggest, even for logistic regression, it is difficult to efficiently find forged batches. In conclusion, our findings call for a reevaluation of existing attacks and highlight that data forging is still an intriguing open problem.
Submission Number: 31