Privacy Amplification by Structured Subsampling for Deep Differentially Private Time Series Forecasting

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 spotlightposterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We analyze the privacy of DP-SGD adapted to time series forecasting in a domain- and task-specific manner.
Abstract: Many forms of sensitive data, such as web traffic, mobility data, or hospital occupancy, are inherently sequential. The standard method for training machine learning models while ensuring privacy for units of sensitive information, such as individual hospital visits, is differentially private stochastic gradient descent (DP-SGD). However, we observe in this work that the formal guarantees of DP-SGD are incompatible with time series specific tasks like forecasting, since they rely on the *privacy amplification* attained by training on small, unstructured batches sampled from an unstructured dataset. In contrast, batches for forecasting are generated by (1) sampling sequentially structured time series from a dataset, (2) sampling contiguous subsequences from these series, and (3) partitioning them into context and ground-truth forecast windows. We theoretically analyze the privacy amplification attained by this *structured subsampling* to enable the training of forecasting models with sound and tight event- and user-level privacy guarantees. Towards more private models, we additionally prove how data augmentation amplifies privacy in self-supervised training of sequence models. Our empirical evaluation demonstrates that amplification by structured subsampling enables the training of forecasting models with strong formal privacy guarantees.
Lay Summary: Differentially Private Stochastic Gradient Descent (DP-SGD) is the standard method for training machine learning models on sensitive data with strong formal privacy guarantees. The core principle underlying these strong privacy guarantees is amplification by subsampling: Training on randomly sampled subsets is much more private than training on an entire set of input-label pairs. But what if our training data is not simply an unstructured set, but composed of sequentially structured data like natural language or time series? What if there are no explicit labels, and we are instead training our model to predict the next sentence or the next 24 hours? We answer this question by deriving formal privacy guarantees for models that predict future information based on observed context information. In particular, we analyze the interplay of sampling sequences from a dataset, sampling shorter subsequences from these sequences, and splitting them into context and ground-truth for training. Using time series forecasting as a testbed, we experimentally demonstrate that our tight privacy guarantees enables private training on sequential data while retaining high model utility.
Link To Code: https://cs.cit.tum.de/daml/dp-forecasting
Primary Area: Social Aspects->Privacy
Keywords: Differential Privacy, Privacy Amplification, Privacy Accounting, Time Series, Forecasting
Submission Number: 11936
Loading