SPriFed-OMP: A Differentially Private Federated Learning Algorithm for Sparse Basis Recovery

Published: 08 Jul 2024, Last Modified: 08 Jul 2024Accepted by TMLREveryoneRevisionsBibTeX
Abstract: Sparse basis recovery is a classical and important statistical learning problem when the number of model dimensions $p$ is much larger than the number of samples $n$. However, there has been little work that studies sparse basis recovery in the Federated Learning (FL) setting, where the client data's differential privacy (DP) must also be simultaneously protected. In particular, the performance guarantees of existing DP-FL algorithms (such as DP-SGD) will degrade significantly when $p \gg n$, and thus, they will fail to learn the true underlying sparse model accurately. In this work, we develop a new differentially private sparse basis recovery algorithm for the FL setting, called SPriFed-OMP. SPriFed-OMP converts OMP (Orthogonal Matching Pursuit) to the FL setting. Further, it combines SMPC (secure multi-party computation) and DP to ensure that only a small amount of noise needs to be added in order to achieve differential privacy. As a result, SPriFed-OMP can efficiently recover the true sparse basis for a linear model with only $n = \mathcal{O}(\sqrt{p})$ samples. We further present an enhanced version of our approach, SPriFed-OMP-GRAD, based on gradient privatization, that improves the performance of SPriFed-OMP. Our theoretical analysis and empirical results demonstrate that both SPriFed-OMP and SPriFed-OMP-GRAD terminate in a small number of steps, and they significantly outperform the previous state-of-the-art DP-FL solutions in terms of the accuracy-privacy trade-off.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: We thank the Action Editor and all the reviewers for their time and for their valuable feedback that significantly enhanced our submitted manuscript. Below, we provide a point-by-point response to each comment provided in the decision response. Due to a lack of space in the changes section of the camera-ready submission form, we only include a gist of the comment made by the Action Editor. However, in our responses, when necessary, we include a comment about the exact change or page number of the change. Below, we summarize the significant changes to the manuscript, > Comment 1: Utility bounds in terms of $(\epsilon, \delta)$-DP. **Response:** We thank the Action Editor for the suggestion. We have expanded the remark after Theorem 8 (on page 12) to discuss how to obtain this utility-privacy tradeoff as noted next, “Specifically, suppose that we wish to re-write the conditions stated in Theorems 7 and 8 in terms of $(\epsilon, \delta)$-DP guarantees. We first pick $\mu$, which can be easily converted to $(\epsilon, \delta)$ DP guarantees according to Lemma 2. Then, by assuming that $\mu_s = \mu_p \cdot m_s$, where $m_s$ is a known constant factor, we obtain from Theorem 6 that $\mu = \mu_p \sqrt{s(1 + 2 \cdot m_s^2)}$. We can thus solve $\mu_p$ and $\mu_s$ as: $\mu_p = \mu/\sqrt{s(1+2 \cdot m_s^2)}$ and $\mu_s = \mu_p \cdot m_s$. These values can then be directly plugged into Theorems 7 and 8 to obtain the accuracy guarantees.” We thank the editor again for this comment that improves the clarity of our results. Furthermore, $\sigma_{\varepsilon}$ denotes the standard deviation of the additive error (denoted by the vector variable $\mathbf{\varepsilon}$) in the system model section on page 4. On page 11, we refer to this notation $\sigma_{\varepsilon}$. To avoid confusion, we thus add the following remark for clarity, “Recall that $\sigma_{\varepsilon}$ is the standard deviation of the additive error in the system model presented in Section 2.” > Comment 2: Reasoning behind choosing GDP over Renyi DP and other DP composition mechanisms. **Response:** We thank the action editor for this insightful comment. Accordingly, on pages 9-10 of our modified manuscript, we have included an additional comment about the work in [1]. More specifically, we augment our existing response with the following sentence, “Similarly, for non-adaptive privacy mechanisms, GDP has been shown to be the most optimal mechanism (see Theorem 8 in [1]) such that the optimal noise variance can be appropriately identified to obtain a given $(\epsilon, \delta)$ DP guarantee.” We also shortened our description of the comparison result from Dong et al. (2019). [1] Balle, Borja, and Yu-Xiang Wang. "Improving the gaussian mechanism for differential privacy: Analytical calibration and optimal denoising." International Conference on Machine Learning. PMLR, 2018. > Comment 3: In the privacy analysis of Alg. 3 (Theorem 6), is the DP cost of steps 4-5 included? **Response:** We thank the action editor for their question. We note that our analysis does already include the DP cost of steps 4-5 since we have (s) privacy mechanisms in Theorem 6 for the \mu_p parameter (Here we run the algorithm until s basis values are selected). > Comment 4: Proofreading of the document. **Response:** Thank you for the suggestions. As requested, we have made the changes mentioned in Comment 4. Furthermore, we have carefully proofread and modified the manuscript. > Comment 5: Discrepancy in the notation of the data matrix X. **Response:** We thank the action editor for their question. We indeed intend that the matrix $\mathbf{X}$ has $n$ rows and $p$ columns, where $n$ indicates the number of samples/clients and $p$ indicates the number of features. Each feature is thus a column vector $n$-element long and denoted by $X_i$, where $i \in [p]$, and thus $X = \{X_1, …, X_p\}$. We also use lower-case $x_i$ to denote the sample $i$, which corresponds to the $i$-th row of X. However, we do realize that there was some confusion when we said on page 4 of the paper that $(X, y) = (x_i, y_i)_{i \in \{1, …, n\}}$. To avoid such confusion, we have revised this part (on page 4) as “We now consider $n$ input-output training pairs represented by $(x_i, y_i), i \in \{1, ..., n\}$ where each pair belongs to a single distinct client. We stack the row vectors $x_i$ vertically together to form an $n \times p $ matrix $X$. Similarly, we stack $y_i$ into an $n \times 1$ vector $y$.” > Comment 6: Add names of the datasets either to figures or captions. **Response:** Thank you for suggesting this change. We have added the names of the datasets to the figure’s captions.
Assigned Action Editor: ~Antti_Koskela1
Submission Number: 2308