Direct Neural Network Training on Securely Encoded Datasets

TMLR Paper528 Authors

24 Oct 2022 (modified: 28 Feb 2023)Rejected by TMLREveryoneRevisionsBibTeX
Abstract: In fields where data privacy and secrecy are critical, such as healthcare and business intel- ligence, security concerns have reduced the availability of data for neural network training. A recently developed technique securely encodes training, test, and inference examples with an aggregate non-orthogonal and nonlinear transformation that consists of steps of random padding, random perturbation, and random orthogonal matrix transformation, enabling artificial neural network (ANN) training and inference directly on encoded datasets. Here, the performance characteristics and privacy aspects of the method are presented. The individual transformations of the method, when applied alone, do not significantly reduce validation accuracy with fully-connected ANNs. Training on datasets transformed by sequential padding, perturbation, and orthogonal transformation results in slightly lower validation accuracies than those seen with unmodified control datasets, with no difference in training time seen between transformed and control datasets. The presented methods have implications for machine learning in fields requiring data security.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: Substantial changes and additions have been made in response to the very helpful feedback from the three reviewers. -New section 2, 'Threat Model' and new associated Figure 1 - better defines the threat model against which our privacy claims are evaluated. - New section 5.6, 'Testing on Alternative Data Sources' - performance assessment with additional training datasets, CIFAR-10, Fashion-MNIST, and our synthetic unknown challenge dataset. - New section 5.7, 'Privacy Analysis', theoretical evaluation of privacy in the context of our threat model. - New section 5.8, 'Information Leakage', definition of information leakage in our threat model and analysis showing that an attacker in possession of the information leaked in the threat model (e.g., association of encoded examples with labels) does not gain information of utility to reversing the random encoding processes, which are independent of label category. - New Figures 9, 10, and 11: providing the theoretical and practical analysis presented in new section 5.8. - Encoded Dataset Challenge section: revised to include a trained keras model on the encoded dataset and labels. - Section 6.2: added subsection 6.2.1, 'Security Parameters' to more clearly enumerate the security parameters available in our approach. - Section 6.4 and subsections: based on reviewer feedback, added discussions and reference to InstaHide, NeuraCrypt, DarKnight, and differential privacy.
Assigned Action Editor: ~Pin-Yu_Chen1
Submission Number: 528
Loading