['3c3', "< Abstract: Random forests have been one of the successful ensemble algorithms in machine learning. Various techniques have been utilized to preserve the privacy of random forests, such as anonymization, differential privacy, homomorphic encryption, etc. This work takes one step towards data encryption by incorporating some crucial ingredients of learning algorithm. Specifically, we develop a new encryption to preserve data's Gini impurity, which plays an important role during the construction of random forests. The basic idea is to modify the structure of binary search tree to store several examples in each node, and encrypt the data features by incorporating label and order information. Theoretically, our scheme is proven to preserve the minimum Gini impurity in ciphertexts without decrypting, and we also present the security guarantee for encryption. For random forests, we encrypt data features based on our Gini-impurity-preserving scheme, and take the homomorphic encryption scheme CKKS to encrypt data labels owing to their importance and privacy. We finally present extensive empirical studies to validate the effectiveness, efficiency and security of our proposed method. * These authors contribute equally. This work takes one step towards data encryption by incorporating some crucial ingredients of learning algorithm, and main contributions can be summarized as follows: • We present a new encryption to preserve data's Gini impurity, and the basic idea is to modify the structure of binary search trees to maintain several samples on each node, and encrypt data's features by incorporating label and order information. Our scheme could change the data frequencies, which is also beneficial for data security. • Theoretically, we prove the preservation of minimum Gini impurity in ciphertexts without decryption, which plays an important role on the construction of random forests. Our scheme also satisfies the security against Gini-impurity-preserving chosen plaintext attack. • We focus on the privacy random forests in the popular client-server protocol, and take our Gini-impurity-preserving encryption for data features. We adopt homomorphic encryption CKKS to encrypt data labels. Our encrypted decision tree takes smaller communication and computational complexities, as shown in Table 1. • Extensive experiments show that our encrypted random forests take significantly better performance than prior privacy random forests via encryption, anonymization and differential privacy, and are comparable to original (plaintexts) random forests without encryption. Our encrypted random forests make a good balance between computational cost and data security. The rest of this work is constructed as follows: Section 2 introduces relevant work. Section 3 presents an encryption on data's Gini impurity. Section 4 proposes the encrypted random forests. Section 5 conducts extensive experiments. Section 6 concludes with future work.Homomorphic Encryption (HE) is a cryptosystem, which allows operations on encrypted data without access to a secret key [40]. We can perform some mathematical operations such as addition and multiplication operations on encrypted data without revealing sensitive information. Given an encryption function E(•) and a decryption function D(•), the HE scheme provides two operators ⊕ and ⊗ such that, for every pair of plaintexts x 1 and x 2 , where + and × denote standard addition and multiplication operations, respectively. Various HE schemes have been developed during the past years, e.g., ElGamal [67], Paillier [68], CKKS [42] encryption, etc. Relevant techniques have been successfully applied to machine learning tasks such as regression problem [69, 70], neural network [71-75], collaborative filtering [76], etc. Generally, HE schemes are accompanied with high computational costs, and one main challenge is to maintain a good trade-off among security, effectiveness and computational cost in real applications.", '---', '> Abstract: Random forests are highly successful ensemble algorithms in machine learning, but their application to sensitive data is hampered by privacy concerns. Existing privacy-preserving techniques, such as anonymization, differential privacy, and homomorphic encryption, often introduce trade-offs between privacy, accuracy, and computational efficiency. This work introduces a novel encryption scheme specifically designed to preserve Gini impurity, a critical splitting criterion in random forest construction, without requiring decryption. Our approach modifies binary search tree structures to store multiple examples per node and encrypts data features by integrating label and order information. We theoretically prove that our scheme preserves the minimum Gini impurity in ciphertexts and provides a robust security guarantee against Gini-impurity-preserving chosen plaintext attacks. For practical privacy-preserving random forests, we combine our Gini-impurity-preserving feature encryption with the CKKS homomorphic encryption scheme for data labels. Extensive empirical studies demonstrate that our proposed method achieves significantly better performance than prior privacy random forests based on encryption, anonymization, and differential privacy, while maintaining accuracy comparable to original plaintext random forests. Our solution strikes an effective balance between computational cost and data security.', '5a6,7', '> Homomorphic Encryption (HE) is a cryptosystem that allows operations on encrypted data without access to a secret key [40]. This enables mathematical operations like addition and multiplication on encrypted data without revealing sensitive information. Given an encryption function E(•) and a decryption function D(•), HE schemes provide operators ⊕ and ⊗ such that, for every pair of plaintexts x 1 and x 2 , where + and × denote standard addition and multiplication operations, respectively. Various HE schemes have been developed, e.g., ElGamal [67], Paillier [68], CKKS [42]. These techniques have been applied to machine learning tasks such as regression [69, 70], neural networks [71-75], and collaborative filtering [76]. However, HE schemes generally incur high computational costs, making the trade-off among security, effectiveness, and computational cost a significant challenge in real-world applications.', '> ', '8,11c10,23', "< Homomorphic encryption [40][41][42][43] has been one of the most important cryptosystems in privacypreserving computing [44][45][46][47]. Based on such scheme, various algorithms have been developed to train privacy random forests and decision trees [48][49][50][51][52], while some other methods only considered inference without training due to computational costs [53][54][55][56][57][58]. In addition, LeFevre et al. [59] took Table 1: Comparisons of communications and complexities for different privacy-preserving decision trees. Here, n is the number of examples in training data, and τ is the cardinality of label space. Let h and κ be the height and number of leaves of decision tree (h < κ), respectively. Denote by ȷ the average number of possible splitting features and positions in the construction of decision trees, and p is the number of clients for secure multi-party computation. '-' means the corresponding methods focusing only on inference without training. the anonymization [60] for random forests by grouping similar attributes so as to hardly identify specific individual information.", '< Figure 1: A simple illustration for our encryption: each plaintext is encrypted into a ciphertext vector (ci, ei,j).', '< Here, random numbers c1 < c2 < • • • < cs are introduced to preserve the Gini impurity for random forests, and we take homomorphic encryption scheme for ei,j = Enc(kpub, j) in Eqn. (5), which is helpful for decryption.', '< Secure Multi-Party Computation (SMC) [77] is another cryptographic technique to jointly compute a function from multiple private inputs with confidential, which has been used for machine learning to protect privacy data, such as neural network [78][79][80], k-means clustering [81][82][83], random forests and decision trees [35][36][37][38][39], etc. Differential privacy is introduced to preserve individual privacy by taking statistically inconsequential changes to data [84], and relevant techniques have been utilized in neural network [85][86][87], random forests [30,31] and decision trees [32][33][34].', '---', '> Homomorphic encryption [40][41][42][43] has been one of the most important cryptosystems in privacy-preserving computing [44][45][46][47]. Based on such schemes, various algorithms have been developed to train privacy random forests and decision trees [48][49][50][51][52], while some other methods only considered inference without training due to computational costs [53][54][55][56][57][58]. In addition, LeFevre et al. [59] applied anonymization [60] for random forests by grouping similar attributes to prevent identification of specific individual information. Table 1 provides a comparison of communication and computational complexities for different privacy-preserving decision trees. Figure 1 illustrates our encryption scheme, where each plaintext is encrypted into a ciphertext vector (ci, ei,j).', '> ', '> Secure Multi-Party Computation (SMC) [77] is another cryptographic technique to jointly compute a function from multiple private inputs with confidentiality, used for machine learning to protect privacy data, such as neural networks [78][79][80], k-means clustering [81][82][83], random forests and decision trees [35][36][37][38][39]. Differential privacy preserves individual privacy by making statistically inconsequential changes to data [84], and relevant techniques have been utilized in neural networks [85][86][87], random forests [30,31], and decision trees [32][33][34].', '> ', '> Despite these advancements, a key challenge in privacy-preserving random forests is to maintain the critical Gini impurity property during tree construction without compromising privacy or incurring prohibitive computational overhead. Our work addresses this gap by focusing on preserving this essential learning ingredient directly within the encryption process.', '> ', '> Our main contributions can be summarized as follows:', "> • We present a novel encryption scheme designed to preserve data's Gini impurity. The core idea involves modifying the structure of binary search trees to maintain multiple samples on each node and encrypting data features by incorporating label and order information. This scheme also inherently changes data frequencies, which further enhances data security.", '> • Theoretically, we prove the preservation of minimum Gini impurity in ciphertexts without decryption, a property crucial for the effective construction of random forests. Our scheme is also proven to satisfy security against Gini-impurity-preserving chosen plaintext attacks.', '> • We develop privacy-preserving random forests within the popular client-server protocol. We utilize our Gini-impurity-preserving encryption for data features and adopt the CKKS homomorphic encryption scheme for data labels, owing to their importance and privacy requirements. Our encrypted decision tree demonstrates reduced communication and computational complexities, as detailed in Table 1.', '> • Extensive empirical studies validate that our encrypted random forests achieve significantly better performance than prior privacy random forests based on encryption, anonymization, and differential privacy. Furthermore, their performance is comparable to original (plaintext) random forests without encryption, demonstrating an excellent balance between computational cost and data security.', '> ', "> The rest of this work is organized as follows: Section 2 introduces relevant background and related work. Section 3 details our encryption scheme for data's Gini impurity. Section 4 proposes the encrypted random forests. Section 5 presents extensive experimental results. Section 6 concludes the paper and outlines future work.", '> ', '31,33c43,50', '< Section: Algorithm 1', '< The Gini-impurity-preserving encryption Input: We consider two important factors in encryption: i) preservation of the minimum Gini impurity I * G (A) over the encrypted data, and ii) a cryptosystem for encoding and decoding data. Based on such recognition, we introduce the following encryption, for every example (a ⟨i⟩ , y ⟨i⟩ ) ∈ I j ,', '< Dataset A = {(a 1 , y 1 ), • • • , (a n , y n )} Output: Binary search tree BT , ciphertexts { a 1 , • • • , a n } Initialize: Tree BT = ∅ with its cipher 1 = c max /2, where c max = 2 λ log 2 n for i = 1', '---', '> Section: Algorithm 1: Gini-impurity-preserving Encryption Scheme', '> This section introduces our Gini-impurity-preserving encryption scheme. The design considers two critical factors: i) the preservation of the minimum Gini impurity I * G (A) over encrypted data, and ii) the integration of a robust cryptosystem for data encoding and decoding.', '> ', '> The encryption process for each example (a ⟨i⟩ , y ⟨i⟩ ) ∈ I j is defined as follows:', '> Input: Dataset A = {(a 1 , y 1 ), • • • , (a n , y n )}', '> Output: Binary search tree BT , ciphertexts { a 1 , • • • , a n }', '> Initialize: Tree BT = ∅ with its cipher 1 = c max /2, where c max = 2 λ log 2 n', '> For i = 1:', '35c52', '< Here, c 1 , c 2 , • • • , c s are random numbers s.t. c 1 < c 2 < • • • < c s , which aim to preserve the minimum Gini impurity. We take the homomorphic encryption scheme CKKS with a public key k pub for', '---', '> Here, c 1 , c 2 , • • • , c s are random numbers such that c 1 < c 2 < • • • < c s , strategically chosen to preserve the minimum Gini impurity. We employ the homomorphic encryption scheme CKKS with a public key k pub for the second component of the ciphertext,', '37,38c54,57', '< ) in Eqn. (5), and it is useful for decryption. Figure 1 presents a simple illustration for our encryption, and the detailed decryption is given in Appendix A.', '< We now present our main theorem as follows: Theorem 1. We have I * G (A) = I * G (A ′ ), for re-sort dataset A by Eqn.', '---', '> ) in Eqn. (5), which is essential for subsequent decryption. Figure 1 provides a simple illustration of our encryption method, with detailed decryption procedures presented in Appendix A.', '> ', '> We now present our main theoretical guarantee:', '> Theorem 1. We have I * G (A) = I * G (A ′ ), for the re-sorted dataset A by Eqn.', '40,41c59,60', '< A ′ = {( a ⟨1⟩ 1 , y ⟨1⟩ ), • • • , ( a ⟨n⟩ 1 , y ⟨n⟩ )} from Eqns. (4)-(5).', '< This theorem shows that our encryption could preserve the minimum Gini impurity over encrypted data. The detailed proof is presented in Appendix B, which involves the proof of piecewise monotonicity of I G (A, a) w.r.t. splitting point a, and then solves the minimum splitting point on plaintexts, as well as the corresponding point on encrypted data.', '---', '> A ′ = {( a ⟨1⟩ 1 , y ⟨1⟩ ), • • • , ( a ⟨n⟩ 1 , y ⟨n⟩ )} derived from Eqns. (4)-(5).', '> This theorem rigorously demonstrates that our encryption scheme effectively preserves the minimum Gini impurity over encrypted data. The comprehensive proof is detailed in Appendix B, involving the analysis of piecewise monotonicity of I G (A, a) with respect to the splitting point a, and subsequently identifying the minimum splitting points on both plaintexts and their corresponding encrypted data.', '47,52c66,69', '< Step-I: Search a node for sample (a i , y i ) in binary search tree BT Let t be a node pointer with the initialization of the root of BT . We search a path downward in BT by comparing with a i , and the search will terminate when t is a leaf node or an empty node.', '< For an internal node t, the search continues to its left child and updates t max = t.cipher 1 if the left child t.left ̸ = ∅ and a i < max{a j : (a j , y j ) ∈ t.left.samples} ; and the search continues to its right child and updates t min = t.cipher 1 if the right child t.right ̸ = ∅ and a i > min{a j : (a j , y j ) ∈ t.right.samples} ; otherwise, the search terminates. This procedure can be easily implemented with a while loop.', '< It is necessary to consider two special cases after the above search. We update t = t.left if t.left ̸ = ∅, a i < min{a j : (a j , y j ) ∈ t.samples} and y i = y j for all (a j , y j ) ∈ t.left.samples . (6) In a similar manner, we update t = t.right if t.right ̸ = ∅, a i > max{a j : (a j , y j ) ∈ t.samples} and y i = y j for all (a j , y j ) ∈ t.right.samples . (7) Step-II: Update the binary search tree BT After Step-I, we could find a node t for sample (a i , y i ) and the corresponding interval [t min , t max ]. We directly append the example (a i , y i ) into t.samples if y i = y j for every (a j , y j ) ∈ t.samples; otherwise, it is necessary to split the node t according to a i .', '< We initialize an empty node l with l.samples = {(a j , y j ) ∈ t.samples : a j < a i }, and it is sufficient to consider l.samples ̸ = ∅. If t.left ̸ = ∅, then we set l.cipher ', '< and update t.left = l. Here, ξ is a random number sampled from N (0, 1), and notice that we may randomly sample ξ multiple times so that the condition holds in Eqns ( 8)-( 9), respectively. ', '< and update t.right = r. Algorithm 2 presents the detailed descriptions on the splitting of node t.', '---', '> Step-I: Search a node for sample (a i , y i ) in binary search tree BT Let t be a node pointer initialized to the root of BT . We search a path downward in BT by comparing the plaintext feature a i with the splitting value represented by t.cipher 1 . The search continues to its left child if a i is less than the value represented by t.cipher 1 , updating t max = t.cipher 1 ; it continues to its right child if a i is greater than the value represented by t.cipher 1 , updating t min = t.cipher 1 . The search terminates when t is a leaf node, an empty node, or when a i falls within the range represented by t.cipher 1 . This procedure can be implemented with a while loop.', '> It is necessary to consider two special cases after the above search. We update t = t.left if t.left ̸ = ∅, a i < min{a j : (a j , y j ) ∈ t.samples} and y i = y j for all (a j , y j ) ∈ t.left.samples . (6) In a similar manner, we update t = t.right if t.right ̸ = ∅, a i > max{a j : (a j , y j ) ∈ t.samples} and y i = y j for all (a j , y j ) ∈ t.right.samples . (7) Step-II: Update the binary search tree BT After Step-I, we identify a node t for sample (a i , y i ) and the corresponding interval [t min , t max ]. We directly append the example (a i , y i ) into t.samples if y i = y j for every (a j , y j ) ∈ t.samples; otherwise, it is necessary to split the node t according to a i .', '> We initialize an empty node l with l.samples = {(a j , y j ) ∈ t.samples : a j < a i }, and proceed if l.samples ̸ = ∅. If t.left ̸ = ∅, then we set l.cipher 1 according to Eqn. (8), and update l.left = t.left, t.left = l. Here, ξ is a random number sampled from N (0, 1), and notice that we may randomly sample ξ multiple times so that the condition holds in Eqns (8)-(9), respectively.', '> We make a similar update for the right child of node t: initialize an empty node r with r.samples = {(a j , y j ) ∈ t.samples : a j > a i }, and proceed if r.samples ̸ = ∅. If t.right ̸ = ∅, then we set r.cipher 1 according to Eqn. (10), and update r.right = t.right, t.right = r. Algorithm 2 presents the detailed descriptions on the splitting of node t.', '97a115', '> After the encrypted random forests (consisting of individual decision trees DT 1 , • • • , DT m ) are constructed, prediction on encrypted testing data Sn ′ = { x1 , • • • , xn ′ } proceeds as follows. For each encrypted test instance xi , the server computes the encrypted predicted label ỹi = DT 1 ( xi ) ⊕ • • • ⊕ DT m ( xi ) by traversing each encrypted decision tree. The server then sends the collection of encrypted predicted labels { ỹ1 , • • • , ỹn ′ } to the client. The client, possessing the secret key k sec , decrypts these ciphertexts. The final plaintext label for each instance is obtained by ỹi = arg max j∈[τ ] {Dec(k sec , ỹi,j )}.', '98a117', '> During this prediction process, the server incurs an O(h) computational complexity, primarily due to searching from the root to a leaf node in each tree. The client benefits from O(1) communication rounds and bandwidth, as it only needs to transfer the encrypted testing data initially and receive the final encrypted predictions, without further interactive communication during the prediction phase. This process is further illustrated in Figure 1.', '101c120', '< We conduct experiments on 20 datasets2 as summarized in Table 2. Most datasets have been wellstudied in previous random forests. In addition to the original (plaintexts) random forests [1], we compare with six state-of-the-art privacy-preserving random forests in recent years.', '---', '> We conduct experiments on 20 datasets as summarized in Table 2. Most datasets have been well-studied in previous random forests literature. In addition to the original (plaintext) random forests [1], we compare our method with six state-of-the-art privacy-preserving random forests from recent years.', '127c146,147', '< Section: Algorithm 4 Decryption', '---', '> Section: Algorithm 4: Decryption Method', '> This section provides the detailed decryption procedures for our Gini-impurity-preserving encryption scheme. Algorithm 4 outlines the general decryption process for a ciphertext a i = ( a i 1 , a i 2 ) using the binary search tree BT and the CKKS secret key k sec . Further details are provided in the subsequent subsections.', '129d148', '< ', '144c163', '< • S ← KeyGen(t max ): Generate the secret state S by initializing binary search tree BT = ∅, and a security parameter c max , which is a random number with c max > n. We maintain an interval [t min , t max ] in each secret state S with t min = 0 and t max = c max in the initial stage, so as to keep the order of ciphertexts c 1 , c 2 , • • • , c s in Eqn. (5). In this way, the ciphertexts are random numbers with semi-order of plaintexts, and we have different ciphertext even for the same plaintexts.', '---', '> • S ← KeyGen(t max ): Generate the secret state S by initializing binary search tree BT = ∅, and a security parameter c max , which is derived as c max = 2 λ log 2 n (as defined in Algorithm 1) with c max > n. We maintain an interval [t min , t max ] in each secret state S with t min = 0 and t max = c max in the initial stage, to preserve the order of ciphertexts c 1 , c 2 , • • • , c s as established in Eqn. (5). This mechanism ensures that ciphertexts are random numbers that maintain the semi-order of plaintexts, allowing for different ciphertexts even for identical plaintexts.', '151,153c170,175', '< Section: B Proof of Theorem 1', '< Lemma 4. Proof. Without loss of generality, we assume that a 1 , a 2 , • • • , a n are distinct elements. Our goal is to solve the optimal splitting point a * ∈ arg min a∈R {I G (A, a)}, and we begin with some notations used in our proof. For every label j ∈ [τ ], we denote by', '< For dataset A = {(a 1 , y 1 ), • • • , (a n , y n )}, let I 1 , I 2 , • • • , I s be', '---', '> Section: B Proof of Theorem 1 and Theorem 5', '> To prove Theorem 1, we first establish the piecewise monotonicity of the Gini impurity function.', '> ', '> Lemma 4 (Piecewise Monotonicity of Gini Impurity). For a dataset A = {(a 1 , y 1 ), • • • , (a n , y n )} sorted such that a ⟨1⟩ ≤ a ⟨2⟩ ≤ • • • ≤ a ⟨n⟩ and partitioned into I 1 , I 2 , • • • , I s as defined in Eqns. (4)-(5), the Gini impurity function I G (A, a) exhibits piecewise monotonicity with respect to the splitting point a within specific intervals.', '> ', '> Proof. Without loss of generality, we assume that a 1 , a 2 , • • • , a n are distinct elements. Our goal is to solve the optimal splitting point a * ∈ arg min a∈R {I G (A, a)}. We begin with some notations used in our proof. For every label j ∈ [τ ], we denote by', '188c210', '< In a summary, we prove the piecewise monotonicity of I G (A, a) for', '---', '> In summary, we prove the piecewise monotonicity of I G (A, a) for', '194c216', '< It is not necessary to consider the splitting point a * > max{a k : (a k , y k ) ∈ I s } with |A r a | = 0, as well as the splitting point a * < min{a k : (a k , y k ) ∈ I 1 } with |A l a | = 0, i.e., without splitting dataset A. This completes the proof.', '---', '> It is not necessary to consider the splitting point a * > max{a k : (a k , y k ) ∈ I s } with |A r a | = 0, as well as the splitting point a * < min{a k : (a k , y k ) ∈ I 1 } with |A l a | = 0, i.e., without splitting dataset A. This completes the proof of Lemma 4.', '196c218,220', '< Section: Proof of Theorem 1', '---', '> Lemma 6 (Optimal Splitting Point). For a dataset A partitioned into I 1 , I 2 , • • • , I s as defined by Eqn. (4), there exists an optimal splitting point a * such that I G (A, a * ) = I * G (A) and a * ∈ i∈[s-1] {max{a k : (a k , y k ) ∈ I i }/2 + min{a k : (a k , y k ) ∈ I i+1 }/2} , where I G (A, a * ) and I * G (A) are defined by Eqns. (1) and (2), respectively.', '> ', '> Proof of Theorem 1.', '202c226,229', '< Based on Theorem 1, our encryption with binary search trees (Algorithm 1) can also preserve the minimum Gini impurity over encrypted data, which can be shown by the following theorem: Proof. Our constructed binary search tree BT (Algorithm 1) maintains several samples on a node. For each node t, we have t.cipher 1 < t.right.cipher 1 and t.cipher 1 > t.left.cipher 1 . In this way, we can obtain a monotone increasing sequence I 1 , I 2 , • • • , I s by inorder traversing the built Tree BT in Algorithm 1. Each I i for j ∈ [s] contains several samples as follows:', '---', '> ', '> Theorem 5 (Preservation of Gini Impurity with Binary Search Trees). We have I * G (A) = I * G ( Â), for the re-sorted dataset A by Eqn. (3) and for the corresponding encrypted dataset Â = {( a ⟨1⟩ 1 , y ⟨1⟩ ), • • • , ( a ⟨n⟩ 1 , y ⟨n⟩ )} from Algorithm 1.', '> ', '> Proof of Theorem 5. Our constructed binary search tree BT (Algorithm 1) maintains several samples on a node. For each node t, we have t.cipher 1 < t.right.cipher 1 and t.cipher 1 > t.left.cipher 1 . In this way, we can obtain a monotone increasing sequence I 1 , I 2 , • • • , I s by inorder traversing the built Tree BT in Algorithm 1. Each I i for j ∈ [s] contains several samples as follows:', '209c236', '< , and this completes the proof.', '---', '> , and this completes the proof of Theorem 5.', '231,232c258,259', '< Section: D Experimental Details Experimental settings', '< We now present some details of compared methods in this work.', '---', '> Section: D Experimental Details', '> We now present some details of compared methods and experimental settings in this work.', '634d660', '< ']
