Abstract: Due to stricter data management regulations such as General Data Protection Regulation (GDPR), traditional production mode of machine learning services is shifting to federated learning, a paradigm that allows multiple data providers to train a joint model collaboratively with their data kept locally. A key enabler for practical adoption of federated learning is how to allocate the prolit earned by the joint model to each data provider. For fair prolit allocation, a metric to quantify the contribution of each data provider to the joint model is essential. Shapley value is a classical concept in cooperative game theory which assigns a unique distribution (among the players) of a total surplus generated by the coalition of all players and has been used for data valuation in machine learning services. However, prior Shapley value based data valuation schemes either do not apply to federated learning or involve extra model training which leads to high cost. In this paper, given n data providers with data sets D <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sub> , D <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sub> , ⋯, D <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">n</sub> , a federated learning algorithm A and a standard test set T, we propose the contribution index, a new Shapley value based metric lit for assessing the contribution of each data provider for the joint model trained by federated learning. The contribution index shares the same properties as Shapley value. However, direct calculation of the contribution index is time consuming, since a large number of joint models with different combinations of data sets are required to be trained and evaluated. To solve this problem, we propose two gradient based methods. The idea is to reconstruct approximately the models on different combinations of the data sets through the intermediate results of the training process of federated learning so as to avoid extra training. The lirst method reconstructs models by updating the initial global model in federated learning with the gradients in different rounds. Then it calculates the contribution index by the performance of these reconstructed models. The second method calculates contribution index in each round by updating the global model in the previous round with the gradients in the current round. Contribution indexes of multiple rounds are then added with elaborated weights to get the linal result. We conduct extensive experiments on the MNIST data set in different settings. The results demonstrate that the proposed methods can approximate the exact contribution index effectively and achieve a time speed up of up to 2x-100x compared with the exact calculation and other baselines extended from existing work.
0 Replies
Loading