Competition over data: how does data purchase affect users?

Yongchan Kwon; Tony A Ginart; James Zou

Competition over data: how does data purchase affect users?

Yongchan Kwon, Tony A Ginart, James Zou

Published: 14 Nov 2022, Last Modified: 17 Sept 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: As the competition among machine learning (ML) predictors is widespread in practice, it becomes increasingly important to understand the impact and biases arising from such competition. One critical aspect of ML competition is that ML predictors are constantly updated by acquiring additional data during the competition. Although this active data acquisition can largely affect the overall competition environment, it has not been well-studied before. In this paper, we study what happens when ML predictors can purchase additional data during the competition. We introduce a new environment in which ML predictors use active learning algorithms to effectively acquire labeled data within their budgets while competing against each other. We empirically show that the overall performance of an ML predictor improves when predictors can purchase additional labeled data. Surprisingly, however, the quality that users experience---i.e., the accuracy of the predictor selected by each user---can decrease even as the individual predictors get better. We demonstrate that this phenomenon naturally arises due to a trade-off whereby competition pushes each predictor to specialize in a subset of the population while data purchase has the effect of making predictors more uniform. With comprehensive experiments, we show that our findings are robust against different modeling assumptions.

Submission Length: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: In the revised paper, we have incorporated the reviewers’ helpful suggestions. All the revisions are reflected in red. The main changes are: - We add the detailed discussion on the feasibility of assumptions in Theorem 1. - We include more realistic competition environments by relaxing model assumptions (Appendix C). - We move the related work section to the end of section 1. (Updated November 2nd) - We add a discussion in the Conclusion section. - We add a link to Python-based implementation codes.

Code: https://github.com/ykwon0407/data_purchase_in_comp

Assigned Action Editor: ~Chicheng_Zhang1

License: Creative Commons Attribution 4.0 International (CC BY 4.0)

Submission Number: 264

Loading