Abstract: Data heterogeneity across multiple sources is common in real-world machine learning (ML) settings. Although many methods focus on enabling a single model to handle diverse data, real-world markets often comprise multiple competing ML providers. In this paper, we propose a game-theoretic framework—the Heterogeneous Data Game—to analyze how such providers compete across heterogeneous data sources. We investigate the resulting pure Nash equilibria (PNE), showing that they can be non-existent, homogeneous (all providers converge on the same model), or heterogeneous (providers specialize in distinct data sources). Our analysis spans monopolistic, duopolistic, and more general markets, illustrating how factors such as the ``temperature'' of data-source choice models and the dominance of certain data sources shape equilibrium outcomes. We offer theoretical insights into both homogeneous and heterogeneous PNEs, guiding regulatory policies and practical strategies for competitive ML marketplaces.
Lay Summary: In the real world, data used to train machine learning (ML) models often comes from different sources, like hospitals, cities, or user groups, each with its own unique characteristics. Yet, most research assumes that a single model serves all users equally. This overlooks how, in practice, multiple companies or institutions compete to offer ML models tailored to different users.
We study how such competition plays out when data is heterogeneous. Using tools from game theory, we model how providers choose what kind of model to offer, and how users decide which model to adopt based on performance. We identify conditions under which competing providers end up offering the same solution and when they specialize to serve different data sources.
Our results show that market diversity depends not only on the data itself, but also on how users choose models and how competitive the market is. In particular, dominant data sources often attract most providers, leaving others underserved. These insights can help platform designers and policymakers build ML ecosystems that are both efficient and equitable.
Primary Area: Theory->Game Theory
Keywords: Heterogeneous Data Game, Model Competition, Nash Equilibrium
Submission Number: 3778
Loading