Anatomy of a Machine Learning Ecosystem: 2 Million Models on Hugging Face

Published: 23 Sept 2025, Last Modified: 09 Oct 2025RegML 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Network Analysis, ML Ecosystem, Open-source ML, Platform Governance
Abstract: Foundation models are resource-intensive but broadly capable. They become specialized for downstream tasks through transformations such as fine-tuning, adaptation, and quantization. While these processes are often examined through individual evaluations or case studies, little work has explored their collective dynamics and interactions at scale. This paper analyzes 1.86 million models on Hugging Face, a leading peer production platform for model development. Our study of model family trees---networks that connect fine-tuned models to their base or parent---reveals sprawling fine-tuning lineages that vary widely in size and structure. Using an evolutionary biology lens to study ML models, we use model metadata and model cards to measure the \textit{genetic similarity} and \textit{mutation of traits} over model families. We find that models tend to exhibit a family resemblance, meaning their genetic markers and traits exhibit more overlap when they belong to the same model family. However, these similarities depart in certain ways from standard models of asexual reproduction, because mutations are fast and directed, such that two `sibling' models tend to exhibit more similarity than parent/child pairs. Further analysis of the directional drifts of these mutations reveals qualitative insights about the open machine learning ecosystem: insights potentially relevant for policymakers and regulators: Licenses counter-intuitively drift from restrictive, commercial licenses towards permissive or copyleft licenses, often in violation of upstream license's terms; models evolve from multi-lingual compatibility towards english-only compatibility; and model cards reduce in length and standardize by turning, more often, to templates and automatically generated text. This work shows how platform tools shape derivative development. The structured dataset, which traces model lineage at a fine-grained level, enables deeper analysis of how models emerge and interact, offering new leverage points for policy and oversight.
Submission Number: 14
Loading