Keywords: Large Language Model Serving, Efficienct Serving Systems, Decentralized LLM Serving, Distributed LLMs
TL;DR: We propose WWW.Serve, a fully decentralized framework for trustless and collaborative LLM serving, which improves efficiency, latency, and scalability while preserving privacy.
Abstract: Large language model (LLM) services are mostly centralized, causing inherent scalability bottlenecks and leaving substantial scattered GPU resources underutilized. Decentralized serving could potentially address these limitations, but impose challenges of **trust**, as the identity and behavior of participants cannot be reliably regularized, and **fairness**, i.e., how to maximize the benefit of all resource providers to improve engagement. However, existing decentralized frameworks **predominantly emphasize the rights and protections of users and the cooperative aspect among GPU providers** while **overlooking the inherent competitive dynamics**, imposing substantial constraints on GPU providers, such as requiring them to accept excessive platform-level oversight and to execute all assigned requests with fixed software stacks on fixed hardware configurations. We argue that such assumptions are unrealistic in real-world decentralized environments. To this end, we propose **WWW.Serve**, a decentralized framework for interconnecting LLM service worldwide. It preserves the flexibility of service providers, allowing them to decide **when, under what policies, and with what resources** they join the decentralized network, while further ensuring their anonymity. In terms of efficiency, WWW.Serve supports self-organizing request dispatch, enabling the network to autonomously allocate requests without centralized coordination. Three key designs are integrated: a blockchain-inspired credit system for trustless collaboration, gossip-driven peer synchronization for flexible participation, and a duel-and-judge mechanism for robust contributor evaluation. Empirically, we show that WWW.Serve incentivizes higher-quality services to obtain greater profit, while improving global SLO (service-level-objective) attainment by up to $1.5\times$ and lowers latency by 27.6\%. Its performance approaches, and in some cases surpasses, centralized scheduling, while fully preserving the benefits of decentralization. These results highlight WWW.Serve as a promising foundation for real-world, decentralized LLM serving.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 27
Loading