Efficient Clustering with Provable Guardrails for LLM Inference at Scale

Published: 29 May 2026, Last Modified: 29 May 2026HiLD at ICML 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM inference scaling, clustering, set cover, approximation algorithms, embedding geometry, inference-time efficiency, recommender systems, high-dimensional data
TL;DR: Two-stage clustering recasting LLM inference scaling as greedy set cover over $\alpha$-balls, with exact similarity/attribute guardrails, and linear-in-$n$ runtime. $10$–$1000\times$ faster than baselines; 50× LLM compute cut in production.
Abstract: Scaling LLM-based applications to millions of users is bottlenecked by the inference cost and latency of modern foundation models, forcing practitioners to trade throughput against output quality. We study input-side clustering as a principled mechanism to reduce downstream LLM calls, and identify a gap in existing methods: none simultaneously guarantee a user-specified minimal within-cluster similarity, exact matching of categorical attributes, and scalability to tens of millions of samples. We propose a two-stage algorithm that combines Mini-batch K-Means with a greedy representative-selection step equivalent to the Johnson–Chvátal heuristic for Set Cover over $\alpha$-balls in embedding space. The algorithm (i) provably enforces the similarity and attribute guardrails by construction, (ii) produces a heavily skewed cluster-size distribution that enables aggressive tail trimming, and (iii) runs in $O(nd + n^2 d / K)$ time with $O(nd + n^2/K^2)$ memory, linear in $n$ for $K = \Theta(n)$. Empirically, it achieves the target minimal similarity at 10x–1000x the speed of standard clustering baselines across internal and public datasets. Deployed on 38 million customers for a persona-based recommendation system, it yields a 50-fold reduction in downstream LLM compute and unblocked a production launch.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 78
Loading