Keywords: LLM inference scaling, clustering, set cover, approximation algorithms, embedding geometry, inference-time efficiency, recommender systems, high-dimensional data
TL;DR: Two-stage clustering recasting LLM inference scaling as greedy set cover over $\alpha$-balls, with exact similarity/attribute guardrails, and linear-in-$n$ runtime. $10$–$1000\times$ faster than baselines; 50× LLM compute cut in production.
Abstract: Scaling LLM-based applications to millions of users is bottlenecked by the
inference cost and latency of modern foundation models, forcing practitioners
to trade throughput against output quality. We study input-side clustering as
a principled mechanism to reduce downstream LLM calls, and identify a gap in
existing methods: none simultaneously guarantee a user-specified minimal
within-cluster similarity, exact matching of categorical attributes, and
scalability to tens of millions of samples. We propose a two-stage algorithm
that combines Mini-batch K-Means with a greedy representative-selection step
equivalent to the Johnson–Chvátal heuristic for Set Cover over $\alpha$-balls
in embedding space. The algorithm (i) provably enforces the similarity and
attribute guardrails by construction, (ii) produces a heavily skewed
cluster-size distribution that enables aggressive tail trimming, and
(iii) runs in $O(nd + n^2 d / K)$ time with $O(nd + n^2/K^2)$ memory, linear
in $n$ for $K = \Theta(n)$. Empirically, it achieves the target minimal
similarity at 10x–1000x the speed of standard clustering baselines across
internal and public datasets. Deployed on 38 million customers for a
persona-based recommendation system, it yields a 50-fold reduction in
downstream LLM compute and unblocked a production launch.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 78
Loading