Submission Type: Recently published work (link only)
Keywords: Decision Trees, k-means, Gradient Boosting, Supervised ML
TL;DR: We replace GBDT quantile histograms with quantile-initialized k-means bins, prove it maximizes worst-case explained variance for L-Lipschitz targets, and show broad gains on OpenML and synthetic tasks–especially on skewed data and tight-bin regimes.
Abstract: Modern Gradient Boosted Decision Trees (GBDTs) accelerate split finding with histogram-based binning, which reduces complexity from $O(N\log N)$ to $O(N)$ by aggregating gradients into fixed-size bins. However, the predominant quantile binning strategy – designed to distribute data points evenly among bins – may overlook critical boundary values that could enhance predictive performance. In this work, we consider a novel approach that replaces quantile binning with a $k$-means discretizer initialized with quantile bins, and justify the swap with a proof showing how, for any $L$-Lipschitz function, k-means maximizes the worst-case explained variance of Y obtained when treating all values in a given bin as equivalent. We test this swap against quantile and uniform binning on 33 OpenML datasets plus synthetics that control for modality, skew, and bin budget. Across 18 regression datasets, k-means shows no statistically significant losses at the 5% level and wins in three cases-most strikingly a 55% MSE drop on one particularly skewed dataset-even though k-means' mean reciprocal rank (MRR) is slightly lower (0.65 vs 0.72). On the 15 classification datasets the two methods are statistically tied (MRR 0.70 vs 0.68) with gaps $\leq$0.2 pp. Synthetic experiments confirm consistently large MSE gains – typically >20% and rising to 90% as outlier magnitude increases or bin budget drops. We find that k-means keeps error on par with exhaustive (no-binning) splitting when extra cuts add little value, yet still recovers key split points that quantile overlooks. As such, we advocate for a built-in bin_method=k-means flag, especially in regression tasks and in tight-budget settings such as the 32-64-bin GPU regime – because it is a "safe default" with large upside, yet adds only a one-off, cacheable overhead ($\approx$ 3.5s per feature to bin 10M rows on one Apple M1 thread).
Published Paper Link: https://openreview.net/forum?id=UaTrLLspJa
Relevance Comments: GBDTs are core to tabular ML. Our k-means binning improves split quality on skewed data and low-bin regimes, with theory and broad empirical evidence–improving upon a staple method and fitting AITD’s “predictive ML” and “methods & benchmarks” themes.
Published Venue And Year: TMLR 2025
Submission Number: 1
Loading