Flash-SD-KDE: Accelerating SD-KDE with Tensor Cores

Elliot L Epstein; Rajat Vadiraj Dwaraknath; John Winnicki

Flash-SD-KDE: Accelerating SD-KDE with Tensor Cores

Elliot L Epstein, Rajat Vadiraj Dwaraknath, John Winnicki

Published: 30 May 2026, Last Modified: 01 Jun 2026SPIGM @ ICML PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Kernel Density Estimation, Score Debiasing, Tensor Cores, SD-KDE, GPU Acceleration

TL;DR: By exploiting GPU Tensor Cores, we make score-debiased kernel density estimation practical at large scale

Abstract: Score-debiased kernel density estimation (SD-KDE) achieves improved asymptotic convergence rates over classical KDE, but its use of an empirical score has made it significantly slower in practice. We show that by re-ordering the SD-KDE computation to expose matrix-multiplication structure, Tensor Cores can be used to accelerate the GPU implementation. On a 32k-sample 16-dimensional problem, our approach runs up to $47\times$ faster than a strong SD-KDE GPU baseline and $3{,}300\times$ faster than scikit-learn's KDE. On a larger 1m-sample 16-dimensional task evaluated on 131k queries, Flash-SD-KDE completes in $2.3$ s on a single GPU, making score-debiased density estimation practical at previously infeasible scales.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 203

Loading