Finding Landmarks of Covariate Shift with the Max-Sliced Kernel Wasserstein Distance

27 Jan 2026 (modified: 24 Apr 2026)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: To detect distribution shifts caused by localized changes, we propose an interpretable kernel-based max-sliced Wasserstein divergence, which is computationally efficient for two-sample testing. The max landmark kernel Wasserstein distance (MLW) seeks a single data point whose kernel embedding acts as a slice of the kernel Hilbert space, such that the two samples' resulting projections (kernel evaluations between each point and the landmark) have maximal Wasserstein distance. This {landmark, or multiple landmarks chosen via a greedy algorithm,} provide an interpretation of localized divergences. We investigate MLW's ability to detect and localize distribution shifts corresponding to over- or -under representation of one class. Results on the MNIST and CIFAR-10 datasets demonstrate MLW's competitive statistical power and accurate landmark selection. Using the mean embedding (ME) test statistic with multiple MLW landmarks enables state-of-the-art power on the Higgs dataset.
Submission Type: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=SxKAl2K8N9
Changes Since Last Submission: The previous reviews were taken into account to thoroughly revise the manuscript, clarifying key portions, adding additional baselines, and extending the proposed methodology. The introduction has been clarified and shortened. The section on Preliminaries and Prior Work has been condensed and background has been removed. Leaving only the essential equations re-used in the proposed methodology. The representer theorem for the kernel max-sliced Wasserstein distance has been removed. Notably, a single landmark form of the mean embedding (ME) test statistic suggested in prior work has been added. This approach (ME1) now stands as a baseline for comparison. The proposed methodology has been clarified so that it is apparent that for two samples the optimization is a discrete maximization that requires checking m+n points. The Rademacher complexity has been added. Also a note that the sample complexity is dimension free which follows from recent work by Boedihardjo has been added. A new subsection on multiple max landmark Wasserstein distance (MMLW) has been added which builds on prior work by Kim et al. using a log-determinate anti-concentration penalty to create an optimization that finds a diverse set of landmarks. A greedy approach (with approximation guarantees) to solve it is stated. The number of landmarks can be selected by a heuristic knee method applied to the cumulative divergence. Alternatively, the asymptotic power of the ME test can be used (and note that the ME test can be applied with the landmarks identified by MMLW). (The ME test statistic is no submodular so a greedy algorithm does not provide approximation guarantees.) Examples using 1D toy data consisting of standard normal with outlier modes are added to the experiments to illustrate the operation and power results show that for the 1D case how the proposed methods compare to MMD. The first experiment (one MNIST class over-represented) was expanded to include all 10 digits when upsampling. The results show that MMD is actually optimal on average but MLW and ME1 are competitive. A table reporting run time is added, and a table showing the landmark accuracy for MLW and ME1 is added which shows that MLW is superior to the ME1 baseline. ME1 is also added as a baseline for the second experiment. Again, for power ME1 is competitive with MLW, actually performing the best for the BBSD representation; however, the landmarks selected by MLW are more accurate. Experiments on the Higgs dataset are now added. The results show that the power of MLW grows with increasing sample size but is not competitive with ME or deep kernel methods or classifier based two-sample tests. The multiple landmarks case MMLW is competitive with ME (we also show that knee based selection on the number of landmarks performs worse than a fixed number but better than MLW). Interestingly, ME1 fails on the Higgs dataset. Finally, the ME test with the MMLW landmarks is competitive with the state-of-the-art (AutoML, classifier based two-sample tests, and MMD-D) With these larger Higgs dataset the runtime shows that MLW in the split permutation test and MMLW is very scalable, especially since no training is necessary.
Assigned Action Editor: ~Masashi_Sugiyama1
Submission Number: 7190
Loading