Learning Interpretable Characteristic Kernels via Decision Forests

Sambit Panda; Cencheng Shen; Joshua T Vogelstein

Learning Interpretable Characteristic Kernels via Decision Forests

Sambit Panda, Cencheng Shen, Joshua T Vogelstein

10 May 2023 (modified: 12 Dec 2023)Submitted to NeurIPS 2023EveryoneRevisionsBibTeX

Keywords: kernel learning, random forest, hypothesis testing

TL;DR: Decision forests induce characteristic kernel and these kernels produce an empirically powerful and interpretable hypothesis test.

Abstract: Decision forests are popular tools for classification and regression. These forests naturally generate proximity matrices that measure the frequency of observations appearing in the same leaf node. While other kernels are known to have strong theoretical properties such as being characteristic, there is no similar result available for decision forest-based kernels. In addition, existing approaches to independence and k-sample testing may require unfeasibly large sample sizes and are not interpretable. In this manuscript, we prove that the decision forest induced proximity is a characteristic kernel, enabling consistent independence and k-sample testing via decision forests. We leverage this to introduce kernel mean embedding random forest (KMERF), which is a valid and consistent method for independence and k-sample testing. Our extensive simulations demonstrate that KMERF outperforms other tests across a variety of independence and two-sample testing scenarios. Additionally, the test is interpretable, and its key features are readily discernible. This work therefore demonstrates the existence of a test that is both more powerful and more interpretable than existing methods, flying in the face of conventional wisdom of the trade-off between the two.

Supplementary Material: zip

Submission Number: 7420

Loading