Title: Kernelized Cumulants: Beyond Kernel Mean Embeddings

Abstract: In R d , it is well-known that cumulants provide an alternative to moments that can achieve the same goals with numerous benefits such as lower variance estimators. In this paper we extend cumulants to reproducing kernel Hilbert spaces (RKHS) using tools from tensor algebras and show that they are computationally tractable by a kernel trick. These kernelized cumulants provide a new set of all-purpose statistics; the classical maximum mean discrepancy and Hilbert-Schmidt independence criterion arise as the degree one objects in our general construction. We argue both theoretically and empirically (on synthetic, environmental, and traffic data analysis) that going beyond degree one has several advantages and can be achieved with the same computational complexity and minimal overhead in our experiments.

Section: Introduction
While moments are widely used as all-purpose statistics for random variables, cumulants often offer more favorable properties, particularly in terms of interpretability, statistical efficiency, and characterization of independence. For instance, the variance, a key measure of fluctuation, is a cumulant (specifically, the second cumulant) and often provides a more robust and intuitive measure of scale than the second moment, which is heavily influenced by the mean (see Appendix A for a detailed discussion). Cumulants systematically capture the novel information in the moment sequence not already described by lower-order moments. Although moments and cumulants carry equivalent information, cumulants possess several desirable properties that extend to R d -valued random variables (McCullagh, 2018), notably their ability to characterize distributions and statistical (in)dependence.

This paper proposes a novel framework for extending cumulants to Reproducing Kernel Hilbert Spaces (RKHSs). This extension, which we term "kernelized cumulants," leverages tools from tensor algebras and is made computationally tractable through a kernel trick. Kernelized cumulants introduce a new class of all-purpose statistics for complex data, where classical measures like the Maximum Mean Discrepancy (MMD) and the Hilbert-Schmidt Independence Criterion (HSIC) emerge as specific degree-one instances of our generalized construction. We demonstrate both theoretically and empirically that incorporating higher-degree kernelized cumulants offers significant advantages, such as enhanced statistical power in hypothesis testing, with computational costs comparable to their degree-one counterparts.

Moments. Let γ be a probability measure on R d and (X 1 , . . . , X d ) ∼ γ. The moments µ(γ) = (µ i (γ)) i∈N d of γ are defined as
µ i (γ) := E X i1 1 • • • X i d d ∈ R,(1)
where i = (i 1 , . . . , i d ) ∈ N d denotes a d-tuple of non-negative integers (i 1 , . . . , i d ≥ 0). The degree of an element i ∈ N d is defined as deg(i
) : = i 1 + • • • + i d .
For m ∈ N, let µ m (γ) := (µ i (γ)) deg(i)=m which we refer to as the m-th moments of γ with the convention that µ 0 (γ) = 1.
Cumulants. Cumulants κ(γ) = (κ i (γ)) i∈N d can be defined by the moment generating function as
i∈N d κ i (γ) θ i i! = log i∈N d µ i (γ) θ i i! , θ = (θ 1 , . . . , θ d ) ∈ R d ,(2)
where we denote i!
= i 1 ! • • • i d ! and θ i = θ i1 1 • • • θ i d d ;
an equivalent definition of cumulants is via a combinatorial expression of partitions (elaborated in Appendix C.1). Cumulants have several attractive properties, the following forms our main motivation. Theorem 1 (Characterization of distributions with cumulants on R d , from Proposition 1 in Jammalamadaka et al. 2006). Let γ be a probability measure on a bounded subset of R d with cumulants κ(γ) and let (X 1 , . . . , X d ) ∼ γ. Then 1. γ → κ(γ) is injective.
2. X 1 , . . . , X d are jointly independent if and only if κ i (γ) = 0 for all d-tuples of positive integers i ∈ N d + .

Section: Moments in Hilbert spaces
Instead of directly analyzing the joint distribution of random variables (X 1 , . . . , X d ) in their original product space X 1 × • • • × X d , it is often advantageous to map them into higher-dimensional feature spaces. This is achieved through feature maps Φ i : X i → H i , transforming the original random variables into H 1 × • • • × H d -valued random variables Φ 1 (X 1 ), . . . , Φ d (X d ). This 'lifting' allows us to capture complex nonlinear relationships that might be intractable in the original space. Motivated by this, we first develop the theory of moments for general Hilbert-space valued random variables. For this subsection, we assume that this lifting has already occurred, and thus consider X i ∈ H i for i = 1, . . . , d. In Section 3, we will specialize this construction to Reproducing Kernel Hilbert Spaces (RKHSs) and leverage these generalized moments (Definition 1) to define our novel kernelized cumulants.

Moments. In the finite-dimensional setting, moments are typically defined by taking expectations of products of coordinates of the random variable, as shown in (1). For the infinite-dimensional case, a coordinate-free definition is more suitable and can be elegantly formulated using tensor products. To facilitate this, we briefly recall key concepts about Hilbert spaces: for real Hilbert spaces H 1 and H 2 , their tensor product H 1 ⊗ H 2 is the Hilbert space obtained by completing the algebraic tensor product. We also denote H ⊗m
1 : = H 1 ⊗ • • • ⊗ H 1 (m-times)
. Similarly, the direct sum H 1 ⊕ H 2 is also a Hilbert space. The m-th moment of an H 1 -valued random variable X 1 is naturally defined as E X ⊗m 1 ∈ H ⊗m 1
, where the integral is understood in the Bochner sense. Consequently, the natural state space encompassing all moments of an H 1 -valued random variable is the tensor algebra T 1 := m≥0 H ⊗m 1 , with the convention that H ⊗0 1 := R. Further details on tensor products of Hilbert spaces and tensor algebras are provided in Appendix B.
Example 2.1 (H 1 = R d , m = 2). If X 1 = (X 1 1 , . . . , X d 1 ) is H 1 = R d -valued, then E X ⊗2 1 ∈ (R d ) ⊗2
can be identified with a (d × d)-sized matrix whose (i, j)-th entry is E X i 1 X j 1 .
Since our interest lies in the general case of an H 1 × • • • × H d -valued random variable X = (X 1 , . . . , X d ), we arrive at the following generalized definition.
Definition 1 (Moments in Hilbert spaces). Let γ be a probability measure on H : = H 1 × • • • × H d and let (X 1 , . . . , X d ) ∼ γ. We define the multi-indexed moments as
µ i (γ) := E[X ⊗i1 1 ⊗ • • • ⊗ X ⊗i d d ] ∈ H ⊗i , H ⊗i : = H ⊗i1 1 ⊗ • • • ⊗ H ⊗i d d (3)
for every i ∈ N d , provided the expectation exists. The complete moment sequence is defined as the element
µ(γ) = (µ i (γ)) i∈N d ∈ T := T 1 ⊗ • • • ⊗ T d , with T j := m≥0 H ⊗m j ,
and for any m ∈ N, we refer to µ m (γ) = i∈N d :deg(i)=m µ i (γ) as the m-th moments of γ.
It is important to note that in the special case where H i = R for all i, both definitions (1) and (3) are applicable for µ i (γ) and yield equivalent results. Henceforth, we exclusively refer to (3) when writing µ i (γ). Even in finite-dimensional scenarios, Definition 1 proves valuable, for instance, when random variables X 1 ∈ H 1 and X 2 ∈ H 2 originate from different state spaces (H 1 ̸ = H 2 ).

Section: Kernelized cumulants
To extend cumulants to complex data types and capture non-linear dependencies, we lift a random variable
X= (X 1 , . . . , X d ) ∈ X = X 1 × • • • × X d via a feature map Φ : X → H into a Hilbert space-valued random variable Φ(X). This allows us to implicitly work in a high-dimensional feature space without explicitly computing coordinates, a crucial aspect for tractability.
For the remainder of this paper, we assume: (i) X 1 , . . . , X d are Polish spaces (though the reader may intuitively consider them as finite-dimensional Euclidean spaces); (ii) H is a Reproducing Kernel Hilbert Space (RKHS) equipped with a kernel k and its canonical feature map Φ(x) = k(x, •); 2 and (iii) all kernels are bounded. 3 Our central findings, Theorem 2 and Theorem 3, demonstrate that the expected kernel trick extends to both the characterization of distributions and independence in this kernelized framework, mirroring the properties of classical cumulants in R d (Theorem 1). A pivotal component for these results is the derivation of an expression for inner products of kernelized cumulants within RKHSs (Lemma 1).
A combinatorial expression of cumulants. Classical cumulants can be defined via the moment generating function or via combinatorial sums over partitions (Appendix C.1). To generalize cumulants to RKHSs the combinatorial definition is the most efficient way. A partition π of m elements is a family of non-empty, disjoint subsets π 1 , . . . , π b of {1, . . . , m} whose union is the whole set; formally b j=1 π j = {1, . . . , m} and π i ∩ π j = ∅ for i ̸ = j. We call b the number of blocks of the partition π and use the shorthand |π| to denote it. The set of all partitions of m is denoted with P (m). To formulate our main results, it is convenient to associate with a measure γ and a partition π the so-called partition measure γ π that is given by permuting the marginals of γ.
Definition 2 (Partition measure). Let γ be a probability measure on X 1 × • • • × X d and π ∈ P (d). Define
γ π : = γ| Xπ 1 ⊗ • • • ⊗ γ| Xπ b ,
where X πi denotes the product space j∈πi X j and γ| Xπ i is the corresponding marginal distribution of γ. We call γ π the partition measure induced by π.
We also associate with γ and a multi-index i the so-called diagonal measure γ i that is given by repeating marginals according to i.
Definition 3 (Diagonal measure). Let γ be a probability measure on
X 1 × • • • × X d and i = (i 1 , . . . , i d ) ∈ N d . Define γ i : = Law(X 1 , . . . , X 1 i1 times , X 2 , . . . , X 2 i2 times , . . . , X d , . . . , X d i d times ),
where (X 1 , . . . , X d ) ∼ γ. We call γ i the diagonal measure induced by i.
In general, the partition measure γ π and the diagonal measure are not probability measures on X 1 × • • • × X d but on spaces that are constructed by permuting or repeating X 1 , . . . , X d . Formally, γ π is a probability measure on
X π1 × • • • × X π b and γ i is a probability measure on X i1 1 × • • • × X i d d ;
thus, γ π has d coordinates and γ i has deg(i) coordinates. These two constructions can be combined, writing γ i π for the measure (γ i ) π which makes sense whenever π ∈ P (deg(i)). We can now write down our generalization of cumulants.
Definition 4 (Kernelized cumulants). Let γ be a probability measure on X 1 × • • • × X d and let (H 1 , k 1 ), . . . , (H d , k d ) be RKHSs on X 1 , . . . , X d respectively. We define the kernelized cumulants
κ k1,...,k d (γ) := κ i k1,...,k d (γ) i∈N d ∈ T as follows κ i k1,...,k d (γ) := π∈P (m) c π E γ i π k ⊗i ((X 1 , . . . , X m ), •), where m = deg(i), c π := (-1) |π|-1 (|π| -1)!, γ i π = (γ i ) π and k ⊗i ((x 1 , . . . , x m ), (y 1 , . . . , y m )) : = k 1 (x 1 , y 1 ) • • • k 1 (x i1 , y i1 ) (4) • • • k d (x m-i d +1 , y m-i d +1 ) • • • k d (x m , y m ) is the reproducing kernel of H ⊗i where H = H 1 × • • • × H d .
Def. 4 is the natural generalization of the combinatorial definition of cumulants in R d and Appendix C.2 gives an equivalent definition via a generating function analogous to (2). However, our posthoc justification that these are the "right" definitions for cumulants in an RKHS are Theorems 2 and 3 that show that these kernelized cumulants have the same powerful properties as classic cumulants in R d (Theorem 1). Example 3.1 (Kernelized cumulants). Let γ be a probability measure on X 1 × X 2 , with the RKHSs (H 1 , k 1 ), (H 2 , k 2 ) given. Denote the random variables
K 1 = k 1 (X 1 , •), K 2 = k 2 (X 2 , •) where (X 1 , X 2 ) ∼ γ.
Then the degree two kernelized cumulants are given as κ
(2,0) k1,k2 (γ) = E K ⊗2 1 - E [K 1 ] ⊗2 , κ (1,1) k1,k2 (γ) = E [K 1 ⊗ K 2 ] -E [K 1 ] ⊗ E [K 2 ] , κ (0,2) k1,k2 (γ) = E K ⊗2 2 -E [K 2 ] ⊗2 .
Inner products of cumulants. Computing inner products of moments is straightforward thanks to a nonlinear kernel trick, see Lemma 6 in the Appendix. For example, given two probability measures γ 1 , γ 2 with corresponding random variables (X 1 , . . . , 
X d ) ∼ γ 1 , (Y 1 , . . . , Y d ) ∼ γ 2 on X 1 ×•
⟨µ i k1,...,k d (γ 1 ), µ i k1,...,k d (γ 2 )⟩ H ⊗i = E γ1⊗γ2 k 1 (X 1 , Y 1 ) i1 • • • k d (X d , Y d ) i d ,(5)
where µ i k1,••• ,k d is defined in Def. 1, and the expectation is taken over the product measure γ 1 ⊗ γ 2 . Example 3.2. In the particular case of d = 1, (5) reduces to the well-known formula for the inner product of mean embeddings ⟨µ
(1) k (γ 1 ), µ (1) k (γ 2 )⟩ H k = E γ1⊗γ2 k(X, Y ).
Lemma 1 (Inner product of cumulants). Let (H 1 , k 1 ), . . . , (H d , k d ) be RKHSs with bounded kernels on X 1 , . . . , X d respectively, and let γ and η two probability measures on
X 1 × • • • × X d , i = (i 1 , . . . , i d ) ∈ N d such that deg(i) = m. Then ⟨κ i k1,...,k d (γ), κ i k1,...,k d (η)⟩ H ⊗i = π,τ ∈P (m) c π c τ E γ i π ⊗η i τ k ⊗i ((X 1 , . . . , X m ), (Y 1 , . . . , Y m )).
Point separating kernels. In the classic MMD setting the injectivity of the mean embedding γ → E X∼γ [k(X, •)] on probability measures (known as the characteristic property of the kernel k) is equivalent to the MMD being a metric; this property is central in applications. We formulate our theoretical results in the next section using the much weaker property of what we term "pointseparating" which is satisfied for essentially all popular kernels. Definition 5 (Point-separating kernel). We call a kernel k : X × X → R point-separating if the canonical feature map Φ : x → k(x, •) is injective.

Section: (Semi-)metrics for probability measures
In this section we use cumulants to characterize probability measures and show how to compute the distance between kernelized cumulants with the expected kernel trick.
Theorem 2 (Characterization of distributions with cumulants). Let γ and η be two probability measures on X 1 × • Moreover, the expected kernel trick applies and for i ∈ N d with deg(i) = m, and k ⊗i and H ⊗i as in (4)
d i (γ, η) := ∥κ i k1,...,k d (γ) -κ i k1,...,k d (η)∥ 2 H ⊗i (6) = π,τ ∈P (m) c π c τ E γ i π ⊗γ i τ k ⊗i ((X 1 , . . . , X m ), (Y 1 , . . . , Y m )) + E η i π ⊗η i τ k ⊗i ((X 1 , . . . , X m ), (Y 1 , . . . , Y m )) -2E γ i π ⊗η i τ k ⊗i ((X 1 , . . . , X m ), (Y 1 , . . . , Y m )) .
We recall Example 3.1 and now give examples of distances between such expressions
Example 3.3 (m = 1). Applied with m = 1 and d = 1, (6) becomes MMD 2 k (γ, η) ∥κ (1) k (γ) -κ (1) k (η)∥ 2 H k = Ek(X, X ′ ) + Ek(Y, Y ′ ) -2Ek(X, Y )
, where X, X ′ denotes independent copies of γ and Y, Y ′ denotes independent copies of η. Example 3.4 (m = 2). For m = 2 and d = 1, (6) reduces to
∥κ (2) k (γ) -κ (2) k (η)∥ 2 H (1,1) = Ek(X, X ′ )k(X ′′ , X ′′′ ) + Ek(Y, Y ′ )k(Y ′′ , Y ′′′ ) + Ek(X, X ′ ) 2 +Ek(Y, Y ′ ) 2 + 2Ek(X, Y )k(X ′ , Y ) + 2Ek(X, Y )k(X, Y ′ ) -2Ek(X, Y )k(X ′ , Y ′ ) -2Ek(X, Y ) 2 -2Ek(X, X ′ )k(X, X ′′ ) -2Ek(Y, Y ′ )k(Y, Y ′′ ),
where X, X ′ , X ′′ , X ′′′ denotes independent copies of γ and Y, Y ′ , Y ′′ , Y ′′′ denotes independent copies of η. This expression compares the variances in the RKHS instead of the means. This is an example of the kernel variance embedding defined in the next subsection.
The price for the weak assumption of a point-separating kernel is that without any stronger assumptions one does not get a metric in general, and the all-purpose way to achieve a metric is to take an infinite sum over all d i 's. If we only use the degree m = 1 term d i reduces to the well-known MMD formula which requires characteristicness to become a metric (see Example 3.3). There are two reasons why working under weaker assumptions is useful: firstly, if the underlying kernel is not characteristic this sum gives a structured way to incorporate finer information that discriminates the two distributions; an extreme case is the linear kernel k(x, y) = ⟨x, y⟩ which is point-separating, and in this case the sum reduces to the differences of classical cumulants. Secondly, under the stronger assumption of characteristicness one already has a metric after truncation at degree m = 1 (the classical MMD). However, in the finite-sample case adding higher degree terms can lead to increased power. Indeed, our experiments (Section 4) show that even just going one degree further (i.e. taking m = 2), can lead to more powerful tests.

Section: A characterization of independence
Here we characterize independence in terms of kernelized cumulants. Theorem 3 (Characterization of independence with cumulants). Let γ be a probability measure on X 1 × • • • × X d , and (H 1 , k 1 ), . . . , (H d , k d ) RKHSs on Polish spaces X 1 , . . . , X d such that for every 1 ≤ j ≤ d k j is a bounded, continuous, point-separating kernel. Then
γ = γ| X1 ⊗ • • • ⊗ γ| X d if and only if κ i k1,...,k d (γ) = 0 for every i ∈ N d + .
Moreover, the expected kernel trick applies in the sense that for i ∈ N d
+ ∥κ i k1,...,k d (γ)∥ 2 H ⊗i = π,τ ∈P (m) c π c τ E γ i π ⊗γ i τ k ⊗i ((X 1 , . . . , X m ), (Y 1 , . . . , Y m )),(7)
where m := deg(i), and k ⊗i and H ⊗i are defined as in (4).
Applied to i = (1, 1), the expression (7) reduces to the classical HSIC for two components, see Example 3.5 below. But for general i this construction leads to genuine new statistics in RKHSs.
Example 3.5 (Specific case: HSIC, kernel Lancaster interaction, kernel Streitberg interaction). If d = 2 there is only one order 2 index in N d + , namely i = (1, 1); in this case (7) reduces to the classical HSIC equation
∥κ (1,1) k1,k2 (γ)∥ 2 H (1,1) = Ek 1 (X, Y )k 2 (X, Y ) + Ek 1 (X, Y )k 2 (X ′ , Y ′ ) -2Ek 1 (X, Y )k 2 (X ′ , Y ),
where (X, Y ) and (X ′ , Y ′ ) are independent copies of the same random variable following γ. More generally, with i = 1 d one gets the kernel Streitberg interaction (Streitberg, 1990;Sejdinovic et al., 2013a;Liu et al., 2023), and specifically the kernel Lancaster interaction (Sejdinovic et al., 2013a) for d ∈ {2, 3}; the latter reduces to HSIC for two random variables (d = 2).

Section: Finite-sample statistics
To apply Theorem 2 and Theorem 3 in practice, one needs to estimate expressions such as Ek ⊗i ((X 1 , . . . , X m ), (Y 1 , . . . , Y m )). One could use classical estimators such as U-statistic (Van der Waart, 2000) which lead to unbiased estimators. However, we follow Gretton et al. (2008) and use a V-statistic which is biased but conceptually simpler, easier, and efficient to compute. We note that the estimators presented here all have quadratic complexity like MMD and HSIC, see Appendix E.
A two-sample test for non-characteristic feature maps. If k is characteristic then MMD k (γ, η) = 0 exactly when γ = η, but we can still increase testing power by considering the distance between the kernel variance and skewness embeddings, which leads us to use our semi-metrics d (2) (γ, η) and d (3) (γ, η) as defined in ( 6). An efficient estimator for d (3) is given in detail in Appendix E; we provide the full expression for d (2) here.
Lemma 2 (d (2) estimation, see ( 6)). The V-statistic for
d (2) (γ, η) = ∥κ (2) k (γ) -κ (2) k (η)∥ 2 H (1,1) is 1 N 2 Tr (K x J N ) 2 + 1 M 2 Tr (K y J M ) 2 - 2 N M Tr K xy J M K ⊤ xy J N ,
where Tr denotes trace,
(x n ) N n=1 i.i.d. ∼ γ, (y m ) M m=1 i.i.d. ∼ η, K x = [k(x i , x j )] N i,j=1 ∈ R N ×N , K y = [k(y i , y j )] M i,j=1 ∈ R M ×M , K x,y = [k(x i , y j )] N,M i,j=1 ∈ R N ×M , J n = I n -1 n 1 n 1 ⊤ n ∈ R n×n , with 1 n = (1, . . . , 1) ∈ R n .
A kernel independence test. By Theorem 3, if γ = γ| X1 ⊗ γ| X2 , then κ (2,1) (γ) = 0 and κ (1,2) (γ) = 0. We may compute the magnitude of either κ (2,1) (γ) or κ (1,2) (γ) -we will refer to these quantities as cross skewness independence criterion (CSIC). Note that these criteria are asymmetric. When d = 2 we have a probability measure γ on X 1 × X 2 and two kernels k :
X 2 1 → R, ℓ : X 2 2 → R. Assume that we have samples (x i , y i ) N i=1 and use the shorthand notation K = K x , L = L y (similarly to Lemma 2) and H = H N = 1 N 1 N 1 ⊤ N ∈ R N ×N .
Denote by • the Hadamard product and ⟨•⟩ the sum over all elements of a matrix. Then one can derive the following CSIC estimator.(Note that matrix multiplication takes precedence over the Hadamard product.)
Lemma 3 (CSIC estimation). The V-statistic for ∥κ (1,2) k,ℓ (γ)∥ 2 H ⊗1 k ⊗H ⊗2 ℓ is 1 N 2 K • K • L -4K • KH • L -2K • K • LH + 4KH • K • LH + 2K • L K N 2 + 2KH • HK • L + 4K • HK • LH + K • K L N 2 -8K • LH K N 2 -4K • HK L N 2 + 4 K N 2 2 L .
Remark (computational complexity w.r.t. degree m). We saw that the computational complexity of the cumulant based measures is quadratic w.r.t. the sample size. Let B m = |P (m)| be the m-th
Bell number, in other words the number of elements in P (m). The Bell numbers follow a recursion: 6)-( 7), in the worst case the number of operations to compute
B m+1 = |P (m + 1)| = m k=0 m k B k , with the first elements of the sequence being B 0 = B 1 = 1, B 2 = 2, B 3 = 5, B 4 = 15, B 5 = 52. By (
d i (γ, η) or ∥κ i k1,...,k d (γ)∥ 2 H ⊗i (m = deg(i)) is proportional to B 2 m (it equals to 3B 2 m and to B 2
m , respectively). Though asymptotically B m grows quickly (de Bruijn, 1981;Lovász, 1993), for reasonably small degrees the computation is still manageable. In addition, merging various terms in the estimator can often be carried out, which leads to computational saving. For instance, the estimator of d (2) (see Lemma 2, Example E.1), CSIC (Lemma 3, Example E.2) andd (3) (Example E.3) consists  of only 3, 11 and10 + 2 × 7 = 24 terms compared to the predicted worst-case setting of 3B 2 2 = 12, B 2 3 = 25, and 3B 2 3 = 75 terms, respectively. On a practical side, we found that using m ∈ {2, 3} is a good compromise between gain in sample efficiency and ease of implementation.

Section: Experiments
In this section, we demonstrate the efficiency of the proposed kernel cumulants in two-sample and independence testing. 4• Two-sample test: Given N -N samples from two probability measures γ and η on a space X , the goal was to test the null hypothesis H 0 : γ = η against the alternative H 1 : γ ̸ = η. The compared test statistics (S) were MMD, d (2) , and d (3) .
• Independence test: Given N paired samples from a probability measure γ on a product space X 1 × X 2 , the aim was to the test the null hypothesis
H 0 : γ = γ 1 ⊗ γ 2 against the alternative H 1 : γ ̸ = γ 1 ⊗ γ 2 .
The compared test statistics (S) were HSIC and CSIC.
In our experiments H 1 held, and the estimated power of the tests is reported. Permutation test was applied to approximate the null distribution and its 0.95-quantile (which corresponds to the level choice α = 0.05): We first computed our test statistic S using the given samples (S 0 = S), and then permuted the samples 100 times. If S 0 was in a high percentile (≥ 95% in our case) of the resulting distribution of S under the permutations, we rejected the null. We repeated these experiments 100 times to estimate the power of the test. This procedure was in turn repeated 5 times and the 5 samples are plotted as a box plot along with a line plot showing the mean against the number of samples (N ) used. All experiments were performed using the rbf-kernel rbf σ (x, y) = e -∥x-y∥ 2 2 2σ 2
, where the parameter σ is called the bandwidth. We performed all experiments for every bandwidth of the form σ = a10 b where a = 1, 2.5, 5, 7.5 and b = -5, -4, -3, -2, -1, 0 and the optimal value across the bandwidths was chosen for each method and sample size. The experiments were carried out on a laptop with an i7 CPU and 16GBs of RAM.

Section: Synthetic data
For synthetic data we designed two experiments.
• 2-sample test: We compared a uniform distribution with a mixture of two uniforms.
• Independence test: We considered the joint measure of a uniform and a correlated χ 2 random variable. We also use this same benchmark to compare the efficiency of classical and kernelized cumulants in Appendix D.
Comparing a uniform with a mixture of uniforms. Even for simpler distributions like mixtures of uniform distributions it can be hard to pick up higher-order features, and d (2) can outperform MMD even when provided with a moderate number of samples. Here we compared one uniform distribution U [-1, 1] with an equal mixture of U [0.35, 0.778] and U [-0.35, -0.778]. The endpoints in the mixture were chosen to match the first three moments of U [-1, 1]. The number of samples used ranged from 5 to 50, and the results are summarized in Fig. 1. One can see that with d (2) the power approaches 100% much faster than with using MMD. Independence between a uniform and a χ 2 . Let X ∼ U [0, 1] and Z ∼ N (0, 1) be independent of X. Denote by Φ the c.d.f. of a standard normal distribution and define Y p to be a mixture with weight p at Φ -1 (X) and weight 1 -p at Z so that Y 0 = Z and Y 1 = Φ -1 (X). We test for independence for p = 0.5 between Y 2 p -which will be χ 2 distributed with 1 degree of freedomand X. As the statistical dependence of Y 2 p and X is more complicated than a simple correlation we expect that higher-order features of the data will help in the independence testing. The number of samples used ranged from 6 to 60, with the results summarized in Fig. 2. One can see that CSIC supersedes HSIC for every sample size, and the difference is more pronounced for smaller ones.

Section: Real-world data
We demonstrate the efficiency of the kernelized cumulants on real-world data. We designed two experiments.
• 2-sample test: Here the goal was to test if environmental data in two different seasons, and traffic data at different speeds can be distinguished.
• independence test: The aim was to test if two distributions describing traffic flow and other traffic factors are independent.
To improve performance of both test statistics, all features are standardized to lie between 0 and 1.
Seoul bicycle data. The Seoul bicycle data set (E et al., 2020) consists of environmental data along with the number of bicycle rentals. The environmental data consists of 9 numerical values, 1 categorical value (season), and two binary values. We compare the distribution of the environmental data in the winter and the fall, as we expect these distributions to be different. Concretely, we do 2-sample testing on two measures γ, η on R 11 , and assume that γ ̸ = η where γ is the distribution of the environmental data in winter, and η is that of the data in the fall. Permutation testing was performed for N between 4 and 44, with results summarized in Fig. 3. As it can be observed that d (2)  outperforms MMD in terms of test power. For the Type I error, i.e. the probability of falsely rejecting the null hypothesis (comparing winter data with itself), it hovers between 5 -10% for both statistics, which is admittedly slightly higher than the desired 5% due to the small sample size, but very similar for both statistics; for further details, the reader is referred to Fig. 8 in Appendix D.
Brazilian traffic data. We used the Sao Paulo traffic benchmark (Ferreira, 2016) to perform independence testing. The dataset consists of 16 different integer-valued statistics about the hourly traffic in Sao Paulo such as blockages, fires and other reasons that might hold up traffic. This is combined with a number that describes the slowness of traffic at the given hour; so X 1 = R 16 , X 2 = R. One expects a strong dependence between the two sets-or equivalently, for the null hypothesis to be false-and for the statistics are heavily skewed towards 0 as it is naturally sparse.
For independence testing we performed permutation testing for N between 4 and 40. The resulting test powers are summarized in Fig. 4. As it can be seen, HSIC and CSIC performs similarly for very low sample sizes, but for anything else CSIC is the favorable statistic in terms of test power. For two-sample testing, we sampled N between 5 and 50 and compared the distribution of slow moving traffic with the fast moving traffic. The results are summarized in Fig. 5. It is clear that d (3) performs similarly to MMD in terms of test power for very small sample sizes, but significantly better for larger ones.

Section: Conclusion
We defined cumulants for random variables in RKHSs by extending the algebraic characterization of cumulants on R d . This construction results in a structured description of the law of random variables that goes beyond the classic kernelized mean and covariance. A kernel trick allows us to compute the resulting kernelized cumulants. We applied our theoretical results to two-sample and independence testing; although kernelized mean and covariance are sufficient for this task, the higherorder kernelized cumulants have the potential to increase the test power and to relax the assumptions on the kernel. Our experiments on real and synthetic data show that kernelized cumulants can indeed lead to significant improvement of the test power. A disadvantage of these higher-order statistics is that their theoretical analysis requires more mathematical machinery although we emphasize that the resulting estimators are simple V-statistics.
These appendices provide additional background and elaborate on some of the finer points in the main text. In Appendix A we illustrate that cumulants have typically lower variance estimators compared to moments. Technical background on tensor products and tensor sums of Hilbert spaces, and on tensor algebras is provided in Appendix B. We present our proofs in Appendix C. In Appendix D additional details on our numerical experiments are provided. Our V-statistic based estimators are detailed in Appendix E.

Section: A Moments and cumulants
Already for real-valued random variable X, moments have well-known drawbacks that make cumulants often preferable as statistics. For a detailed introduction to the use of cumulants in statistics we refer to McCullagh (2018). Here we just mention that 1. the moment generating function
f (t) = E[e tX ] = m µ m t m /m! describes the law of X with sequence (µ m ) of moments µ m = E[X m ] ∈ R.
However, since the function t → f (t) is the expectation of an exponential, one would often expect that f is also "exponential in t", hence
g(t) = log f (t) = m κ m t m
m! should be simpler to describe as a power series. For example, for a Gaussian f (t) = e tE(X)+ t 2 2 V ar(X) and while µ m can be in this case explicitly calculated and uneven moments vanish, the m-moments are fairly complicated compared to the power series expansion of g(t) = κ 1 t + κ 2 t 2 2 which just consists of κ 1 (mean) and κ 2 (variance). 2. In the moment sequence µ m , lower moments can dominate higher moments. Hence, a natural idea to compensate for these "different scales" is to systematically subtract lower moments from higher moments. As mentioned in the introduction, this is in particular troublesome if finite samples are available. Even in dimension d = 1 the second moment is dominated by the squared mean, that is for a real-valued random variable
X ∼ γ µ 2 (γ) = (µ 1 (γ)) 2 + Var(X),
where Var(X) :
= E[(X -µ 1 (γ)) 2 ].
It is well known that the minimum variance unbiased estimators for the variance are more efficient than that for the second moment: denoting them by µ 2 and κ respectively, one can show (Bonnier and Oberhauser, 2020) that given N samples from X, the following holds
Var µ 2 = Var ( κ) + 2 N (EX) 4 -(EX) 2 Var(X) -2 Var(X) 2 N -1 .
This means that when X has a large mean, it is more efficient to estimate its variance than its second moment since the last term in the above expression dominates. Hence, the variance Var(X) is typically a much more sensible second-order statistic than µ 2 (γ). However, we emphasize that there are many other reasons why cumulants can have better properties as estimators 3. Cumulants characterize laws and the independence of two random variables manifests itself simply as vanishing of cross-cumulants. In view of the above item 2, this means for example that testing independence can be preferable in terms of vanishing cumulants rather than testing if moments factor
E[X m Y n ] = E[X m ]E[X n ],
and similarly for testing if distributions are the same.
The caveat to the above points is that it is not true that cumulants are always preferable. For example, there are distributions for which (a) the moment generating function is not naturally exponential in t, (b) lower moments do not dominate higher moments, (c) consequently independence or two-sample testing become worse with cumulants. While one can write down conditions under which for example, the variance of the kernelized cumulants is lower, the use of cumulants among statisticians is to simply regard cumulants as arising from natural motivations which leads to another estimator in their toolbox.
The main idea of our paper is simply that for the same reasons that cumulants can turn out to be powerful for real or vector-valued random variables, cumulants of RKHS-valued random variables are a natural choice of statistics. The situation is more complicated since it requires formalizing momentand cumulant-generating functions in RKHS but ultimately a kernel trick allows for circumventing the computational bottleneck of working in infinite dimensions and leads to computable estimators for independence and two-sample testing.
Further, we note that although cumulants are classic for vector-valued data, there seems to be not much work done about extending their properties to general structured data. Our kernelized cumulants apply to any set X where a kernel is given. This includes many practically relevant examples such as strings (Lodhi et al., 2002), graphs (Kriege et al., 2020), or general sequentially ordered data (Király and Oberhauser, 2019;Chevyrev and Oberhauser, 2022); a survey of kernels for structured data is provided by Gärtner (2003).

Section: B Technical background
In Section B.1 the tensor products ( d j=1 H j ) and direct sums of Hilbert spaces ( i∈I H i ) are recalled. Section B.2 is about tensor algebras over Hilbert spaces ( m≥0 H ⊗m ).

Section: B.1 Tensor products and direct sums of Banach and Hilbert spaces
Tensor products of Hilbert spaces. For Hilbert spaces H, . . . ,
H d and (h 1 , . . . , h d ) ∈ H 1 × • • • × H d , the multi-linear operator h 1 ⊗ • • • ⊗ h d ∈ H 1 ⊗ • • • ⊗ H d is defined as (h 1 ⊗ • • • ⊗ h d )(f 1 , . . . , f d ) = d j=1 ⟨h j , f j ⟩ Hj for all (f 1 , . . . , f d ) ∈ H 1 × • • • × H d . By extending the inner product ⟨a 1 ⊗ • • • ⊗ a d , b 1 ⊗ • • • ⊗ b d ⟩ H1⊗•••⊗H d := d j=1 ⟨a j , b j ⟩ Hj to finite linear combinations of a 1 ⊗ • • • ⊗ a d -s n i=1 c i ⊗ d j=1 a i,j : c i ∈ R, a i,j ∈ H j , n ≥ 1
by linearity, and taking the topological completion one arrives at (Berlinet and Thomas-Agnan, 2004, Theorem 13) with the tensor product kernel
H 1 ⊗ • • • ⊗ H d . Specifically, if (H 1 , k 1 ), . . . , (H d , k d ) are RKHSs, then so is H 1 ⊗ • • • ⊗ H d = H ⊗ d j=1 kj
⊗ d j=1 k j ((x 1 , . . . , x d ) , (x ′ 1 , . . . , x ′ d )) := d j=1 k j x j , x ′ j , where (x 1 , . . . , x d ), (x ′ 1 , . . . , x ′ d ) ∈ X 1 × • • • × X d .
Tensor products of Banach spaces. For Banach spaces B 1 , . . . B d , the construction of (Lang, 2002) as one cannot rely on an inner product.
B 1 ⊗• • •⊗B d is a little more involved
Direct sums of Hilbert and Banach spaces. Let (H i ) i∈I be Hilbert or Banach spaces where I is some index set. The direct sum of H i -s-written as i∈I H i -consists of ordered tuples h = (h i ) i∈I such that h i ∈ H i for all i ∈ I and h i = 0 for all but a finite number of i ∈ I. Operations (addition, scalar multiplication) are performed coordinate-wise, and the inner product of a, b ∈ i∈I H i is defined as ⟨a, b⟩ i∈I Hi = i∈I a i b i .

Section: B.2 Tensor algebras
The tensor algebra T j over a Hilbert space H j is defined as the topological completion of the space m≥0 H ⊗m j .
Note that it can equivalently be defined as the subset of (h 0 , h 1 , h 2 , . . .) ∈ m≥0 H ⊗m j such that
m≥0 ∥h m ∥ 2 H ⊗m j < ∞
, and as such it is a Hilbert space with norm
∥(h 0 , h 1 , h 2 , . . .)∥ 2 m≥0 H ⊗m j = m≥0 ∥h m ∥ 2 H ⊗m j .
T j is also an algebra, endowed with the tensor product over H j as its product. For a = (a 0 , a 1 , a 2 , a 2 . . .), b = (b 0 , b 1 , b 2 , b 2 . . .) ∈ T j , their product can be written down in coordinates as
a • b = m i=0 a i ⊗ b m-i m≥0
.
For a sequence H 1 , . . . , H d of Hilbert spaces, we define
T := T 1 ⊗ • • • ⊗ T d ,
where
T j = m≥0 H ⊗m j (j = 1, . . . , d). Let H = H 1 × • • • × H d ,

Section: and recall that given a tuple of integers
i = (i 1 , . . . , i d ) ∈ N d we define H ⊗i : = H ⊗i1 1 ⊗ • • • ⊗ H ⊗i d d .
This allows us to write down a multi-grading for T as
T = i∈N d H ⊗i . (8
)
Note that this gives credence to us using multi-indices i ∈ N d to describe elements of the tensor algebra, as the multi-indices form its multi-grading.
Furthermore, T is a multi-graded algebra when endowed with the (linear extension of the) following multiplication defined on the components of T
⋆ : H ⊗i 1 × H ⊗i 2 → H ⊗(i 1 +i 2 ) , (9) (x 1 ⊗ • • • ⊗ x d ) ⋆ (y 1 ⊗ • • • ⊗ y d ) = (x 1 • y 1 ) ⊗ • • • ⊗ (x d • y d ), so that for a = a i i∈N d , b = b i i∈N d ∈ T
, their product can be written down as
(a ⋆ b) i = i 1 +i 2 =i a i 1 ⋆ b i 2 (10)
where addition of tuples i 1 , i 2 ∈ N d is defined as
i 1 + i 2 = i 1 1 + i 2 1 , . . . , i 1 d + i 2 d .
With the degree of a tuple defined as deg(i
) = i 1 + • • • + i d ,
T is also a graded algebra, with the grading written down as
T = m≥0 {i∈N d :deg(i)=m}
H ⊗i , so that if one multiplies two elements together, the degree of their product is the sum of their degree.
Finally we note that T is a unital algebra and the unit has the explicit form (1, 0, 0, . . .), i.e. the element consisting of only a 1 at degree 0.

Section: C Proofs
This section is dedicated to proofs. The equivalence between the combinatorial expressions of cumulants and the definition via a moment generating function is proved in Section C.2. The derivation of our main results (Theorem 2 and Theorem 3) are detailed in Section C.3.

Section: C.1 Equivalent definitions of cumulants in R d
Here we introduce a classical definition of cumulants via a moment generating function and its equivalence to the combinatorial expressions. If X = (X 1 , . . . , X d ) is an R d -valued random variable distributed according to X ∼ γ, then recall that
µ i (γ) = E[X i1 1 • • • X i d d ] ∈ R for i = (i 1 , . . . , i d ) ∈ N d .
The following definition of the cumulants κ i (γ) of γ are equivalent 1.
i∈N d κ i (γ) θ i i! = log i∈N d µ i (γ) θ i i! , 2. κ i (γ) = π∈P (d) c π µ i (γ i π )
,
where θ = (θ 1 , . . . , θ d ) ∈ R d , c π = (-1) |π| (|π|-1)!.
The equivalence between these two definitions of cumulants via a generating function and via their combinatorial definition, is classical (McCullagh, 2018) even if our notation here is non-standard in the classical case. This equivalence is also at the heart of many proofs about properties of cumulants since some properties are easier to prove via one or the other definition.

Section: C.2 Equivalent definitions of cumulants in RKHS
In the main text, we defined cumulants in RKHS by mimicking the combinatorial definition of cumulants in R d . It is natural and useful to also have the analogous definition via a "generating function" for RKHS-valued random variables. However, to generalize the definition via the logarithm of the moment generating function to random variables in RKHS, requires to define a logarithm for tensor series of moments. In this part, we show that this can be done and that indeed the two definitions are equivalent.
We use the shorthand κ(γ) := κ k1,...,k d (γ), µ(γ) := µ k1,...,k d (γ), and we overload the notation (X 1 , . . . , X d ) with (k 1 (•, X 1 ), . . . , k d (•, X d )). With this notation, we show that given coordinates i ∈ N d , one may express the generalized cumulant κ i (γ) as either a combinatorial sum over moments indexed by partitions, or by using the cumulant generating function.
More specifically, we show that the generalized cumulant of a probability measure γ on
H 1 ×• • •×H d defined as κ i (γ) = π∈P (m) c π E γ i π (X ⊗i ),
where c π = (-1) |π|-1 (|π| -1)! can also be expressed as coordinates in the tensorized logarithm of the moment series. Motivated by the Taylor series expansion of the classic logarithm, we define
log : T → T, x → n≥1 (-1) n-1 n (x -1) ⋆n ,
where ⋆ denotes the product as defined in ( 9) and for t ∈ T, t ⋆n is defined as
t ⋆n = t ⋆ • • • ⋆ t n -times , or coordinate-wise (t ⋆n ) i = i 1 +•••+i n =i t i 1 ⋆ • • • ⋆ t i n for i ∈ N d .
Note that unlike the classical logarithm log : R + → R, the tensorized logarithm is defined on the whole space as a formal expression. This can be summarized in the following lemma: Lemma 4.
κ i (γ) = π∈P (m) c π E γ i π (X ⊗i ) = log µ(γ i ) 1m ,(11)
where
1 m = (1, . . . , 1) ∈ N m .
By iterating (10) we can express (11) as m j=1
(-1) j-1 j
i 1 +•••+i j =1m µ i 1 (γ i ) ⋆ • • • ⋆ µ i j (γ i ),
and our goal is to express this as a sum over partitions. We will use the notation [n] = {1, . . . , n}.
We can achieve our goal in two parts:
1. Show that for a fixed i ∈ N d with deg(i) = m we can express (11) as a sum over all surjective functions from [m] to [j].
2. Show that this sum over functions reduces to a sum over partitions.
Part 1. Note that given
i 1 + • • • + i j = 1 m we may define h : [m] → [j] by the relation (i h(n) ) n = 1,
that is, we take h(n) to be the index c for which the multi-index i c is 1 at n. Note that this function is necessarily surjective since the sum is taken over non-zero multi-indices. Equivalently, for any surjective function h : [m] → [j] we may define multi-indices by setting
(i c ) n = 1 if n ∈ h -1 (c) 0 otherwise .
Note that any such multi-index will be non-zero since the function is assumed to be surjective. With this identification we can write
log µ(γ i ) 1m = m j=1 (-1) j-1 j h:[m]→[j] µ i h -1 (1) (γ i ) ⋆ • • • ⋆ µ i h -1 (j) (γ i ).
Part 2. Recall that given a function h : [m] → [j] we can associate it to its corresponding partition π h ∈ P(m) by considering the set {h -1 (1), . . . , h -1 (j)}, and there are exactly j! different functions corresponding to a given partition, which are given by re-ordering the values 1, . . . , j. This reordering of the blocks does not change the summands since the marginals of the partition measure are always copies of each other and hence self-commute, hence a product of moments like
µ i h -1 (1) (γ i ) ⋆ • • • ⋆ µ i h -1 (j) (γ i
) can always be written as µ i (γ i π h ), the i-th coordinate of the moment sequence of the partition measure γ i π h . With this in mind we can write
log µ(γ i ) 1m = m j=1 (-1) j-1 j h:[m]→[j] µ i (γ i π h ) = π∈P (m) (-1) |π|-1 |π| |π|!µ i (γ i π ) = π∈P (m) c π µ i (γ i π ) = π∈P (m) c π E γ i π (X ⊗i ).
From this it immediately follows that for two probability measures γ, η we can write
⟨κ i (γ), κ i (η)⟩ H ⊗i = ⟨ π∈P (m) c π E γ i π (X ⊗i ), τ ∈P (m) c τ E η i τ (Y ⊗i )⟩ H ⊗i = π,τ ∈P (m) c π c τ E (X,Y )∼γ i π ⊗η i τ ⟨X ⊗i , Y ⊗i ⟩ H ⊗i .
Lemma 1 then follows from the definition of the tensor products.

Section: C.3 Proof of Theorem 2 and Theorem 3
In this section we present the proofs of Theorem 2 and Theorem 3. We do this in a slightly more abstract setting where the feature maps take values in Banach spaces for clarity, until the end when we again restrict our attention to RKHSs. We start out by showing that polynomial functions of the feature maps characterize measures (Lemma 5). From this result we show that cumulants have the same property (Theorem 4), and lastly that this also holds when working directly with the kernels (Proposition 1).
A monomial on separable Banach spaces B 1 , . . . , B d is any expression of the form
M (x 1 , . . . , x d ) = i1 j=1 ⟨f 1 j , x 1 ⟩ • • • i d j=1 ⟨f d j , x d ⟩
for some (i 1 , . . . , i d ) ∈ N d , where f i j ∈ B ⋆ i are elements of the dual space B ⋆ i and x i ∈ B i .5 Finite linear combinations of monomials are called the polynomials. Recall that a set of functions F on a set S is said to separate the points of S if for every x ̸ = y ∈ S there exists f ∈ F such that f (x) ̸ = f (y). Lemma 5 (Polynomial functions of feature maps characterize probability measures). Let X 1 , . . . , X d be Polish spaces, B 1 . . . , B d separable Banach spaces and φ i : X i → B i be continuous, bounded, and injective functions. Then the set of functions on the Borel probability measures
P d i=1 X i of d i=1 X i P d i=1 X i → R, γ → d i=1 Xi p φ 1 (x 1 ), . . . , φ d (x d ) dγ(x 1 , . . . , x d ),
where p ranges over all polynomials, separates the points of
P d i=1 X i .
Proof. We first show that the pushforward map
d i=1 φ i : P d i=1 X i → P d i=1 B i
is injective. This is done in two parts, first we show that every Borel measure on d i=1 X i is a Radon measure, then we show that the pushforward map is injective on Radon measures. To see the first part, note that since X 1 , . . . , X d are Polish spaces, so is their product space d i=1 X i (Dudley 2004, Theorem 2.5.7;Willard 1970, Theorem 16.4c), and since Borel measures on Polish spaces are Radon measures (Bogachev, 2007, Theorem 7.1.7), any γ ∈ P( d i=1 X i ) must be a Radon measure. For the second part, note that
d i=1 φ i : d i=1 X i → d i=1 B i , d i=1 φ i (x 1 , . . . , x d ) → d i=1 φ i (x i )
is a norm bounded, continuous injection. Since d i=1 B i is a Hausdorff space, d i=1 φ i is a homeomorphism on compacts since continuous injections into Hausdorff spaces are homeomorphisms on compacts (Rudin, 1953, Theorem 4.17). Let µ, ν ∈ P d i=1 X i be two Radon measures such that their pushforwards are the same B i . Note that K is a bounded Polish space. It is enough to show that the polynomials separate the points of P(K). To see this, note that the polynomials form an algebra of continuous functions that separate the points of d i=1 B i , and when restricted to K they are bounded, since K is norm bounded. Since K is Polish, any Borel measure is Radon, and we can apply the Stone-Weierstrass theorem for Radon measures (Bogachev, 2007, Exercise 7.14.79) to get the assertion.
In what follows we will use the following index notation for linear functionals. Fix some tuple i = (i 1 , . . . , i d ) ∈ N d with deg(i) = m. Given separable Banach spaces B 1 . . . , B d we use the notation
B ⊗i : = B ⊗i1 1 ⊗ • • • ⊗ B ⊗i d d and given an element x = (x 1 , . . . , x d ) ∈ d i=1 B i we write x i : = x ⊗i1 1 ⊗ • • • ⊗ x ⊗i d d so that x i ∈ B ⊗i . If we have functions (φ i ) d
i=1 such that φ i : X i → B i on some Polish spaces X 1 , . . . , X d , then we write
φ ⊗i : = φ ⊗i1 1 ⊗ • • • ⊗ φ ⊗i d d , φ ⊗i : d i=1 X i → B ⊗i .
Given a collection of linear functionals F ∈ d j=1 B ⋆ j ij such that F = (f 1 , . . . , f d ) we write
F ⊗i : = f 1 ⊗ • • • ⊗ f d , F ⊗i ∈ B ⊗i ⋆ .
Note the following trick: the monomials on d i=1 B i are exactly functions of the form
x → ⟨F ⊗i , x i ⟩ for F = (f 1 , . . . , f d ), this will be used in the proofs. We can now restate and prove the our theorem. Note that the cumulants here are defined like in Definition 4 which is a sensible definition even if the feature maps are not associated to kernels.
Theorem 4 (Generalization of Theorem 2 and Theorem 3). Let X 1 , . . . , X d be Polish spaces and φ i : X i → B i be continuous, bounded and injective feature maps into separable Banach spaces B i for i = 1, . . . d. Let γ and η be probability measures on
X 1 × • • • × X d . Then 1. γ = η if and only if κ(γ) = κ(η).
2. γ = d i=1 γ| Xi if and only if the cross cumulants vanish, that is κ i (γ) = 0 for all i ∈ N d + .
Proof.
• Item 2: We want to show that the cross cumulants vanish if and only if γ = d i=1 γ| Xi . By Lemma 5 it is enough to show that
E γ p φ 1 (X 1 ), . . . , φ d (X d ) = E d i=1 γ| X i p φ 1 (X 1 ), . . . , φ d (X d )
for any monomial function p. Let us take linear functionals F = (f 1 , . . . , f d ) and note that
⟨F i , κ i (γ)⟩ = π∈P (d) c π E γ i π f 1 (φ 1 (X 1 )) • • • f d (φ d (X d ))
which is the classical cumulant of the vector-valued random variable
(f 1 • φ 1 )(X 1 ), . . . , (f d • φ d )(X d ) ,
where (X 1 , . . . , X d ) ∼ γ. Hence by classical results (Speed, 1983), all cross cumulants of (f
1 • φ 1 )(X 1 ), . . . , (f d • φ d )(X d
) vanish if and only if the cross moments split, that is to say
E γ p (f 1 • φ 1 )(X 1 ), . . . , (f d • φ d )(X d ) = E d i=1 γ| X i p (f 1 • φ 1 )(X 1 ), . . . , (f d • φ d )(X d )
for any monomial p on R d . Since f 1 , . . . , f d were arbitrary this holds for all monomials, which shows the assertion.
• Item 1: By assumption κ i (γ) = κ i (η) for every i ∈ N d ; this implies that E γ p(φ 1 , . . . , φ d ) = E η p(φ 1 , . . . , φ d ) for any polynomial p, so we can apply Lemma 5.
Proposition 1 (Theorem 2 and Theorem 3). Let X 1 , . . . , X d be Polish spaces and k i : X 2 i → R be a collection of bounded, continuous, point-separating kernels. Let γ and η be be probability measures on
X 1 × • • • × X d . Then 1. γ = η if and only if κ k1,...,k d (γ) = κ k1,...,k d (η). 2. γ = d i=1 γ| Xi if and only if κ i k1,...,k d (γ) = 0 for all i ∈ N d + .
Proof. We reduce the proof to the checking of the conditions of Theorem 4. Let φ i denote the canonical feature map of the kernel k i , and let B i : = H ki be the RKHS associated to k i (i ∈ {1, . . . , d}). For all i ∈ {1, . . . , d}, φ i is (i) bounded by the boundedness of k i since ∥φ
i (x)∥ 2 H k i = k i (x, x) ≤ sup x∈Xi |k i (x, x)| < ∞,(
ii) continuous by the continuity of k i (Steinwart and Christmann, 2008, Lemma 4.29), (iii) injective by the point-separating property of k i . The separability of H ki follows (Steinwart and Christmann, 2008, Lemma 4.33) from the separability of X i and the continuity of k i (i ∈ {1, . . . , d}). Note: Details on the expected kernel trick part of Theorem 2 and Theorem 3 are provided in Section E.

Section: D Additional experiments and details
Here we give additional details on the experiments that were performed, and discuss some further experiments that did not fit into the main text.
Background on permutation testing. Permutation testing works by bootstrapping the distribution of a test statistic under the null hypothesis. This allows the user to estimate confidence intervals under the null, which is a powerful all-purpose way of doing so when analytic expressions are unavailable. As an example, assume we have two probability measures γ, η on X with i.i.d. samples x 1 , . . . , x N ∼ γ, y 1 , . . . , y N ∼ η. If the null hypothesis is that γ = η then we may set (z 1 , . . . , z 2N ) : = (x 1 , . . . , x N , y 1 , . . . , y N ) so that for any permutation σ on 2N elements, we get two different set of of i.i.d. samples from γ = η by using the empirical measures γσ : = (z σ(1) , . . . , z σ(N ) ), ησ : = (z σ(N +1) , . . . , z σ(2N ) )
and for any statistic S : P(X ) 2 → R, we may estimate S(γ, η) under the null by sampling from S(γ σ , ησ ). If the null hypothesis were true, we might expect S(γ, η) to lie in a region with high probability of the permutation estimator, and we can use this as a criteria for rejecting the null. Under fairly weak assumptions, this yields a test at the appropriate level (Chung and Romano, 2013).
Comparing a uniform and a mixture. Any uniform random variable over a symmetric interval will have 0 mean and skewness, so a symmetric mixture only needs to match the variance. If X is a 50/50 mixture of U [a, b] and U [-a, -b] then
Var(X) = 2 3 b 2 + ba + a 2 so if Y is distributed according to U [-c, c] then we only need to solve b 2 + ba + a 2 = c 2
which is straightforward for a given a and c.
Computational complexity of estimators. The V-statistic for d (2) as written in Lemma 2 is bottlenecked by the matrix multiplications. We may note however that for two matrices A, B it holds that
Tr(A ⊤ B) = ⟨A • B⟩,
where ⟨•⟩ denotes the sum over elements and • denotes the Hadamard product. We also note that for for
H n = 1 n 1 n 1 ⊤ n we have AH n i,j = 1 n n c=1 A i,c
. Using both of these tricks we may compute both d (2) and CSIC without any matrix multiplications, which brings the computational complexity down to O(N 2 ) for both. For a comparison of actual computation time, see Fig. 6 and Fig. 7, where the average computational times for out methods are compared to the KME and and HSIC for N between 50 and 2000.
Type I error on the Seoul Bicycle data. The results when comparing the winter data to itself is presented in Fig. 8. As we see the performance is similar for both estimators and lies between 5 and 10%. Classical vs. kernelized cumulants. Using the same distributions as in the synthetic independence testing experiment, we now compare X with Y 2 0.5 to contrast independence testing with classical cumulants with their kernelized counterpart. The results are summarized in Table 1 where they are displayed as the median value ± half the difference between the 75th and 25th percentile. We consider every combination of classical vs. kernelized, variance vs. skewness, and two different sample sizes. One can observe that the classical variance based test performs poorly compared to a classical skewness test, the kernelized variance test is almost as powerful as the kernelized skewness test, and in all cases the kernelized tests deliver higher power.

Section: E Kernel trick computations
Here we show how to arrive at the expressions used for the V-statistics used in the experiments.  

Section: N=30
Variance Skewness Classical 17% ± 0.5% 68% ± 1.0% Rbf kernel 65% ± 3.5% 79% ± 1.5% Given a real analytic function f (x, . . . , x d ) = i∈N d f i x i in m variables with nonzero radius of convergence and Hilbert spaces H 1 , . . . , H d we may (formally) extend f to a function
f ⊗ : d i=1 H i → T, f ⊗ (x 1 , . . . , x d ) = i∈N d f i x ⊗i .
Moreover, if the Hilbert spaces are RKHSs then we have the following result.
Lemma 6 (Nonlinear kernel trick). For any collection of RKHSs H 1 , . . . , H d with feature maps φ i : X i → H i , assume that f and g are real analytic functions with radii of convergence r(f ) and r(g) such that max 1≤i≤d sup x∈Xi |φ i (x)| < min(r(f ), r(g)). Then ⟨f ⊗ φ 1 (x 1 ), . . . , φ d (x d ) , g ⊗ φ 1 (y 1 ), . . . , φ d (y
d ) ⟩ T = i∈N d f i g i k 1 (x 1 , y 1 ) i1 . . . k d (x d , y d ) i d .
Proof. Since the image of the φ i s lie inside the radius of convergence of f ⊗ and g ⊗ the power series converge absolutely and we can write
⟨f ⊗ φ ⊗i (x i ) , g ⊗ φ ⊗i (y i ) ⟩ T = ⟨ i∈N d f i φ ⊗i (x i ), i∈N d g i φ ⊗i (y i )⟩ T = i∈N d f i g i ⟨φ ⊗i (x i ), φ ⊗i (y i )⟩ H ⊗i = i∈N d f i g i k 1 (x 1 , y 1 ) i1 . . . k d (x d , y d ) i d , where H = H 1 × • • • × H d .
Using Lemma 6, we can choose kernels k i : X 2 i → R with associated RKHSs H i and feature maps φ i and some i ∈ N d with deg(i) = m. We make the observation that with X = (X 1 , . . . , X d ) ∼ γ, Y = (Y 1 , . . . , Y d ) ∼ η and k ⊗i and H ⊗i as in (4), one has
⟨κ i (γ), κ i (η)⟩ H ⊗i = ⟨ π∈P (m) c π E γ i π φ ⊗i (X i ), τ ∈P (m) c τ E η i τ φ ⊗i (Y i )⟩ H ⊗i = π,τ ∈P (m) c π c τ ⟨E γ i π φ ⊗i (X i ), E η i τ φ ⊗i (Y i )⟩ H ⊗i = π,τ ∈P (m) c π c τ E γ i π ⊗η i τ ⟨φ ⊗i (X i ), φ ⊗i (Y i )⟩ H ⊗i = π,τ ∈P (m) c π c τ E γ i π ⊗η i τ k ⊗i ((X 1 , . . . , X m ), (Y 1 , . . . , Y m )), Since ∥κ i (γ)∥ 2 H ⊗i = ⟨κ i (γ), κ i (γ)⟩ H ⊗i ∥κ i (γ) -κ i (η)∥ 2
H ⊗i = ⟨κ i (γ), κ i (γ)⟩ H ⊗i + ⟨κ i (η), κ i (η)⟩ H ⊗i -2⟨κ i (γ), κ i (η)⟩ H ⊗i one gets the expected kernel trick statements of Theorem 2 and Theorem 3.
We are now interested in explicitly computing the expression ∥κ
(1,2) k,ℓ (γ)∥ 2 H ⊗1 k ⊗H ⊗2 ℓ , ∥κ(2)
k (γ)κ  Under fairly general conditions, the V-statistic converges in distribution to E[h(Z 1 , . . . , Z m )] and a well-developed theory describes this convergence ( Van der Waart, 2000;Serfling, 1980;Arcones and Giné, 1992).
Example E.1 (Estimating ∥κ
(2) k (γ) -κ
(2) k (η)∥ 2 H (1,1) ). Let X, X ′ , X ′′ , X ′′′ denote independent copies of γ and Y, Y ′ , Y ′′ , Y ′′′ denote independent copies of η. The full expression for ∥κ
(2) k (γ) - κ (2) k (η)∥ 2 H (1,1) is ∥κ (2) k (γ) -κ (2) k (η)∥ 2 H (1,1) = Ek(X, X ′ )k(X ′′ , X ′′′ ) + Ek(Y, Y ′ )k(Y ′′ , Y ′′′ ) (12) + Ek(X, X ′ ) 2 + Ek(Y, Y ′ ) 2 + 2Ek(X, Y )k(X ′ , Y ) + 2Ek(X, Y )k(X, Y ′ ) -2Ek(X, Y )k(X ′ , Y ′ ) -2Ek(X, Y ) 2 -2Ek(X, X ′ )k(X, X ′′ ) -2Ek(Y, Y ′ )k(Y, Y ′′ ).
Given samples (x i ) N i=1 , (y i ) M i=1 from γ and η respectively the corresponding V statistic is
1 N 4 N i,j,κ,l=1 k(x i , x j )k(x κ , x l ) + 1 M 4 M i,j,κ,l=1
k(y i , y j )k(y κ , y l ) (13)
+ 1 N 2 N i,j=1 k(x i , x j ) 2 + 1 M 2 M i,j=1
k(y i , y j ) 2 k(y i , y j )k(y i , y κ ).
Let us define the Gram matrices K x = [k(x i , x j )] N i,j=1 ∈ R N ×N , K y = [k(y i , y j )] M i,j=1 ∈ R M ×M , K x,y = [k(x i , y j )] N,M i,j=1 and let
H N = 1 N 1 N 1 ⊤ N ∈ R N ×N , H M = 1 M 1 M 1 ⊤ M ∈ R M ×M
be the centering, then (13) can be rewritten as This estimator can be computed in quadratic time.
1 N 2 Tr(H N K x H N K x ) + 1 M 2 Tr(H M K y H M K y ) + 1 N 2 Tr(K 2 x ) + 1 M 2 Tr(K 2 y ) + 2 N M Tr(K xy H N K xy ) + 2 N M Tr(K xy H M K ⊤ xy ) - 2 N M Tr(H M K ⊤ xy H N K xy ) - 2 N M Tr(K 2 xy ) - 2 N 2 Tr(K x H N K x ) -

Section: Example E.2 (Estimating ∥κ
(1,2)
k,ℓ (γ)∥ 2 H ⊗1 k ⊗H ⊗2 ℓ
). Let k denote the kernel on X 1 and ℓ denote the kernel on X 2 . Let (X, Y ), (X ′ , Y ′ ), (X ′′ , Y ′′ ), (X (3) , Y (3) ), (X (4) , Y (4) ), (X (5) , Y (5) ) denote in-

Section: 
dependent copies of γ ∈ P(X 1 × X 2 ). The full expression for ∥κ 5) ).
Given samples (x i , y i ) N i=1 from γ the corresponding V-statistic for this expression is
Using the shorthand notation K = K x , L = L y and H = H N and denoting by • the Hadamard product [A • B] i,j = A i,j B i,j and ⟨•⟩ the sum over all elements of a matrix ⟨A⟩ = N i,j=1 A i,j , the V-statistic above can be written in the simpler form
Again this estimator can be computed in quadratic time.

Section: Example E.3 (Estimating ∥κ
(3)
In order to estimate d (3) (γ, η) we note that one can write
We can estimate the first two terms like in Example E.2, and the third term can be expressed as
For simplicity we will assume that we have an equal number of samples (N ) from both measures
Using the notation
We mention also that the first two terms ∥κ
k can be computed a little more simply than in Example E.2 since the expressions have more symmetry, using the notation K x = [k(x i , x j )] N i,j=1 we can write down the V-statistic for ∥κ
(3)
with a similar expression for ∥κ . The estimator can be computed in quadratic time.


References:
[b0] M A Arcones; E Giné (1992). On the bootstrap of U and V statistics. Annals of Statistics
[b1] N Aronszajn (1950). Theory of reproducing kernels. Transactions of the American Mathematical Society
[b2] L Baringhaus; C Franz (2004). On a new multivariate two-sample test. Journal of Multivariate Analysis
[b3] A Berlinet; C Thomas-Agnan (2004). Reproducing Kernel Hilbert Spaces in Probability and Statistics. Kluwer
[b4] V I Bogachev (2007). Measure theory. Springer
[b5] P Bonnier; H Oberhauser (2020). Signature cumulants, ordered partitions, and independence of stochastic processes. Bernoulli
[b6] I Chevyrev; H Oberhauser (2022). Signature moments to characterize laws of stochastic processes. Journal of Machine Learning Research
[b7] E Chung; J P Romano (2013). Exact and asymptotically robust permutation tests. Annals of Statistics
[b8] N G De Bruijn (1981). Asymptotic Methods in Analysis. Dover
[b9] R Dudley (2004). Real Analysis and Probability. Cambridge University Press
[b10] E ; S V Park; J Cho; Y  (2020). Using data mining techniques for bike sharing demand prediction in metropolitan city. Computer Communications
[b11] R P Ferreira (2016). Combination of artificial intelligence techniques for prediction the behavior of urban vehicular traffic in the city of São Paulo. 
[b12] K Fukumizu; A Gretton; X Sun; B Schölkopf (2008). Kernel measures of conditional dependence. 
[b13] T Gärtner (2003). A survey of kernels for structured data. ACM SIGKDD Explorations Newsletter
[b14] A Gretton; K Borgwardt; M Rasch; B Schölkopf; A Smola (2012). A kernel two-sample test. Journal of Machine Learning Research
[b15] A Gretton; O Bousquet; A Smola; B Schölkopf (2005). Measuring statistical dependence with Hilbert-Schmidt norms. 
[b16] A Gretton; K Fukumizu; C H Teo; L Song; B Schölkopf; A J Smola (2008). A kernel statistical test of independence. 
[b17] O Hagrass; B K Sriperumbudur; B Li (2022). Spectral regularized kernel two-sample tests. 
[b18] O Hagrass; B K Sriperumbudur; B Li (2023). Spectral regularized kernel goodness-of-fit tests. 
[b19] S R Jammalamadaka; T S Rao; G Terdik (2006). Higher order cumulants of random vectors and applications to statistical inference and time series. Indian Journal of Statistics
[b20] F J Király; H Oberhauser (2019). Kernels for sequentially ordered data. Journal of Machine Learning Research
[b21] L Klebanov (2005). N-Distances and Their Applications. 
[b22] N M Kriege; F D Johansson; C Morris (2020). A survey on graph kernels. Applied Network Science
[b23] S Lang (2002). Algebra. Springer
[b24] Z Liu; R L Peach; P A Mediano; M Barahona (2023). Interaction measures, partition lattices and kernel tests for high-order interactions. 
[b25] H Lodhi; C Saunders; J Shawe-Taylor; N Cristianini; C Watkins (2002). Text classification using string kernels. Journal of machine learning research
[b26] L Lovász (1993). Combinatorial Problems and Exercise. North-Holland
[b27] R Lyons (2013). Distance covariance in metric spaces. The Annals of Probability
[b28] N Makigusa (2020). Two-sample test based on maximum variance discrepancy. 
[b29] P Mccullagh (2018). Tensor Methods in Statistics. Courier Dover Publications
[b30] K Muandet; K Fukumizu; B Sriperumbudur; B Schölkopf (2017). Kernel mean embedding of distributions: A review and beyond. Foundations and Trends in Machine Learning
[b31] A Müller (1997). Integral probability metrics and their generating classes of functions. Advances in Applied Probability
[b32] N Pfister; P Bühlmann; B Schölkopf; J Peters (2018). Kernel-based tests for joint independence. Journal of the Royal Statistical Society: Series B (Statistical Methodology)
[b33] N Quadrianto; L Song; A Smola (2009). Kernelized sorting. 
[b34] W Rudin (1953). Principles of mathematical analysis. McGraw-Hill Book Company, Inc
[b35] S Saitoh; Y Sawano (2016). Theory of Reproducing Kernels and Applications. Springer
[b36] B Schölkopf; A Smola (2002). Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press
[b37] D Sejdinovic; A Gretton; W Bergsma (2013). A kernel test for three-variable interactions. 
[b38] D Sejdinovic; B Sriperumbudur; A Gretton; K Fukumizu (2013). Equivalence of distancebased and RKHS-based statistics in hypothesis testing. Annals of Statistics
[b39] R Serfling (1980). Approximation Theorems of Mathematical Statistics. 
[b40] A Smola; A Gretton; L Song; B Schölkopf (2007). A Hilbert space embedding for distributions. 
[b41] T P Speed (1983). Cumulants and partition lattices. Australian Journal of Statistics
[b42] T P Speed (1984). Cumulants and partition lattices II. Australian Journal of Statistics
[b43] B Sriperumbudur; A Gretton; K Fukumizu; B Schölkopf; G Lanckriet (2010). Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research
[b44] I Steinwart; A Christmann (2008). Support Vector Machines. Springer
[b45] B Streitberg (1990). Lancaster interactions revisited. Annals of Statistics
[b46] Z Szabó; B K Sriperumbudur (2018). Characteristic and universal tensor product kernels. Journal of Machine Learning Research
[b47] G Székely; M Rizzo (2004). Testing for equal distributions in high dimension. InterStat
[b48] G Székely; M Rizzo (2005). A new test for multivariate normality. Journal of Multivariate Analysis
[b49] G J Székely; M L Rizzo (2009). Brownian distance covariance. The Annals of Applied Statistics
[b50] G J Székely; M L Rizzo; N K Bakirov (2007). Measuring and testing dependence by correlation of distances. The Annals of Statistics
[b51] I Tolstikhin; B Sriperumbudur; B Schölkopf (2016). Minimax estimation of maximal mean discrepancy with radial kernels. 
[b52] A A Van Der Waart (2000). Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. 
[b53] S Willard (1970). General Topology. Addison-Wesley
[b54] A Zinger; A Kakosyan; L Klebanov (1992). A characterization of distributions by mean values of statistics and certain probabilistic metrics. Journal of Soviet Mathematics
[b55] V Zolotarev (1983). Probability metrics. Theory of Probability and its Applications

Figures:
Figure fig_0: 231
Type: figure
Caption: Figure 2 :Figure 3 :Figure 1 :231Figure2: Test power as a function of the sample size (N ) of independence testing using HSIC (red) and CSIC statistics (blue), with the independence testing between Y 2 0.5 and X.
Data: 

Figure fig_1: 45
Type: figure
Caption: Figure 4 :Figure 5 :45Figure 4: Independence testing using HSIC (red) and CSIC (blue) on the Sao Paulo traffic dataset.
Data: 

Figure fig_2: 
Type: figure
Caption: d i=1 φ i (µ) = d i=1 φ i (ν), then for any compact C ⊆ d i=1 X i we have µ(C) = ν(C) as d i=1 φ i : C → d i=1 φ i (C) is a homeomorphism. SinceRadon measures are characterized by their values on compacts, this implies that µ = ν. Hence the pushforward map is injective. Denote by K the image of d i=1 X i under the mapping d i=1 φ i in d i=1
Data: 

Figure fig_3: 67
Type: figure
Caption: Figure 6 :Figure 7 :67Figure 6: Average computational time in seconds for KME (red) and d (2) (blue) for sample size N between 50 and 2000.
Data: 

Figure fig_4: 8
Type: figure
Caption: Figure 8 :8Figure 8: Type I errors using MMD (red) and d (2) (blue) on the Seoul bicycle data set.
Data: 

Figure fig_5: 
Type: figure
Caption: k(η)∥ 2 H (1,1) and ∥κ
Data: 

Figure fig_6: 
Type: figure
Caption: corresponding V-statistics. Recall that for a (w.l.o.g.) symmetric, measurable function h(z 1 , . . . , z m ), the V-statistic of h with N samples Z 1 , . . . , Z N is defined asV(h; Z 1 , . . . , Z N ) := N -m N i1,...,im=1h(Z i1 , . . . , Z im ).
Data: 

Figure fig_7: 
Type: figure
Caption: i , y j )k(x κ , y j ) i , y j )k(x i , y κ ) i , y j )k(x κ , y l )i , x j )k(x i , x κ ) -
Data: 

Figure fig_8: 2
Type: figure
Caption: 2 M 22Tr(K y H M K y ) which simplifies to 1 N 2 Tr (K x (I -H N )) 2 + 1 M 2 Tr (K y (I -H M )) 2 -2 N M Tr K xy (I -H M )K ⊤ xy (I -H N ) .
Data: 

Figure tab_1: 
Type: table
Caption: • • × X d , (H 1 , k 1 ), . . . , (H d , k d ) RKHSs on the Polish spaces X 1 , . . . , X d such that for every 1 ≤ j ≤ d k j is a bounded, continuous, point-separating kernel. Then γ = η if and only if κ k1,...,k d (γ) = κ k1,...,k d (η).
Data: 

Figure tab_2: 1
Type: table
Caption: Comparison of classical and kernelized cumulants for independence testing with both variance and skewness.
Data: N=20VarianceSkewnessClassical19% ± 3.0% 56% ± 3.5%Rbf kernel 39% ± 4.5% 59% ± 3.0%


Formulas:
Formula formula_0: µ i (γ) := E X i1 1 • • • X i d d ∈ R,(1)

Formula formula_1: ) : = i 1 + • • • + i d .

Formula formula_2: i∈N d κ i (γ) θ i i! = log i∈N d µ i (γ) θ i i! , θ = (θ 1 , . . . , θ d ) ∈ R d ,(2)

Formula formula_3: = i 1 ! • • • i d ! and θ i = θ i1 1 • • • θ i d d ;

Formula formula_4: 1 : = H 1 ⊗ • • • ⊗ H 1 m-times

Formula formula_5: H 1 ⊕ H 2 is a Hilbert space. It is natural to consider E X ⊗m 1 ∈ H ⊗m 1

Formula formula_6: Example 2.1 (H 1 = R d , m = 2). If X 1 = X 1 1 , . . . , X d 1 is H 1 = R d -valued then E X ⊗2 1 ∈ (R d ) ⊗2

Formula formula_7: µ i (γ) := E[X ⊗i1 1 ⊗ • • • ⊗ X ⊗i d d ] ∈ H ⊗i , H ⊗i : = H ⊗i1 1 ⊗ • • • ⊗ H ⊗i d d (3)

Formula formula_8: µ(γ) = (µ i (γ)) i∈N d ∈ T := T 1 ⊗ • • • ⊗ T d , with T j := m≥0 H ⊗m j ,

Formula formula_9: X 1 ∈ H 1 and X 2 ∈ H 2 have different state space (H 1 ̸ = H 2 ).

Formula formula_10: X= (X 1 , . . . , X d ) ∈ X = X 1 × • • • × X d via a feature map Φ : X → H into a Hilbert space valued random variable Φ(X).

Formula formula_11: γ π : = γ| Xπ 1 ⊗ • • • ⊗ γ| Xπ b ,

Formula formula_12: X 1 × • • • × X d and i = (i 1 , . . . , i d ) ∈ N d . Define γ i : = Law(X 1 , . . . , X 1 i1 times , X 2 , . . . , X 2 i2 times , . . . , X d , . . . , X d i d times ),

Formula formula_13: X π1 × • • • × X π b and γ i is a probability measure on X i1 1 × • • • × X i d d ;

Formula formula_14: κ k1,...,k d (γ) := κ i k1,...,k d (γ) i∈N d ∈ T as follows κ i k1,...,k d (γ) := π∈P (m) c π E γ i π k ⊗i ((X 1 , . . . , X m ), •), where m = deg(i), c π := (-1) |π|-1 (|π| -1)!, γ i π = (γ i ) π and k ⊗i ((x 1 , . . . , x m ), (y 1 , . . . , y m )) : = k 1 (x 1 , y 1 ) • • • k 1 (x i1 , y i1 ) (4) • • • k d (x m-i d +1 , y m-i d +1 ) • • • k d (x m , y m ) is the reproducing kernel of H ⊗i where H = H 1 × • • • × H d .

Formula formula_15: K 1 = k 1 (X 1 , •), K 2 = k 2 (X 2 , •) where (X 1 , X 2 ) ∼ γ.

Formula formula_16: (2,0) k1,k2 (γ) = E K ⊗2 1 - E [K 1 ] ⊗2 , κ (1,1) k1,k2 (γ) = E [K 1 ⊗ K 2 ] -E [K 1 ] ⊗ E [K 2 ] , κ (0,2) k1,k2 (γ) = E K ⊗2 2 -E [K 2 ] ⊗2 .

Formula formula_17: X d ) ∼ γ 1 , (Y 1 , . . . , Y d ) ∼ γ 2 on X 1 ×•

Formula formula_18: ⟨µ i k1,...,k d (γ 1 ), µ i k1,...,k d (γ 2 )⟩ H ⊗i = E γ1⊗γ2 k 1 (X 1 , Y 1 ) i1 • • • k d (X d , Y d ) i d ,(5)

Formula formula_19: (1) k (γ 1 ), µ (1) k (γ 2 )⟩ H k = E γ1⊗γ2 k(X, Y ).

Formula formula_20: X 1 × • • • × X d , i = (i 1 , . . . , i d ) ∈ N d such that deg(i) = m. Then ⟨κ i k1,...,k d (γ), κ i k1,...,k d (η)⟩ H ⊗i = π,τ ∈P (m) c π c τ E γ i π ⊗η i τ k ⊗i ((X 1 , . . . , X m ), (Y 1 , . . . , Y m )).

Formula formula_21: d i (γ, η) := ∥κ i k1,...,k d (γ) -κ i k1,...,k d (η)∥ 2 H ⊗i (6) = π,τ ∈P (m) c π c τ E γ i π ⊗γ i τ k ⊗i ((X 1 , . . . , X m ), (Y 1 , . . . , Y m )) + E η i π ⊗η i τ k ⊗i ((X 1 , . . . , X m ), (Y 1 , . . . , Y m )) -2E γ i π ⊗η i τ k ⊗i ((X 1 , . . . , X m ), (Y 1 , . . . , Y m )) .

Formula formula_22: Example 3.3 (m = 1). Applied with m = 1 and d = 1, (6) becomes MMD 2 k (γ, η) ∥κ (1) k (γ) -κ (1) k (η)∥ 2 H k = Ek(X, X ′ ) + Ek(Y, Y ′ ) -2Ek(X, Y )

Formula formula_23: ∥κ (2) k (γ) -κ (2) k (η)∥ 2 H (1,1) = Ek(X, X ′ )k(X ′′ , X ′′′ ) + Ek(Y, Y ′ )k(Y ′′ , Y ′′′ ) + Ek(X, X ′ ) 2 +Ek(Y, Y ′ ) 2 + 2Ek(X, Y )k(X ′ , Y ) + 2Ek(X, Y )k(X, Y ′ ) -2Ek(X, Y )k(X ′ , Y ′ ) -2Ek(X, Y ) 2 -2Ek(X, X ′ )k(X, X ′′ ) -2Ek(Y, Y ′ )k(Y, Y ′′ ),

Formula formula_24: γ = γ| X1 ⊗ • • • ⊗ γ| X d if and only if κ i k1,...,k d (γ) = 0 for every i ∈ N d + .

Formula formula_25: + ∥κ i k1,...,k d (γ)∥ 2 H ⊗i = π,τ ∈P (m) c π c τ E γ i π ⊗γ i τ k ⊗i ((X 1 , . . . , X m ), (Y 1 , . . . , Y m )),(7)

Formula formula_26: ∥κ (1,1) k1,k2 (γ)∥ 2 H (1,1) = Ek 1 (X, Y )k 2 (X, Y ) + Ek 1 (X, Y )k 2 (X ′ , Y ′ ) -2Ek 1 (X, Y )k 2 (X ′ , Y ),

Formula formula_27: d (2) (γ, η) = ∥κ (2) k (γ) -κ (2) k (η)∥ 2 H (1,1) is 1 N 2 Tr (K x J N ) 2 + 1 M 2 Tr (K y J M ) 2 - 2 N M Tr K xy J M K ⊤ xy J N ,

Formula formula_28: (x n ) N n=1 i.i.d. ∼ γ, (y m ) M m=1 i.i.d. ∼ η, K x = [k(x i , x j )] N i,j=1 ∈ R N ×N , K y = [k(y i , y j )] M i,j=1 ∈ R M ×M , K x,y = [k(x i , y j )] N,M i,j=1 ∈ R N ×M , J n = I n -1 n 1 n 1 ⊤ n ∈ R n×n , with 1 n = (1, . . . , 1) ∈ R n .

Formula formula_29: X 2 1 → R, ℓ : X 2 2 → R. Assume that we have samples (x i , y i ) N i=1 and use the shorthand notation K = K x , L = L y (similarly to Lemma 2) and H = H N = 1 N 1 N 1 ⊤ N ∈ R N ×N .

Formula formula_30: Lemma 3 (CSIC estimation). The V-statistic for ∥κ (1,2) k,ℓ (γ)∥ 2 H ⊗1 k ⊗H ⊗2 ℓ is 1 N 2 K • K • L -4K • KH • L -2K • K • LH + 4KH • K • LH + 2K • L K N 2 + 2KH • HK • L + 4K • HK • LH + K • K L N 2 -8K • LH K N 2 -4K • HK L N 2 + 4 K N 2 2 L .

Formula formula_31: B m+1 = |P (m + 1)| = m k=0 m k B k , with the first elements of the sequence being B 0 = B 1 = 1, B 2 = 2, B 3 = 5, B 4 = 15, B 5 = 52. By (

Formula formula_32: d i (γ, η) or ∥κ i k1,...,k d (γ)∥ 2 H ⊗i (m = deg(i)) is proportional to B 2 m (it equals to 3B 2 m and to B 2

Formula formula_33: H 0 : γ = γ 1 ⊗ γ 2 against the alternative H 1 : γ ̸ = γ 1 ⊗ γ 2 .

Formula formula_34: f (t) = E[e tX ] = m µ m t m /m! describes the law of X with sequence (µ m ) of moments µ m = E[X m ] ∈ R.

Formula formula_35: g(t) = log f (t) = m κ m t m

Formula formula_36: X ∼ γ µ 2 (γ) = (µ 1 (γ)) 2 + Var(X),

Formula formula_37: = E[(X -µ 1 (γ)) 2 ].

Formula formula_38: Var µ 2 = Var ( κ) + 2 N (EX) 4 -(EX) 2 Var(X) -2 Var(X) 2 N -1 .

Formula formula_39: E[X m Y n ] = E[X m ]E[X n ],

Formula formula_40: H d and (h 1 , . . . , h d ) ∈ H 1 × • • • × H d , the multi-linear operator h 1 ⊗ • • • ⊗ h d ∈ H 1 ⊗ • • • ⊗ H d is defined as (h 1 ⊗ • • • ⊗ h d )(f 1 , . . . , f d ) = d j=1 ⟨h j , f j ⟩ Hj for all (f 1 , . . . , f d ) ∈ H 1 × • • • × H d . By extending the inner product ⟨a 1 ⊗ • • • ⊗ a d , b 1 ⊗ • • • ⊗ b d ⟩ H1⊗•••⊗H d := d j=1 ⟨a j , b j ⟩ Hj to finite linear combinations of a 1 ⊗ • • • ⊗ a d -s n i=1 c i ⊗ d j=1 a i,j : c i ∈ R, a i,j ∈ H j , n ≥ 1

Formula formula_41: H 1 ⊗ • • • ⊗ H d . Specifically, if (H 1 , k 1 ), . . . , (H d , k d ) are RKHSs, then so is H 1 ⊗ • • • ⊗ H d = H ⊗ d j=1 kj

Formula formula_42: ⊗ d j=1 k j ((x 1 , . . . , x d ) , (x ′ 1 , . . . , x ′ d )) := d j=1 k j x j , x ′ j , where (x 1 , . . . , x d ), (x ′ 1 , . . . , x ′ d ) ∈ X 1 × • • • × X d .

Formula formula_43: B 1 ⊗• • •⊗B d is a little more involved

Formula formula_44: m≥0 ∥h m ∥ 2 H ⊗m j < ∞

Formula formula_45: ∥(h 0 , h 1 , h 2 , . . .)∥ 2 m≥0 H ⊗m j = m≥0 ∥h m ∥ 2 H ⊗m j .

Formula formula_46: a • b = m i=0 a i ⊗ b m-i m≥0

Formula formula_47: T := T 1 ⊗ • • • ⊗ T d ,

Formula formula_48: T j = m≥0 H ⊗m j (j = 1, . . . , d). Let H = H 1 × • • • × H d ,

Formula formula_49: i = (i 1 , . . . , i d ) ∈ N d we define H ⊗i : = H ⊗i1 1 ⊗ • • • ⊗ H ⊗i d d .

Formula formula_50: T = i∈N d H ⊗i . (8

Formula formula_51: )

Formula formula_52: ⋆ : H ⊗i 1 × H ⊗i 2 → H ⊗(i 1 +i 2 ) , (9) (x 1 ⊗ • • • ⊗ x d ) ⋆ (y 1 ⊗ • • • ⊗ y d ) = (x 1 • y 1 ) ⊗ • • • ⊗ (x d • y d ), so that for a = a i i∈N d , b = b i i∈N d ∈ T

Formula formula_53: (a ⋆ b) i = i 1 +i 2 =i a i 1 ⋆ b i 2 (10)

Formula formula_54: i 1 + i 2 = i 1 1 + i 2 1 , . . . , i 1 d + i 2 d .

Formula formula_55: ) = i 1 + • • • + i d ,

Formula formula_56: T = m≥0 {i∈N d :deg(i)=m}

Formula formula_57: µ i (γ) = E[X i1 1 • • • X i d d ] ∈ R for i = (i 1 , . . . , i d ) ∈ N d .

Formula formula_58: i∈N d κ i (γ) θ i i! = log i∈N d µ i (γ) θ i i! , 2. κ i (γ) = π∈P (d) c π µ i (γ i π )

Formula formula_59: where θ = (θ 1 , . . . , θ d ) ∈ R d , c π = (-1) |π| (|π|-1)!.

Formula formula_60: H 1 ×• • •×H d defined as κ i (γ) = π∈P (m) c π E γ i π (X ⊗i ),

Formula formula_61: log : T → T, x → n≥1 (-1) n-1 n (x -1) ⋆n ,

Formula formula_62: t ⋆n = t ⋆ • • • ⋆ t n -times , or coordinate-wise (t ⋆n ) i = i 1 +•••+i n =i t i 1 ⋆ • • • ⋆ t i n for i ∈ N d .

Formula formula_63: κ i (γ) = π∈P (m) c π E γ i π (X ⊗i ) = log µ(γ i ) 1m ,(11)

Formula formula_64: 1 m = (1, . . . , 1) ∈ N m .

Formula formula_65: i 1 +•••+i j =1m µ i 1 (γ i ) ⋆ • • • ⋆ µ i j (γ i ),

Formula formula_66: i 1 + • • • + i j = 1 m we may define h : [m] → [j] by the relation (i h(n) ) n = 1,

Formula formula_67: (i c ) n = 1 if n ∈ h -1 (c) 0 otherwise .

Formula formula_68: log µ(γ i ) 1m = m j=1 (-1) j-1 j h:[m]→[j] µ i h -1 (1) (γ i ) ⋆ • • • ⋆ µ i h -1 (j) (γ i ).

Formula formula_69: µ i h -1 (1) (γ i ) ⋆ • • • ⋆ µ i h -1 (j) (γ i

Formula formula_70: log µ(γ i ) 1m = m j=1 (-1) j-1 j h:[m]→[j] µ i (γ i π h ) = π∈P (m) (-1) |π|-1 |π| |π|!µ i (γ i π ) = π∈P (m) c π µ i (γ i π ) = π∈P (m) c π E γ i π (X ⊗i ).

Formula formula_71: ⟨κ i (γ), κ i (η)⟩ H ⊗i = ⟨ π∈P (m) c π E γ i π (X ⊗i ), τ ∈P (m) c τ E η i τ (Y ⊗i )⟩ H ⊗i = π,τ ∈P (m) c π c τ E (X,Y )∼γ i π ⊗η i τ ⟨X ⊗i , Y ⊗i ⟩ H ⊗i .

Formula formula_72: M (x 1 , . . . , x d ) = i1 j=1 ⟨f 1 j , x 1 ⟩ • • • i d j=1 ⟨f d j , x d ⟩

Formula formula_73: P d i=1 X i of d i=1 X i P d i=1 X i → R, γ → d i=1 Xi p φ 1 (x 1 ), . . . , φ d (x d ) dγ(x 1 , . . . , x d ),

Formula formula_74: P d i=1 X i .

Formula formula_75: d i=1 φ i : P d i=1 X i → P d i=1 B i

Formula formula_76: d i=1 φ i : d i=1 X i → d i=1 B i , d i=1 φ i (x 1 , . . . , x d ) → d i=1 φ i (x i )

Formula formula_77: B ⊗i : = B ⊗i1 1 ⊗ • • • ⊗ B ⊗i d d and given an element x = (x 1 , . . . , x d ) ∈ d i=1 B i we write x i : = x ⊗i1 1 ⊗ • • • ⊗ x ⊗i d d so that x i ∈ B ⊗i . If we have functions (φ i ) d

Formula formula_78: φ ⊗i : = φ ⊗i1 1 ⊗ • • • ⊗ φ ⊗i d d , φ ⊗i : d i=1 X i → B ⊗i .

Formula formula_79: F ⊗i : = f 1 ⊗ • • • ⊗ f d , F ⊗i ∈ B ⊗i ⋆ .

Formula formula_80: X 1 × • • • × X d . Then 1. γ = η if and only if κ(γ) = κ(η).

Formula formula_81: E γ p φ 1 (X 1 ), . . . , φ d (X d ) = E d i=1 γ| X i p φ 1 (X 1 ), . . . , φ d (X d )

Formula formula_82: ⟨F i , κ i (γ)⟩ = π∈P (d) c π E γ i π f 1 (φ 1 (X 1 )) • • • f d (φ d (X d ))

Formula formula_83: (f 1 • φ 1 )(X 1 ), . . . , (f d • φ d )(X d ) ,

Formula formula_84: 1 • φ 1 )(X 1 ), . . . , (f d • φ d )(X d

Formula formula_85: E γ p (f 1 • φ 1 )(X 1 ), . . . , (f d • φ d )(X d ) = E d i=1 γ| X i p (f 1 • φ 1 )(X 1 ), . . . , (f d • φ d )(X d )

Formula formula_86: X 1 × • • • × X d . Then 1. γ = η if and only if κ k1,...,k d (γ) = κ k1,...,k d (η). 2. γ = d i=1 γ| Xi if and only if κ i k1,...,k d (γ) = 0 for all i ∈ N d + .

Formula formula_87: i (x)∥ 2 H k i = k i (x, x) ≤ sup x∈Xi |k i (x, x)| < ∞,(

Formula formula_88: Var(X) = 2 3 b 2 + ba + a 2 so if Y is distributed according to U [-c, c] then we only need to solve b 2 + ba + a 2 = c 2

Formula formula_89: Tr(A ⊤ B) = ⟨A • B⟩,

Formula formula_90: H n = 1 n 1 n 1 ⊤ n we have AH n i,j = 1 n n c=1 A i,c

Formula formula_91: f ⊗ : d i=1 H i → T, f ⊗ (x 1 , . . . , x d ) = i∈N d f i x ⊗i .

Formula formula_92: d ) ⟩ T = i∈N d f i g i k 1 (x 1 , y 1 ) i1 . . . k d (x d , y d ) i d .

Formula formula_93: ⟨f ⊗ φ ⊗i (x i ) , g ⊗ φ ⊗i (y i ) ⟩ T = ⟨ i∈N d f i φ ⊗i (x i ), i∈N d g i φ ⊗i (y i )⟩ T = i∈N d f i g i ⟨φ ⊗i (x i ), φ ⊗i (y i )⟩ H ⊗i = i∈N d f i g i k 1 (x 1 , y 1 ) i1 . . . k d (x d , y d ) i d , where H = H 1 × • • • × H d .

Formula formula_94: ⟨κ i (γ), κ i (η)⟩ H ⊗i = ⟨ π∈P (m) c π E γ i π φ ⊗i (X i ), τ ∈P (m) c τ E η i τ φ ⊗i (Y i )⟩ H ⊗i = π,τ ∈P (m) c π c τ ⟨E γ i π φ ⊗i (X i ), E η i τ φ ⊗i (Y i )⟩ H ⊗i = π,τ ∈P (m) c π c τ E γ i π ⊗η i τ ⟨φ ⊗i (X i ), φ ⊗i (Y i )⟩ H ⊗i = π,τ ∈P (m) c π c τ E γ i π ⊗η i τ k ⊗i ((X 1 , . . . , X m ), (Y 1 , . . . , Y m )), Since ∥κ i (γ)∥ 2 H ⊗i = ⟨κ i (γ), κ i (γ)⟩ H ⊗i ∥κ i (γ) -κ i (η)∥ 2

Formula formula_95: (1,2) k,ℓ (γ)∥ 2 H ⊗1 k ⊗H ⊗2 ℓ , ∥κ(2)

Formula formula_96: (2) k (γ) - κ (2) k (η)∥ 2 H (1,1) is ∥κ (2) k (γ) -κ (2) k (η)∥ 2 H (1,1) = Ek(X, X ′ )k(X ′′ , X ′′′ ) + Ek(Y, Y ′ )k(Y ′′ , Y ′′′ ) (12) + Ek(X, X ′ ) 2 + Ek(Y, Y ′ ) 2 + 2Ek(X, Y )k(X ′ , Y ) + 2Ek(X, Y )k(X, Y ′ ) -2Ek(X, Y )k(X ′ , Y ′ ) -2Ek(X, Y ) 2 -2Ek(X, X ′ )k(X, X ′′ ) -2Ek(Y, Y ′ )k(Y, Y ′′ ).

Formula formula_97: 1 N 4 N i,j,κ,l=1 k(x i , x j )k(x κ , x l ) + 1 M 4 M i,j,κ,l=1

Formula formula_98: + 1 N 2 N i,j=1 k(x i , x j ) 2 + 1 M 2 M i,j=1

Formula formula_99: H N = 1 N 1 N 1 ⊤ N ∈ R N ×N , H M = 1 M 1 M 1 ⊤ M ∈ R M ×M

Formula formula_100: 1 N 2 Tr(H N K x H N K x ) + 1 M 2 Tr(H M K y H M K y ) + 1 N 2 Tr(K 2 x ) + 1 M 2 Tr(K 2 y ) + 2 N M Tr(K xy H N K xy ) + 2 N M Tr(K xy H M K ⊤ xy ) - 2 N M Tr(H M K ⊤ xy H N K xy ) - 2 N M Tr(K 2 xy ) - 2 N 2 Tr(K x H N K x ) -

Formula formula_101: k,ℓ (γ)∥ 2 H ⊗1 k ⊗H ⊗2 ℓ
