Abstract: Finding associations between pairs of variables in large datasets is crucial for various disciplines. The brute force method for
solving this problem requires computing the
mutual information between N^2 pairs. In
this paper, we consider the problem of finding pairs of variables with high mutual information in sub-quadratic complexity. This
problem is analogous to the nearest neighbor search, where the goal is to find pairs
among N variables that are similar to each
other. To solve this problem, we develop a
new algorithm for finding associations based
on constructing a decision tree that assigns a
hash to each variable, in a way that for pairs
with higher mutual information, the chance
of having the same hash is higher. For any
1\le\lambda\le2, we prove that in the case of binary data, we can reduce the number of necessary mutual information computations for
finding all pairs satisfying I(X, Y ) > 2
from O(N^2) to O(N), where I(X, Y ) is the
empirical mutual information between variables X and Y . Finally, we confirmed our
theory by experiments on simulated and real
data.
Loading