Abstract: Mutual information (MI) is hard to estimate for high dimensional data, and various estimators have been proposed over the years to tackle this problem. Here, we note that there exists another challenging problem, namely that many estimators of MI, which we denote as $I(X;T)$, are sensitive to scale, i.e., $I(X;\alpha T)\neq I(X;T)$ where $\alpha \in \mathbb{R}^{+}$. Although some normalization methods have been hinted at in previous works, there is no in-depth study of the problem. In this work, we study new normalization strategies for MI estimators to be scale-invariant, particularly for the Kraskov–Stögbauer–Grassberger (KSG) and the neural network-based MI (MINE) estimators. We provide theoretical and empirical results and show that the original un-normalized estimators are not scale-invariant and highlight the consequences of an estimator's scale-dependence. We propose new global normalization strategies that are tuned to the corresponding estimator and scale invariant. We compare our global normalization strategies to existing local normalization strategies and provide intuitive and empirical arguments to support the use of global normalization. Extensive experiments across multiple distributions and settings are conducted, and we find that our proposed variants KSG-Global-$L_{\infty}$ and MINE-Global-Corrected are most accurate within their respective approaches. Finally, we perform an information plane analysis of neural networks and observe clearer trends of fitting and compression using the normalized estimators compared to the original un-normalized estimators. Our work highlights the importance of scale awareness and global normalization in the MI estimation problem.
Submission Length: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=OVVKMt5w6a¬eId=NaSQ427I66
Changes Since Last Submission: The proofs and the proposition statements of Propositions 1-6 have been updated for correctness and rigor. Additional discussions on binning estimators that counter the problem of high dimensionality using structured assumptions on the ground truth distribution have been added in Remark 1 and Appendix A.2, and has been mentioned in the limitations. The writing has been improved in some places in the Appendix, including in the background section on MI Estimators, the discussion of the synthetic dataset results, the discussion of the neural network SNR results and the MI vs Feature Energy results. Typos have been rectified in the appendix and the main text. Minor changes have been made to the main text to improve language and fix some empirical details.
Supplementary Material: zip
Assigned Action Editor: ~Antoine_Patrick_Isabelle_Eric_Ledent1
Submission Number: 4560
Loading