A Novel Dual of Shannon Information and Weighting Scheme

ACL ARR 2024 June Submission3086 Authors

15 Jun 2024 (modified: 03 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Shannon Information theory has achieved great success in not only communication technology where it was originally developed for but also many other science and engineering fields such as machine learning and artificial intelligence. Inspired by the famous weighting scheme TF-IDF, we discovered that Shannon information entropy actually has a natural dual. To complement the classical Shannon information entropy which measures the uncertainty we propose a novel information quantity, namely troenpy. Troenpy measures the certainty and commonness of the underlying distribution. So entropy and troenpy form an information twin. To demonstrate its usefulness, we propose a conditional troenpy based weighting scheme for document with class labels, namely positive class frequency (PCF). On a collection of public datasets we show the PCF based weighting scheme outperforms the classical TF-IDF and a popular Optimal Transport based word moving distance algorithm in a kNN setting with respectively more than 22.9 and 26.5 classification error reduction while the corresponding entropy based approach completely fails. We further developed a new odds-ratio type feature, namely Expected Class Information Bias(ECIB), which can be regarded as the expected odds ratio of the information twin across different classes. In the experiments we observe that including the new ECIB features and simple binary term features in a simple logistic regression model can further significantly improve the performance. The proposed simple new weighting scheme and ECIB features are very effective and can be computed with linear time complexity.
Paper Type: Long
Research Area: Machine Learning for NLP
Research Area Keywords: information, entropy, weighting
Contribution Types: Model analysis & interpretability, Theory
Languages Studied: english
Submission Number: 3086
Loading