- TL;DR: How to estimate original probability vector for millions of classes from count-min sketch measurements - a theoretical and practical setup.
- Abstract: Extreme Classification Methods have become of paramount importance, particularly for Information Retrieval (IR) problems, owing to the development of smart algorithms that are scalable to industry challenges. One of the prime class of models that aim to solve the memory and speed challenge of extreme multi-label learning is Group Testing. Multi-label Group Testing (MLGT) methods construct label groups by grouping original labels either randomly or based on some similarity and then train smaller classifiers to first predict the groups and then recover the original label vectors. Recently, a novel approach called MACH (Merged Average Classifiers via Hashing) was proposed which projects the huge label vectors to a small and manageable count-min sketch (CMS) matrix and then learns to predict this matrix to recover the original prediction probabilities. Thereby, the model memory scales O(logK) for K classes. MACH is a simple algorithm which works exceptionally well in practice. Despite this simplicity of MACH, there is a big gap between the theoretical understanding of the trade-offs with MACH. In this paper we fill this gap. Leveraging the theory of count-min sketch we provide precise quantification of the memory-identifiablity tradeoffs. We extend the theory to the case of multi-label classification, where the dependencies make the estimators hard to calculate in closed forms. To mitigate this issue, we propose novel quadratic approximation using the Inclusion-Exclusion Principle. Our estimator has significantly lower reconstruction error than the typical CMS estimator across various values of number of classes K, label sparsity and compression ratio.
- Keywords: Extreme Classification, Count-Min Sketch