Abstract: This paper develops the MUFIN technique for extreme classification (XC) tasks with millions of labels where data-points and labels are endowed with visual and textual de-scriptors. Applications of MUFIN to product-to-product recommendation and bid query prediction over several mil-lions of products are presented. Contemporary multi-modal methods frequently rely on purely embedding-based meth-ods. On the other hand, XC methods utilize classifier ar-chitectures to offer superior accuracies than embedding-only methods but mostly focus on text-based categorization tasks. MUFIN bridges this gap by reformulating multi-modal categorization as an XC problem with several mil-lions of labels. This presents the twin challenges of devel-oping multi-modal architectures that can offer embeddings sufficiently expressive to allow accurate categorization over millions of labels; and training and inference routines that scale logarithmically in the number of labels. MUFIN de-velops an architecture based on cross-modal attention and trains it in a modular fashion using pre-training and positive and negative mining. A novel product-to-product rec-ommendation dataset MM-AmazonTitles-300K containing over 300K products was curated from publicly available amazon.com listings with each product endowed with a title and multiple images. On the MM-AmazonTitles-300K and Polyvore datasets, and a dataset with over 4 million labels curated from click logs of the Bing search engine, MUFIN offered at least 3% higher accuracy than leading text-based, image-based and multi-modal techniques.
Loading