MDF-Net: Multimodal Deep Fusion for Large-Scale Product RecognitionOpen Website

Published: 01 Jan 2022, Last Modified: 08 May 2023CCBR 2022Readers: Everyone
Abstract: Large-scale production recognition systems are crucial for building efficient E-commerce platforms. However, various traditional product recognition approaches are based on single-modal data input (e.g., image or text), which limits recognition performance. To tackle this issue, in this paper, we propose a Multimodal Deep Fusion Network (MDF-Net) for accurate large-scale product recognition. The MDF-Net has a two-stream late fusion architecture, with a CNN model and a bi-directional language model that respectively extract semantic latent features from multimodal inputs. Image and text features are fused via Hadamard product, then jointly generate results. Further, we investigated the integration of attention mechanism and residual connection to respectively improve the text and image representations. We conduct experiments on a large-scale multimodal E-commerce product dataset MEP-3M, which consists of three million image-text product data. MDF-Net achieves a 93.72% classification accuracy over 599 fine-grained classes. Empirical results demonstrated that the MDF-Net yields better performance than traditional approaches.
0 Replies

Loading