Abstract: In real world, it is common that an entity is represented by multiple modalities, which motivates multi-modal learning, e.g., multi-modal clustering and cross-modal retrieval. Traditional methods based on deep neural networks usually assume a joint factor or multiple similar factors are learned. However, different modalities representing the same content share both common and modality-specific characteristics, and few approaches can fully discover those features, i.e., consistency and complementarity. In this paper, we propose to learn shared and specific factors for each modality. Then the consistency can be explored through the shared factors. By combining the shared and specific factors, the complementarity will be excavated. Finally, a triadic autoencoder with deep architecture is developed for the shared and specific factors learning. Extensive experiments are conducted for cross-modal retrieval and multi-model clustering, which clearly demonstrate the effectiveness of our model.
Loading