Abstract: Multi-modal neural architecture search (MNAS) is an effective approach to obtain task-adaptive multi-modal classification models. Deep neural networks, as currently main-stream feature extractors, can provide hierarchical features for each modality. Existing MNAS methods face difficulty in exploiting such hierarchical features due to their different form coexistence such as tensorial multi-scale features and vectorized penultimate features. Moreover, existing methods always focus on the evolution of fusion operators or vectorized features of all modalities, constraining search space. In this paper, a novel two-stage method called multi-modal multi-scale evolutionary neural architecture search (MM-ENAS) is proposed. The first stage unifies the representation form of hierarchical features by the proposed evolutionary statistics strategy. The second stage identifies the optimal combination of basic fusion operations for all unified hierarchical features by the evolutionary algorithm. MM-ENAS increases search space by simultaneously searching for feature statistical extraction methods, basic fusion operators and feature representation set consisting of tensorial multi-scale features and vectorized penultimate features. Experimental results on three multi-modal tasks demonstrate that the proposed method achieves competitive performance in terms of accuracy, search time, and number of parameters compared to existing representative MNAS methods. Additionally, the method exhibits fast adaptation to various multi-modal tasks.
Loading