Keywords: conditional batch norm, multimodal learning, shortcut learning
TL;DR: empirical evaluation of conditional batch normalization shows it may not learn visual features, due to shortcut learning between auxiliary modality and label
Abstract: Humans have perfected the art of learning from multiple modalities, through sensory organs. Despite impressive predictive performance on a single modality, neural networks cannot reach human level accuracy with respect to multiple modalities. This is a particularly challenging task due to variations in the structure of respective modalities. A popular method, Conditional Batch Normalization (CBN), was proposed to learn contextual features to aid a deep learning task. This uses the auxiliary data to improve representational power by learning affine transformation for Convolution Neural Networks. Despite the boost in performance by using CBN layer, our work reveals that the visual features learned by introducing auxiliary data via CBN deteriorates. We perform comprehensive experiments to evaluate the brittleness of a dataset to CBN. We show the sensitivity of CBN to the dataset, suggesting that learning from visual features could often be superior for generalization. We perform exhaustive experiments on natural images for bird classification and histology images for cancer type classification. We observe that the CBN network, learns close to no visual features on the bird classification dataset and partial visual features on the histology dataset. Our experiments reveal that CBN may encourage shortcut learning between the auxiliary data and labels.