Abstract: Real-world perception tasks often involve multiple modalities or views of input. While joint training of multiple modality classification models has been explored previously, it has not consistently outperformed the best single modality model. This paper aims to address one of the reasons for this: the difficulty in balancing the contributions of each input in the end-to-end training of multi-input models. Additionally, the increased capacity of multiinput networks can lead to overfitting. To solve these issues, we propose InputMix, a simple yet effective method for optimally mixing different inputs. Our method mixes a certain proportion p of input pairs to relieve the increased capacity problems and assigns a weighting factor λ for each input to generate a mixed target, allowing us to specify the contributions of each input. Experimental results on three multi-input classification tasks demonstrate that our method significantly improves the generalization performance of multi-input neural networks. Codes are available at https://github.com/JesseWong333/inputmix/.
Loading