We provide the training code for training ViT on AVE and Kinetics-Sound. For training uni-modal models, please refer to train_audio.py and train_video.py. 
To train multi-modal models with missing-modality augmentation, see train_missing.py