We provide the training code for training ViTs and ResNets on CREMA-D, AVE and Kinetics-Sound. For training uni-modal models, please refer to train_audio.py and train_video.py. 
To train multi-modal lora models, see train_ume_lora.py