Audio:
1. LibriSpeech test clean full set
2. Audio caps test full set

Video:
1. NExTQA: nextqa_test.json
ID provided in the "image" field

Image:
1. Flickr30k: flickr30k_captions.json 
(this is the standard 1k test set). ID provided in the "image" field.
2. TextVQA: textvqa.json 
ID provided in the "image" field
3. GQA: testdev_balanced_questions_with_images.json
ID provided in the "image" field

Audio-visual:
1. How2: how2_test.json
ID provided in "image". Format: <video_id>_<start_second>_<end_second>.mp4 or .wav.
2. AVSD: avsd_textformat_subset.json
ID provided in the "image_name" field. The audio file is the audio in the Charades video by extracting the audio out.
3. Image Spoken QA (ISQA): Synthesise questions based on the test sets for TextVQA and GQA
4. Audio-Visual Sound Source Detection (AVSSD): testdata_formatted.json
ID provided in the "image" field. The first one is image and the second one is the corresponding audio.
5. Audio Visual Matching (AVM): audiovisualmatching_combined.json
ID provided in the "image" field as a list of two values. The first one is the image and the second one is the audio/speech
Whether it is from VGGSS or is from SpokenCOCO is indicated in the ID as well

