AVQA Test Samples

🗣️: Question 🤖: Prediction ✅: Ground Truth
Q1
q1_0

🗣️ What is the audio classification and source position?

🤖 magpie calling; magpie calling

✅ magpie calling; magpie calling

q1_1

🗣️ What category best describes this sound, and what is its point of origin?

🤖 child; child

✅ child; child

q1_2

🗣️ What is the sound class category? Where is the sound coming from?

🤖 alarm clock; folding chair

✅ telephone; folding chair

Q2
q2_0

🗣️ Does the sound source of the playing tabla located closer to the agent than the visual object congas?

🤖 No

✅ No

q2_1

🗣️ Is the playing snare drum's sound source positioned more distant from the agent than the visual object pigeon?

🤖 Yes

✅ Yes

q2_2

🗣️ Compared to the visual object cell phone, is the playing tambourine's sound source closer to the agent?

🤖 Yes

✅ Yes

Q3
q3_0

🗣️ How far is the chicken crowing sound source from the visual object bowling pins, and what is the relative spatial positioning of chicken crowing to bowling pins?

🤖 left;up;front;3.6

✅ left;up;front;3.7

q3_1

🗣️ Is it possible to gauge the distance from the mynah bird singing sound to the visual object air horn, and specify where mynah bird singing is located relative to air horn?

🤖 right;down;front;3.8

✅ right;down;front;4.3

q3_2

🗣️ Can you calculate the distance from the playing bugle sound source to the visual object magpie calling, and specify the relative location of playing bugle in relation to magpie calling?

🤖 left;up;behind;2.3

✅ left;up;behind;1.6

Q4
q4_0

🗣️ How would you estimate the distance from the hedge trimmer running sound to the visual object rocking chair, and the angle of the agent’s gaze between them? A: 180°, B: -150°, C: -120°, D: -90°, E: -60°, F: -30°,G: 0°, H: 30°, I: 60°, J: 90°, K: 120°, L: 150°.

🤖 K;2.6

✅ L;3.0

q4_1

🗣️ Could you assess the spatial distance from the playing tabla sound to the visual object loafer and the angle of the agent's gaze difference? A: 180°, B: -150°, C: -120°, D: -90°, E: -60°, F: -30°,G: 0°, H: 30°, I: 60°, J: 90°, K: 120°, L: 150°.

🤖 A;4.1

✅ L;4.2

q4_2

🗣️ Can you estimate how far the sea lion barking sound source is from the visual object bucket, and what is the angular deviation in the agent's gaze direction? A: 180°, B: -150°, C: -120°, D: -90°, E: -60°, F: -30°,G: 0°, H: 30°, I: 60°, J: 90°, K: 120°, L: 150°.

🤖 H;2.1

✅ H;1.4

Q5
q5_0

🗣️ What is the object found at (-79, -10), 1.3 meters, and is it the origin of the sound?

🤖 congas;No

✅ congas;No

q5_1

🗣️ What is the object located at (-137, -8), 2.4 meters in the scene, and is it producing a sound?

🤖 cornet;No

✅ alarm clock;No

q5_2

🗣️ What can be found at (-69, -17), 2.2 meters, and is it the source of any sound?

🤖 mandolin;Yes

✅ mandolin;Yes

Q6
q6_0

🗣️ Can you identify the object positioned at (-28, -15), 1.4 meters and state whether it is producing the sound?

🤖 electric blender;Yes

✅ electric blender;Yes

q6_1

🗣️ Can you tell what object is located at (-146, -6), 1.7 meters and whether it is generating sound?

🤖 chimpanzee;Yes

✅ chimpanzee;Yes

q6_2

🗣️ Which object is found at (-138, -20), 1.9 meters, and is it currently making a sound?

🤖 harmonica;Yes

✅ harmonica;Yes

Q7
q7_0

🗣️ Can you identify which visual object is emitting a sound, and indicate its location? A: 180°, B: -150°, C: -120°, D: -90°, E: -60°, F: -30°,G: 0°, H: 30°, I: 60°, J: 90°, K: 120°, L: 150°.

🤖 canary calling;E;-10;1.6

✅ canary calling;E;-10;1.9

q7_1

🗣️ Out of the multiple visual objects, which is emitting a sound, and where is it located? A: 180°, B: -150°, C: -120°, D: -90°, E: -60°, F: -30°,G: 0°, H: 30°, I: 60°, J: 90°, K: 120°, L: 150°.

🤖 keyboard;E;-18;2.0

✅ keyboard;E;-18;2.3

q7_2

🗣️ Which object in the scene is producing a sound, and where exactly is it located? A: 180°, B: -150°, C: -120°, D: -90°, E: -60°, F: -30°,G: 0°, H: 30°, I: 60°, J: 90°, K: 120°, L: 150°.

🤖 tambourine;E;-10;2.3

✅ wind chime;H;-6;2.5

Q8
q8_0

🗣️ What kind of sound is it, and which object within that class in the scene is the source? A: 180°, B: -150°, C: -120°, D: -90°, E: -60°, F: -30°,G: 0°, H: 30°, I: 60°, J: 90°, K: 120°, L: 150°.

🤖 dog baying;J;-78;2.4

✅ dog barking;D;-9;1.9

q8_1

🗣️ How would you categorize the sound and indicate the object in that class that is making it? A: 180°, B: -150°, C: -120°, D: -90°, E: -60°, F: -30°,G: 0°, H: 30°, I: 60°, J: 90°, K: 120°, L: 150°.

🤖 chicken crowing;E;-15;1.6

✅ chicken crowing;E;-14;1.9

q8_2

🗣️ Could you determine the sound class category, and which object of that category in the scene is making the sound? A: 180°, B: -150°, C: -120°, D: -90°, E: -60°, F: -30°,G: 0°, H: 30°, I: 60°, J: 90°, K: 120°, L: 150°.

🤖 playing accordion;F;-13;1.6

✅ playing accordion;F;-12;1.8

Q9
q9_0

🗣️ What is the type of sound, and is the object making it visually detectable in the scene?

🤖 mynah bird singing;No

✅ mynah bird singing;Yes

q9_1

🗣️ Can you specify the sound type and indicate whether its source can be seen in the scene?

🤖 acoustic guitar;Yes

✅ acoustic guitar;Yes

q9_2

🗣️ What is the sound class, and can the sound source be seen in the scene?

🤖 hedge trimmer;Yes

✅ hedge trimmer;Yes