For better visualization, we submit the original videos of data examples mentioned in the paper in supplementary materials. We extract audio streams from these videos, and utilize multiple audio-visual cues to prompt the LLM,  generating the final audio captions.