Abstract: Audio Question Answering (AQA) is a complex task in Multi-Modal Learning, where a system interprets audio inputs and associated questions to produce appropriate answers. Previous AQA research has primarily focused on text-based queries, exploration into spoken questions in languages like English has been limited. Since speech is a primary mode of communication, integrating spoken queries could significantly enhance AQA system capabilities. To bridge this gap, this paper introduces a Spoken AQA system utilizing the Textless Multilingual Audio Question Answering (TM-AQA) dataset. This dataset comprises 107,514 question-answer pairs in English, Hindi, and Bengali, derived from 1991 environmental audio recordings corresponding to various environmental scenes. The study establishes baseline performance metrics by evaluating several multimodal (MML) AQA frameworks that employ diverse acoustic features and architectures. The experimental results demonstrate that the proposed Audio-MAMBA (A-MAMBA) based MML framework, incorporating a Continuous Scanning Mechanism (CSM), surpasses Transformer-based MML frameworks in performance and computational efficiency.
Paper Type: Long
Research Area: Question Answering
Research Area Keywords: Audio Processing, Multi-Modal Learning, State Space Model
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: English, Bengali, Hindi
Submission Number: 2892
Loading