Multi-Modal Retrieval For Large Language Model Based Speech Recognition

Anonymous

Multi-Modal Retrieval For Large Language Model Based Speech Recognition

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone

Abstract: Retrieval is a widely adopted approach for improving language models leveraging external information. As the field moves towards multi-modal large language models, it is important to extend the pure text based methods to incorporate other modalities in retrieval as well for applications across the wide spectrum of machine learning tasks and data types. In this work, we propose multi-modal retrieval with two approaches: kNN-LM and cross-attention techniques. We demonstrate the effectiveness of our retrieval approaches empirically by applying them to automatic speech recognition tasks with access to external information. Under this setting, we show that speech-based multi-modal retrieval outperforms text based retrieval, and yields up to $~50\,\%$ improvement in word error rate over the multi-modal language model baseline. Furthermore, we achieve state-of-the-art recognition results on the Spoken-Squad question answering dataset.

Paper Type: long

Research Area: Speech recognition, text-to-speech and spoken language understanding

Contribution Types: NLP engineering experiment

Languages Studied: English

Preprint Status: There is no non-anonymous preprint and we do not intend to release one.

A1: yes

A1 Elaboration For Yes Or No: 6

A2: n/a

A3: yes

A3 Elaboration For Yes Or No: 1

B: yes

B1: yes

B1 Elaboration For Yes Or No: 3

B2: n/a

B3: yes

B3 Elaboration For Yes Or No: 3

B4: yes

B4 Elaboration For Yes Or No: 3

B5: yes

B5 Elaboration For Yes Or No: 3

B6: yes

B6 Elaboration For Yes Or No: 3

C: yes

C1: yes

C1 Elaboration For Yes Or No: 3

C2: yes

C2 Elaboration For Yes Or No: 3

C3: yes

C3 Elaboration For Yes Or No: 4

C4: yes

C4 Elaboration For Yes Or No: 3

D: no

D1: n/a

D2: n/a

D3: n/a

D4: n/a

D5: n/a

E: no

E1: n/a

0 Replies

Loading