Huatuo-26M, a Large-scale Chinese Medical QA DatasetDownload PDF

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone
Abstract: In this paper, we release the largest ever medical Question Answering (QA) dataset with \26 Million QA pairs named Huatuo-26M. We benchmark many existing approaches in our dataset in terms of both retrieval and generation. We also experimentally show the benefit of the proposed dataset in many aspects: (i) it serves as a fine-tuning data for training medical Large Language Models (LLMs); (ii) it works as an external knowledge source for retrieval-augmented generation (RAG); (iii) it demonstrates transferability by enhancing zero-shot performance on other QA datasets; and (iv) it aids in training biomedical model as a pre-training corpus. Our empirical findings substantiate the dataset's utility in these domains, thereby confirming its significance as a resource in the medical QA landscape.
Paper Type: long
Research Area: Special Theme (conference specific)
Contribution Types: Data resources
Languages Studied: Chinese
Preprint Status: There is a non-anonymous preprint (URL specified in the next question).
A1: yes
A1 Elaboration For Yes Or No: 9
A2: yes
A2 Elaboration For Yes Or No: 3.1.2, 3.2.2,
A3: yes
A3 Elaboration For Yes Or No: 1
B: no
B1: n/a
B1 Elaboration For Yes Or No: We don't use artifacts
B2: n/a
B2 Elaboration For Yes Or No: We don't use artifacts
B3: n/a
B3 Elaboration For Yes Or No: We don't use artifacts
B4: n/a
B4 Elaboration For Yes Or No: We don't use artifacts
B5: n/a
B5 Elaboration For Yes Or No: We don't use artifacts
B6: n/a
B6 Elaboration For Yes Or No: We don't use artifacts
C: yes
C1: yes
C1 Elaboration For Yes Or No: 80
C2: yes
C2 Elaboration For Yes Or No: 4.2
C3: yes
C3 Elaboration For Yes Or No: 6,7,8
C4: yes
C4 Elaboration For Yes Or No: 6,7,8
D: no
D1: n/a
D1 Elaboration For Yes Or No: We don't use annotators
D2: n/a
D2 Elaboration For Yes Or No: We don't use annotators
D3: n/a
D3 Elaboration For Yes Or No: We don't use annotators
D4: n/a
D4 Elaboration For Yes Or No: We don't use annotators
D5: n/a
D5 Elaboration For Yes Or No: We don't use annotators
E: yes
E1: yes
E1 Elaboration For Yes Or No: 4.
0 Replies

Loading