Huatuo-26M, a Large-scale Chinese Medical QA Dataset

ACL ARR 2024 August Submission313 Authors

16 Aug 2024 (modified: 20 Sept 2024)ACL ARR 2024 August SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large Language Models infuse newfound vigor into the advancement of the medical domain, yet the scarcity of data poses a significant bottleneck hindering community progress. In this paper, we release the \textbf{largest} ever medical Question Answering (QA) dataset with \textbf{26 Million} QA pairs named Huatuo-26M. We benchmark many existing approaches in our dataset in terms of both retrieval and generation. We also experimentally show the benefit of the proposed dataset in many aspects: (i) it serves as a fine-tuning data for training medical Large Language Models (LLMs); (ii) it works as an external knowledge source for retrieval-augmented generation (RAG); (iii) it demonstrates transferability by enhancing zero-shot performance on other QA datasets; and (iv) it aids in training biomedical model as a pre-training corpus. Our empirical findings substantiate the dataset's utility in these domains, thereby confirming its significance as a resource in the medical QA landscape.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Medical LLM
Contribution Types: Data resources
Languages Studied: Chinese
Submission Number: 313
Loading