Abstract: In this paper, we release the largest ever medical Question Answering (QA) dataset with 26Million QA pairs named Huatuo-26M and its streamlined version Huatuo-Lite with 177K QA pairs. We benchmark many existing approaches in our dataset in terms of both retrieval and generation. We also experimentally show the benefit of the proposed dataset in many aspects: (i) it serves as a fine-tuning data for training medical large language models (LLMs); (ii) it works as an external knowledge source for retrieval-augmented generation(RAG); (iii) it demonstrates transferability by enhancing zero-shot performance on other QA datasets; and (iv) it aids in traing biomedical model as a pre-training corpus. Our empirical findings substantiate the dataset’s utility in these domains, thereby confirming its significance as a resource in the medical QA landscape.
Paper Type: long
Research Area: Resources and Evaluation
Contribution Types: Data resources, Data analysis
Languages Studied: Chinese
0 Replies
Loading