DentalBench: Benchmarking and Advancing LLMs Capability for Bilingual Dentistry Understanding

DentalBench: Benchmarking and Advancing LLMs Capability for Bilingual Dentistry Understanding

ACL ARR 2025 May Submission7955 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Recent advances in large language models (LLMs) and medical LLMs (Med-LLMs) have demonstrated strong performance on general medical benchmarks. However, their capabilities in specialized medical fields, such as dentistry which require deeper domain-specific knowledge, remain underexplored due to the lack of targeted evaluation resources. In this paper, we introduce \textbf{DentalBench}, the first comprehensive bilingual benchmark designed to evaluate and advance LLMs in the dental domain. DentalBench consists of two main components: \textbf{DentalQA}, an English-Chinese question-answering (QA) benchmark with 36,597 questions spanning 4 tasks and 16 dental subfields; and \textbf{DentalCorpus}, a large-scale, high-quality corpus with 337.35 million tokens curated for dental domain adaptation, supporting both supervised fine-tuning (SFT) and retrieval-augmented generation (RAG). We evaluate 14 LLMs, covering proprietary, open-source, and medical-specific models, and reveal significant performance gaps across task types and languages. Further experiments with Qwen-2.5-3B demonstrate that domain adaptation substantially improves model performance, particularly on knowledge-intensive and terminology-focused tasks, and highlight the importance of domain-specific benchmarks for developing trustworthy and effective LLMs tailored to healthcare applications.

Paper Type: Short

Research Area: Resources and Evaluation

Research Area Keywords: corpus creation, benchmarking, multilingual corpora

Contribution Types: Data resources

Languages Studied: English, Chinese

Submission Number: 7955

Loading