Abstract: Numerous advanced Large Language Models (LLMs) now support context lengths up to 128K, and some extend to 200K. Benchmarks in the general domain have also followed up on evaluating long-context capabilities. In medical domain, due to the unique contexts and need for domain expertise, more professional and further evaluations are necessitating. Long-context scenarios are common in medical domain tasks but lacks a long-context LLMs benchmark specifically for medical domain. In this paper, we propose MedOdyssey, the first medical long-context benchmark with seven length levels ranging from 4K to 200K tokens. MedOdyssey consists of two primary components: the medical needles in a haystack evaluation and a series of medical related long-context tasks, totally 10 datasets. The former includes challenges such as counter-intuitive reasoning and novel (unknown) facts injection to mitigate knowledge leakage and data contamination of LLMs. The latter confronts the challenge of requiring professional medical expertise. Especially, we design the ``Maximum Identical Context'' principle to improve fairness by guaranteeing that different LLMs observe as many identical contexts as possible. Our experiment evaluates advanced proprietary and open-source LLMs tailored for processing long-context and presents detailed performance analyses. This highlights that LLMs still face challenges to handle long-context in medical domain.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Large Language Models, Long Context Evaluation, Medical Domain
Contribution Types: Data resources, Data analysis
Languages Studied: English and Chinese
Submission Number: 178
Loading