MedOdyssey: A Medical Domain Benchmark for Long Context Evaluation Up to 200K Tokens

MedOdyssey: A Medical Domain Benchmark for Long Context Evaluation Up to 200K Tokens

ACL ARR 2024 August Submission178 Authors

15 Aug 2024 (modified: 23 Sept 2024)ACL ARR 2024 August SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Numerous advanced Large Language Models (LLMs) now support context lengths up to 128K, and some extend to 200K. Benchmarks in the general domain have also followed up on evaluating long-context capabilities. In medical domain, due to the unique contexts and need for domain expertise, more professional and further evaluations are necessitating. Long-context scenarios are common in medical domain tasks but lacks a long-context LLMs benchmark specifically for medical domain. In this paper, we propose MedOdyssey, the first medical long-context benchmark with seven length levels ranging from 4K to 200K tokens. MedOdyssey consists of two primary components: the medical needles in a haystack evaluation and a series of medical related long-context tasks, totally 10 datasets. The former includes challenges such as counter-intuitive reasoning and novel (unknown) facts injection to mitigate knowledge leakage and data contamination of LLMs. The latter confronts the challenge of requiring professional medical expertise. Especially, we design the ``Maximum Identical Context'' principle to improve fairness by guaranteeing that different LLMs observe as many identical contexts as possible. Our experiment evaluates advanced proprietary and open-source LLMs tailored for processing long-context and presents detailed performance analyses. This highlights that LLMs still face challenges to handle long-context in medical domain.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: Large Language Models, Long Context Evaluation, Medical Domain

Contribution Types: Data resources, Data analysis

Languages Studied: English and Chinese

Submission Number: 178

Loading