MedFrenchmark, a Small Set for Benchmarking Generative LLMs in Medical French

Published: 01 Jan 2024, Last Modified: 25 Jan 2025MIE 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Generative Large Language Models (LLMs) have become ubiquitous in various fields, including healthcare and medicine. Consequently, there is growing interest in leveraging LLMs for medical applications, leading to the emergence of novel models daily. However, evaluation and benchmarking frameworks for LLMs are scarce, particularly those tailored for medical French. To address this gap, we introduce a minimal benchmark consisting of 114 open questions designed to assess the medical capabilities of LLMs in French. The proposed benchmark encompasses a wide range of medical domains, reflecting real-world clinical scenarios’ complexity. A preliminary validation involved testing seven widely used LLMs with a parameter size of 7 billion. Results revealed significant variability in performance, emphasizing the importance of rigorous evaluation before deploying LLMs in medical settings. In conclusion, we present a novel and valuable resource for rapidly evaluating LLMs in medical French. By promoting greater accountability and standardization, this benchmark has the potential to enhance trustworthiness and utility in harnessing LLMs for medical applications.
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview