CLINIC : Evaluating Multilingual Trustworthiness in Language Models for Healthcare

CLINIC : Evaluating Multilingual Trustworthiness in Language Models for Healthcare

ICLR 2026 Conference Submission19256 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multilingual, Trustworthiness, Language Models

TL;DR: Multilingual Trustworthiness Benchmark for HealthCare

Abstract: Integrating language models (LMs) in healthcare systems holds great promise for improving medical workflows and decision-making. However, a critical barrier to their real-world adoption is the lack of reliable evaluation of their trustworthiness, especially in multilingual healthcare settings. Existing LMs are predominantly trained in high-resource languages, making them ill-equipped to handle the complexity and diversity of healthcare queries in mid- and low-resource languages, posing significant challenges for deploying them in global healthcare contexts where linguistic diversity is key. In this work, we present \textsc{Clinic}, a \textbf{C}omprehensive Mu\textbf{l}tilingual Benchmark to evaluate the trustworth\textbf{i}ness of la\textbf{n}guage models \textbf{i}n health\textbf{c}are. \name systematically benchmarks LMs across five key dimensions of trustworthiness: truthfulness, fairness, safety, robustness, and privacy, operationalized through 18 diverse tasks, spanning 15 languages (covering all the major continents), and encompassing a wide array of critical healthcare topics like disease conditions, preventive actions, diagnostic tests, treatments, surgeries, and medications. Our extensive evaluation reveals that LMs struggle with factual correctness, demonstrate bias across demographic and linguistic groups, and are susceptible to privacy breaches and adversarial attacks. By highlighting these shortcomings, \name lays the foundation for enhancing the global reach and safety of LMs in healthcare across diverse languages.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 19256

Loading