# CLINIC : Evaluating Multilingual Trustworthiness in Language Models for Healthcare

Integrating language models (LMs) in healthcare systems holds great promise for
improving medical workflows and decision-making. However, a critical barrier to
their real-world adoption is the lack of reliable evaluation of their trustworthiness,
especially in multilingual healthcare settings. Existing LMs are predominantly
trained in high-resource languages, making them ill-equipped to handle the complexity and diversity of healthcare queries in mid- and low-resource languages, posing significant challenges for deploying them in global healthcare contexts where
linguistic diversity is key. In this work, we present CLINIC, a Comprehensive
Multilingual Benchmark to evaluate the trustworthiness of language models in
healthcare. CLINIC systematically benchmarks LMs across five key dimensions
of trustworthiness: truthfulness, fairness, safety, robustness, and privacy, operationalized through 18 diverse tasks, spanning 15 languages (covering all the major
continents), and encompassing a wide array of critical healthcare topics like disease
conditions, preventive actions, diagnostic tests, treatments, surgeries, and medications. Our extensive evaluation reveals that LMs struggle with factual correctness,
demonstrate bias across demographic and linguistic groups, and are susceptible
to privacy breaches and adversarial attacks. By highlighting these shortcomings,
CLINIC lays the foundation for enhancing the global reach and safety of LMs in
healthcare across diverse languages. We have uploaded our dataset to Harvard
Dataverse and shared all the codes as part of the supplementary material.

<img width="836" height="432" alt="image" src="https://github.com/user-attachments/assets/ee5ff0d8-dcdc-4316-ac6a-5babec1576b6" />

CLINIC is a multilingual benchmark comprising samples from five trustworthiness thrusts across six healthcare subdomains and 15 global languages. It encompasses testing of proprietary,
open-weight models (small and large) and specialized medical language models.

**Dataset Link:** [**https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/YTULXG**](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/YTULXG)  

## Contents of this repo
This repoitory contains the model generation and response evaluation scripts used. The generation folder consists of 16 sub-folders each having a python script used during inference from difference models. Similarly, the evaluation folder also conists of 16 scripts used for evaluation of different models.

Note: The paper contains 18 tasks and while generating responses and evaluating them, we had clubbed all three tasks - False Confidence Test(FCT), False Question Test(FQT) and None of the Above Test(NOTA), so in total we get 16 scripts each for generation and evaluation.


