MedHalwasa: Quantify & Analyze Factual Hallucinations in Large Language Models For Arabic Medical Data

MedHalwasa: Quantify & Analyze Factual Hallucinations in Large Language Models For Arabic Medical Data

ACL ARR 2025 July Submission513 Authors

28 Jul 2025 (modified: 19 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Hallucination in medical text generation poses critical risks, especially when large language models (LLMs) produce factually incorrect information. Such behavior, particularly when occurring at scale, can affect the quality of clinical decision-making and compromise patient safety. Although this issue has been studied in English, it remains largely unexplored in Arabic. We introduce MedHalwasa, the first Arabic dataset to quantify and analyze hallucination in Arabic medical fact generation. The name MedHalwasa is derived from “Medical Halwasa,” where “Halwasa” denotes hallucination in Arabic. Using nine different LLMs, we generate and evaluate 9,000 Arabic medical facts, annotating them with automatic factuality annotations. To support future research, we detail a systematic and reproducible data generation and annotation framework that can be extended to study other LLMs and domains. Our study enables the first systematic analysis of hallucinations in Arabic medical contexts and offers key insights to inform the selection of reliable LLMs for Arabic healthcare applications. The dataset is publicly accessible to facilitate future research.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: corpus creation; NLP datasets; automatic evaluation of datasets; evaluation; datasets for low resource languages

Contribution Types: Model analysis & interpretability, Approaches to low-resource settings, Data resources, Data analysis

Languages Studied: Arabic

Reassignment Request Area Chair: This is not a resubmission

Reassignment Request Reviewers: This is not a resubmission

A1 Limitations Section: This paper has a limitations section.

A2 Potential Risks: No

A2 Elaboration: We did not include a separate discussion of potential risks because our work is primarily focused on evaluating existing language models rather than proposing new models, training methods, or deployment strategies. The dataset we introduce is designed for research purposes and does not contain sensitive or identifiable patient information. Our goal is to analyze hallucination behaviors in a controlled setting to inform future responsible use of LLMs in Arabic healthcare contexts.

B Use Or Create Scientific Artifacts: Yes

B1 Cite Creators Of Artifacts: Yes

B1 Elaboration: We have cited all the used models and frameworks inside the text where relevant (i.e page 3, Large Language Model Selection Strategy section)

B2 Discuss The License For Artifacts: N/A

B2 Elaboration: We did not explicitly discuss licenses or terms of use because all the artifacts used in our work—including the language models and libraries—are publicly available and widely adopted in academic research under open-source licenses . We adhered to their usage policies, and no proprietary or restricted-access artifacts were involved in our study.

B3 Artifact Use Consistent With Intended Use: No

B3 Elaboration: We did not explicitly discuss intended use because all artifacts used in this work (including models) were employed strictly for academic research purposes, in line with their typical and permissible usage as described in their documentation or licensing terms. No artifacts were used in a commercial or unintended manner, and our work remains within the research scope.

B4 Data Contains Personally Identifying Info Or Offensive Content: No

B4 Elaboration: We relied on publicly available LLM outputs generated in a controlled setting. No real user data or sensitive personal content was involved. Given the nature of the task—medical fact generation—there were no privacy concerns introduced through the model outputs.

B5 Documentation Of Artifacts: Yes

B5 Elaboration: ( Figure 4, Table 1 and Figure 6)

B6 Statistics For Data: Yes

B6 Elaboration: Section 4

C Computational Experiments: Yes

C1 Model Size And Budget: Yes

C1 Elaboration: Table 1, Section 3

C2 Experimental Setup And Hyperparameters: No

C2 Elaboration: The focus of our work is on evaluating the factual consistency of LLM outputs rather than optimizing model performance through hyperparameter tuning. We used off-the-shelf pre-trained models with default decoding settings or settings informed by prior work. As such, a hyperparameter search was not applicable to our experimental goals.

C3 Descriptive Statistics: No

C3 Elaboration: Our primary goal was to conduct a qualitative and case-based evaluation of factual consistency in LLM outputs for medical fact generation. As such, we did not perform repeated runs or statistical aggregation (e.g., mean, variance) of results. We clearly indicate in the paper that the reported outcomes are based on a single representative run, which aligns with the nature and objectives of our analysis.

C4 Parameters For Packages: N/A

D Human Subjects Including Annotators: No

D1 Instructions Given To Participants: N/A

D2 Recruitment And Payment: N/A

D3 Data Consent: N/A

D4 Ethics Review Board Approval: N/A

D5 Characteristics Of Annotators: N/A

E Ai Assistants In Research Or Writing: Yes

E1 Information About Use Of Ai Assistants: Yes

E1 Elaboration: Section 3

Author Submission Checklist: yes

Submission Number: 513

Loading