MedHalwasa: Quantify & Analyze Factual Hallucinations in Large Language Models For Arabic Medical Data
Abstract: Hallucination in medical text generation poses critical risks, especially when large language models (LLMs) produce factually incorrect information. Such behavior, particularly when occurring at scale, can affect the quality of clinical decision-making and compromise patient safety. Although this issue has been studied in English, it remains largely unexplored in Arabic. We introduce MedHalwasa, the first Arabic dataset to quantify and analyze hallucination in Arabic medical fact generation. The name MedHalwasa is derived from “Medical Halwasa,” where “Halwasa” denotes hallucination in Arabic. Using nine different LLMs, we generate and evaluate 9,000 Arabic medical facts, annotating them with automatic factuality annotations. To support future research, we detail a systematic and reproducible data generation and annotation framework that can be extended to study other LLMs and domains. Our study enables the first systematic analysis of hallucinations in Arabic medical contexts and offers key insights to inform the selection of reliable LLMs for Arabic healthcare applications. The dataset is publicly accessible to facilitate future research.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: corpus creation; NLP datasets; automatic evaluation of datasets; evaluation; datasets for low resource languages
Contribution Types: Model analysis & interpretability, Approaches to low-resource settings, Data resources, Data analysis
Languages Studied: Arabic
Reassignment Request Area Chair: This is not a resubmission
Reassignment Request Reviewers: This is not a resubmission
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: No
A2 Elaboration: We did not include a separate discussion of potential risks because our work is primarily focused on evaluating existing language models rather than proposing new models, training methods, or deployment strategies. The dataset we introduce is designed for research purposes and does not contain sensitive or identifiable patient information. Our goal is to analyze hallucination behaviors in a controlled setting to inform future responsible use of LLMs in Arabic healthcare contexts.
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: We have cited all the used models and frameworks inside the text where relevant (i.e page 3, Large Language Model Selection Strategy section)
B2 Discuss The License For Artifacts: N/A
B2 Elaboration: We did not explicitly discuss licenses or terms of use because all the artifacts used in our work—including the language models and libraries—are publicly available and widely adopted in academic research under open-source licenses . We adhered to their usage policies, and no proprietary or restricted-access artifacts were involved in our study.
B3 Artifact Use Consistent With Intended Use: No
B3 Elaboration: We did not explicitly discuss intended use because all artifacts used in this work (including models) were employed strictly for academic research purposes, in line with their typical and permissible usage as described in their documentation or licensing terms. No artifacts were used in a commercial or unintended manner, and our work remains within the research scope.
B4 Data Contains Personally Identifying Info Or Offensive Content: No
B4 Elaboration: We relied on publicly available LLM outputs generated in a controlled setting. No real user data or sensitive personal content was involved. Given the nature of the task—medical fact generation—there were no privacy concerns introduced through the model outputs.
B5 Documentation Of Artifacts: Yes
B5 Elaboration: ( Figure 4, Table 1 and Figure 6)
B6 Statistics For Data: Yes
B6 Elaboration: Section 4
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: Table 1, Section 3
C2 Experimental Setup And Hyperparameters: No
C2 Elaboration: The focus of our work is on evaluating the factual consistency of LLM outputs rather than optimizing model performance through hyperparameter tuning. We used off-the-shelf pre-trained models with default decoding settings or settings informed by prior work. As such, a hyperparameter search was not applicable to our experimental goals.
C3 Descriptive Statistics: No
C3 Elaboration: Our primary goal was to conduct a qualitative and case-based evaluation of factual consistency in LLM outputs for medical fact generation. As such, we did not perform repeated runs or statistical aggregation (e.g., mean, variance) of results. We clearly indicate in the paper that the reported outcomes are based on a single representative run, which aligns with the nature and objectives of our analysis.
C4 Parameters For Packages: N/A
D Human Subjects Including Annotators: No
D1 Instructions Given To Participants: N/A
D2 Recruitment And Payment: N/A
D3 Data Consent: N/A
D4 Ethics Review Board Approval: N/A
D5 Characteristics Of Annotators: N/A
E Ai Assistants In Research Or Writing: Yes
E1 Information About Use Of Ai Assistants: Yes
E1 Elaboration: Section 3
Author Submission Checklist: yes
Submission Number: 513
Loading