Culturally Respectful Is Not Enough: Auditing LLM Safety in Diabetes Advice During Ramadan

Muhra AlMahri

Culturally Respectful Is Not Enough: Auditing LLM Safety in Diabetes Advice During Ramadan

Muhra AlMahri

Published: 14 Jun 2026, Last Modified: 21 Jun 2026ICML 2026 Workshop MusIML PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM safety, culturally-aware AI, medical AI, healthcare evaluation, Muslim communities, Ramadan, diabetes, audit benchmark, prompt engineering, responsible AI, AI fairness, language models, evaluation, LLM bias, prompt grounding

Abstract: Large language models are increasingly consulted for health information, yet their safety is rarely evaluated in culturally situated medical contexts where a user's religious practice changes the relevant risks, constraints, and answer style. We study Ramadan fasting among Muslims with diabetes, a setting in which safe advice must jointly handle hypoglycemia and dehydration risk, medication adjustment, religious significance, and individualized clinical judgment. We introduce RamadanSafeQA, a preliminary audit benchmark of 68 synthetic vignettes spanning five Ramadan-diabetes categories and IDF-DAR-style risk levels. We generate 816 responses from four LLMs, GPT-4o, Claude Sonnet 4.6, Jais 2 8B, and MedGemma 27B under vanilla, safety-checklist, and guideline-grounded prompts, and manually score a shuffled subset of 530 responses with a four-item safety rubric. Cultural respect, clinician referral, and autonomy preservation are near ceiling across models, while medical safety varies sharply: fully-safe rates range from 0% for Jais 2 8B to 81% for Claude Sonnet 4.6 with checklist prompting. The failures are usually medical omissions or incompleteness, not bare refusal or overt religious disrespect. Guideline-grounded prompting improves three of four models, but does not help Jais in this English-language audit; its dominant failure mode is substituting supportive interpersonal scripts for clinical content. Our results expose a dissociation between the two axes: a response can satisfy every cultural criterion while failing on medical safety. Culturally aware medical safety evaluation must therefore measure both cultural and clinical axes, because high cultural-respect scores can co-occur with missing clinical substance.

Track: Track 1: ML Research Addressing Challenges Faced by Muslim Communities

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Non Archival Confirmation: I understand that submissions to MusIML are non-archival and can be submitted to other venues.

Submission Number: 22

Loading