# Empathy and Understandability: Assessing LLMs in Delivering Compassionate Medical Diagnoses

This repository contains the code and data for evaluating the performance of Large Language Models (LLMs) in delivering empathetic and understandable medical diagnostic explanations across diverse patient demographics.

## 📋 Overview

We developed a comprehensive evaluation framework to assess how well commercial LLMs (GPT-4o and Claude-3.7) can deliver medical diagnoses with appropriate empathy and understandability across different patient populations. Our study reveals systematic demographic biases in AI-generated medical communications.

## 🎯 Key Findings

- **Medical diagnosis** is the strongest bias factor (p<0.0001) – Alzheimer’s receives highest empathy, heart disease lowest  
- **Education level** shows inverse relationship – medical degree holders receive 0.30–0.50 points lower empathy  
- **Age patterns** are rater-dependent – U-shaped empathy distribution emerges only with specific evaluators  
- **Cognitive empathy** remains stable across demographics while **affective empathy** varies substantially  
- **Critical methodological issue**: Poor inter-rater reliability between Claude and GPT evaluators

## 🧠 Empathy Dimensions

- Affective empathy: sharing/attuning to the patient’s emotional state (e.g., “I feel sad for you and am here with you.”).
- Cognitive empathy: recognizing and accurately modeling the patient’s feelings and perspective (e.g., “I know you are sad and this is hard for you.”).
Note: This project evaluates simulated monologues for research. It does not claim to replace real clinical conversations (e.g., SPIKES protocol). It is an evaluative study, not a deployment recommendation.


## 🧪 Evaluation Framework
We evaluate simulated doctor–patient diagnostic monologues along two core dimensions:
- Understandability: readability metrics capturing clarity, jargon density, and structural complexity (Flesch–Kincaid, SMOG, Gunning Fog, Coleman–Liau, Dale–Chall).
- Empathy: LLM-as-a-Judge ratings (GPT-4o and Claude-3.7) and human ratings, decomposed into affective (emotional resonance) and cognitive (perspective-taking).
Scope and bias findings:
- Models adapt outputs to socio-demographic variables and medical conditions, creating systematic differences in both understandability and empathy.
- Common issues include over-complexity relative to public health recommendations and rater-dependent affective empathy with inconsistent self-assessment.

## 🔧 Setup

### Prerequisites

```bash
pip install -r requirements.txt
```

### Required Libraries
- pandas
- numpy  
- matplotlib, seaborn
- textstat
- litellm
- scipy

### API Keys
Set up your API keys for:
- OpenAI (GPT-4o)
- Anthropic (Claude-3.7)

## 📁 Repository Structure
```
├── notebooks/
│   ├── 01_data_collection/
│   │   └── prompt_generation_claude.ipynb
│   └── 02_scoring/
│       └── judges_scoring_v2.ipynb
├── Code/
│   ├── Understandability/
│   │   └── empathy_ethics_understandability_analysis_rikard.ipynb
│   └── Plotting/
├── data/
│   └── raw/
│       ├── prompts/
│       │   ├── empathy_prompts.csv
│       │   └── initial_prompts.csv
│       └── responses/
│           ├── claude_responses_empathy.csv
│           └── gpt_responses.csv
└── README.md
```

## 🚀 Usage
### 1. Generate Diagnostic Scenarios
Run the data collection notebook:
```
notebooks/01_data_collection/prompt_generation_claude.ipynb
```
Outputs are saved under:
- data/raw/prompts/empathy_prompts.csv (generated prompt bank)
- data/raw/prompts/initial_prompts.csv (prompt set for inference)

### 2. Collect Model Responses
In the same notebook (Step 2), responses are requested via the LiteLLM proxy and saved to:
- data/raw/responses/claude_responses_empathy.csv
- data/raw/responses/gpt_responses.csv (if collected similarly)

### 3. Evaluate Understandability
Compute readability metrics:
```
Code/Understandability/empathy_ethics_understandability_analysis_rikard.ipynb
```
This calculates standard metrics (Flesch–Kincaid, SMOG, etc.) and summarizes outputs.

### 4. Assess Empathy
LLM-as-a-Judge evaluation using EmotionQueen and a jury of GPT-4o graders:
```
notebooks/02_scoring/judges_scoring_v2.ipynb
```

### 5. Generate Analysis
Statistical analysis and visualizations:
```
Code/Plotting/plot.ipynb
```

## 🔎 Understandability Evaluation (what we measure)
We compute five standard readability metrics commonly used in health communication:
- Flesch–Kincaid Grade Level: sentence length + syllable density (higher = harder).
- SMOG Index: polysyllabic words in 30 sentences (higher = more years of education).
- Gunning Fog Index: sentence length + complex word share (higher = more years).
- Coleman–Liau Index: letters and sentences per 100 words (higher = harder).
- Dale–Chall Score: proportion of “difficult” words + sentence length (higher = harder).
Key takeaways:
- Both models typically produce grade ~9–13 content, above recommended 6–8 for public health materials.
- Complexity adapts with education level; Claude adapts more strongly than GPT.
- By age: easiest for minors, then harder for young adults, and easier again for older adults.
- Minimal differences by gender and geographical/ethnic group; CIHD tends to be harder.

## 📊 Evaluation Framework

### Two-Stage Assessment:

**Stage 1: Generation**
- 156 diagnostic scenarios combining:
  - Demographics: 3 ethnicities × 2 genders × 3 education levels
  - Medical conditions: Obesity, pancreatic cancer, Alzheimer's, heart disease
  - Age ranges: 8-85 years

**Stage 2: Evaluation**
- **Understandability**: 6 readability metrics (Flesch-Kincaid, SMOG, etc.)
- **Empathy**: Affective (emotional resonance) + Cognitive (perspective-taking)
- **Validation**: EmotionQueen benchmark + human annotation

## 🎨 Key Visualizations

- `scores_by_diagnosis.png` - Empathy bias across medical conditions
- `scores_by_education.png` - Education-level empathy patterns  
- `scores_by_age.png` - Age-related empathy variations
- `human_vs_LLM_1.png` - Human vs. AI evaluation comparison
- `framework.png` - Overall methodology diagram

## 📈 Results Summary

### Systematic Biases Identified:
1. **Medical Diagnosis** (strongest): Alzheimer's > Cancer > Obesity > Heart Disease
2. **Education Level**: High School > University > Medical Degree  
3. **Age** (rater-dependent): U-shaped pattern with higher empathy for children and elderly

### Methodological Insights:
- GPT consistently inflates own empathy ratings (+0.333 points)
- Claude deflates own empathy ratings (-0.256 points)
- Poor inter-rater reliability (r=-0.005 to 0.459)
- Human evaluators detect biases that LLMs miss

## 🔬 Reproducing Results
- Option A: Run interactively in Jupyter and execute notebooks cell-by-cell.
- Option B: Execute notebooks headlessly from the repo root using nbconvert:
 
 ```bash
 # 1) Generate prompts + collect responses (Claude path)
 jupyter nbconvert --to notebook --execute "notebooks/01_data_collection/prompt_generation_claude.ipynb" --inplace
 
 # 2) EmotionQueen / LLM-as-judge evaluation
 jupyter nbconvert --to notebook --execute "notebooks/02_scoring/judges_scoring_v2.ipynb" --inplace
 ```
 
 - Ensure API keys/proxy are set as expected by the notebooks (e.g., OPENAI_API_KEY, OPENAI_API_BASE for LiteLLM proxy as shown in the notebooks).
 - For additional charts, open plotting notebooks interactively (e.g., Code/Plotting/plot.ipynb).

## 👥 Authors

- **Shunchang Liu** - liushu@ethz.ch
- **Guillaume Drui** - gdrui01@ethz.ch  
- **Jianzhou Yao** - yaojia@ethz.ch
- **Rikard Pettersson** - rpettersson@ethz.ch
- **Sara Kijewski** - sara.kijewski@hest.ethz.ch
- **Alessandro Blasimme** - alessandro.blasimme@hest.ethz.ch


## ⚠️ Ethical Considerations

This research reveals systematic biases in AI medical communication that could perpetuate healthcare disparities. Key concerns:

- Reduced empathy toward highly educated patients
- Differential treatment across medical conditions  
- Potential amplification of existing healthcare biases
- Need for human oversight in clinical deployment

## 🔄 Future Work

- Develop debiasing techniques for medical AI
- Improve evaluation methodology with better inter-rater reliability
- Expand demographic representation in training data
- Large-scale human validation studies
- Real-world clinical deployment guidelines

## 📞 Contact

For questions about the research or code, please contact the authors or open an issue in this repository.

---

**⚠️ Disclaimer**: This research is for academic purposes. Any deployment in clinical settings requires careful validation, bias mitigation, and human oversight to ensure patient safety and equitable care.
