# MIMIC-IV LLM Framework for Pharmacovigilance Research

Reproducible implementation analyzing vancomycin-piperacillin/tazobactam drug interactions and AKI risk using MIMIC-IV data.

## Prerequisites

1. **MIMIC-IV Access**: Complete CITI training and PhysioNet DUA to access MIMIC-IV v3.1 data
2. **Computing**: 16GB+ RAM required (Google Colab Pro High-RAM recommended)
3. **Data**: Download and extract MIMIC-IV files to accessible directory

## File Structure

```
├── notebooks/
│   ├── MIMIC4_JOIN_LLM.ipynb       # Main analysis pipeline
│   ├── MIMIC4_batch.ipynb          # Batch processing 
│   └── MIMIC4.ipynb                # Statistical functions
├── src/
│   ├── config.py                   # Configuration
│   ├── cohort_builder.py           # Cohort construction
│   ├── causal_inference.py         # Propensity score methods
│   └── [other modules...]
└── prompts/
    └── llm_prompt_template.txt     # LLM prompts
```

## Setup

### 1. Install Dependencies
```bash
pip install -r src/requirements.txt
```

### 2. Configure Data Path
Edit `src/config.py`:
```python
MIMIC_DIR = Path("/your/path/to/mimiciv/3.1/")
```

### 3. Verify Data Access
Ensure these MIMIC-IV files are available:
- `hosp/admissions.csv.gz`
- `hosp/patients.csv.gz`
- `hosp/prescriptions.csv.gz`
- `hosp/labevents.csv.gz`
- `hosp/d_labitems.csv.gz`
- `note/discharge.csv.gz`

## Execution

### Step-by-Step (Required Order)
1. **Batch processing**: Execute `MIMIC4_batch.ipynb` first
   - Processes clinical notes locally for confounder extraction
   - Generates structured features needed for analysis
   
2. **Core functions**: Execute `MIMIC4.ipynb` 
   - Defines statistical functions and causal inference methods
   - Required before main analysis
   
3. **Main analysis**: Execute `MIMIC4_JOIN_LLM.ipynb`
   - Joins batch-processed LLM features with clinical data
   - Performs complete causal inference analysis
   - Expected runtime: 2-4 hours total

### Key Outputs to Validate
- Total cohort: 90,327 patients
- VPT rate: 8.7% (7,822 patients)
- AKI incidence: 17.5% overall
- Hazard ratio: 1.40 (95% CI: 1.35-1.45)
- SMD after IPTW: 0.018

## Code Usage

**Main Pipeline** (`MIMIC4_JOIN_LLM.ipynb`):
- Cohort construction with inclusion/exclusion criteria
- Laboratory data processing and AKI detection using KDIGO criteria
- Local clinical note processing for confounder extraction
- Propensity score matching and IPTW analysis
- Bootstrap validation with 300 iterations

**Supporting Modules**:
- `cohort_builder.py`: Patient selection and filtering
- `causal_inference.py`: Propensity scores and IPTW weights
- `lab_processor.py`: Creatinine processing and AKI detection
- `note_processor.py`: Local clinical note analysis

**Configuration**:
- All parameters in `config.py` for consistent reproduction
- Random state fixed at 7 for deterministic results

## Privacy & Compliance

- All patient data processing occurs locally
- No PHI transmitted to external APIs
- Full MIMIC-IV DUA compliance maintained
- LLM APIs used only for prompt development with synthetic data