# Artifact for "Detecting Data Deviation in Electronic Health Records"

This repository provides the Python scripts for the submission "Detecting Data Deviation in Electronic Health Records." The scripts implement (1) Data Shapley value calculation and (2) the training of the bi-level knowledge distillation approach.

## Scripts Overview

### 1. `shapley_calculator.py`

This script calculates data Shapley values, which quantify the contribution of each data point to a predictive task's performance. It utilizes a Monte Carlo permutation sampling approach with Logistic Regression as the performance evaluation model. The script is designed for parallel computation across multiple tasks (labels) to improve efficiency.

**Core Function:**
*   Estimates per-sample data Shapley values for multiple tasks using TMC-Shapley.
*   Supports AUC as the primary performance metric.
*   Aggregates results from distributed computations into a final Shapley value set.

**Usage:**
```bash
python shapley_calculator.py \\
    --train-features-path /path/to/train_features.npy \\
    --train-labels-path /path/to/train_labels.npy \\
    --test-features-path /path/to/test_features.npy \\
    --test-labels-path /path/to/test_labels.npy \\
    --base-output-dir ./shapley_output \\
    --iterations-per-label 5000 \\
    --num-workers 8 \\
    --label-names "Task1,Task2"
```

### 2. `distillation_trainer.py`

This script implements the two-stage knowledge distillation approach proposed in the paper.

*   **Stage 1:** Task-specific neural oracle models ($g^{(t)}$) are trained to approximate the data Shapley values (output from `shapley_calculator.py`).
*   **Stage 2:** A unified EHR data fidelity predictor model ($\Psi$) and an associated attention subnetwork ($\mathcal{A}$) are trained by distilling aggregated knowledge from the Stage 1's neural oracle models.

**Core Function:**
*   Trains Stage 1 models ($g^{(t)}$) to learn task-specific Shapley values.
*   Trains Stage 2 model ($\Psi$) and attention subnetwork ($\mathcal{A}$) using a composite loss function that includes knowledge distillation, an entropy constraint, and a similarity constraint.
*   Employs dynamic loss weighting in both stages for stable and balanced training.

**Usage:**
```bash
python distillation_trainer.py \\
    --dataset_name EHR_data \\
    --feature_file /path/to/features.npy \\
    --shapley_pkl_file /path/to/aggregated_shapley_values.pkl \\
    --base_output_dir ./distillation_output \\
    --num_epochs_g 100 \\
    --num_epochs_psi 150 \\
    --batch_size 256
```

## Setup and Requirements

1.  **Python Environment:** Python 3.8+ is recommended.
2.  **Dependencies:** Install necessary packages:
    ```bash
    pip install numpy torch scikit-learn tabulate
    ```
    (Ensure PyTorch is compatible with your system, with CUDA support if available.)
