# Codebase for Term2Note

### Pre-requisites
1. [UMLS Full Release](https://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html): 2023AA-full version used
2. [SNOMED CT](https://www.nlm.nih.gov/healthit/snomedct/archive.html): SNOMED CT International May 2023 vserion used. Folder placeholder for it: `Data/SNOMED-CT/SnomedCT_InternationalRF2_PRODUCTION_20230531T120000Z`

---

### Data pre-processing
Codes are available in `Codes/preprocessing/`.

To split note into sections and get corresponding UMLS terms-of-interest for each section in the note.
1. Download [MIMIC-IV-Note](https://physionet.org/content/mimic-iv-note/2.2/) (training corpus) and [SNOMED CT Entity Linking Challenge dataset](https://physionet.org/content/snomed-ct-entity-challenge/1.0.0/) (testing corpus). Folder placeholder for them: `Data/MIMIC-IV/mimic-iv-note-2.2` and `Data/MIMIC-IV/snomed-ct-entity-challenge/1.0.0`.
    - Filter MIMIC-IV notes to keep those with ICD-10 codes: `Codes/preprocessing/mimic-iv-note/icd.py` -> placeholder example for the output:`Data/mimic-iv-note/2.2/note/discharge_ICD10_excl.csv`
    - Exclude those notes in SNOMED CT Entity Linking Challenge dataset from MIMIC-IV-Note by note ID
2. Split note into six pre-defined sections based on manually constructed headings: `Codes/preprocessing/extract_sections.py` -> placeholder example for the output:`Data/mimic-iv-note/2.2/note/discharge_ICD10_excl_sections_grouped.json`
    - Note: this split algorithm is greedy and may not be perfect
3. Identify UMLS terms for both training and testing corpus
   - `Codes/preprocessing/run_QuickUMLS.py`: get results from [QuickUMLS](https://github.com/Georgetown-IR-Lab/QuickUMLS)
   - `Codes/preprocessing/filter_UMLS.py`: filter out non-SNOMED CT terms (more specifically, only keep terms belonging to "body structure", "procedure", and "finding" groups); Mapping between SNOMED CT ID for term-of-interest and UMLS term ID is provided in `Data/UMLS/cui2scui.json` -> placeholder example for the output: `Data/processed/discharge_ICD10_excl_UMLS_filtered.json` and `Data/processed/snomed_ct_entity_linking_UMLS_filtered.json`

---

### Term Generation
Codes are available in `Codes/TermGeneration/`.
 - Prepare dataset: run `prepare_terms.py` and extract list of (section-level) terms, split into train and validation set
 - Fine-tune LM to generate terms
    ```bash
    # to prepare term embeddings first, can run with do_train=False and do_eval=False to save embeddings
    CUDA_LAUNCH_BLOCKING=1 CUDA_VISIBLE_DEVICES=0 python train.py --config sft_config.json
    ```

---

### Note Generation
Codes are available in `Codes/NoteGeneration/`.

**Finetune with DP**: fine-tune 1b model with differential privacy
   ```bash
   # multi-GPU train
   dp_run.sh

   # depending on the python version and deepspeed version, you need to convert deepspeed checkpoint to huggingface format before running inference
   pyhton DP/dp_convert.py

   # inference
   python infer_vllm_batch.py --model MODEL-CKPT --lora None --dataset DATASET --output OUTPUT_FILE
   ```

---

### Synthetic Note Evaluation
Codes are available in `Codes/evaluation/`.

 - `analyse_datasets.py`: report statistics of the note length, and MAUVE score between synthetic notes and original notes
    ```bash
    python analyse_datasets.py --dataset1 "PATH-TO/OUTPUT_FILE.json" --key1 "KEY1" --key2 "KEY2" --model "PATH_TO_MODEL"
    ```

 - `compare_UMLS.py`: compare UMLS term distribution in synthetic notes and original notes
    ```bash
    # change the file path in the script to the output of the inference
    python compare_UMLS.py
    ```

---

### Downstream Task -- ICD Code Prediction
Codes are available in `Codes/downstream/`.
 - `ICD_classifier.py`
    ```bash
   python icd_classifier.py --train_file "TRAIN_FOLD_FILE.csv" --test_file "TEST_FOLD_FILE.csv" --model_name yikuan8/Clinical-Longformer --num_epochs 30 --output_dir "OUTPUT_PATH" --fold 5
    ```