
# From Zero-Shot to Bedside: A Practical Playbook for Adapting Open-Source LLMs to Clinical Symptom Extraction
>**TL;DR**. This repo accompanies the ML4H manuscript From Zero-Shot to Bedside, which presents a playbook for adapting open-source LLMs to extract structured symptom information from de-identified oncology notes. It compares prompt designs, contrasts open vs. proprietary models, introduces an LLM-assisted adjudication loop to target likely label errors, and studies machine-generated label augmentation to stretch limited expert annotations.

**For most up-to-date files please visit [the anonymous repository](https://anonymous.4open.science/r/From-Zero-Shot-to-Bedside-A-Practical-Playbook-E13E/README.md).**

## Overview
This repository houses the implementations of the following approaches:
* Benchmark zero-shot extraction across multiple instruction-tuned LLMs and the comparison of two types of prompt.
* Fine-tune an open model (e.g., Llama-3.1-8B-Instruct) with QLoRA/LoRA.
* Run a multi-fold model-vs-label disagreement scan that flags likely human annotation errors for targeted adjudication.
* Augment training data with high-confidence machine-generated annotations.


## Zero-shot inference and prompt design
We benchmarked the fine-tuned LLMs against proprietary LLMs, e.g. GPT-4o. The inference was done using the script `data_augmentation/llm_inference.py`.

We designed two types of prompts in this studdy. Here is the example used for the on-treatment cohort.

**Mapped Symptom Extraction**
```
We have a list of symptoms of interest in these notes organized within the 
"Predetermined Symptom Map" below. 
Please identify all symptoms that the patient is experiencing from this list during this visit. 
If the patient is not experiencing the symptom currently, 
even if it is mentioned or the patient has experienced it in the past, do not include it. 
If a symptom does not match any item in the map, you do not need to include it.
    Predetermined Symptom Map:
    (A): EyeRedness
    (B): LowerBackPain
    (C): WeightLoss
    (D): AppetiteLoss
    (E): Jaundice
    (F): Pruritus
    (G): Indigestion
    (H): Steatorrhea
    (I): Urine Color Change
    (J): Constipation
    (K): Nausea
    (L): Vomiting
    (M): Diarrhea
    (N): GasorBloating
    (O): FatigueMalaiseLethargy
    (P): EarlySatiety
    (Q): BloodGlucose
    (R): GI_Bleed
    (S): Melena
    (T): BRBPR
    (U): AbdominalPain
    (V): UpperMidBackPain
Your answer should be in json format.
For example: {\n"Eye Redness": {\n"Symptom Map": "(A): EyeRedness"\n},\n"Nausea": 
{\n"Symptom Map": "(K): Nausea"\n}\n}
If no symptoms are identified, return an empty response: {\n"N/A": {\n"Symptom Map": 
"(Z): None"\n}\n}
Provide your answer in the given json format.
```

**Binary Symptom Encoding**

```
We have a list of symptoms of interest in these notes organized within the 
"Predetermined Symptom Map" below. 
Please identify all symptoms that the patient is experiencing from this list during this visit. 
If the patient is not experiencing the symptom currently, 
even if it is mentioned or the patient has experienced it in the past, do not include it. 
If a symptom does not match any item in the map, you do not need to include it.
    Predetermined Symptom Map:
    (A): EyeRedness
    (B): LowerBackPain
    (C): WeightLoss
    (D): AppetiteLoss
    (E): Jaundice
    (F): Pruritus
    (G): Indigestion
    (H): Steatorrhea
    (I): Urine Color Change
    (J): Constipation
    (K): Nausea
    (L): Vomiting
    (M): Diarrhea
    (N): GasorBloating
    (O): FatigueMalaiseLethargy
    (P): EarlySatiety
    (Q): BloodGlucose
    (R): GI_Bleed
    (S): Melena
    (T): BRBPR
    (U): AbdominalPain
    (V): UpperMidBackPain
Your answer should be in json format.
For example: {"(A) EyeRedness": 0, "(B) LowerBackPain": 1, "(C)WeightLoss": 0, 
"(D) AppetiteLoss": 0, "(E) Jaundice": 0, "(F) Pruritus": 0,"(G) Indigestion": 0, 
"(H) Steatorrhea": 0, "(I) UrineColorChange": 0, "(J)Constipation": 0, "(K) Nausea": 0, 
"(L) Vomiting": 0, "(M) Diarrhea": 0,"(N) GasorBloating": 0, "(O) FatigueMalaiseLethargy": 0, 
"(P) EarlySatiety":0, "(Q) BloodGlucose": 0, "(R) GI_Bleed": 0, "(S) Melena": 0, "(T) BRBPR":0, 
"(U) AbdominalPain": 0, "(V) UpperMidBackPain": 0}
Provide your answer in the given json format.
```
All the prompts are listed in `model_finetuning\configs\pancreas\templates_ontreatment.yaml` and `model_finetuning\configs\pancreas\templates.yaml`.
## Fine-tuning LLMs
We leveraged the finetuning framework [Strata: human-level information extraction from clinical reports with fine-tuned language models](https://github.com/YalaLab/strata). The config files are in the config folder. It contains the hyperparameters used for zero-shot learning and finetuning. Please refer the use of the config to [Strata](https://github.com/YalaLab/strata).

* **Cohorts.** Two PDAC cohorts from a single academic health system:

  * **Pre-diagnosis:** 112 PDAC patients (307 notes) plus 14 controls (44 notes).&#x20;
  * **On-treatment:** 94 patients undergoing FOLFIRINOX; 142 notes.&#x20;
* **Targets.** Clinically curated symptom lists, with prevalence tables provided in the paper’s appendices (examples include GI, neuropathy, and constitutional symptoms).&#x20;
* **Evaluation.** Micro-averaged F1/precision/recall at the note-timepoint level.
* **Representative results.**

| Cohort/Prompt              | Model setup                             |        F1 | Precision | Recall |
| -------------------------- | --------------------------------------- | --------: | --------: | -----: |
| On-treatment / **Mapped**  | GPT-4o, zero-shot                       | 0.896 |     0.904 | 0.888  |
| On-treatment / **Mapped**  | Llama-3.1-8B, finetuned **adjudicated** | 0.775 |     0.898 | 0.681  |
| On-treatment / **Binary**  | GPT-4o, zero-shot                       | 0.847 |     0.858 | 0.836  |
| On-treatment / **Binary**  | Llama-3.1-8B, finetuned **adjudicated** | 0.784 |     0.821 | 0.750  |
| Pre-diagnosis / **Binary** | GPT-4o, zero-shot                       | 0.664 |     0.641 | 0.689  |
| Pre-diagnosis / **Binary** | Llama-3.1-8B, finetuned (task-specific) | 0.639 |     0.722 | 0.574  |

---


## Adjudication loop (targeted label error discovery)
In this adjudication process, we evaluatate the annotation quality of the training and validation set by doing cross-validation and accrue the fine-tuned model consesus on the test set in each fold. We created the cross-validation splits using `annotation_adjudication\create_cross_validation_splits.py` and fine-tuned the model using the configs in `model_finetuning\configs\pancreas\on_treatment\cross_validation`. The test set of each split was tested using Strata and we computed the consensus fine-tuned model in `annotation_adjudication\get_model_consensus.py`.

---

## Machine-generated label augmentation

In `data_augmentation` folder, use the `llm_inference_logprob.py` to generate machine annotations on the unlabeled data. To assure that the symptom labeled as positive have high confidence, the script processes the labels based on the token probability returned by the proprietary model.

---

