# ClinTrack

## Introduction
ClinTrack is used for clinical data processing and medical model training. It currently supports the MIMIC-IV database.

## Installation Tutorial

1.  Clone the repository
    ```bash
    git clone https://repo-to-be-open-sourced/clin-track.git
    cd clin-track
    ```
2.  Install dependencies
    ```bash
    # Use the mimic environment for exporting data from the original MIMIC-IV database
    conda env create -f ./conda_envs/env_mimic.yml

    # Use the ms environment for model inference and training
    conda env create -f ./conda_envs/env_ms.yml

    # Use the qwen3 environment for inference only
    conda env create -f ./conda_envs/env_qwen3.yml
    ```

## Data Preparation
1.  Download the MIMIC-IV database and import it into PostgreSQL (if you have already downloaded the `json.zip` file, unzip it to `data/json` and skip steps 1-3).
    -   Refer to the [Official MIMIC-IV Documentation](https://mimic.mit.edu/docs/iv/)
    -   You need to import both MIMIC-IV and MIMIC-IV-notes.
    -   This project uses MIMIC-IV version 3.0 and MIMIC-IV-notes version 2.2.
2.  Modify the parameters of `sqla.create_engine` in `pipeline/psql2patients.py` to connect to your PostgreSQL database. The exported data will be stored in the `data/json` directory. You can adjust the `process_map` parameter in the script according to your configuration.
3.  Run the data export script.
    ```bash
    conda activate mimic
    python pipeline/psql2patients.py
    ```
4.  Modify the environment variables, storage path (`save_path`), and `batch_size` in `pipeline/disch_parse.py`. `batch_size` controls the amount of data loaded into memory and sent to vLLM for processing at one time; it does not represent the degree of parallelism and does not affect VRAM usage. The `MAX_WORKERS` environment variable is used for the number of parallel requests in API mode. Other unlisted environment variables include:
    -   `TENSOR_PARALLEL_SIZE=4`
    -   `MAX_TOKENS=8192`
    -   `TEMPERATURE=0.0`
    -   `INPUT_MAX_CHAR_LENGTH=30000`: This is the maximum character length for the original structured text during generation in the subsequent `description_gen.py`, and is not related to other scripts.
5.  Run the discharge summary extraction script, using the qwen3 environment as an example.
    ```bash
    conda activate qwen3
    python pipeline/disch_parse.py
    ```
6.  Modify the environment variables, load path (`load_path`), storage path (`save_path`), and `batch_size` in `pipeline/radio_parse.py`.
7.  Run the radiology report type extraction script, using the qwen3 environment as an example.
    ```bash
    conda activate qwen3
    python pipeline/radio_parse.py
    ```

## SFT Training Data Preparation
1.  Modify the load path (`load_path`), storage path (`save_path`), `batch_size`, `max_length` (to control the maximum text length), and `max_workers` in `pipeline/json2envdata.py`.
2.  Run the training data preparation script, using the qwen3 environment as an example.
    ```bash
    conda activate qwen3
    python pipeline/json2envdata.py
    ```

## CoT Data Preparation
Note: The preparation scripts require an inference service. Start the vLLM service and then use the following code for API-based inference.
1.  Modify the load path (`load_path`), storage path (`save_path`), `batch_size`, and `client` (to specify the API parameters to use) in `pipeline/trainset2cot.py`.
2.  Run the CoT data preparation script, using the qwen3 environment as an example.
    ```bash
    conda activate qwen3
    python pipeline/trainset2cot.py
    ```
3.  Modify the load path (`load_path`), storage path (`save_path`), `batch_size`, and `client` (to specify the API parameters to use) in `pipeline/cot_refine.py`.
4.  Run the CoT data refinement script, using the qwen3 environment as an example.
    ```bash
    conda activate qwen3
    python pipeline/cot_refine.py
    ```

## CoT Data Filtering
1.  Modify the load path (`load_path`), and storage path (`save_path`) in `pipeline/check.py`.
2.  Run the CoT data filtering script, using the qwen3 environment as an example.
    ```bash
    conda activate qwen3
    python pipeline/check.py
    ```

## SFT Training
1.  Modify the data path (`data_path`), model load path (`model_load_path`), model save path (`model_save_path`), and various training parameters in `pipeline/train_sft.py`.
2.  Run the SFT training script, using the ms environment as an example.
    ```bash
    conda activate ms
    accelerate launch pipeline/train_sft.py
    ```
## Reinforcement Learning Data Preparation
1.  Modify the load path (`load_path`), storage path (`save_path`), `tokenizer_path` (to specify the tokenizer to use), and `max_workers` (to specify the number of parallel processes) in `pipeline/trainset_sft2grpo.py`.
2.  Run the reinforcement learning data preparation script, using the ms environment as an example.
    ```bash
    conda activate ms
    python pipeline/trainset_sft2grpo.py
    ```
## Reinforcement Learning Training
1.  Modify the data path (`data_path`), model load path (`model_load_path`), model save path (`model_save_path`), and various training parameters in `pipeline/train_grpo.py`. If you have sufficient VRAM, it is recommended to set `gradient_checkpointing=False`.
2.  Run the vLLM service for rollout inference. It is recommended to use 1/4 of the GPUs for inference and 3/4 for training. For example, with 8 GPUs:
    ```bash
    conda activate ms
    CUDA_VISIBLE_DEVICES=6,7 trl vllm-serve --model path/to/model --tensor-parallel-size 2 --max-model-len 16384
    ```
3.  Run the reinforcement learning training script, using the ms environment as an example.
    ```bash
    conda activate ms
    CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 accelerate launch pipeline/train_grpo.py
    ```
## Model Testing
Note: Model testing requires an inference service. Start the vLLM service and then use the following code for API-based inference.
1.  Modify the load path (`load_path`), storage path (`save_path`), and `client` (to specify the API parameters to use) in `pipeline/env_test.py`.
2.  Run the model inference script, using the qwen3 environment as an example.
    ```bash
    conda activate qwen3
    python pipeline/env_test.py
    ```