# 🚀 Get Started

<summary><strong>Introduction</strong></summary>

This guide will walk you through setting up the environment, installing necessary dependencies, configuring a Ray cluster, and setting up experiment parameters to run this code.

---

<details open>
<summary><strong>🛠️ Installation</strong></summary>

Follow these steps to prepare your environment and install the required packages.

We provide `Dockerfile` in the `docker/` directory for containerized setup. Alternatively, you can configure your environment using Conda as described below.

1.  **Create and Activate Conda Environment:**
    We recommend using Python 3.12.
    ```bash
    conda create -n xxxxx python=3.12
    source activate xxxxx
    ```

2.  **Install PyTorch (CUDA 12.4):**
    This project is optimized for PyTorch 2.6.0 with CUDA 12.4 support.
    ```bash
    pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124
    ```

3.  **Install FlashAttention:**
    FlashAttention (version 2.7.3) is used for efficient attention mechanisms. Ensure `ninja` is correctly installed.
    ```bash
    pip uninstall -y ninja && pip install ninja
    pip install flash-attn==2.7.3 --no-build-isolation
    ```

4.  **Clone:**
    Clone the  repository and install it in editable mode.
    ```bash
    git clone xxxxxxxxxxxxxxxxxxxxxxxx
    cd xxxxxxxxxxxxxxxxxxxxxxxx
    pip install -e .
    ```

</details>

---

<details>
<summary><strong>🌐 Ray Cluster Setup</strong></summary>

For distributed training, set up a Ray cluster. Here's an example for a 2-node cluster, each with 8 GPUs.

1.  **Start the Head Node:**
    Run this command on your designated head node. The dashboard will be accessible via `http://<head_node_ip>:8265`.
    ```bash
    ray start --head --dashboard-host=0.0.0.0
    ```
    Note down the address provided (e.g., `xxxxxx:6379`).

2.  **Start Worker Node(s):**
    Run this command on each worker node, replacing `xxxxxx:6379` with the address from the head node.
    ```bash
    ray start --address=xxxxxx:6379
    ```

3.  **Verify Cluster Status:**
    On the head node, run `ray status` to confirm that all nodes have joined and all GPUs (16 in this example) are detected.

</details>

---

<details>
<summary><strong>🏆 Reward Server</strong></summary>


This section describes how to launch a remote reward server, which is used to calculate reward values during the training process.

To start the reward server, execute the following command:

```bash
bash scripts/reward_server.sh
```

This script utilizes several configurable parameters:

`DET_IOU_THRESHOLD`: Defines the threshold strategy for the Intersection over Union (IoU) reward.

`PORT`: Specifies the network port on which the reward server will listen for incoming requests.

`WORKERS`: Sets the number of worker processes for the server.

Upon successful execution, the script launches a FastAPI service. A file named with a unique `JOB_ID` will be created within the `.reward_server/` directory in your project. This `JOB_ID` file contains the `IP` address and `PORT` of the running reward server (e.g., your_server_ip:8192).

**Important**: Take note of this **JOB_ID**, as it will be required for configuring the training process later.

</details>

---

<details>
<summary><strong>⚙️ Experiment Parameters</strong></summary>

For a comprehensive list of all configurable parameters and hyperparameters, please refer to the `scripts/train.sh` file.

Before running experiments, configure the environment variables to match your Ray cluster setup.

* **Set Node and GPU Counts:**
    Adjust these values based on your actual cluster configuration (e.g., for 2 nodes and 8 GPUs per node):
    ```bash
    export NUM_NODES=2
    export GPUS_PER_NODE=8
    ```
* **Configure Reward Server Job ID:**
    Set `REMOTE_REWARD_JOB_ID` to the identifier(s) of your previously launched reward server(s). This enables the training pipeline to locate the reward server's address.
    ```bash
    export REMOTE_REWARD_JOB_ID="j-xxxxxxxxxx"
    ```
    If you are using multiple reward servers, provide their `JOB_ID`s concatenated with a `|` (pipe) delimiter:
    ```bash
    export REMOTE_REWARD_JOB_ID="j-xxxxxxxxxx|j-yyyyyyyyyy|j-zzzzzzzzzz"
    ```
* **Set Training and Online Test Data Files:**
    Based on the dataset you downloaded, specify the paths for your training and online test files. The format is a string containing a comma-separated list of file paths, enclosed in brackets.
    For example:
    ```bash
    export DATA_TRAIN_FILE="[/path/to/your/data]"
    export DATA_TEST_FILE="[/path/to/your/data]"
    ```
    **Note:** Replace `[/path/to/your/data]` with the actual absolute or relative paths to your dataset files based on where you downloaded and stored the `Anonymous-Data-47k` directory.
* **Model Loading and Checkpointing:**
    Configure paths for loading initial model weights and saving training states, along with the save frequency.
    * `ACTOR_LOAD_PATH`: Path to the initial model checkpoint to load.
    * `TRAIN_SAVE_FREQ`: Frequency to save the training state (e.g., `5` for every 5 steps, `-1` for do not save).
    * `TRAIN_SAVE_PATH`: Directory where training checkpoints will be stored.
    
    For example:
    ```bash
    export ACTOR_LOAD_PATH="xxxxx"
    export TRAIN_SAVE_FREQ="5"
    export TRAIN_SAVE_PATH="xxxxx"
    ```
* **PPO Training Parameters:**
    Define batch sizes for PPO updates and related calculations. These settings include default values if not explicitly overridden.
    * `ACTOR_PPO_GLOBAL_BSZ`: Total number of samples used for a single PPO update across all workers. Defaults to `1024`.
    * `ACTOR_PPO_MICRO_BSZ`: Number of samples in each micro-batch for the PPO loss calculation. This helps manage memory usage, especially with large models, to prevent Out-of-Memory (OOM) errors. Defaults to `16`.
    * `LOG_P_MICRO_BSZ`: Micro-batch size specifically for log-probability calculations. Defaults to `32`.
    
    For example:
    ```bash
    export ACTOR_PPO_GLOBAL_BSZ="1024"
    export ACTOR_PPO_MICRO_BSZ="16"
    export LOG_P_MICRO_BSZ="32"
    ```
* **Data Configuration:**
    Set parameters related to data batching for training and testing, and the maximum length for generated responses.
    * `DATA_TRAIN_BATCH_SIZE`: Number of prompts for which responses are generated in each training batch.
    * `DATA_TEST_BATCH_SIZE`: Number of samples to evaluate at once during testing (across all specified `DATA_TEST_FILE`s).
    * `DATA_MAX_RES_LENGTH`: Maximum token length allowed for each generated response.

    For example:
    ```bash    
    export DATA_TRAIN_BATCH_SIZE="1024"
    export DATA_TEST_BATCH_SIZE="4096"
    export DATA_MAX_RES_LENGTH="2048"
    ```
* **Learning Rate and Optimization:**
    Configure the learning rate for the actor model and specify which parts of the model should have their weights frozen during training.
    * `ACTOR_LR`: Sets the learning rate for the actor model.
    * `ACTOR_LR_FREEZE`: Specifies parts of the model whose weights will be frozen (not updated) during training. This can be an empty string for no frozen parts, or a string representing a list of component names (e.g., `"[vit,connector]"` or `['vit', 'connector', 'llm']` depending on how it's parsed by the script).
    
    For example:
    ```bash
    export ACTOR_LR="1e-6"
    export ACTOR_LR_FREEZE="[vit,connector]"
    ```
* **Rollout Configuration:**
    Set parameters that control the generation of rollout sequences by the actor model.
    * `ROLLOUT_N`: Specifies the number of distinct rollout sequences to generate per query.
    * `ROLLOUT_TEMP`: Controls the temperature during generation.
    * `ROLLOUT_TP_SIZE`: Defines the tensor parallelism size for the rollout model.

    For example:
    ```bash
    export ROLLOUT_N="8"
    export ROLLOUT_TEMP="1.0"
    export ROLLOUT_TP_SIZE="1"
    ```
* **Evaluation Configuration:**
    Configure parameters for the evaluation process. These settings include default values that are used if the variables are not explicitly set.
    * `EVAL_BEFORE_TRAIN`: Determines whether to run an evaluation cycle before the training process begins.
    * `EVAL_DO_SAMPLE`: Specifies if sampling should be used during evaluation generation.
    * `EVAL_TEMP`: Sets the temperature for generation during evaluation.
    * `EVAL_TOPP`: Defines the top-p sampling parameter for evaluation.
 
    For example:
    ```bash
    export EVAL_BEFORE_TRAIN=True
    export EVAL_DO_SAMPLE=False
    export EVAL_TEMP=0
    export EVAL_TOPP=1
    ```
* **Training Run Configuration:**
    General settings for managing the training process, including project identification, evaluation scheduling, training duration, and experiment tracking with Weights & Biases.
    * `TRAIN_PROJECT_NAME`: Specifies the project name.
    * `TRAIN_TEST_FREQ`: Defines the frequency (step) for running evaluations on the test set during training.
    * `TRAIN_TOTAL_EPOCHS`: The total number of epochs the training process will run for.
    * `WANDB_API_KEY`: Your personal Weights & Biases API key. This is required if you intend to log results and metrics to W&B. **Remember to replace `"your wandb api key"` with your actual key.**
 
    For example:
    ```bash
    export TRAIN_PROJECT_NAME="xxxxx"
    export TRAIN_TEST_FREQ="5"
    export TRAIN_TOTAL_EPOCHS="3"
    export WANDB_API_KEY="your wandb api key"
    ```
</details>

---
