Sandbox Fusion Example
============================

Last updated: 06/27/2025.

Introduction
------------

Sandbox Fusion is a remote code sandbox service that provides a secure environment for running and evaluating code generated by Large Language Models (LLMs). This example demonstrates how to train an LLM and use Sandbox Fusion to verify generated code, enhancing both security and performance.

By leveraging a remote code sandbox service with greater CPU resources for concurrent code verification, you can reduce the reward stage time by 10-30%, depending on the quality of the generated code.

Step 1: Prepare the Dataset
---------------------------

We use the Eurus-2-RL-Data dataset for training. This dataset combines math and code questions, making it suitable for LLM training tasks. You can download it from HuggingFace: `Eurus-2-RL-Data Dataset XXXX`_.

Step 2: Set Up the Sandbox Fusion Service
-----------------------------------------

Sandbox Fusion is a remote code sandbox service designed to securely run and evaluate LLM-generated code. To use it:

1. **Access Full Documentation**: For detailed setup instructions, refer to the `Sandbox Fusion Documentation XXXX`_.
2. **Deploy the Service**: Choose one of the following deployment methods:

   - **Local Deployment**: Follow the guide `here XXXX`_.
   - **FaaS Instance (Volcengine)**: Create an instance using the `Volcengine Documentation XXXX`_.

After deployment, you will receive an API endpoint in the format: ``https://<ip-address-or-domain-name>/run_code``.

Step 3: Configure the Training Script
-------------------------------------

To integrate Sandbox Fusion into your training script, configure the following parameters:

**Key Settings for Sandbox Fusion**

- ``reward_model.sandbox_fusion.url='<API-endpoint>'``: Enable Sandbox Fusion by specifying the API endpoint (must end with ``/run_code``).
- ``reward_model.sandbox_fusion.max_concurrent=256``: Set the maximum number of concurrent API requests to the Sandbox Fusion service.
- ``reward_model.sandbox_fusion.memory_limit_mb=1024``: Set the memory limit (in MB) for each sandbox instance. Defaults to 1024MB if not specified.

**Additional Optimization**

To further reduce code verification time, enable parallel processing with:  

- ``reward_model.reward_manager=prime``: The Prime reward manager verifies code across multiple subprocesses concurrently.

**Example Script**

For a practical implementation, refer to the example script:  

``examples/ppo_trainer/run_deepseek7b_llm_sandbox_fusion.sh``

Once you’ve set your API endpoint in the script, you can start the training job.