Frequently Asked Questions
====================================

Last updated: 06/25/2025.

Ray related
------------

How to add breakpoint for debugging with distributed Ray?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Please checkout the official debugging guide from Ray: XXXX


"Unable to register worker with raylet"
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The cause of this issue is due to some system setting, e.g., SLURM added some constraints on how the CPUs are shared on a node. 
While `ray.init()` tries to launch as many worker processes as the number of CPU cores of the machine,
some constraints of SLURM restricts the `core-workers` seeing the `raylet` process, leading to the problem.

To fix this issue, you can set the config term ``ray_init.num_cpus`` to a number allowed by your system.

Distributed training
------------------------

How to run multi-node post-training with Ray?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

You can start a ray XXXX-12 and submit a ray job, following the official guide from Ray: XXXX

Then in the configuration, set the ``trainer.nnode`` config to the number of machines for your job.

How to use verl on a Slurm-managed XXXX-12?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Ray provides users with `this XXXX`_ official
tutorial to start a Ray XXXX-12 on top of Slurm. We have verified the :doc:`GSM8K example<../examples/gsm8k_example>`
on a Slurm XXXX-12 under a multi-node setting with the following steps.

1. [Optional] If your XXXX-12 support `Apptainer or Singularity XXXX`_ and you wish
to use it, convert verl's Docker image to an Apptainer image. Alternatively, set up the environment with the package
manager available on your XXXX-12 or use other container runtimes (e.g. through `Slurm's OCI support XXXX`_) available to you.

.. code:: bash

    apptainer pull /your/dest/dir/vemlp-th2.4.0-cu124-vllm0.6.3-ray2.10-te1.7-v0.0.3.sif docker://verlai/verl:vemlp-th2.4.0-cu124-vllm0.6.3-ray2.10-te1.7-v0.0.3

2. Follow :doc:`GSM8K example<../examples/gsm8k_example>` to prepare the dataset and model checkpoints.

3. Modify `examples/slurm/ray_on_slurm.slurm XXXX`_ with your XXXX-12's own information.

4. Submit the job script to the Slurm XXXX-12 with `sbatch`.

Please note that Slurm XXXX-12 setup may vary. If you encounter any issues, please refer to Ray's
`Slurm user guide XXXX`_ for common caveats.

If you changed Slurm resource specifications, please make sure to update the environment variables in the job script if necessary.


Install related
------------------------

NotImplementedError: TensorDict does not support membership checks with the `in` keyword. 
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Detail error information: 

.. code:: bash

    NotImplementedError: TensorDict does not support membership checks with the `in` keyword. If you want to check if a particular key is in your TensorDict, please use `key in tensordict.keys()` instead.

Cause of the problem: There is no suitable version of tensordict package for the linux-arm64 platform. The confirmation method is as follows:

.. code:: bash

    pip install tensordict==0.6.2

Output example:

.. code:: bash

    ERROR: Could not find a version that satisfies the requirement tensordict==0.6.2 (from versions: 0.0.1a0, 0.0.1b0, 0.0.1rc0, 0.0.2a0, 0.0.2b0, 0.0.3, 0.1.0, 0.1.1, 0.1.2, 0.8.0, 0.8.1, 0.8.2, 0.8.3)
    ERROR: No matching distribution found for tensordict==0.6.2

Solution 1st:
  Install tensordict from source code:

.. code:: bash

    pip uninstall tensordict
    git clone XXXX
    cd tensordict/
    git checkout v0.6.2
    python setup.py develop
    pip install -v -e .

Solution 2nd:
  Temperally modify the error takeplace codes: tensordict_var -> tensordict_var.keys()


Illegal memory access
---------------------------------

If you encounter the error message like ``CUDA error: an illegal memory access was encountered`` during rollout, please check the vLLM documentation for troubleshooting steps specific to your vLLM version.

Checkpoints
------------------------

If you want to convert the model checkpoint into huggingface safetensor format, please refer to ``verl/model_merger``.


Triton ``compile_module_from_src`` error
------------------------------------------------

If you encounter triton compilation error similar to the stacktrace below, please set the ``use_torch_compile`` flag according to
XXXX to disable just-in-time compilation for fused kernels.

.. code:: bash

  File "/data/lbh/conda_envs/verl/lib/python3.10/site-packages/triton/runtime/jit.py", line 345, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
  File "/data/lbh/conda_envs/verl/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 338, in run
    return self.fn.run(*args, **kwargs)
  File "/data/lbh/conda_envs/verl/lib/python3.10/site-packages/triton/runtime/jit.py", line 607, in run
    device = driver.active.get_current_device()
  File "/data/lbh/conda_envs/verl/lib/python3.10/site-packages/triton/runtime/driver.py", line 23, in __getattr__
    self._initialize_obj()
  File "/data/lbh/conda_envs/verl/lib/python3.10/site-packages/triton/runtime/driver.py", line 20, in _initialize_obj
    self._obj = self._init_fn()
  File "/data/lbh/conda_envs/verl/lib/python3.10/site-packages/triton/runtime/driver.py", line 9, in _create_driver
    return actives[0]()
  File "/data/lbh/conda_envs/verl/lib/python3.10/site-packages/triton/backends/nvidia/driver.py", line 371, in __init__
    self.utils = CudaUtils()  # TODO: make static
  File "/data/lbh/conda_envs/verl/lib/python3.10/site-packages/triton/backends/nvidia/driver.py", line 80, in __init__
    mod = compile_module_from_src(Path(os.path.join(dirname, "driver.c")).read_text(), "cuda_utils")
  File "/data/lbh/conda_envs/verl/lib/python3.10/site-packages/triton/backends/nvidia/driver.py", line 57, in compile_module_from_src
    so = _build(name, src_path, tmpdir, library_dirs(), include_dir, libraries)
  File "/data/lbh/conda_envs/verl/lib/python3.10/site-packages/triton/runtime/build.py", line 48, in _build
    ret = subprocess.check_call(cc_cmd)
  File "/data/lbh/conda_envs/verl/lib/python3.10/subprocess.py", line 369, in check_call
    raise CalledProcessError(retcode, cmd)

What is the meaning of train batch size, mini batch size, and micro batch size?
------------------------------------------------------------------------------------------

This figure illustrates the relationship between different batch size configurations.

XXXX

.. image:: XXXX

How to generate ray timeline to analyse performance of a training job?
------------------------------------------------------------------------------------------

To generate the ray timeline file, you can set the config term ``ray_init.timeline_file`` to a json file path.
For example:

.. code:: bash

    ray_init.timeline_file=/tmp/ray_timeline.json
  
The file will be generated in the specified path at the end of a training job.
You can use tools like chrome://tracing or the Perfetto UI and view the ray timeline file.

This figure shows the ray timeline file generated by from a training job on 1 node with 4 GPUs

.. image:: XXXX

How to set proxy only for wandb?
------------------------------------------------------------------------------------------

If you need a proxy to access wandb, you can add below config in your training job script.
Comparing to using global https_proxy env variable, this approach won't mess up other http requests, such as ChatCompletionScheduler.

.. code:: bash

  +trainer.wandb_proxy=http://<your proxy and port>

