Model Placement and Colocation
===============================

SkyRL provides flexible control over how to distribute models across available GPU resources. You can either colocate models on the same GPUs or disaggregate them across separate GPUs, depending on your setup and requirements.

Model Components Overview
-------------------------

A typical PPO training workflow involves 5 model-based components:

- **Inference Engines** (handle text generation)
- **Policy model** (learns actions to take)
- **Reference model** (tracks the original policy)
- **Reward model** (optional; scores action quality)
- **Critic model** (estimates future rewards)

*Note: GRPO training generally uses the first 2-4 components depending on the setup – no critic model needed.*

Inference Engine Management
----------------------------

The ``generator.run_engines_locally`` argument controls inference engine management. 

If ``run_engines_locally=true``, then the inference engines are launched during the training run and managed by SkyRL.

If ``run_engines_locally=false``, then the user can specify inference engine URLs managed externally (with the ``generator.remote_inference_engine_urls`` parameter). In this case, the user is responsible for setup and teardown. Note that SkyRL expects certain additional endpoints in the inference engine specifically related to weight syncing. We provide scripts for launching remote inference engines `here <https://github.com/NovaSky-AI/SkyRL/tree/main/skyrl-train/examples/remote_inference_engine>`_ for convenience.


Inference Engine Placement
--------------------------

The ``generator.colocate_all`` setting controls inference engine placement.

**Colocated Engines (colocate_all = true)**

Inference engines share GPUs with training models:

- Generation runs on the same hardware as training
- Engines will ``sleep()`` after generation to free GPU memory
- Engines will ``wake_up()`` before the next generation round

NOTE: As of now, colocated engines are only supported with ``generator.run_engines_locally=true``.

**Disaggregated Engines (colocate_all = false)**

Inference engines run on dedicated GPUs:

- Inference engines do not need to sleep/wake_up
- Updated weights are still efficiently synced to Inference engines (via NCCL, RDMA, etc.)

Training Model Placement
------------------------

The highest-level placement configuration for the training models is ``trainer.placement.colocate_all``:


**Full Colocation (colocate_all = true)**

All training models (policy, critic, reward, reference) share the same GPUs.

**Granular Control (colocate_all = false)**

The policy and critic models are not colocated, but fine-grained placement of the reference and reward models can be controlled with two additional parameters:

- ``trainer.placement.colocate_policy_ref``: Colocate policy and reference models (``true``) or place them on separate GPUs (``false``)
- ``trainer.placement.colocate_critic_reward``: Colocate critic and reward models (``true``) or place them on separate GPUs (``false``)

Hardware Configuration
----------------------

Finally, the configuration for specifying node and GPU counts for each model (along with their default values) is as follows:

.. code-block:: yaml

    trainer:
      # Training model resources
      policy_num_nodes: 1
      policy_num_gpus_per_node: 4
      critic_num_nodes: 1
      critic_num_gpus_per_node: 4
      ref_num_nodes: 1
      ref_num_gpus_per_node: 4
      reward_num_nodes: 1
      reward_num_gpus_per_node: 4

    generator:
      # InferenceEngine resources
      num_inference_engines: 1
      inference_engine_tensor_parallel_size: 4
      inference_engine_expert_parallel_size: 1
      inference_engine_data_parallel_size: 1

.. note::
   **Resource Allocation Guidelines**
   
   - When ``colocate_all=true``, all training models should have identical node and GPU counts.
   - When ``generator.run_engines_locally=true``, the total number of GPUs used for Inference engines should match the total number of GPUs used for training models.
