# DeepSpeed Configuration
## Batch
- `train_batch_size`: The global batch size for training.
    - `train_batch_size` = `train_micro_batch_size_per_gpu` * `gpus_number` * `gradient_accumulation_steps`
- `train_micro_batch_size_per_gpu`: The training batch size for one computation on each GPU.
- `gradient_accumulation_steps`: The number of steps to accumulate gradients before updating weights.
    - By setting `gradient_accumulation_steps`, you can update the weights with a larger batch size without increasing memory usage.
    - `gradient_accumulation_steps` can be ignored if you provide `train_batch_size` and `train_micro_batch_size_per_gpu`.

**Notice**
1. You should ensure that the number of tokens (`train_batch_size` * `max_seq_length`) in a global batch is large enough to train the model.
2. If the training parameters are in the billion range, the number of tokens in a global batch should be at least in the millions.

## Optimizer
- `type`: The type of optimizer.
- `params`: The parameters of the optimizer.

**Notice**
1. If you use configurations in the `training` folder, you need to set `total_num_steps` and `warmup_num_steps` in the `scheduler` section properly according to your data.
2. You can refer to the [DeepSpeed Configuration - Optimizer](https://www.deepspeed.ai/docs/config-json/#optimizer-parameters) website for more details.

## Scheduler
- `type`: The type of scheduler.
- `params`: The parameters of the scheduler.

**Notice**
1. You should set `total_num_steps`, `warmup_min_lr`, `warmup_max_lr`, and `warmup_num_steps` properly according to your data, training settings, and computational resources.
2. You can refer to the [DeepSpeed Configuration - Scheduler](https://www.deepspeed.ai/docs/config-json/#scheduler-parameters) website for more details.

## Precision
### FP16
- `enabled`: Whether to use FP16 precision.

### BF16
- `enabled`: Whether to use BF16 precision.

**Notice**
1. If your device supports BF16 precision, it is recommended to use BF16 precision instead of FP16 precision.

### AMP
- `enabled`: Whether to use AMP (Automatic Mixed Precision).

## Zero
- `stage`: The stage of ZeRO.
- `offload_param`
    - `device`: The device to offload the parameters to.
    - `pin_memory`: Whether to pin the memory.
        - Pinning memory can speed up data transfer between devices but increases memory usage.
- `offload_optimizer`
    - `device`: The device to offload the optimizer states to.
    - `pin_memory`: Whether to pin the memory.
        - Pinning memory can speed up data transfer between devices but increases memory usage.
- `stage3_gather_16bit_weights_on_model_save`: Whether to gather 16-bit (FP16) weights when saving the model. (Only for ZeRO-3)

## Activation Checkpointing
- `partition_activations`: Whether to enable activation checkpointing when model parallelism is used.
    - Activation checkpointing can reduce memory usage by storing only a subset of the activations and recomputing the rest during the backward pass. However, recomputation will increase computation time.
- `cpu_checkpointing`: Whether to offload activation checkpointing to the CPU.
- `contiguous_memory_optimization`: Whether to use contiguous memory optimization.
    - It can improve memory access efficiency but may increase memory usage.
- `number_checkpoints`: The number of checkpoints (for `contiguous_memory_optimization`).
    - The value of this parameter specifies the number of checkpoints that can be stored in the pre-allocated memory.
    - If the value of this parameter is too small, the activation is still non-contiguous.
    - If the value of this parameter is too large, memory will be wasted.

## Others
- `gradient_clipping`: The maximum norm of the gradients. If the norm of the gradients exceeds this value, the gradients are clipped.
