[2025-04-11 17:48:04,023] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
INFO:root:Using nproc_per_node=8.
W0411 17:48:05.297000 140620612792960 torch/distributed/run.py:757] 
W0411 17:48:05.297000 140620612792960 torch/distributed/run.py:757] *****************************************
W0411 17:48:05.297000 140620612792960 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0411 17:48:05.297000 140620612792960 torch/distributed/run.py:757] *****************************************
[2025-04-11 17:48:07,855] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-11 17:48:07,873] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-11 17:48:07,879] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-11 17:48:07,882] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-11 17:48:07,888] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-11 17:48:07,889] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-11 17:48:07,890] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-11 17:48:07,891] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[93m [WARNING] [0m async_io requires the dev libaio .so object and headers but these were not found.
[93m [WARNING] [0m async_io requires the dev libaio .so object and headers but these were not found.
[93m [WARNING] [0m async_io requires the dev libaio .so object and headers but these were not found.
[93m [WARNING] [0m async_io requires the dev libaio .so object and headers but these were not found.
[93m [WARNING] [0m async_io requires the dev libaio .so object and headers but these were not found.
[93m [WARNING] [0m async_io requires the dev libaio .so object and headers but these were not found.
[93m [WARNING] [0m async_io requires the dev libaio .so object and headers but these were not found.
[93m [WARNING] [0m async_io: please install the libaio-dev package with apt
[93m [WARNING] [0m If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[93m [WARNING] [0m Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[93m [WARNING] [0m async_io requires the dev libaio .so object and headers but these were not found.
[93m [WARNING] [0m async_io: please install the libaio-dev package with apt
[93m [WARNING] [0m If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[93m [WARNING] [0m Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[93m [WARNING] [0m async_io: please install the libaio-dev package with apt
[93m [WARNING] [0m If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[93m [WARNING] [0m Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[93m [WARNING] [0m async_io: please install the libaio-dev package with apt
[93m [WARNING] [0m If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[93m [WARNING] [0m Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[93m [WARNING] [0m async_io: please install the libaio-dev package with apt
[93m [WARNING] [0m If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[93m [WARNING] [0m Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[93m [WARNING] [0m async_io: please install the libaio-dev package with apt
[93m [WARNING] [0m If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[93m [WARNING] [0m Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[93m [WARNING] [0m async_io: please install the libaio-dev package with apt
[93m [WARNING] [0m If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[93m [WARNING] [0m Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[93m [WARNING] [0m async_io: please install the libaio-dev package with apt
[93m [WARNING] [0m If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[93m [WARNING] [0m Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[93m [WARNING] [0m sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[93m [WARNING] [0m using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[93m [WARNING] [0m sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[93m [WARNING] [0m using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[93m [WARNING] [0m sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[93m [WARNING] [0m using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[93m [WARNING] [0m sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[93m [WARNING] [0m using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[93m [WARNING] [0m sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[93m [WARNING] [0m using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[93m [WARNING] [0m sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[93m [WARNING] [0m using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[93m [WARNING] [0m sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[93m [WARNING] [0m using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[93m [WARNING] [0m sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[93m [WARNING] [0m using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[2025-04-11 17:48:08,989] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-04-11 17:48:08,999] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-04-11 17:48:09,001] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-04-11 17:48:09,001] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2025-04-11 17:48:09,002] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-04-11 17:48:09,012] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-04-11 17:48:09,014] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-04-11 17:48:09,014] [INFO] [comm.py:637:init_distributed] cdb=None
/home/username/.conda/envs/llm/lib/python3.11/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
2025-04-11 17:48:09 - INFO - __main__ - Model parameters ModelArguments(base_model_revision=None, model_name_or_path='/data/username/grafting/saves/llama3-8b/full/sft_code', model_revision='main', model_code_revision=None, torch_dtype=None, tokenizer_name_or_path=None, trust_remote_code=False, attn_implementation=None, use_peft=False, lora_r=16, lora_alpha=32, lora_dropout=0.05, lora_target_modules=None, lora_modules_to_save=None, load_in_8bit=False, load_in_4bit=False, bnb_4bit_quant_type='nf4', use_bnb_nested_quant=False, bnb_4bit_quant_storage='uint8')
2025-04-11 17:48:09 - INFO - __main__ - Data parameters DataArguments(chat_template=None, dataset_mixer={'/data/public/grafting/dpo_code': 1.0}, text_column='text', dataset_splits=['train'], dataset_configs=None, preprocessing_num_workers=12, truncation_side=None, auto_insert_empty_system_msg=True)
2025-04-11 17:48:09 - INFO - __main__ - Training/evaluation parameters DPOConfig(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
batch_eval_metrics=False,
beta=10,
bf16=True,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
dataset_num_proc=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_dropout=True,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=10000,
eval_strategy=IntervalStrategy.NO,
eval_use_gather_object=False,
evaluation_strategy=None,
f_alpha_divergence_coef=1.0,
f_divergence_type=FDivergenceType.REVERSE_KL,
force_use_ref_model=False,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
generate_during_eval=False,
gradient_accumulation_steps=16,
gradient_checkpointing=True,
gradient_checkpointing_kwargs={'use_reentrant': False},
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=HubStrategy.EVERY_SAVE,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
include_num_input_tokens_seen=False,
include_tokens_per_second=False,
is_encoder_decoder=None,
jit_mode_eval=False,
label_names=None,
label_pad_token_id=-100,
label_smoothing=0,
label_smoothing_factor=0.0,
learning_rate=8e-07,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=info,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=/data/username/grafting/saves/llama3-8b/full/dpo_code/runs/Apr11_17-48-09_amax,
logging_first_step=True,
logging_nan_inf_filter=True,
logging_steps=2,
logging_strategy=IntervalStrategy.STEPS,
loss_type=length_normalization,
lr_scheduler_kwargs={},
lr_scheduler_type=SchedulerType.COSINE,
max_grad_norm=1.0,
max_length=2048,
max_prompt_length=512,
max_steps=-1,
max_target_length=None,
metric_for_best_model=None,
model_adapter_name=None,
model_init_kwargs=None,
mp_parameters=,
neftune_noise_alpha=None,
no_cuda=False,
num_train_epochs=1,
optim=OptimizerNames.ADAMW_TORCH,
optim_args=None,
optim_target_modules=None,
output_dir=/data/username/grafting/saves/llama3-8b/full/dpo_code,
overwrite_output_dir=False,
padding_value=None,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=1,
precompute_ref_log_probs=False,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
ref_adapter_name=None,
ref_model_init_kwargs=None,
ref_model_mixup_alpha=0.9,
ref_model_sync_steps=64,
reference_free=False,
remove_unused_columns=False,
report_to=['tensorboard'],
restore_callback_states_from_checkpoint=False,
resume_from_checkpoint=None,
rpo_alpha=None,
run_name=/data/username/grafting/saves/llama3-8b/full/dpo_code,
save_on_each_node=False,
save_only_model=False,
save_safetensors=True,
save_steps=1000,
save_strategy=IntervalStrategy.STEPS,
save_total_limit=2,
seed=42,
skip_memory_metrics=True,
split_batches=None,
sync_ref_model=False,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torch_empty_cache_steps=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
truncation_mode=keep_end,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.1,
warmup_steps=0,
weight_decay=0.0,
)
44444 Dataset({
    features: ['rejected', 'dataset', 'source_model', 'prompt', 'chosen'],
    num_rows: 3968
})
/home/username/.conda/envs/llm/lib/python3.11/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
[2025-04-11 17:48:09,115] [INFO] [comm.py:637:init_distributed] cdb=None
44444 Dataset({
    features: ['rejected', 'dataset', 'source_model', 'prompt', 'chosen'],
    num_rows: 3968
})
44444 Dataset({
    features: ['rejected', 'dataset', 'source_model', 'prompt', 'chosen'],
    num_rows: 3968
})
/home/username/.conda/envs/llm/lib/python3.11/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
44444 Dataset({
    features: ['rejected', 'dataset', 'source_model', 'prompt', 'chosen'],
    num_rows: 3968
})
44444 Dataset({
    features: ['rejected', 'dataset', 'source_model', 'prompt', 'chosen'],
    num_rows: 3968
})
2025-04-11 17:48:09 - INFO - __main__ - Training on the following splits: ['train : 3968']
[INFO|tokenization_utils_base.py:2287] 2025-04-11 17:48:09,132 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2287] 2025-04-11 17:48:09,133 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2287] 2025-04-11 17:48:09,133 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2287] 2025-04-11 17:48:09,133 >> loading file tokenizer_config.json
44444 Dataset({
    features: ['rejected', 'dataset', 'source_model', 'prompt', 'chosen'],
    num_rows: 3968
})
44444 Dataset({
    features: ['rejected', 'dataset', 'source_model', 'prompt', 'chosen'],
    num_rows: 3968
})
/home/username/.conda/envs/llm/lib/python3.11/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
44444 Dataset({
    features: ['rejected', 'dataset', 'source_model', 'prompt', 'chosen'],
    num_rows: 3968
})
[INFO|tokenization_utils_base.py:2533] 2025-04-11 17:48:09,412 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
DEBUG: model type = <class 'str'>
/home/username/.conda/envs/llm/lib/python3.11/site-packages/huggingface_hub/utils/_deprecation.py:100: FutureWarning: Deprecated argument(s) used in '__init__': model_init_kwargs, ref_model_init_kwargs, beta, max_length, max_prompt_length, loss_type. Will not be supported from version '1.0.0'.

Deprecated positional argument(s) used in DPOTrainer, please use the DPOConfig to set these arguments instead.
  warnings.warn(message, FutureWarning)
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:171: UserWarning: You passed `model_init_kwargs` to the DPOTrainer, the value you passed will override the one in the `DPOConfig`.
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:199: UserWarning: You passed `ref_model_init_kwargs` to the DPOTrainer, the value you passed will override the one in the `DPOConfig`.
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:226: UserWarning: You passed a model_id to the DPOTrainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you.
  warnings.warn(
2025-04-11 17:48:09 - INFO - __main__ - Decontaminated 0 (0.00%) samples from the training set.
2025-04-11 17:48:09 - INFO - __main__ - Prompt sample 2619 of the raw training set:

<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Write a Python function that takes a string `s` and returns a dictionary `d` with keys `s[i]` and values `s[i+1]` for all `i` from `0` to `len(s)-2` (inclusive). If `s[i]` is already in `d`, replace its current value with the concatenation of the current value and `s[i+1]`.

Your code should pass the following test case:
```python
s = "abcdefgh"
expected_output = {"a": "ab", "b": "bc", "c": "cd", "d": "de", "e": "ef", "f": "fg", "g": "gh"}
assert dictionary_comprehension(s) == expected_output
```<|eot_id|>
2025-04-11 17:48:09 - INFO - __main__ - Chosen sample 2619 of the raw training set:

<|begin_of_text|><|start_header_id|>assistant<|end_header_id|>

To solve this problem, we can use a dictionary comprehension to iterate over the string `s` and construct the dictionary `d`. We need to ensure that if a key already exists in the dictionary, its value is concatenated with the new value. Here's how you can implement this:

```python
def dictionary_comprehension(s):
    # Initialize an empty dictionary
    d = {}
    
    # Iterate over the string from index 0 to len(s)-2
    for i in range(len(s) - 1):
        # If the key already exists, concatenate the new value to the existing value
        if s[i] in d:
            d[s[i]] += s[i + 1]
        else:
            # Otherwise, create a new key-value pair
            d[s[i]] = s[i] + s[i + 1]
    
    return d

# Test the function with the provided test case
s = "abcdefgh"
expected_output = {"a": "ab", "b": "bc", "c": "cd", "d": "de", "e": "ef", "f": "fg", "g": "gh"}
assert dictionary_comprehension(s) == expected_output
```

This function works as follows:
1. It initializes an empty dictionary `d`.
2. It iterates over the string `s` from index `0` to `len(s)-2`.
3. For each character `s[i]`, it checks if the character is already a key in the dictionary.
   - If it is, it concatenates the next character `s[i+1]` to the existing value.
   - If it is not, it creates a new key-value pair where the key is `s[i]` and the value is `s[i] + s[i+1]`.
4. Finally, it returns the constructed dictionary `d`.

This approach ensures that the function correctly handles the concatenation of values for repeated keys.<|eot_id|>
2025-04-11 17:48:09 - INFO - __main__ - Rejected sample 2619 of the raw training set:

<|begin_of_text|><|start_header_id|>assistant<|end_header_id|>

To solve this problem, we can use a dictionary comprehension to iterate over the string `s` and construct the required dictionary. However, since we need to handle cases where the same key appears more than once, we will use a regular loop to ensure that we concatenate values correctly when a key is repeated.

Here's the implementation:

```python
def dictionary_comprehension(s):
    # Initialize an empty dictionary to store the results
    d = {}
    
    # Iterate over the string s from index 0 to len(s)-2
    for i in range(len(s) - 1):
        # Check if the key s[i] is already in the dictionary
        if s[i] in d:
            # If it is, concatenate the current value with s[i+1]
            d[s[i]] += s[i+1]
        else:
            # If it is not, initialize the key with the value s[i+1]
            d[s[i]] = s[i+1]
    
    return d

# Test the function with the provided test case
s = "abcdefgh"
expected_output = {"a": "ab", "b": "bc", "c": "cd", "d": "de", "e": "ef", "f": "fg", "g": "gh"}
assert dictionary_comprehension(s) == expected_output
```

This function iterates through the string `s` and constructs the dictionary `d` as specified. If a key already exists in the dictionary, it appends the next character to the existing value. If the key does not exist, it initializes the key with the next character. This ensures that the dictionary is built correctly according to the problem statement.<|eot_id|>
2025-04-11 17:48:09 - INFO - __main__ - Prompt sample 456 of the raw training set:

<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Implement a Python function that recursively performs binary search on a sorted array for a given target. The function should handle base cases and early returns, and use integer division to calculate the midpoint to avoid the problem of floating point imprecision.

Your code should pass the following assertions:
```python
assert recursive_binary_search([1, 3, 5, 7, 9], 10) == -1
assert recursive_binary_search([1, 3, 5, 7, 9], 3) == 1
assert recursive_binary_search([], 5) == -1
```<|eot_id|>
2025-04-11 17:48:09 - INFO - __main__ - Chosen sample 456 of the raw training set:

<|begin_of_text|><|start_header_id|>assistant<|end_header_id|>

```python
def recursive_binary_search(arr, target):
  """
  Recursively performs binary search on a sorted array for a given target.

  Args:
    arr: The sorted array to search.
    target: The target value to search for.

  Returns:
    The index of the target value in the array, or -1 if it is not found.
  """
  if not arr:  # Base case: empty array
    return -1

  mid = len(arr) // 2  # Calculate midpoint using integer division
  if arr[mid] == target:
    return mid  # Target found at midpoint
  elif target < arr[mid]:
    return recursive_binary_search(arr[:mid], target)  # Search left half
  else:
    return recursive_binary_search(arr[mid + 1:], target)  # Search right half

```

**Explanation:**

1. **Base Case:** The function first checks if the input array `arr` is empty. If it is, the target cannot be found, so it returns -1.

2. **Midpoint Calculation:** It calculates the middle index `mid` using integer division (`//`) to ensure an integer result, avoiding potential floating-point precision issues.

3. **Comparison:** The function compares the value at the midpoint `arr[mid]` with the `target`.

   - **Target Found:** If they are equal, the function returns `mid`, indicating the index where the target is found.

   - **Target Smaller:** If the `target` is smaller than `arr[mid]`, the search continues recursively in the left half of the array (`arr[:mid]`).

   - **Target Larger:** If the `target` is larger than `arr[mid]`, the search continues recursively in the right half of the array (`arr[mid + 1:]`).

**Assertions:**

The provided assertions test the function with different scenarios:

- **Target not found:** `recursive_binary_search([1, 3, 5, 7, 9], 10) == -1`
- **Target found:** `recursive_binary_search([1, 3, 5, 7, 9], 3) == 1`
- **Empty array:** `recursive_binary_search([], 5) == -1`

These assertions verify that the function correctly handles base cases and returns the expected results for various inputs.<|eot_id|>
2025-04-11 17:48:09 - INFO - __main__ - Rejected sample 456 of the raw training set:

<|begin_of_text|><|start_header_id|>assistant<|end_header_id|>

```python
def recursive_binary_search(arr, target):
    """
    Recursively performs binary search on a sorted array for a given target.

    Args:
        arr: A sorted array.
        target: The value to search for.

    Returns:
        The index of the target element if found, -1 otherwise.
    """

    left = 0
    right = len(arr) - 1

    if left > right:
        return -1

    mid = (left + right) // 2  

    if arr[mid] == target:
        return mid
    elif target < arr[mid]:
        return recursive_binary_search(arr[left:mid], target)
    else:
        return recursive_binary_search(arr[mid+1:right+1], target)


# Assertions
assert recursive_binary_search([1, 3, 5, 7, 9], 10) == -1
assert recursive_binary_search([1, 3, 5, 7, 9], 3) == 1
assert recursive_binary_search([], 5) == -1
```

**Explanation:**

1. **Base Case:** The function first checks if `left` is greater than `right`. This means the search space is empty, indicating the target is not present. In this case, it returns `-1`.
2. **Midpoint Calculation:** It calculates the middle index `mid` using integer division (`//`) to avoid potential floating-point precision issues.
3. **Comparison:**
   - If `arr[mid]` equals the `target`, the target is found, and its index `mid` is returned.
   - If `target` is less than `arr[mid]`, the search continues recursively in the left half of the array (`arr[left:mid]`).
   - If `target` is greater than `arr[mid]`, the search continues recursively in the right half of the array (`arr[mid+1:right+1]`).

**Key Points:**

- **Recursion:** The function calls itself with a smaller subarray in each recursive step, narrowing down the search space.
- **Integer Division:** Using `//` ensures that the midpoint is always an integer index.
- **Base Case Handling:** The `left > right` condition handles the case when the target is not found, preventing infinite recursion.

This implementation effectively performs binary search recursively, taking advantage of the sorted nature of the input array to efficiently locate the target element.<|eot_id|>
2025-04-11 17:48:09 - INFO - __main__ - Prompt sample 102 of the raw training set:

<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Create a Python function that handles the reply to a command with optional parameters, such as `/help`, `/ping [optional_message]`, `/echo [message]`, and `/sum [number1] [number2]`. The function should recognize the command and construct a reply accordingly.

- For `/help`, the function should return "There are five commands: /help, /ping, /echo, /sum, /exit".
- For `/ping`, the function should return "Pong!" if no optional message is provided, or "Pong! {optional_message}" if an optional message is provided.
- For `/echo`, the function should return "{message}" if a message is provided.
- For `/sum`, the function should return the sum of two numbers if two numbers are provided.
- For `/exit`, the function should exit the program.

Your code should pass the following assertion:
```python
assert handle_command("/help") == "There are five commands: /help, /ping, /echo, /sum, /exit"
```<|eot_id|>
2025-04-11 17:48:09 - INFO - __main__ - Chosen sample 102 of the raw training set:

<|begin_of_text|><|start_header_id|>assistant<|end_header_id|>

Certainly! Below is a Python function that handles the specified commands:

```python
import sys

def handle_command(command):
    parts = command.split()
    base_command = parts[0]

    if base_command == "/help":
        return "There are five commands: /help, /ping, /echo, /sum, /exit"

    elif base_command == "/ping":
        if len(parts) > 1:
            optional_message = ' '.join(parts[1:])
            return f"Pong! {optional_message}"
        else:
            return "Pong!"

    elif base_command == "/echo":
        if len(parts) > 1:
            message = ' '.join(parts[1:])
            return message
        else:
            return "No message provided."

    elif base_command == "/sum":
        if len(parts) == 3:
            try:
                number1 = float(parts[1])
                number2 = float(parts[2])
                return str(number1 + number2)
            except ValueError:
                return "Invalid numbers provided."
        else:
            return "Usage: /sum [number1] [number2]"

    elif base_command == "/exit":
        sys.exit()

    else:
        return "Unknown command."

# Example assertions
assert handle_command("/help") == "There are five commands: /help, /ping, /echo, /sum, /exit"
assert handle_command("/ping") == "Pong!"
assert handle_command("/ping Hello") == "Pong! Hello"
assert handle_command("/echo Hello World") == "Hello World"
assert handle_command("/sum 3 5") == "8.0"
assert handle_command("/sum 3.5 2.5") == "6.0"

# The following line should exit the program
# handle_command("/exit")
```

This function splits the input command into parts and handles each command accordingly. The `/ping`, `/echo`, and `/sum` commands process optional parameters as described. The `/exit` command will terminate the program using `sys.exit()`.

The assertions provided ensure that the function behaves correctly for the given examples.<|eot_id|>
2025-04-11 17:48:09 - INFO - __main__ - Rejected sample 102 of the raw training set:

<|begin_of_text|><|start_header_id|>assistant<|end_header_id|>

Certainly! Below is a Python function that handles the specified commands with optional parameters:

```python
def handle_command(command):
    parts = command.split()
    if len(parts) == 0:
        return "Invalid command"

    base_command = parts[0]

    if base_command == "/help":
        return "There are five commands: /help, /ping, /echo, /sum, /exit"
    elif base_command == "/ping":
        if len(parts) > 1:
            return f"Pong! {' '.join(parts[1:])}"
        else:
            return "Pong!"
    elif base_command == "/echo":
        if len(parts) > 1:
            return ' '.join(parts[1:])
        else:
            return "Please provide a message to echo"
    elif base_command == "/sum":
        if len(parts) == 3:
            try:
                number1 = float(parts[1])
                number2 = float(parts[2])
                return str(number1 + number2)
            except ValueError:
                return "Both arguments must be numbers"
        else:
            return "Please provide two numbers to sum"
    elif base_command == "/exit":
        exit()
    else:
        return "Unknown command"

# Test assertions
assert handle_command("/help") == "There are five commands: /help, /ping, /echo, /sum, /exit"
assert handle_command("/ping") == "Pong!"
assert handle_command("/ping hello") == "Pong! hello"
assert handle_command("/echo test message") == "test message"
assert handle_command("/sum 3 5") == "8.0"
assert handle_command("/exit") == None
assert handle_command("/unknown") == "Unknown command"
```

Explanation:
1. The function `handle_command` takes a command string as input.
2. It splits the command string into parts based on spaces.
3. It identifies the base command (the first part) and handles it accordingly:
   - For `/help`, it returns the help message.
   - For `/ping`, it checks for an optional message and constructs the reply.
   - For `/echo`, it checks for a message to echo.
   - For `/sum`, it checks for two numbers and returns their sum.
   - For `/exit`, it exits the program.
4. If the command is unknown, it returns an "Unknown command" message.
5. The function includes various assertions to ensure it works correctly.<|eot_id|>
DEBUG: model type = <class 'str'>
/home/username/.conda/envs/llm/lib/python3.11/site-packages/huggingface_hub/utils/_deprecation.py:100: FutureWarning: Deprecated argument(s) used in '__init__': model_init_kwargs, ref_model_init_kwargs, beta, max_length, max_prompt_length, loss_type. Will not be supported from version '1.0.0'.

Deprecated positional argument(s) used in DPOTrainer, please use the DPOConfig to set these arguments instead.
  warnings.warn(message, FutureWarning)
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:171: UserWarning: You passed `model_init_kwargs` to the DPOTrainer, the value you passed will override the one in the `DPOConfig`.
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:199: UserWarning: You passed `ref_model_init_kwargs` to the DPOTrainer, the value you passed will override the one in the `DPOConfig`.
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:226: UserWarning: You passed a model_id to the DPOTrainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you.
  warnings.warn(
[INFO|configuration_utils.py:731] 2025-04-11 17:48:09,563 >> loading configuration file /data/username/grafting/saves/llama3-8b/full/sft_code/config.json
[INFO|configuration_utils.py:800] 2025-04-11 17:48:09,564 >> Model config LlamaConfig {
  "_name_or_path": "/data/username/grafting/saves/llama3-8b/full/sft_code",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": [
    128001,
    128008,
    128009
  ],
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 131072,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 8.0,
    "high_freq_factor": 4.0,
    "low_freq_factor": 1.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "transformers_version": "4.43.4",
  "use_cache": false,
  "vocab_size": 128256
}

[INFO|modeling_utils.py:3641] 2025-04-11 17:48:09,565 >> loading weights file /data/username/grafting/saves/llama3-8b/full/sft_code/model.safetensors.index.json
[INFO|modeling_utils.py:3786] 2025-04-11 17:48:09,566 >> Detected DeepSpeed ZeRO-3: activating zero.init() for this model
DEBUG: model type = <class 'str'>
/home/username/.conda/envs/llm/lib/python3.11/site-packages/huggingface_hub/utils/_deprecation.py:100: FutureWarning: Deprecated argument(s) used in '__init__': model_init_kwargs, ref_model_init_kwargs, beta, max_length, max_prompt_length, loss_type. Will not be supported from version '1.0.0'.

Deprecated positional argument(s) used in DPOTrainer, please use the DPOConfig to set these arguments instead.
  warnings.warn(message, FutureWarning)
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:171: UserWarning: You passed `model_init_kwargs` to the DPOTrainer, the value you passed will override the one in the `DPOConfig`.
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:199: UserWarning: You passed `ref_model_init_kwargs` to the DPOTrainer, the value you passed will override the one in the `DPOConfig`.
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:226: UserWarning: You passed a model_id to the DPOTrainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you.
  warnings.warn(
[INFO|configuration_utils.py:1038] 2025-04-11 17:48:09,569 >> Generate config GenerationConfig {
  "bos_token_id": 128000,
  "eos_token_id": [
    128001,
    128008,
    128009
  ],
  "use_cache": false
}

DEBUG: model type = <class 'str'>
/home/username/.conda/envs/llm/lib/python3.11/site-packages/huggingface_hub/utils/_deprecation.py:100: FutureWarning: Deprecated argument(s) used in '__init__': model_init_kwargs, ref_model_init_kwargs, beta, max_length, max_prompt_length, loss_type. Will not be supported from version '1.0.0'.

Deprecated positional argument(s) used in DPOTrainer, please use the DPOConfig to set these arguments instead.
  warnings.warn(message, FutureWarning)
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:171: UserWarning: You passed `model_init_kwargs` to the DPOTrainer, the value you passed will override the one in the `DPOConfig`.
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:199: UserWarning: You passed `ref_model_init_kwargs` to the DPOTrainer, the value you passed will override the one in the `DPOConfig`.
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:226: UserWarning: You passed a model_id to the DPOTrainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you.
  warnings.warn(
DEBUG: model type = <class 'str'>
/home/username/.conda/envs/llm/lib/python3.11/site-packages/huggingface_hub/utils/_deprecation.py:100: FutureWarning: Deprecated argument(s) used in '__init__': model_init_kwargs, ref_model_init_kwargs, beta, max_length, max_prompt_length, loss_type. Will not be supported from version '1.0.0'.

Deprecated positional argument(s) used in DPOTrainer, please use the DPOConfig to set these arguments instead.
  warnings.warn(message, FutureWarning)
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:171: UserWarning: You passed `model_init_kwargs` to the DPOTrainer, the value you passed will override the one in the `DPOConfig`.
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:199: UserWarning: You passed `ref_model_init_kwargs` to the DPOTrainer, the value you passed will override the one in the `DPOConfig`.
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:226: UserWarning: You passed a model_id to the DPOTrainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you.
  warnings.warn(
DEBUG: model type = <class 'str'>
/home/username/.conda/envs/llm/lib/python3.11/site-packages/huggingface_hub/utils/_deprecation.py:100: FutureWarning: Deprecated argument(s) used in '__init__': model_init_kwargs, ref_model_init_kwargs, beta, max_length, max_prompt_length, loss_type. Will not be supported from version '1.0.0'.

Deprecated positional argument(s) used in DPOTrainer, please use the DPOConfig to set these arguments instead.
  warnings.warn(message, FutureWarning)
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:171: UserWarning: You passed `model_init_kwargs` to the DPOTrainer, the value you passed will override the one in the `DPOConfig`.
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:199: UserWarning: You passed `ref_model_init_kwargs` to the DPOTrainer, the value you passed will override the one in the `DPOConfig`.
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:226: UserWarning: You passed a model_id to the DPOTrainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you.
  warnings.warn(
DEBUG: model type = <class 'str'>
/home/username/.conda/envs/llm/lib/python3.11/site-packages/huggingface_hub/utils/_deprecation.py:100: FutureWarning: Deprecated argument(s) used in '__init__': model_init_kwargs, ref_model_init_kwargs, beta, max_length, max_prompt_length, loss_type. Will not be supported from version '1.0.0'.

Deprecated positional argument(s) used in DPOTrainer, please use the DPOConfig to set these arguments instead.
  warnings.warn(message, FutureWarning)
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:171: UserWarning: You passed `model_init_kwargs` to the DPOTrainer, the value you passed will override the one in the `DPOConfig`.
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:199: UserWarning: You passed `ref_model_init_kwargs` to the DPOTrainer, the value you passed will override the one in the `DPOConfig`.
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:226: UserWarning: You passed a model_id to the DPOTrainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you.
  warnings.warn(
DEBUG: model type = <class 'str'>
/home/username/.conda/envs/llm/lib/python3.11/site-packages/huggingface_hub/utils/_deprecation.py:100: FutureWarning: Deprecated argument(s) used in '__init__': model_init_kwargs, ref_model_init_kwargs, beta, max_length, max_prompt_length, loss_type. Will not be supported from version '1.0.0'.

Deprecated positional argument(s) used in DPOTrainer, please use the DPOConfig to set these arguments instead.
  warnings.warn(message, FutureWarning)
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:171: UserWarning: You passed `model_init_kwargs` to the DPOTrainer, the value you passed will override the one in the `DPOConfig`.
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:199: UserWarning: You passed `ref_model_init_kwargs` to the DPOTrainer, the value you passed will override the one in the `DPOConfig`.
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:226: UserWarning: You passed a model_id to the DPOTrainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you.
  warnings.warn(
[2025-04-11 17:48:11,734] [INFO] [partition_parameters.py:345:__exit__] finished initializing model - num_params = 291, num_elems = 8.03B
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards:  25%|██▌       | 1/4 [00:00<00:00,  4.51it/s]
Loading checkpoint shards:  25%|██▌       | 1/4 [00:00<00:00,  4.51it/s]
Loading checkpoint shards:  25%|██▌       | 1/4 [00:00<00:00,  4.58it/s]
Loading checkpoint shards:  25%|██▌       | 1/4 [00:00<00:00,  4.52it/s]
Loading checkpoint shards:  25%|██▌       | 1/4 [00:00<00:00,  4.51it/s]
Loading checkpoint shards:  25%|██▌       | 1/4 [00:00<00:00,  4.41it/s]
Loading checkpoint shards:  25%|██▌       | 1/4 [00:00<00:00,  4.28it/s]
Loading checkpoint shards:  25%|██▌       | 1/4 [00:01<00:04,  1.53s/it]
Loading checkpoint shards:  50%|█████     | 2/4 [00:01<00:01,  1.06it/s]
Loading checkpoint shards:  50%|█████     | 2/4 [00:01<00:01,  1.05it/s]
Loading checkpoint shards:  50%|█████     | 2/4 [00:01<00:01,  1.05it/s]
Loading checkpoint shards:  50%|█████     | 2/4 [00:01<00:01,  1.05it/s]
Loading checkpoint shards:  50%|█████     | 2/4 [00:01<00:01,  1.05it/s]
Loading checkpoint shards:  50%|█████     | 2/4 [00:01<00:01,  1.04it/s]
Loading checkpoint shards:  50%|█████     | 2/4 [00:01<00:01,  1.04it/s]
Loading checkpoint shards:  75%|███████▌  | 3/4 [00:02<00:01,  1.10s/it]
Loading checkpoint shards:  75%|███████▌  | 3/4 [00:02<00:01,  1.10s/it]
Loading checkpoint shards:  75%|███████▌  | 3/4 [00:02<00:01,  1.11s/it]
Loading checkpoint shards:  75%|███████▌  | 3/4 [00:02<00:01,  1.10s/it]
Loading checkpoint shards:  75%|███████▌  | 3/4 [00:02<00:01,  1.11s/it]
Loading checkpoint shards:  75%|███████▌  | 3/4 [00:02<00:01,  1.11s/it]
Loading checkpoint shards:  75%|███████▌  | 3/4 [00:02<00:01,  1.11s/it]
Loading checkpoint shards:  50%|█████     | 2/4 [00:03<00:03,  1.52s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.29it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.24it/s]
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:233: UserWarning: You passed a ref model_id to the DPOTrainer. This will automatically create an `AutoModelForCausalLM`
  warnings.warn(
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.29it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.23it/s]
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:233: UserWarning: You passed a ref model_id to the DPOTrainer. This will automatically create an `AutoModelForCausalLM`
  warnings.warn(
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.28it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.28it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.23it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.23it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.28it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.23it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.28it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.23it/s]/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:233: UserWarning: You passed a ref model_id to the DPOTrainer. This will automatically create an `AutoModelForCausalLM`
  warnings.warn(

/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:233: UserWarning: You passed a ref model_id to the DPOTrainer. This will automatically create an `AutoModelForCausalLM`
  warnings.warn(
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.28it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.23it/s]
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:233: UserWarning: You passed a ref model_id to the DPOTrainer. This will automatically create an `AutoModelForCausalLM`
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:233: UserWarning: You passed a ref model_id to the DPOTrainer. This will automatically create an `AutoModelForCausalLM`
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:233: UserWarning: You passed a ref model_id to the DPOTrainer. This will automatically create an `AutoModelForCausalLM`
  warnings.warn(
Loading checkpoint shards:  75%|███████▌  | 3/4 [00:04<00:01,  1.50s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:04<00:00,  1.06s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:04<00:00,  1.23s/it]
[INFO|modeling_utils.py:4473] 2025-04-11 17:48:16,671 >> All model checkpoint weights were used when initializing LlamaForCausalLM.

[INFO|modeling_utils.py:4481] 2025-04-11 17:48:16,671 >> All the weights of LlamaForCausalLM were initialized from the model checkpoint at /data/username/grafting/saves/llama3-8b/full/sft_code.
If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training.
[INFO|configuration_utils.py:991] 2025-04-11 17:48:16,673 >> loading configuration file /data/username/grafting/saves/llama3-8b/full/sft_code/generation_config.json
[INFO|configuration_utils.py:1038] 2025-04-11 17:48:16,674 >> Generate config GenerationConfig {
  "bos_token_id": 128000,
  "do_sample": true,
  "eos_token_id": [
    128001,
    128008,
    128009
  ],
  "temperature": 0.6,
  "top_p": 0.9
}

/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:233: UserWarning: You passed a ref model_id to the DPOTrainer. This will automatically create an `AutoModelForCausalLM`
  warnings.warn(
[INFO|configuration_utils.py:731] 2025-04-11 17:48:16,674 >> loading configuration file /data/username/grafting/saves/llama3-8b/full/sft_code/config.json
[INFO|configuration_utils.py:800] 2025-04-11 17:48:16,675 >> Model config LlamaConfig {
  "_name_or_path": "/data/username/grafting/saves/llama3-8b/full/sft_code",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": [
    128001,
    128008,
    128009
  ],
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 131072,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 8.0,
    "high_freq_factor": 4.0,
    "low_freq_factor": 1.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "transformers_version": "4.43.4",
  "use_cache": false,
  "vocab_size": 128256
}

[INFO|modeling_utils.py:3641] 2025-04-11 17:48:16,675 >> loading weights file /data/username/grafting/saves/llama3-8b/full/sft_code/model.safetensors.index.json
[INFO|modeling_utils.py:3786] 2025-04-11 17:48:16,675 >> Detected DeepSpeed ZeRO-3: activating zero.init() for this model
[INFO|configuration_utils.py:1038] 2025-04-11 17:48:16,678 >> Generate config GenerationConfig {
  "bos_token_id": 128000,
  "eos_token_id": [
    128001,
    128008,
    128009
  ],
  "use_cache": false
}

[2025-04-11 17:48:17,598] [INFO] [partition_parameters.py:345:__exit__] finished initializing model - num_params = 582, num_elems = 16.06B
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards:  25%|██▌       | 1/4 [00:00<00:00,  5.35it/s]
Loading checkpoint shards:  25%|██▌       | 1/4 [00:00<00:00,  5.31it/s]
Loading checkpoint shards:  25%|██▌       | 1/4 [00:00<00:00,  5.28it/s]
Loading checkpoint shards:  25%|██▌       | 1/4 [00:00<00:00,  5.12it/s]
Loading checkpoint shards:  25%|██▌       | 1/4 [00:00<00:00,  4.99it/s]
Loading checkpoint shards:  25%|██▌       | 1/4 [00:00<00:00,  4.98it/s]
Loading checkpoint shards:  25%|██▌       | 1/4 [00:00<00:00,  4.23it/s]
Loading checkpoint shards:  25%|██▌       | 1/4 [00:01<00:04,  1.43s/it]
Loading checkpoint shards:  50%|█████     | 2/4 [00:01<00:01,  1.12it/s]
Loading checkpoint shards:  50%|█████     | 2/4 [00:01<00:01,  1.11it/s]
Loading checkpoint shards:  50%|█████     | 2/4 [00:01<00:01,  1.10it/s]
Loading checkpoint shards:  50%|█████     | 2/4 [00:01<00:01,  1.10it/s]
Loading checkpoint shards:  50%|█████     | 2/4 [00:01<00:01,  1.10it/s]
Loading checkpoint shards:  50%|█████     | 2/4 [00:01<00:01,  1.09it/s]
Loading checkpoint shards:  50%|█████     | 2/4 [00:01<00:01,  1.09it/s]
Loading checkpoint shards:  75%|███████▌  | 3/4 [00:02<00:01,  1.02s/it]
Loading checkpoint shards:  75%|███████▌  | 3/4 [00:02<00:01,  1.02s/it]
Loading checkpoint shards:  75%|███████▌  | 3/4 [00:02<00:01,  1.02s/it]
Loading checkpoint shards:  75%|███████▌  | 3/4 [00:02<00:01,  1.03s/it]
Loading checkpoint shards:  75%|███████▌  | 3/4 [00:02<00:01,  1.02s/it]
Loading checkpoint shards:  75%|███████▌  | 3/4 [00:02<00:01,  1.02s/it]
Loading checkpoint shards:  75%|███████▌  | 3/4 [00:02<00:01,  1.02s/it]
Loading checkpoint shards:  50%|█████     | 2/4 [00:02<00:02,  1.42s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.38it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.32it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.38it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.32it/s]
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:389: UserWarning: You passed `max_length` to the DPOTrainer, the value you passed will override the one in the `DPOConfig`.
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:402: UserWarning: You passed `max_prompt_length` to the DPOTrainer, the value you passed will override the one in the `DPOConfig`.
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:487: UserWarning: You passed `loss_type` to the DPOTrainer, the value you passed will override the one in the `DPOConfig`.
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:504: UserWarning: You passed `beta` to the DPOTrainer, the value you passed will override the one in the `DPOConfig`.
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:389: UserWarning: You passed `max_length` to the DPOTrainer, the value you passed will override the one in the `DPOConfig`.
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:402: UserWarning: You passed `max_prompt_length` to the DPOTrainer, the value you passed will override the one in the `DPOConfig`.
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:487: UserWarning: You passed `loss_type` to the DPOTrainer, the value you passed will override the one in the `DPOConfig`.
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:504: UserWarning: You passed `beta` to the DPOTrainer, the value you passed will override the one in the `DPOConfig`.
  warnings.warn(
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.37it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.31it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.37it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.31it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.37it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.31it/s]
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:389: UserWarning: You passed `max_length` to the DPOTrainer, the value you passed will override the one in the `DPOConfig`.
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:402: UserWarning: You passed `max_prompt_length` to the DPOTrainer, the value you passed will override the one in the `DPOConfig`.
  warnings.warn(
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.37it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.31it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.37it/s]/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:389: UserWarning: You passed `max_length` to the DPOTrainer, the value you passed will override the one in the `DPOConfig`.
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:402: UserWarning: You passed `max_prompt_length` to the DPOTrainer, the value you passed will override the one in the `DPOConfig`.
  warnings.warn(
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.31it/s]
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:487: UserWarning: You passed `loss_type` to the DPOTrainer, the value you passed will override the one in the `DPOConfig`.
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:504: UserWarning: You passed `beta` to the DPOTrainer, the value you passed will override the one in the `DPOConfig`.
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:389: UserWarning: You passed `max_length` to the DPOTrainer, the value you passed will override the one in the `DPOConfig`.
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:402: UserWarning: You passed `max_prompt_length` to the DPOTrainer, the value you passed will override the one in the `DPOConfig`.
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:487: UserWarning: You passed `loss_type` to the DPOTrainer, the value you passed will override the one in the `DPOConfig`.
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:504: UserWarning: You passed `beta` to the DPOTrainer, the value you passed will override the one in the `DPOConfig`.
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:487: UserWarning: You passed `loss_type` to the DPOTrainer, the value you passed will override the one in the `DPOConfig`.
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:504: UserWarning: You passed `beta` to the DPOTrainer, the value you passed will override the one in the `DPOConfig`.
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:389: UserWarning: You passed `max_length` to the DPOTrainer, the value you passed will override the one in the `DPOConfig`.
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:402: UserWarning: You passed `max_prompt_length` to the DPOTrainer, the value you passed will override the one in the `DPOConfig`.
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:389: UserWarning: You passed `max_length` to the DPOTrainer, the value you passed will override the one in the `DPOConfig`.
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:402: UserWarning: You passed `max_prompt_length` to the DPOTrainer, the value you passed will override the one in the `DPOConfig`.
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:487: UserWarning: You passed `loss_type` to the DPOTrainer, the value you passed will override the one in the `DPOConfig`.
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:504: UserWarning: You passed `beta` to the DPOTrainer, the value you passed will override the one in the `DPOConfig`.
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:487: UserWarning: You passed `loss_type` to the DPOTrainer, the value you passed will override the one in the `DPOConfig`.
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:504: UserWarning: You passed `beta` to the DPOTrainer, the value you passed will override the one in the `DPOConfig`.
  warnings.warn(
Loading checkpoint shards:  75%|███████▌  | 3/4 [00:04<00:01,  1.39s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:04<00:00,  1.00it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:04<00:00,  1.15s/it]
[INFO|modeling_utils.py:4473] 2025-04-11 17:48:22,214 >> All model checkpoint weights were used when initializing LlamaForCausalLM.

[INFO|modeling_utils.py:4481] 2025-04-11 17:48:22,214 >> All the weights of LlamaForCausalLM were initialized from the model checkpoint at /data/username/grafting/saves/llama3-8b/full/sft_code.
If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training.
[INFO|configuration_utils.py:991] 2025-04-11 17:48:22,216 >> loading configuration file /data/username/grafting/saves/llama3-8b/full/sft_code/generation_config.json
[INFO|configuration_utils.py:1038] 2025-04-11 17:48:22,216 >> Generate config GenerationConfig {
  "bos_token_id": 128000,
  "do_sample": true,
  "eos_token_id": [
    128001,
    128008,
    128009
  ],
  "temperature": 0.6,
  "top_p": 0.9
}

/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:389: UserWarning: You passed `max_length` to the DPOTrainer, the value you passed will override the one in the `DPOConfig`.
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:402: UserWarning: You passed `max_prompt_length` to the DPOTrainer, the value you passed will override the one in the `DPOConfig`.
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:487: UserWarning: You passed `loss_type` to the DPOTrainer, the value you passed will override the one in the `DPOConfig`.
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:504: UserWarning: You passed `beta` to the DPOTrainer, the value you passed will override the one in the `DPOConfig`.
  warnings.warn(
Map:   0%|          | 0/3968 [00:00<?, ? examples/s]
Map:   1%|          | 25/3968 [00:00<00:16, 242.59 examples/s]
Map:   1%|▏         | 50/3968 [00:00<00:16, 242.87 examples/s]
Map:   2%|▏         | 77/3968 [00:00<00:15, 252.68 examples/s]
Map:   3%|▎         | 116/3968 [00:00<00:15, 253.57 examples/s]
Map:   4%|▎         | 142/3968 [00:00<00:15, 248.91 examples/s]
Map:   4%|▍         | 168/3968 [00:00<00:15, 251.41 examples/s]
Map:   5%|▌         | 205/3968 [00:00<00:15, 245.40 examples/s]
Map:   6%|▌         | 231/3968 [00:00<00:15, 245.35 examples/s]
Map:   7%|▋         | 260/3968 [00:01<00:14, 250.81 examples/s]
Map:   7%|▋         | 296/3968 [00:01<00:15, 243.81 examples/s]
Map:   8%|▊         | 324/3968 [00:01<00:14, 248.16 examples/s]
Map:   9%|▉         | 354/3968 [00:01<00:13, 260.67 examples/s]
Map:  10%|▉         | 396/3968 [00:01<00:13, 262.47 examples/s]
Map:  11%|█         | 435/3968 [00:01<00:13, 258.60 examples/s]
Map:  12%|█▏        | 462/3968 [00:01<00:13, 258.42 examples/s]
Map:  12%|█▏        | 489/3968 [00:01<00:13, 259.77 examples/s]
Map:  13%|█▎        | 527/3968 [00:02<00:13, 254.37 examples/s]
Map:  14%|█▍        | 566/3968 [00:02<00:13, 252.37 examples/s]
Map:  15%|█▍        | 592/3968 [00:02<00:13, 253.37 examples/s]
Map:  16%|█▌        | 618/3968 [00:02<00:13, 252.55 examples/s]
Map:  17%|█▋        | 655/3968 [00:02<00:13, 248.84 examples/s]
Map:  17%|█▋        | 680/3968 [00:02<00:13, 242.76 examples/s]
Map:  18%|█▊        | 708/3968 [00:02<00:12, 250.92 examples/s]
Map:  18%|█▊        | 734/3968 [00:02<00:12, 249.32 examples/s]
Map:  19%|█▉        | 762/3968 [00:03<00:12, 255.68 examples/s]
Map:  20%|██        | 800/3968 [00:03<00:12, 247.85 examples/s]
Map:  21%|██        | 838/3968 [00:03<00:12, 248.57 examples/s]
Map:  22%|██▏       | 876/3968 [00:03<00:12, 248.88 examples/s]
Map:  23%|██▎       | 903/3968 [00:03<00:12, 252.36 examples/s]
Map:  23%|██▎       | 930/3968 [00:03<00:12, 252.43 examples/s]
Map:  24%|██▍       | 959/3968 [00:03<00:11, 257.29 examples/s]
Map:  25%|██▍       | 985/3968 [00:03<00:11, 253.53 examples/s]
Map:  26%|██▌       | 1022/3968 [00:04<00:11, 247.17 examples/s]
Map:  26%|██▋       | 1048/3968 [00:04<00:11, 248.75 examples/s]
Map:  27%|██▋       | 1074/3968 [00:04<00:11, 243.60 examples/s]
Map:  28%|██▊       | 1102/3968 [00:04<00:11, 249.36 examples/s]
Map:  28%|██▊       | 1129/3968 [00:04<00:11, 250.62 examples/s]
Map:  29%|██▉       | 1158/3968 [00:04<00:10, 259.23 examples/s]
Map:  30%|██▉       | 1186/3968 [00:04<00:10, 260.89 examples/s]
Map:  31%|███       | 1213/3968 [00:04<00:10, 256.88 examples/s]
Map:  31%|███       | 1239/3968 [00:04<00:10, 255.83 examples/s][WARNING|tokenization_utils_base.py:4119] 2025-04-11 17:48:27,619 >> Token indices sequence length is longer than the specified maximum sequence length for this model (2078 > 2048). Running this sequence through the model will result in indexing errors
Map:  32%|███▏      | 1277/3968 [00:05<00:10, 253.87 examples/s]
Map:  33%|███▎      | 1311/3968 [00:05<00:10, 241.88 examples/s]
Map:  34%|███▎      | 1339/3968 [00:05<00:10, 248.23 examples/s]
Map:  34%|███▍      | 1366/3968 [00:05<00:10, 252.00 examples/s]
Map:  35%|███▌      | 1402/3968 [00:05<00:10, 242.95 examples/s]
Map:  36%|███▌      | 1427/3968 [00:05<00:10, 242.88 examples/s]
Map:  37%|███▋      | 1454/3968 [00:05<00:10, 241.36 examples/s]
Map:  37%|███▋      | 1480/3968 [00:05<00:10, 243.64 examples/s]
Map:  38%|███▊      | 1505/3968 [00:06<00:10, 243.17 examples/s]
Map:  39%|███▉      | 1540/3968 [00:06<00:10, 235.40 examples/s]
Map:  40%|███▉      | 1572/3968 [00:06<00:10, 225.91 examples/s]
Map:  40%|████      | 1597/3968 [00:06<00:10, 227.87 examples/s]
Map:  41%|████      | 1631/3968 [00:06<00:10, 225.51 examples/s]
Map:  42%|████▏     | 1655/3968 [00:06<00:10, 223.73 examples/s]
Map:  42%|████▏     | 1679/3968 [00:06<00:10, 226.21 examples/s]
Map:  43%|████▎     | 1704/3968 [00:06<00:09, 230.43 examples/s]
Map:  44%|████▎     | 1730/3968 [00:07<00:09, 232.80 examples/s]
Map:  44%|████▍     | 1757/3968 [00:07<00:09, 239.29 examples/s]
Map:  45%|████▍     | 1784/3968 [00:07<00:08, 244.99 examples/s]
Map:  46%|████▌     | 1810/3968 [00:07<00:08, 246.04 examples/s]
Map:  47%|████▋     | 1846/3968 [00:07<00:08, 239.59 examples/s]
Map:  47%|████▋     | 1881/3968 [00:07<00:09, 229.77 examples/s]
Map:  48%|████▊     | 1918/3968 [00:07<00:08, 232.11 examples/s]
Map:  49%|████▉     | 1943/3968 [00:07<00:08, 227.83 examples/s]
Map:  50%|████▉     | 1970/3968 [00:08<00:08, 235.79 examples/s]
Map:  51%|█████     | 2005/3968 [00:08<00:08, 232.30 examples/s]
Map:  51%|█████     | 2030/3968 [00:08<00:08, 232.71 examples/s]
Map:  52%|█████▏    | 2059/3968 [00:08<00:07, 243.28 examples/s]
Map:  53%|█████▎    | 2085/3968 [00:08<00:07, 246.19 examples/s]
Map:  53%|█████▎    | 2121/3968 [00:08<00:07, 241.20 examples/s]
Map:  54%|█████▍    | 2158/3968 [00:08<00:07, 241.60 examples/s]
Map:  55%|█████▌    | 2195/3968 [00:08<00:07, 241.15 examples/s]
Map:  56%|█████▌    | 2221/3968 [00:09<00:07, 243.91 examples/s]
Map:  57%|█████▋    | 2248/3968 [00:09<00:06, 247.64 examples/s]
Map:  58%|█████▊    | 2283/3968 [00:09<00:07, 238.06 examples/s]
Map:  58%|█████▊    | 2312/3968 [00:09<00:06, 247.23 examples/s]
Map:  59%|█████▉    | 2340/3968 [00:09<00:06, 249.08 examples/s]
Map:  60%|█████▉    | 2378/3968 [00:09<00:06, 247.21 examples/s]
Map:  61%|██████    | 2415/3968 [00:09<00:06, 245.46 examples/s]
Map:  61%|██████▏   | 2440/3968 [00:09<00:06, 244.21 examples/s]
Map:  62%|██████▏   | 2466/3968 [00:10<00:06, 245.27 examples/s]
Map:  63%|██████▎   | 2502/3968 [00:10<00:06, 241.38 examples/s]
Map:  64%|██████▎   | 2528/3968 [00:10<00:05, 243.89 examples/s]
Map:  64%|██████▍   | 2555/3968 [00:10<00:05, 248.40 examples/s]
Map:  65%|██████▌   | 2583/3968 [00:10<00:05, 254.37 examples/s]
Map:  66%|██████▌   | 2610/3968 [00:10<00:05, 251.25 examples/s]
Map:  66%|██████▋   | 2638/3968 [00:10<00:05, 253.94 examples/s]
Map:  67%|██████▋   | 2675/3968 [00:10<00:05, 249.65 examples/s]
Map:  68%|██████▊   | 2710/3968 [00:11<00:05, 237.08 examples/s]
Map:  69%|██████▉   | 2736/3968 [00:11<00:05, 241.94 examples/s]
Map:  70%|██████▉   | 2764/3968 [00:11<00:04, 246.66 examples/s]
Map:  71%|███████   | 2800/3968 [00:11<00:04, 237.79 examples/s]
Map:  71%|███████▏  | 2836/3968 [00:11<00:04, 235.61 examples/s]
Map:  72%|███████▏  | 2867/3968 [00:11<00:04, 249.78 examples/s]
Map:  73%|███████▎  | 2893/3968 [00:11<00:04, 248.02 examples/s]
Map:  74%|███████▎  | 2921/3968 [00:11<00:04, 253.74 examples/s]
Map:  74%|███████▍  | 2949/3968 [00:11<00:03, 257.58 examples/s]
Map:  75%|███████▌  | 2981/3968 [00:12<00:04, 237.53 examples/s]
Map:  76%|███████▌  | 3020/3968 [00:12<00:03, 240.75 examples/s]
Map:  77%|███████▋  | 3058/3968 [00:12<00:03, 242.29 examples/s]
Map:  78%|███████▊  | 3086/3968 [00:12<00:03, 249.92 examples/s]
Map:  79%|███████▊  | 3120/3968 [00:12<00:03, 238.79 examples/s]
Map:  79%|███████▉  | 3148/3968 [00:12<00:03, 245.76 examples/s]
Map:  80%|████████  | 3184/3968 [00:12<00:03, 240.77 examples/s]
Map:  81%|████████  | 3219/3968 [00:13<00:03, 236.30 examples/s]
Map:  82%|████████▏ | 3244/3968 [00:13<00:03, 236.29 examples/s]
Map:  82%|████████▏ | 3270/3968 [00:13<00:02, 239.34 examples/s]
Map:  83%|████████▎ | 3300/3968 [00:13<00:02, 248.19 examples/s]
Map:  84%|████████▍ | 3337/3968 [00:13<00:02, 244.62 examples/s]
Map:  85%|████████▌ | 3376/3968 [00:13<00:02, 247.41 examples/s]
Map:  86%|████████▌ | 3401/3968 [00:13<00:02, 244.54 examples/s]
Map:  87%|████████▋ | 3436/3968 [00:14<00:02, 237.63 examples/s]
Map:  87%|████████▋ | 3465/3968 [00:14<00:02, 245.34 examples/s]
Map:  88%|████████▊ | 3493/3968 [00:14<00:01, 249.98 examples/s]
Map:  89%|████████▊ | 3519/3968 [00:14<00:01, 252.30 examples/s]
Map:  89%|████████▉ | 3550/3968 [00:14<00:01, 263.00 examples/s]
Map:  90%|█████████ | 3579/3968 [00:14<00:01, 267.73 examples/s]
Map:  91%|█████████ | 3620/3968 [00:14<00:01, 264.73 examples/s]
Map:  92%|█████████▏| 3647/3968 [00:14<00:01, 264.59 examples/s]
Map:  93%|█████████▎| 3688/3968 [00:14<00:01, 258.68 examples/s]
Map:  94%|█████████▎| 3715/3968 [00:15<00:00, 259.36 examples/s]
Map:  95%|█████████▍| 3751/3968 [00:15<00:00, 249.10 examples/s]
Map:  95%|█████████▌| 3777/3968 [00:15<00:00, 249.32 examples/s]
Map:  96%|█████████▌| 3803/3968 [00:15<00:00, 249.88 examples/s]
Map:  97%|█████████▋| 3839/3968 [00:15<00:00, 245.28 examples/s]
Map:  98%|█████████▊| 3872/3968 [00:15<00:00, 231.84 examples/s]
Map:  98%|█████████▊| 3907/3968 [00:15<00:00, 230.36 examples/s]
Map:  99%|█████████▉| 3940/3968 [00:16<00:00, 222.58 examples/s]
Map: 100%|█████████▉| 3965/3968 [00:16<00:00, 228.39 examples/s]
Map: 100%|██████████| 3968/3968 [00:16<00:00, 233.46 examples/s]
[INFO|trainer.py:648] 2025-04-11 17:48:39,697 >> Using auto half precision backend
[2025-04-11 17:48:39,697] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.4, git-hash=unknown, git-branch=unknown
[2025-04-11 17:48:39,706] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2025-04-11 17:48:39,707] [INFO] [logging.py:96:log_dist] [Rank 0] Creating ZeRO Offload
[2025-04-11 17:48:39,863] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2025-04-11 17:48:39,863] [INFO] [utils.py:782:see_memory_usage] MA 3.74 GB         Max_MA 6.55 GB         CA 6.98 GB         Max_CA 7 GB 
[2025-04-11 17:48:39,863] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 19.18 GB, percent = 2.5%
Parameter Offload: Total persistent parameters: 266240 in 65 params
[2025-04-11 17:48:40,006] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2025-04-11 17:48:40,007] [INFO] [utils.py:782:see_memory_usage] MA 3.74 GB         Max_MA 3.74 GB         CA 6.98 GB         Max_CA 7 GB 
[2025-04-11 17:48:40,007] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 19.2 GB, percent = 2.5%
[2025-04-11 17:48:40,008] [INFO] [config.py:997:print] DeepSpeedEngine configuration:
[2025-04-11 17:48:40,008] [INFO] [config.py:1001:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2025-04-11 17:48:40,008] [INFO] [config.py:1001:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2025-04-11 17:48:40,008] [INFO] [config.py:1001:print]   amp_enabled .................. False
[2025-04-11 17:48:40,008] [INFO] [config.py:1001:print]   amp_params ................... False
[2025-04-11 17:48:40,008] [INFO] [config.py:1001:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
}
[2025-04-11 17:48:40,008] [INFO] [config.py:1001:print]   bfloat16_enabled ............. True
[2025-04-11 17:48:40,008] [INFO] [config.py:1001:print]   bfloat16_immediate_grad_update  False
[2025-04-11 17:48:40,008] [INFO] [config.py:1001:print]   checkpoint_parallel_write_pipeline  False
[2025-04-11 17:48:40,008] [INFO] [config.py:1001:print]   checkpoint_tag_validation_enabled  True
[2025-04-11 17:48:40,008] [INFO] [config.py:1001:print]   checkpoint_tag_validation_fail  False
[2025-04-11 17:48:40,008] [INFO] [config.py:1001:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7feb424c9fd0>
[2025-04-11 17:48:40,008] [INFO] [config.py:1001:print]   communication_data_type ...... None
[2025-04-11 17:48:40,008] [INFO] [config.py:1001:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2025-04-11 17:48:40,008] [INFO] [config.py:1001:print]   curriculum_enabled_legacy .... False
[2025-04-11 17:48:40,008] [INFO] [config.py:1001:print]   curriculum_params_legacy ..... False
[2025-04-11 17:48:40,008] [INFO] [config.py:1001:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2025-04-11 17:48:40,008] [INFO] [config.py:1001:print]   data_efficiency_enabled ...... False
[2025-04-11 17:48:40,008] [INFO] [config.py:1001:print]   dataloader_drop_last ......... False
[2025-04-11 17:48:40,008] [INFO] [config.py:1001:print]   disable_allgather ............ False
[2025-04-11 17:48:40,008] [INFO] [config.py:1001:print]   dump_state ................... False
[2025-04-11 17:48:40,008] [INFO] [config.py:1001:print]   dynamic_loss_scale_args ...... None
[2025-04-11 17:48:40,008] [INFO] [config.py:1001:print]   eigenvalue_enabled ........... False
[2025-04-11 17:48:40,008] [INFO] [config.py:1001:print]   eigenvalue_gas_boundary_resolution  1
[2025-04-11 17:48:40,009] [INFO] [config.py:1001:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2025-04-11 17:48:40,009] [INFO] [config.py:1001:print]   eigenvalue_layer_num ......... 0
[2025-04-11 17:48:40,009] [INFO] [config.py:1001:print]   eigenvalue_max_iter .......... 100
[2025-04-11 17:48:40,009] [INFO] [config.py:1001:print]   eigenvalue_stability ......... 1e-06
[2025-04-11 17:48:40,009] [INFO] [config.py:1001:print]   eigenvalue_tol ............... 0.01
[2025-04-11 17:48:40,009] [INFO] [config.py:1001:print]   eigenvalue_verbose ........... False
[2025-04-11 17:48:40,009] [INFO] [config.py:1001:print]   elasticity_enabled ........... False
[2025-04-11 17:48:40,009] [INFO] [config.py:1001:print]   flops_profiler_config ........ {
    "enabled": false, 
    "recompute_fwd_factor": 0.0, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
[2025-04-11 17:48:40,009] [INFO] [config.py:1001:print]   fp16_auto_cast ............... None
[2025-04-11 17:48:40,009] [INFO] [config.py:1001:print]   fp16_enabled ................. False
[2025-04-11 17:48:40,009] [INFO] [config.py:1001:print]   fp16_master_weights_and_gradients  False
[2025-04-11 17:48:40,009] [INFO] [config.py:1001:print]   global_rank .................. 0
[2025-04-11 17:48:40,009] [INFO] [config.py:1001:print]   grad_accum_dtype ............. None
[2025-04-11 17:48:40,009] [INFO] [config.py:1001:print]   gradient_accumulation_steps .. 16
[2025-04-11 17:48:40,009] [INFO] [config.py:1001:print]   gradient_clipping ............ 1.0
[2025-04-11 17:48:40,009] [INFO] [config.py:1001:print]   gradient_predivide_factor .... 1.0
[2025-04-11 17:48:40,009] [INFO] [config.py:1001:print]   graph_harvesting ............. False
[2025-04-11 17:48:40,009] [INFO] [config.py:1001:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2025-04-11 17:48:40,009] [INFO] [config.py:1001:print]   initial_dynamic_scale ........ 1
[2025-04-11 17:48:40,009] [INFO] [config.py:1001:print]   load_universal_checkpoint .... False
[2025-04-11 17:48:40,009] [INFO] [config.py:1001:print]   loss_scale ................... 1.0
[2025-04-11 17:48:40,009] [INFO] [config.py:1001:print]   memory_breakdown ............. False
[2025-04-11 17:48:40,009] [INFO] [config.py:1001:print]   mics_hierarchial_params_gather  False
[2025-04-11 17:48:40,009] [INFO] [config.py:1001:print]   mics_shard_size .............. -1
[2025-04-11 17:48:40,009] [INFO] [config.py:1001:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2025-04-11 17:48:40,009] [INFO] [config.py:1001:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
}
[2025-04-11 17:48:40,009] [INFO] [config.py:1001:print]   optimizer_legacy_fusion ...... False
[2025-04-11 17:48:40,009] [INFO] [config.py:1001:print]   optimizer_name ............... None
[2025-04-11 17:48:40,009] [INFO] [config.py:1001:print]   optimizer_params ............. None
[2025-04-11 17:48:40,009] [INFO] [config.py:1001:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2025-04-11 17:48:40,009] [INFO] [config.py:1001:print]   pld_enabled .................. False
[2025-04-11 17:48:40,009] [INFO] [config.py:1001:print]   pld_params ................... False
[2025-04-11 17:48:40,009] [INFO] [config.py:1001:print]   prescale_gradients ........... False
[2025-04-11 17:48:40,009] [INFO] [config.py:1001:print]   scheduler_name ............... None
[2025-04-11 17:48:40,009] [INFO] [config.py:1001:print]   scheduler_params ............. None
[2025-04-11 17:48:40,009] [INFO] [config.py:1001:print]   seq_parallel_communication_data_type  torch.float32
[2025-04-11 17:48:40,009] [INFO] [config.py:1001:print]   sparse_attention ............. None
[2025-04-11 17:48:40,009] [INFO] [config.py:1001:print]   sparse_gradients_enabled ..... False
[2025-04-11 17:48:40,009] [INFO] [config.py:1001:print]   steps_per_print .............. inf
[2025-04-11 17:48:40,009] [INFO] [config.py:1001:print]   timers_config ................ enabled=True synchronized=True
[2025-04-11 17:48:40,009] [INFO] [config.py:1001:print]   train_batch_size ............. 128
[2025-04-11 17:48:40,009] [INFO] [config.py:1001:print]   train_micro_batch_size_per_gpu  1
[2025-04-11 17:48:40,009] [INFO] [config.py:1001:print]   use_data_before_expert_parallel_  False
[2025-04-11 17:48:40,009] [INFO] [config.py:1001:print]   use_node_local_storage ....... False
[2025-04-11 17:48:40,009] [INFO] [config.py:1001:print]   wall_clock_breakdown ......... False
[2025-04-11 17:48:40,009] [INFO] [config.py:1001:print]   weight_quantization_config ... None
[2025-04-11 17:48:40,009] [INFO] [config.py:1001:print]   world_size ................... 8
[2025-04-11 17:48:40,009] [INFO] [config.py:1001:print]   zero_allow_untested_optimizer  False
[2025-04-11 17:48:40,009] [INFO] [config.py:1001:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False, ratio=1.0) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=True use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2025-04-11 17:48:40,009] [INFO] [config.py:1001:print]   zero_enabled ................. True
[2025-04-11 17:48:40,009] [INFO] [config.py:1001:print]   zero_force_ds_cpu_optimizer .. True
[2025-04-11 17:48:40,009] [INFO] [config.py:1001:print]   zero_optimization_stage ...... 3
[2025-04-11 17:48:40,010] [INFO] [config.py:987:print_user_config]   json = {
    "train_batch_size": 128, 
    "train_micro_batch_size_per_gpu": 1, 
    "gradient_accumulation_steps": 16, 
    "zero_optimization": {
        "stage": 3, 
        "offload_optimizer": {
            "device": "none", 
            "nvme_path": null
        }, 
        "offload_param": {
            "device": "none", 
            "nvme_path": null
        }, 
        "stage3_gather_16bit_weights_on_model_save": true
    }, 
    "gradient_clipping": 1.0, 
    "steps_per_print": inf, 
    "bf16": {
        "enabled": true
    }, 
    "fp16": {
        "enabled": false
    }, 
    "zero_optimization.reduce_bucket_size": 1.677722e+07, 
    "zero_optimization.stage3_param_persistence_threshold": 4.096000e+04, 
    "zero_optimization.stage3_prefetch_bucket_size": 1.509949e+07
}
Map:   0%|          | 0/3968 [00:00<?, ? examples/s]
Map:   0%|          | 0/3968 [00:00<?, ? examples/s]
Map:   0%|          | 0/3968 [00:00<?, ? examples/s]
Map:   0%|          | 0/3968 [00:00<?, ? examples/s]
Map:   0%|          | 0/3968 [00:00<?, ? examples/s]
Map:   0%|          | 0/3968 [00:00<?, ? examples/s]
Map:   0%|          | 0/3968 [00:00<?, ? examples/s]
Map:   1%|          | 25/3968 [00:00<00:16, 240.42 examples/s][2025-04-11 17:48:40,183] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.4, git-hash=unknown, git-branch=unknown
Map:   1%|          | 25/3968 [00:00<00:16, 241.48 examples/s][2025-04-11 17:48:40,189] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Map:   1%|          | 25/3968 [00:00<00:16, 241.99 examples/s][2025-04-11 17:48:40,190] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2025-04-11 17:48:40,190] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
Map:   1%|          | 25/3968 [00:00<00:16, 238.52 examples/s]
Map:   1%|          | 25/3968 [00:00<00:16, 239.64 examples/s]
Map:   1%|          | 25/3968 [00:00<00:16, 240.29 examples/s][2025-04-11 17:48:40,197] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = AdamW
[2025-04-11 17:48:40,197] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'torch.optim.adamw.AdamW'>
[2025-04-11 17:48:40,197] [INFO] [logging.py:96:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer, MiCS is enabled False, Hierarchical params gather False
[2025-04-11 17:48:40,197] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 3 optimizer
Map:   1%|          | 25/3968 [00:00<00:16, 241.09 examples/s]
Map:   1%|▏         | 50/3968 [00:00<00:16, 242.62 examples/s]
Map:   1%|▏         | 50/3968 [00:00<00:16, 242.69 examples/s]
Map:   1%|▏         | 50/3968 [00:00<00:16, 242.65 examples/s]
Map:   1%|▏         | 50/3968 [00:00<00:16, 240.21 examples/s]
Map:   1%|▏         | 50/3968 [00:00<00:16, 240.03 examples/s]
Map:   1%|▏         | 50/3968 [00:00<00:16, 240.66 examples/s]
Map:   1%|▏         | 50/3968 [00:00<00:16, 240.91 examples/s][2025-04-11 17:48:40,320] [INFO] [utils.py:781:see_memory_usage] Stage 3 initialize beginning
[2025-04-11 17:48:40,321] [INFO] [utils.py:782:see_memory_usage] MA 3.74 GB         Max_MA 3.74 GB         CA 5.02 GB         Max_CA 7 GB 
[2025-04-11 17:48:40,321] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 19.32 GB, percent = 2.6%
[2025-04-11 17:48:40,322] [INFO] [stage3.py:130:__init__] Reduce bucket size 500,000,000
[2025-04-11 17:48:40,322] [INFO] [stage3.py:131:__init__] Prefetch bucket size 50,000,000
Map:   2%|▏         | 77/3968 [00:00<00:15, 253.18 examples/s]
Map:   2%|▏         | 77/3968 [00:00<00:15, 252.62 examples/s]
Map:   2%|▏         | 77/3968 [00:00<00:15, 252.15 examples/s]
Map:   2%|▏         | 77/3968 [00:00<00:15, 250.87 examples/s]
Map:   2%|▏         | 77/3968 [00:00<00:15, 250.75 examples/s]
Map:   2%|▏         | 77/3968 [00:00<00:15, 249.89 examples/s]
Map:   2%|▏         | 77/3968 [00:00<00:15, 250.73 examples/s][2025-04-11 17:48:40,443] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2025-04-11 17:48:40,444] [INFO] [utils.py:782:see_memory_usage] MA 3.74 GB         Max_MA 3.74 GB         CA 5.02 GB         Max_CA 5 GB 
[2025-04-11 17:48:40,444] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 19.32 GB, percent = 2.6%
Parameter Offload: Total persistent parameters: 266240 in 65 params
Map:   3%|▎         | 102/3968 [00:00<00:15, 246.80 examples/s]
Map:   3%|▎         | 117/3968 [00:00<00:14, 257.27 examples/s]
Map:   3%|▎         | 116/3968 [00:00<00:15, 253.81 examples/s]
Map:   3%|▎         | 116/3968 [00:00<00:15, 253.34 examples/s]
Map:   3%|▎         | 116/3968 [00:00<00:15, 252.69 examples/s]
Map:   3%|▎         | 116/3968 [00:00<00:15, 251.54 examples/s]
Map:   3%|▎         | 116/3968 [00:00<00:15, 252.19 examples/s][2025-04-11 17:48:40,583] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2025-04-11 17:48:40,583] [INFO] [utils.py:782:see_memory_usage] MA 3.74 GB         Max_MA 3.74 GB         CA 5.02 GB         Max_CA 5 GB 
[2025-04-11 17:48:40,583] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 19.32 GB, percent = 2.6%
Map:   3%|▎         | 128/3968 [00:00<00:15, 245.86 examples/s]
Map:   4%|▎         | 142/3968 [00:00<00:15, 249.38 examples/s]
Map:   4%|▎         | 142/3968 [00:00<00:15, 248.52 examples/s]
Map:   4%|▎         | 142/3968 [00:00<00:15, 248.25 examples/s]
Map:   4%|▍         | 157/3968 [00:00<00:14, 256.82 examples/s][2025-04-11 17:48:40,708] [INFO] [utils.py:781:see_memory_usage] Before creating fp16 partitions
[2025-04-11 17:48:40,709] [INFO] [utils.py:782:see_memory_usage] MA 3.74 GB         Max_MA 3.74 GB         CA 5.02 GB         Max_CA 5 GB 
[2025-04-11 17:48:40,709] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 19.32 GB, percent = 2.6%
Map:   4%|▍         | 155/3968 [00:00<00:15, 249.59 examples/s]
Map:   4%|▍         | 155/3968 [00:00<00:15, 251.35 examples/s]
Map:   4%|▍         | 155/3968 [00:00<00:15, 249.98 examples/s]
Map:   4%|▍         | 169/3968 [00:00<00:15, 252.27 examples/s]
Map:   4%|▍         | 168/3968 [00:00<00:15, 251.43 examples/s]
Map:   4%|▍         | 168/3968 [00:00<00:15, 250.53 examples/s]
Map:   5%|▍         | 194/3968 [00:00<00:15, 248.61 examples/s]
Map:   5%|▍         | 191/3968 [00:00<00:15, 244.15 examples/s]
Map:   5%|▍         | 191/3968 [00:00<00:15, 244.71 examples/s]
Map:   5%|▍         | 191/3968 [00:00<00:15, 244.67 examples/s]
Map:   5%|▌         | 206/3968 [00:00<00:15, 247.53 examples/s]
Map:   5%|▌         | 207/3968 [00:00<00:15, 247.64 examples/s]
Map:   5%|▌         | 205/3968 [00:00<00:15, 244.44 examples/s]
Map:   6%|▌         | 220/3968 [00:00<00:15, 245.61 examples/s]
Map:   5%|▌         | 216/3968 [00:00<00:15, 243.50 examples/s]
Map:   5%|▌         | 216/3968 [00:00<00:15, 243.12 examples/s]
Map:   5%|▌         | 216/3968 [00:00<00:15, 243.46 examples/s]
Map:   6%|▌         | 232/3968 [00:00<00:14, 249.08 examples/s]
Map:   6%|▌         | 231/3968 [00:00<00:15, 244.25 examples/s]
Map:   6%|▌         | 236/3968 [00:00<00:14, 249.85 examples/s]
Map:   6%|▋         | 248/3968 [00:00<00:14, 253.48 examples/s]
Map:   6%|▌         | 243/3968 [00:00<00:14, 248.43 examples/s]
Map:   6%|▌         | 243/3968 [00:00<00:15, 247.81 examples/s]
Map:   6%|▌         | 243/3968 [00:00<00:15, 248.13 examples/s]
Map:   7%|▋         | 260/3968 [00:01<00:14, 253.35 examples/s]
Map:   7%|▋         | 260/3968 [00:01<00:14, 249.49 examples/s]
Map:   7%|▋         | 264/3968 [00:01<00:14, 252.36 examples/s]
Map:   7%|▋         | 275/3968 [00:01<00:14, 253.34 examples/s]
Map:   7%|▋         | 270/3968 [00:01<00:14, 250.37 examples/s]
Map:   7%|▋         | 270/3968 [00:01<00:14, 249.86 examples/s]
Map:   7%|▋         | 270/3968 [00:01<00:14, 250.17 examples/s]
Map:   7%|▋         | 290/3968 [00:01<00:14, 249.88 examples/s]
Map:   7%|▋         | 297/3968 [00:01<00:14, 247.69 examples/s]
Map:   7%|▋         | 296/3968 [00:01<00:15, 242.47 examples/s]
Map:   8%|▊         | 313/3968 [00:01<00:14, 251.96 examples/s]
Map:   8%|▊         | 310/3968 [00:01<00:14, 249.97 examples/s]
Map:   8%|▊         | 310/3968 [00:01<00:14, 250.13 examples/s]
Map:   8%|▊         | 310/3968 [00:01<00:14, 250.60 examples/s]
Map:   8%|▊         | 317/3968 [00:01<00:14, 253.05 examples/s]
Map:   8%|▊         | 324/3968 [00:01<00:14, 251.23 examples/s]
Map:   8%|▊         | 322/3968 [00:01<00:14, 246.51 examples/s]
Map:   9%|▊         | 340/3968 [00:01<00:14, 252.16 examples/s]
Map:   9%|▊         | 338/3968 [00:01<00:14, 255.37 examples/s]
Map:   9%|▊         | 338/3968 [00:01<00:14, 255.37 examples/s]
Map:   9%|▊         | 338/3968 [00:01<00:14, 256.08 examples/s]
Map:   9%|▊         | 346/3968 [00:01<00:13, 260.07 examples/s]
Map:   9%|▉         | 357/3968 [00:01<00:13, 264.76 examples/s]
Map:   9%|▉         | 352/3968 [00:01<00:14, 257.59 examples/s]
Map:   9%|▉         | 369/3968 [00:01<00:13, 257.60 examples/s]
Map:   9%|▉         | 366/3968 [00:01<00:13, 258.74 examples/s]
Map:   9%|▉         | 366/3968 [00:01<00:13, 258.78 examples/s]
Map:   9%|▉         | 366/3968 [00:01<00:13, 259.86 examples/s]
Map:  10%|▉         | 387/3968 [00:01<00:13, 263.08 examples/s]
Map:  10%|▉         | 380/3968 [00:01<00:14, 255.42 examples/s]
Map:  10%|█         | 400/3968 [00:01<00:13, 265.21 examples/s]
Map:  10%|█         | 400/3968 [00:01<00:13, 266.22 examples/s]
Map:  10%|▉         | 396/3968 [00:01<00:13, 264.16 examples/s]
Map:  10%|▉         | 396/3968 [00:01<00:13, 264.10 examples/s]
Map:  10%|▉         | 396/3968 [00:01<00:13, 265.14 examples/s]
Map:  10%|█         | 416/3968 [00:01<00:13, 267.22 examples/s]
Map:  10%|█         | 408/3968 [00:01<00:13, 261.55 examples/s]
Map:  11%|█         | 427/3968 [00:01<00:13, 263.36 examples/s]
Map:  11%|█         | 427/3968 [00:01<00:13, 264.75 examples/s]
Map:  11%|█         | 435/3968 [00:01<00:13, 260.21 examples/s]
Map:  11%|█         | 435/3968 [00:01<00:13, 261.08 examples/s]
Map:  11%|█         | 435/3968 [00:01<00:13, 259.59 examples/s]
Map:  11%|█         | 445/3968 [00:01<00:13, 266.55 examples/s]
Map:  11%|█▏        | 455/3968 [00:01<00:13, 259.14 examples/s]
Map:  11%|█▏        | 455/3968 [00:01<00:13, 260.49 examples/s]
Map:  11%|█▏        | 449/3968 [00:01<00:13, 259.35 examples/s]
Map:  12%|█▏        | 462/3968 [00:01<00:13, 259.57 examples/s]
Map:  12%|█▏        | 462/3968 [00:01<00:13, 260.58 examples/s]
Map:  12%|█▏        | 462/3968 [00:01<00:13, 258.85 examples/s]
Map:  12%|█▏        | 472/3968 [00:01<00:13, 265.96 examples/s]
Map:  12%|█▏        | 484/3968 [00:01<00:13, 263.92 examples/s]
Map:  12%|█▏        | 484/3968 [00:01<00:13, 264.71 examples/s]
Map:  12%|█▏        | 476/3968 [00:01<00:13, 259.51 examples/s]
Map:  12%|█▏        | 489/3968 [00:01<00:13, 261.31 examples/s]
Map:  12%|█▏        | 489/3968 [00:01<00:13, 260.31 examples/s]
Map:  12%|█▏        | 490/3968 [00:01<00:13, 260.35 examples/s]
Map:  13%|█▎        | 512/3968 [00:01<00:13, 262.97 examples/s]
Map:  13%|█▎        | 522/3968 [00:02<00:13, 253.09 examples/s]
Map:  13%|█▎        | 514/3968 [00:02<00:13, 253.29 examples/s]
Map:  13%|█▎        | 522/3968 [00:02<00:13, 255.51 examples/s]
Map:  13%|█▎        | 527/3968 [00:02<00:13, 255.85 examples/s]
Map:  13%|█▎        | 527/3968 [00:02<00:13, 254.83 examples/s]
Map:  13%|█▎        | 528/3968 [00:02<00:13, 254.72 examples/s]
Map:  14%|█▍        | 548/3968 [00:02<00:13, 253.63 examples/s]
Map:  14%|█▍        | 548/3968 [00:02<00:13, 254.62 examples/s]
Map:  14%|█▍        | 561/3968 [00:02<00:13, 250.88 examples/s]
Map:  14%|█▍        | 550/3968 [00:02<00:13, 244.25 examples/s]
Map:  14%|█▍        | 567/3968 [00:02<00:13, 254.77 examples/s]
Map:  14%|█▍        | 566/3968 [00:02<00:13, 252.43 examples/s]
Map:  14%|█▍        | 569/3968 [00:02<00:13, 258.69 examples/s]
Map:  15%|█▍        | 576/3968 [00:02<00:13, 256.70 examples/s]
Map:  14%|█▍        | 575/3968 [00:02<00:13, 255.19 examples/s]
Map:  15%|█▍        | 590/3968 [00:02<00:13, 254.49 examples/s]
Map:  15%|█▍        | 579/3968 [00:02<00:13, 252.62 examples/s]
Map:  15%|█▍        | 592/3968 [00:02<00:13, 253.87 examples/s]
Map:  15%|█▍        | 594/3968 [00:02<00:13, 254.69 examples/s]
Map:  15%|█▌        | 604/3968 [00:02<00:12, 260.96 examples/s]
Map:  15%|█▍        | 595/3968 [00:02<00:13, 256.11 examples/s]
Map:  15%|█▌        | 604/3968 [00:02<00:12, 259.88 examples/s]
Map:  15%|█▌        | 606/3968 [00:02<00:13, 254.93 examples/s]
Map:  16%|█▌        | 618/3968 [00:02<00:13, 253.08 examples/s]
Map:  16%|█▌        | 620/3968 [00:02<00:13, 253.08 examples/s]
Map:  16%|█▌        | 621/3968 [00:02<00:13, 254.30 examples/s]
Map:  16%|█▌        | 630/3968 [00:02<00:13, 253.06 examples/s]
Map:  16%|█▌        | 643/3968 [00:02<00:13, 255.67 examples/s]
Map:  16%|█▌        | 643/3968 [00:02<00:13, 253.84 examples/s]
Map:  16%|█▋        | 649/3968 [00:02<00:13, 253.47 examples/s]
Map:  16%|█▌        | 643/3968 [00:02<00:13, 248.96 examples/s]
Map:  17%|█▋        | 656/3968 [00:02<00:13, 251.19 examples/s]
Map:  17%|█▋        | 655/3968 [00:02<00:13, 249.31 examples/s]
Map:  17%|█▋        | 659/3968 [00:02<00:13, 250.61 examples/s]
Map:  17%|█▋        | 680/3968 [00:02<00:13, 248.54 examples/s]
Map:  17%|█▋        | 680/3968 [00:02<00:13, 246.49 examples/s]
Map:  17%|█▋        | 686/3968 [00:02<00:13, 246.99 examples/s]
Map:  17%|█▋        | 694/3968 [00:02<00:13, 247.59 examples/s]
Map:  17%|█▋        | 680/3968 [00:02<00:13, 242.09 examples/s]
Map:  17%|█▋        | 694/3968 [00:02<00:13, 246.02 examples/s]
Map:  18%|█▊        | 709/3968 [00:02<00:12, 257.88 examples/s]
Map:  18%|█▊        | 698/3968 [00:02<00:13, 249.39 examples/s]
Map:  18%|█▊        | 709/3968 [00:02<00:12, 255.88 examples/s]
Map:  18%|█▊        | 713/3968 [00:02<00:12, 251.50 examples/s]
Map:  18%|█▊        | 720/3968 [00:02<00:13, 249.12 examples/s]
Map:  18%|█▊        | 709/3968 [00:02<00:12, 251.22 examples/s]
Map:  18%|█▊        | 720/3968 [00:02<00:13, 247.64 examples/s]
Map:  18%|█▊        | 726/3968 [00:02<00:12, 251.78 examples/s]
Map:  19%|█▉        | 750/3968 [00:02<00:12, 257.53 examples/s]
Map:  19%|█▊        | 741/3968 [00:02<00:12, 256.43 examples/s]
Map:  19%|█▉        | 749/3968 [00:02<00:12, 257.50 examples/s]
Map:  19%|█▉        | 749/3968 [00:02<00:12, 257.68 examples/s]
Map:  19%|█▉        | 749/3968 [00:02<00:12, 256.02 examples/s]
Map:  19%|█▉        | 755/3968 [00:02<00:12, 257.89 examples/s]
Map:  19%|█▉        | 748/3968 [00:02<00:12, 251.70 examples/s]
Map:  20%|█▉        | 778/3968 [00:03<00:12, 259.78 examples/s]
Map:  19%|█▉        | 771/3968 [00:03<00:12, 263.13 examples/s]
Map:  20%|█▉        | 777/3968 [00:03<00:12, 258.83 examples/s]
Map:  20%|█▉        | 777/3968 [00:03<00:12, 259.52 examples/s]
Map:  20%|█▉        | 777/3968 [00:03<00:12, 258.04 examples/s]
Map:  20%|█▉        | 775/3968 [00:03<00:12, 255.01 examples/s]
Map:  20%|█▉        | 781/3968 [00:03<00:12, 254.11 examples/s]
Map:  20%|██        | 807/3968 [00:03<00:12, 251.35 examples/s]
Map:  20%|██        | 811/3968 [00:03<00:12, 244.23 examples/s]
Map:  21%|██        | 816/3968 [00:03<00:12, 250.97 examples/s]
Map:  20%|██        | 811/3968 [00:03<00:12, 246.60 examples/s]
Map:  20%|██        | 811/3968 [00:03<00:12, 243.69 examples/s]
Map:  20%|██        | 809/3968 [00:03<00:12, 244.00 examples/s]
Map:  21%|██        | 818/3968 [00:03<00:12, 249.15 examples/s]
Map:  21%|██        | 839/3968 [00:03<00:12, 250.73 examples/s]
Map:  21%|██        | 839/3968 [00:03<00:12, 253.02 examples/s]
Map:  21%|██▏       | 844/3968 [00:03<00:12, 254.03 examples/s]
Map:  21%|██        | 834/3968 [00:03<00:12, 251.01 examples/s]
Map:  21%|██        | 839/3968 [00:03<00:12, 250.18 examples/s]
Map:  21%|██        | 835/3968 [00:03<00:12, 245.23 examples/s]
Map:  21%|██▏       | 844/3968 [00:03<00:12, 250.58 examples/s]
Map:  22%|██▏       | 865/3968 [00:03<00:12, 252.08 examples/s]
Map:  22%|██▏       | 865/3968 [00:03<00:12, 249.38 examples/s]
Map:  22%|██▏       | 870/3968 [00:03<00:12, 253.61 examples/s]
Map:  22%|██▏       | 860/3968 [00:03<00:12, 250.61 examples/s]
Map:  22%|██▏       | 865/3968 [00:03<00:12, 248.73 examples/s]
Map:  22%|██▏       | 860/3968 [00:03<00:12, 243.90 examples/s]
Map:  22%|██▏       | 870/3968 [00:03<00:12, 250.10 examples/s]
Map:  22%|██▏       | 888/3968 [00:03<00:11, 257.49 examples/s]
Map:  23%|██▎       | 895/3968 [00:03<00:11, 259.82 examples/s]
Map:  23%|██▎       | 900/3968 [00:03<00:11, 261.09 examples/s]
Map:  23%|██▎       | 895/3968 [00:03<00:11, 257.10 examples/s]
Map:  22%|██▏       | 888/3968 [00:03<00:12, 251.65 examples/s]
Map:  23%|██▎       | 895/3968 [00:03<00:11, 256.59 examples/s]
Map:  23%|██▎       | 900/3968 [00:03<00:11, 257.80 examples/s]
Map:  23%|██▎       | 915/3968 [00:03<00:11, 257.23 examples/s]
Map:  23%|██▎       | 927/3968 [00:03<00:11, 260.15 examples/s]
Map:  23%|██▎       | 923/3968 [00:03<00:11, 259.92 examples/s]
Map:  23%|██▎       | 923/3968 [00:03<00:11, 257.37 examples/s]
Map:  23%|██▎       | 915/3968 [00:03<00:12, 252.18 examples/s]
Map:  23%|██▎       | 923/3968 [00:03<00:11, 256.61 examples/s]
Map:  23%|██▎       | 926/3968 [00:03<00:11, 255.46 examples/s]
Map:  24%|██▎       | 941/3968 [00:03<00:11, 255.19 examples/s]
Map:  24%|██▍       | 954/3968 [00:03<00:11, 259.39 examples/s]
Map:  24%|██▍       | 950/3968 [00:03<00:11, 258.32 examples/s]
Map:  24%|██▍       | 950/3968 [00:03<00:11, 255.69 examples/s]
Map:  24%|██▍       | 950/3968 [00:03<00:11, 255.18 examples/s]
Map:  24%|██▍       | 954/3968 [00:03<00:11, 254.85 examples/s]
Map:  24%|██▍       | 954/3968 [00:03<00:12, 249.61 examples/s]
Map:  24%|██▍       | 969/3968 [00:03<00:11, 258.71 examples/s]
Map:  25%|██▍       | 981/3968 [00:03<00:11, 260.45 examples/s]
Map:  25%|██▍       | 978/3968 [00:03<00:11, 259.57 examples/s]
Map:  25%|██▍       | 978/3968 [00:03<00:11, 257.00 examples/s]
Map:  25%|██▍       | 978/3968 [00:03<00:11, 256.81 examples/s]
Map:  25%|██▍       | 981/3968 [00:03<00:11, 255.74 examples/s]
Map:  25%|██▍       | 980/3968 [00:03<00:11, 251.30 examples/s]
Map:  25%|██▌       | 995/3968 [00:03<00:11, 255.85 examples/s]
Map:  25%|██▌       | 1008/3968 [00:03<00:11, 259.11 examples/s]
Map:  25%|██▌       | 1005/3968 [00:03<00:11, 258.04 examples/s]
Map:  25%|██▌       | 1005/3968 [00:03<00:11, 255.08 examples/s]
Map:  25%|██▌       | 1007/3968 [00:03<00:11, 253.69 examples/s]
Map:  25%|██▌       | 1005/3968 [00:03<00:11, 255.16 examples/s]
Map:  25%|██▌       | 1006/3968 [00:04<00:11, 250.11 examples/s]
Map:  26%|██▌       | 1034/3968 [00:04<00:11, 253.20 examples/s]
Map:  26%|██▌       | 1031/3968 [00:04<00:11, 246.21 examples/s]
Map:  26%|██▋       | 1043/3968 [00:04<00:11, 249.78 examples/s]
Map:  26%|██▌       | 1040/3968 [00:04<00:11, 244.62 examples/s]
Map:  26%|██▌       | 1040/3968 [00:04<00:11, 245.14 examples/s]
Map:  26%|██▋       | 1045/3968 [00:04<00:11, 248.35 examples/s]
Map:  27%|██▋       | 1058/3968 [00:04<00:11, 249.25 examples/s]
Map:  26%|██▋       | 1043/3968 [00:04<00:12, 242.29 examples/s]
Map:  27%|██▋       | 1072/3968 [00:04<00:11, 251.04 examples/s]
Map:  27%|██▋       | 1067/3968 [00:04<00:11, 248.39 examples/s]
Map:  27%|██▋       | 1070/3968 [00:04<00:11, 248.83 examples/s]
Map:  27%|██▋       | 1067/3968 [00:04<00:11, 248.50 examples/s]
Map:  27%|██▋       | 1070/3968 [00:04<00:11, 246.17 examples/s]
Map:  27%|██▋       | 1084/3968 [00:04<00:11, 250.59 examples/s]
Map:  27%|██▋       | 1068/3968 [00:04<00:11, 242.94 examples/s]
Map:  28%|██▊       | 1100/3968 [00:04<00:11, 254.20 examples/s]
Map:  28%|██▊       | 1094/3968 [00:04<00:11, 251.55 examples/s]
Map:  28%|██▊       | 1098/3968 [00:04<00:11, 255.09 examples/s]
Map:  28%|██▊       | 1094/3968 [00:04<00:11, 250.99 examples/s]
Map:  28%|██▊       | 1097/3968 [00:04<00:11, 251.83 examples/s]
Map:  28%|██▊       | 1110/3968 [00:04<00:11, 250.79 examples/s]
Map:  28%|██▊       | 1095/3968 [00:04<00:11, 246.88 examples/s]
Map:  28%|██▊       | 1128/3968 [00:04<00:10, 258.92 examples/s]
Map:  28%|██▊       | 1120/3968 [00:04<00:11, 249.25 examples/s]
Map:  28%|██▊       | 1124/3968 [00:04<00:11, 253.93 examples/s]
Map:  28%|██▊       | 1120/3968 [00:04<00:11, 248.83 examples/s]
Map:  28%|██▊       | 1124/3968 [00:04<00:11, 251.58 examples/s]
Map:  29%|██▊       | 1139/3968 [00:04<00:10, 259.21 examples/s]
Map:  28%|██▊       | 1120/3968 [00:04<00:11, 243.96 examples/s]
Map:  29%|██▉       | 1158/3968 [00:04<00:10, 267.67 examples/s]
Map:  29%|██▉       | 1154/3968 [00:04<00:10, 263.68 examples/s]
Map:  29%|██▉       | 1150/3968 [00:04<00:10, 259.32 examples/s]
Map:  29%|██▉       | 1153/3968 [00:04<00:10, 260.86 examples/s]
Map:  29%|██▉       | 1150/3968 [00:04<00:10, 258.71 examples/s]
Map:  29%|██▉       | 1166/3968 [00:04<00:10, 261.56 examples/s]
Map:  29%|██▉       | 1147/3968 [00:04<00:11, 249.08 examples/s]
Map:  30%|██▉       | 1186/3968 [00:04<00:10, 269.64 examples/s]
Map:  30%|██▉       | 1182/3968 [00:04<00:10, 266.15 examples/s]
Map:  30%|██▉       | 1179/3968 [00:04<00:10, 265.79 examples/s]
Map:  30%|██▉       | 1180/3968 [00:04<00:10, 262.18 examples/s]
Map:  30%|██▉       | 1179/3968 [00:04<00:10, 265.21 examples/s]
Map:  30%|███       | 1196/3968 [00:04<00:10, 270.36 examples/s]
Map:  30%|██▉       | 1175/3968 [00:04<00:11, 252.85 examples/s]
Map:  31%|███       | 1214/3968 [00:04<00:10, 266.80 examples/s]
Map:  30%|███       | 1207/3968 [00:04<00:10, 265.74 examples/s]
Map:  30%|███       | 1210/3968 [00:04<00:10, 264.95 examples/s]
Map:  30%|███       | 1207/3968 [00:04<00:10, 266.37 examples/s]
Map:  30%|███       | 1210/3968 [00:04<00:10, 262.37 examples/s]
Map:  31%|███       | 1225/3968 [00:04<00:10, 267.46 examples/s]
Map:  30%|███       | 1204/3968 [00:04<00:10, 260.83 examples/s]
Map:  31%|███       | 1238/3968 [00:04<00:10, 265.32 examples/s]
Map:  32%|███▏      | 1254/3968 [00:04<00:10, 263.61 examples/s]
Map:  31%|███       | 1238/3968 [00:04<00:10, 262.91 examples/s][WARNING|tokenization_utils_base.py:4119] 2025-04-11 17:48:44,966 >> Token indices sequence length is longer than the specified maximum sequence length for this model (2078 > 2048). Running this sequence through the model will result in indexing errors
Map:  31%|███▏      | 1247/3968 [00:04<00:10, 260.64 examples/s][WARNING|tokenization_utils_base.py:4119] 2025-04-11 17:48:44,990 >> Token indices sequence length is longer than the specified maximum sequence length for this model (2078 > 2048). Running this sequence through the model will result in indexing errors
[WARNING|tokenization_utils_base.py:4119] 2025-04-11 17:48:45,008 >> Token indices sequence length is longer than the specified maximum sequence length for this model (2078 > 2048). Running this sequence through the model will result in indexing errors
Map:  31%|███▏      | 1247/3968 [00:04<00:10, 261.10 examples/s][WARNING|tokenization_utils_base.py:4119] 2025-04-11 17:48:45,024 >> Token indices sequence length is longer than the specified maximum sequence length for this model (2078 > 2048). Running this sequence through the model will result in indexing errors
[WARNING|tokenization_utils_base.py:4119] 2025-04-11 17:48:45,038 >> Token indices sequence length is longer than the specified maximum sequence length for this model (2078 > 2048). Running this sequence through the model will result in indexing errors
[WARNING|tokenization_utils_base.py:4119] 2025-04-11 17:48:45,050 >> Token indices sequence length is longer than the specified maximum sequence length for this model (2078 > 2048). Running this sequence through the model will result in indexing errors
Map:  32%|███▏      | 1264/3968 [00:04<00:10, 258.61 examples/s]
Map:  31%|███▏      | 1244/3968 [00:04<00:10, 258.14 examples/s]
Map:  32%|███▏      | 1281/3968 [00:04<00:10, 261.82 examples/s]
Map:  32%|███▏      | 1279/3968 [00:04<00:10, 264.52 examples/s][WARNING|tokenization_utils_base.py:4119] 2025-04-11 17:48:45,109 >> Token indices sequence length is longer than the specified maximum sequence length for this model (2078 > 2048). Running this sequence through the model will result in indexing errors
Map:  32%|███▏      | 1277/3968 [00:05<00:10, 260.11 examples/s]
Map:  32%|███▏      | 1289/3968 [00:05<00:10, 263.36 examples/s]
Map:  33%|███▎      | 1292/3968 [00:05<00:10, 261.63 examples/s]
Map:  32%|███▏      | 1270/3968 [00:05<00:10, 250.13 examples/s]
Map:  32%|███▏      | 1289/3968 [00:05<00:10, 262.58 examples/s]
Map:  33%|███▎      | 1317/3968 [00:05<00:10, 248.95 examples/s]
Map:  33%|███▎      | 1314/3968 [00:05<00:10, 249.29 examples/s]
Map:  33%|███▎      | 1311/3968 [00:05<00:10, 246.87 examples/s]
Map:  33%|███▎      | 1298/3968 [00:05<00:10, 255.08 examples/s]
Map:  33%|███▎      | 1326/3968 [00:05<00:10, 249.24 examples/s]
Map:  33%|███▎      | 1329/3968 [00:05<00:10, 251.38 examples/s]
Map:  34%|███▍      | 1343/3968 [00:05<00:10, 249.30 examples/s]
Map:  33%|███▎      | 1326/3968 [00:05<00:10, 249.42 examples/s]
Map:  34%|███▍      | 1341/3968 [00:05<00:10, 251.33 examples/s]
Map:  34%|███▎      | 1339/3968 [00:05<00:10, 252.50 examples/s]
Map:  34%|███▍      | 1355/3968 [00:05<00:10, 249.50 examples/s]
Map:  35%|███▍      | 1372/3968 [00:05<00:10, 255.26 examples/s]
Map:  34%|███▎      | 1334/3968 [00:05<00:10, 245.42 examples/s]
Map:  35%|███▍      | 1370/3968 [00:05<00:10, 258.34 examples/s]
Map:  34%|███▍      | 1367/3968 [00:05<00:10, 253.16 examples/s]
Map:  34%|███▍      | 1367/3968 [00:05<00:10, 256.76 examples/s]
Map:  34%|███▍      | 1367/3968 [00:05<00:10, 252.81 examples/s]
Map:  35%|███▍      | 1381/3968 [00:05<00:10, 247.66 examples/s]
Map:  35%|███▌      | 1398/3968 [00:05<00:10, 252.02 examples/s]
Map:  34%|███▍      | 1360/3968 [00:05<00:10, 246.15 examples/s]
Map:  35%|███▌      | 1407/3968 [00:05<00:10, 249.25 examples/s]
Map:  35%|███▌      | 1402/3968 [00:05<00:10, 244.95 examples/s]
Map:  35%|███▌      | 1402/3968 [00:05<00:10, 246.03 examples/s]
Map:  35%|███▍      | 1385/3968 [00:05<00:10, 243.90 examples/s]
Map:  35%|███▌      | 1402/3968 [00:05<00:10, 244.78 examples/s]
Map:  36%|███▌      | 1417/3968 [00:05<00:10, 244.14 examples/s]
Map:  36%|███▌      | 1437/3968 [00:05<00:10, 250.93 examples/s]
Map:  36%|███▌      | 1427/3968 [00:05<00:10, 245.10 examples/s]
Map:  36%|███▌      | 1434/3968 [00:05<00:10, 249.07 examples/s]
Map:  36%|███▌      | 1427/3968 [00:05<00:10, 246.19 examples/s]
Map:  36%|███▌      | 1427/3968 [00:05<00:10, 245.10 examples/s]
Map:  36%|███▋      | 1442/3968 [00:05<00:10, 243.59 examples/s]
Map:  36%|███▌      | 1422/3968 [00:05<00:10, 241.63 examples/s]
Map:  37%|███▋      | 1460/3968 [00:05<00:10, 247.86 examples/s]
Map:  37%|███▋      | 1454/3968 [00:05<00:10, 243.70 examples/s]
Map:  37%|███▋      | 1477/3968 [00:05<00:09, 250.33 examples/s]
Map:  37%|███▋      | 1454/3968 [00:05<00:10, 243.71 examples/s]
Map:  37%|███▋      | 1463/3968 [00:05<00:10, 242.43 examples/s]
Map:  37%|███▋      | 1470/3968 [00:05<00:10, 247.68 examples/s]
Map:  36%|███▋      | 1447/3968 [00:05<00:10, 241.69 examples/s]
Map:  37%|███▋      | 1487/3968 [00:05<00:09, 252.05 examples/s]
Map:  37%|███▋      | 1480/3968 [00:05<00:10, 246.40 examples/s]
Map:  37%|███▋      | 1480/3968 [00:05<00:10, 245.85 examples/s]
Map:  38%|███▊      | 1490/3968 [00:05<00:10, 244.51 examples/s]
Map:  38%|███▊      | 1495/3968 [00:05<00:10, 246.75 examples/s]
Map:  37%|███▋      | 1474/3968 [00:05<00:10, 244.49 examples/s]
Map:  38%|███▊      | 1515/3968 [00:05<00:09, 245.68 examples/s]
Map:  38%|███▊      | 1505/3968 [00:05<00:10, 246.03 examples/s]
Map:  38%|███▊      | 1513/3968 [00:05<00:09, 245.52 examples/s]
Map:  38%|███▊      | 1505/3968 [00:05<00:10, 245.47 examples/s]
Map:  38%|███▊      | 1520/3968 [00:06<00:10, 242.16 examples/s]
Map:  38%|███▊      | 1499/3968 [00:06<00:10, 243.68 examples/s]
Map:  39%|███▊      | 1530/3968 [00:06<00:10, 239.70 examples/s]
Map:  38%|███▊      | 1526/3968 [00:06<00:10, 240.88 examples/s]
Map:  39%|███▉      | 1551/3968 [00:06<00:10, 241.52 examples/s]
Map:  39%|███▊      | 1530/3968 [00:06<00:10, 239.16 examples/s]
Map:  39%|███▉      | 1550/3968 [00:06<00:10, 239.47 examples/s]
Map:  39%|███▉      | 1545/3968 [00:06<00:10, 238.95 examples/s]
Map:  39%|███▉      | 1555/3968 [00:06<00:10, 237.54 examples/s]
Map:  39%|███▊      | 1534/3968 [00:06<00:10, 234.99 examples/s]
Map:  39%|███▉      | 1555/3968 [00:06<00:10, 236.90 examples/s]
Map:  39%|███▉      | 1561/3968 [00:06<00:10, 233.19 examples/s]
Map:  40%|███▉      | 1569/3968 [00:06<00:10, 234.99 examples/s]
Map:  40%|███▉      | 1585/3968 [00:06<00:10, 235.31 examples/s]
Map:  40%|███▉      | 1584/3968 [00:06<00:10, 233.11 examples/s]
Map:  39%|███▉      | 1558/3968 [00:06<00:10, 233.59 examples/s]
Map:  40%|████      | 1590/3968 [00:06<00:10, 230.40 examples/s]
Map:  41%|████      | 1609/3968 [00:06<00:10, 234.50 examples/s]
Map:  40%|████      | 1593/3968 [00:06<00:10, 229.57 examples/s]
Map:  40%|████      | 1590/3968 [00:06<00:10, 229.53 examples/s]
Map:  41%|████      | 1609/3968 [00:06<00:10, 233.08 examples/s]
Map:  40%|████      | 1596/3968 [00:06<00:10, 230.81 examples/s]
Map:  41%|████      | 1633/3968 [00:06<00:10, 231.89 examples/s]
Map:  41%|████      | 1616/3968 [00:06<00:10, 227.17 examples/s]
Map:  40%|████      | 1593/3968 [00:06<00:10, 225.56 examples/s]
Map:  41%|████      | 1626/3968 [00:06<00:10, 229.70 examples/s]
Map:  41%|████      | 1633/3968 [00:06<00:10, 230.23 examples/s]
Map:  41%|████      | 1625/3968 [00:06<00:10, 227.48 examples/s]
Map:  41%|████      | 1630/3968 [00:06<00:10, 225.45 examples/s]
Map:  42%|████▏     | 1658/3968 [00:06<00:09, 232.29 examples/s]
Map:  41%|████▏     | 1643/3968 [00:06<00:10, 228.10 examples/s]
Map:  42%|████▏     | 1658/3968 [00:06<00:10, 230.58 examples/s]
Map:  41%|████      | 1629/3968 [00:06<00:10, 228.23 examples/s]
Map:  42%|████▏     | 1662/3968 [00:06<00:10, 229.30 examples/s]
Map:  42%|████▏     | 1669/3968 [00:06<00:09, 235.12 examples/s]
Map:  42%|████▏     | 1660/3968 [00:06<00:10, 225.88 examples/s]
Map:  42%|████▏     | 1666/3968 [00:06<00:10, 229.16 examples/s]
Map:  43%|████▎     | 1694/3968 [00:06<00:09, 232.58 examples/s]
Map:  43%|████▎     | 1694/3968 [00:06<00:09, 231.59 examples/s]
Map:  42%|████▏     | 1665/3968 [00:06<00:10, 226.98 examples/s]
Map:  42%|████▏     | 1684/3968 [00:06<00:10, 225.38 examples/s]
Map:  43%|████▎     | 1700/3968 [00:06<00:09, 231.17 examples/s]
Map:  43%|████▎     | 1720/3968 [00:06<00:09, 236.96 examples/s]
Map:  43%|████▎     | 1704/3968 [00:06<00:09, 232.79 examples/s]
Map:  43%|████▎     | 1701/3968 [00:06<00:09, 229.53 examples/s]
Map:  43%|████▎     | 1720/3968 [00:06<00:09, 236.15 examples/s]
Map:  43%|████▎     | 1710/3968 [00:06<00:09, 229.20 examples/s]
Map:  43%|████▎     | 1724/3968 [00:06<00:09, 231.79 examples/s]
Map:  44%|████▍     | 1747/3968 [00:06<00:09, 243.59 examples/s]
Map:  43%|████▎     | 1701/3968 [00:06<00:09, 228.36 examples/s]
Map:  44%|████▎     | 1730/3968 [00:06<00:09, 235.20 examples/s]
Map:  44%|████▍     | 1747/3968 [00:06<00:09, 243.21 examples/s]
Map:  43%|████▎     | 1725/3968 [00:06<00:09, 229.52 examples/s]
Map:  44%|████▍     | 1738/3968 [00:06<00:09, 239.57 examples/s]
Map:  44%|████▍     | 1750/3968 [00:07<00:09, 237.57 examples/s]
Map:  43%|████▎     | 1725/3968 [00:07<00:09, 228.14 examples/s]
Map:  44%|████▍     | 1757/3968 [00:07<00:09, 241.57 examples/s]
Map:  44%|████▍     | 1753/3968 [00:07<00:09, 237.62 examples/s]
Map:  45%|████▌     | 1787/3968 [00:07<00:08, 248.88 examples/s]
Map:  45%|████▌     | 1787/3968 [00:07<00:08, 248.85 examples/s]
Map:  45%|████▍     | 1775/3968 [00:07<00:09, 239.13 examples/s]
Map:  44%|████▍     | 1753/3968 [00:07<00:09, 235.88 examples/s]
Map:  45%|████▍     | 1784/3968 [00:07<00:08, 247.14 examples/s]
Map:  45%|████▍     | 1775/3968 [00:07<00:09, 238.63 examples/s]
Map:  45%|████▍     | 1780/3968 [00:07<00:09, 241.46 examples/s]
Map:  46%|████▌     | 1814/3968 [00:07<00:08, 246.02 examples/s]
Map:  45%|████▌     | 1802/3968 [00:07<00:08, 245.55 examples/s]
Map:  46%|████▌     | 1814/3968 [00:07<00:08, 246.13 examples/s]
Map:  45%|████▍     | 1779/3968 [00:07<00:09, 240.78 examples/s]
Map:  46%|████▌     | 1810/3968 [00:07<00:08, 248.13 examples/s]
Map:  45%|████▌     | 1802/3968 [00:07<00:08, 244.69 examples/s]
Map:  46%|████▌     | 1807/3968 [00:07<00:08, 246.14 examples/s]
Map:  46%|████▋     | 1840/3968 [00:07<00:08, 245.93 examples/s]
Map:  46%|████▋     | 1840/3968 [00:07<00:08, 246.17 examples/s]
Map:  45%|████▌     | 1804/3968 [00:07<00:08, 242.53 examples/s]
Map:  46%|████▌     | 1827/3968 [00:07<00:08, 243.46 examples/s]
Map:  46%|████▋     | 1840/3968 [00:07<00:08, 244.78 examples/s]
Map:  46%|████▌     | 1832/3968 [00:07<00:08, 238.80 examples/s]
Map:  47%|████▋     | 1846/3968 [00:07<00:08, 242.79 examples/s]
Map:  47%|████▋     | 1877/3968 [00:07<00:08, 237.66 examples/s]
Map:  47%|████▋     | 1877/3968 [00:07<00:08, 237.72 examples/s]
Map:  46%|████▋     | 1840/3968 [00:07<00:08, 239.89 examples/s]
Map:  47%|████▋     | 1863/3968 [00:07<00:08, 234.76 examples/s]
Map:  47%|████▋     | 1876/3968 [00:07<00:08, 235.37 examples/s]
Map:  47%|████▋     | 1868/3968 [00:07<00:08, 235.60 examples/s]
Map:  47%|████▋     | 1881/3968 [00:07<00:08, 233.11 examples/s]
Map:  48%|████▊     | 1910/3968 [00:07<00:08, 228.75 examples/s]
Map:  48%|████▊     | 1910/3968 [00:07<00:08, 228.85 examples/s]
Map:  47%|████▋     | 1876/3968 [00:07<00:09, 231.61 examples/s]
Map:  48%|████▊     | 1906/3968 [00:07<00:08, 231.30 examples/s]
Map:  48%|████▊     | 1898/3968 [00:07<00:08, 231.17 examples/s]
Map:  48%|████▊     | 1910/3968 [00:07<00:09, 227.40 examples/s]
Map:  48%|████▊     | 1903/3968 [00:07<00:09, 227.97 examples/s]
Map:  49%|████▉     | 1936/3968 [00:07<00:08, 234.69 examples/s]
Map:  49%|████▉     | 1936/3968 [00:07<00:08, 234.83 examples/s]
Map:  49%|████▊     | 1932/3968 [00:07<00:08, 232.33 examples/s]
Map:  48%|████▊     | 1924/3968 [00:07<00:08, 233.07 examples/s]
Map:  49%|████▉     | 1936/3968 [00:07<00:08, 232.99 examples/s]
Map:  49%|████▊     | 1929/3968 [00:07<00:08, 232.23 examples/s]
Map:  49%|████▉     | 1964/3968 [00:07<00:08, 242.64 examples/s]
Map:  48%|████▊     | 1910/3968 [00:07<00:09, 224.58 examples/s]
Map:  49%|████▉     | 1964/3968 [00:07<00:08, 243.01 examples/s]
Map:  49%|████▉     | 1960/3968 [00:07<00:08, 241.89 examples/s]
Map:  49%|████▉     | 1948/3968 [00:07<00:08, 232.97 examples/s]
Map:  49%|████▉     | 1964/3968 [00:07<00:08, 240.65 examples/s]
Map:  49%|████▉     | 1953/3968 [00:07<00:08, 230.88 examples/s]
Map:  49%|████▉     | 1936/3968 [00:07<00:08, 230.39 examples/s]
Map:  50%|█████     | 1998/3968 [00:07<00:08, 233.89 examples/s]
Map:  50%|████▉     | 1975/3968 [00:07<00:08, 239.03 examples/s]
Map:  50%|█████     | 1998/3968 [00:07<00:08, 234.41 examples/s]
Map:  50%|████▉     | 1980/3968 [00:08<00:08, 233.71 examples/s]
Map:  49%|████▉     | 1964/3968 [00:08<00:08, 237.73 examples/s]
Map:  50%|█████     | 1995/3968 [00:08<00:08, 234.65 examples/s]
Map:  50%|█████     | 1997/3968 [00:08<00:08, 231.39 examples/s]
Map:  51%|█████     | 2023/3968 [00:08<00:08, 236.65 examples/s]
Map:  51%|█████     | 2023/3968 [00:08<00:08, 237.33 examples/s]
Map:  51%|█████     | 2020/3968 [00:08<00:08, 237.76 examples/s]
Map:  51%|█████     | 2021/3968 [00:08<00:08, 232.98 examples/s]
Map:  51%|█████     | 2010/3968 [00:08<00:08, 232.25 examples/s]
Map:  52%|█████▏    | 2049/3968 [00:08<00:07, 241.25 examples/s]
Map:  51%|█████     | 2016/3968 [00:08<00:08, 229.42 examples/s]
Map:  50%|█████     | 1997/3968 [00:08<00:08, 229.67 examples/s]
Map:  52%|█████▏    | 2049/3968 [00:08<00:07, 241.74 examples/s]
Map:  52%|█████▏    | 2045/3968 [00:08<00:08, 239.40 examples/s]
Map:  52%|█████▏    | 2046/3968 [00:08<00:08, 235.49 examples/s]
Map:  51%|█████▏    | 2036/3968 [00:08<00:08, 238.01 examples/s]
Map:  52%|█████▏    | 2075/3968 [00:08<00:07, 244.87 examples/s]
Map:  51%|█████▏    | 2041/3968 [00:08<00:08, 233.00 examples/s]
Map:  51%|█████     | 2021/3968 [00:08<00:08, 231.68 examples/s]
Map:  52%|█████▏    | 2075/3968 [00:08<00:07, 245.28 examples/s]
Map:  52%|█████▏    | 2073/3968 [00:08<00:07, 244.85 examples/s]
Map:  52%|█████▏    | 2073/3968 [00:08<00:07, 241.32 examples/s]
Map:  52%|█████▏    | 2062/3968 [00:08<00:07, 240.87 examples/s]
Map:  52%|█████▏    | 2069/3968 [00:08<00:07, 241.75 examples/s]
Map:  53%|█████▎    | 2102/3968 [00:08<00:07, 245.01 examples/s]
Map:  53%|█████▎    | 2102/3968 [00:08<00:07, 245.24 examples/s]
Map:  52%|█████▏    | 2058/3968 [00:08<00:08, 234.82 examples/s]
Map:  53%|█████▎    | 2100/3968 [00:08<00:07, 248.49 examples/s]
Map:  53%|█████▎    | 2100/3968 [00:08<00:07, 245.43 examples/s]
Map:  53%|█████▎    | 2090/3968 [00:08<00:07, 243.97 examples/s]
Map:  53%|█████▎    | 2094/3968 [00:08<00:07, 241.52 examples/s]
Map:  54%|█████▎    | 2130/3968 [00:08<00:07, 251.24 examples/s]
Map:  54%|█████▎    | 2130/3968 [00:08<00:07, 251.30 examples/s]
Map:  53%|█████▎    | 2085/3968 [00:08<00:07, 240.54 examples/s]
Map:  54%|█████▎    | 2127/3968 [00:08<00:07, 251.66 examples/s]
Map:  54%|█████▎    | 2127/3968 [00:08<00:07, 248.74 examples/s]
Map:  53%|█████▎    | 2115/3968 [00:08<00:07, 244.59 examples/s]
Map:  53%|█████▎    | 2120/3968 [00:08<00:07, 242.57 examples/s]
Map:  55%|█████▍    | 2167/3968 [00:08<00:07, 247.22 examples/s]
Map:  55%|█████▍    | 2167/3968 [00:08<00:07, 247.05 examples/s]
Map:  54%|█████▍    | 2141/3968 [00:08<00:07, 241.29 examples/s]
Map:  53%|█████▎    | 2121/3968 [00:08<00:07, 236.96 examples/s]
Map:  54%|█████▍    | 2145/3968 [00:08<00:07, 242.39 examples/s]
Map:  55%|█████▍    | 2163/3968 [00:08<00:07, 243.40 examples/s]
Map:  54%|█████▍    | 2162/3968 [00:08<00:07, 241.08 examples/s]
Map:  55%|█████▍    | 2167/3968 [00:08<00:07, 243.14 examples/s]
Map:  56%|█████▌    | 2204/3968 [00:08<00:07, 245.44 examples/s]
Map:  54%|█████▍    | 2145/3968 [00:08<00:07, 236.65 examples/s]
Map:  55%|█████▍    | 2170/3968 [00:08<00:07, 243.38 examples/s]
Map:  55%|█████▌    | 2189/3968 [00:08<00:07, 246.04 examples/s]
Map:  55%|█████▌    | 2188/3968 [00:08<00:07, 244.33 examples/s]
Map:  56%|█████▌    | 2205/3968 [00:08<00:07, 247.65 examples/s]
Map:  56%|█████▌    | 2230/3968 [00:08<00:07, 247.47 examples/s]
Map:  55%|█████▍    | 2170/3968 [00:08<00:07, 237.98 examples/s]
Map:  55%|█████▌    | 2195/3968 [00:08<00:07, 243.37 examples/s]
Map:  56%|█████▌    | 2215/3968 [00:08<00:07, 247.68 examples/s]
Map:  56%|█████▌    | 2216/3968 [00:08<00:07, 247.81 examples/s]
Map:  56%|█████▋    | 2233/3968 [00:08<00:06, 248.72 examples/s]
Map:  56%|█████▌    | 2205/3968 [00:08<00:07, 243.42 examples/s]
Map:  55%|█████▌    | 2194/3968 [00:09<00:07, 237.81 examples/s]
Map:  56%|█████▌    | 2221/3968 [00:09<00:07, 246.92 examples/s]
Map:  57%|█████▋    | 2257/3968 [00:09<00:06, 247.70 examples/s]
Map:  56%|█████▋    | 2241/3968 [00:09<00:06, 247.16 examples/s]
Map:  57%|█████▋    | 2244/3968 [00:09<00:06, 249.71 examples/s]
Map:  57%|█████▋    | 2259/3968 [00:09<00:06, 247.43 examples/s]
Map:  56%|█████▌    | 2231/3968 [00:09<00:07, 245.54 examples/s]
Map:  56%|█████▌    | 2220/3968 [00:09<00:07, 242.13 examples/s]
Map:  57%|█████▋    | 2248/3968 [00:09<00:06, 251.03 examples/s]
Map:  57%|█████▋    | 2257/3968 [00:09<00:06, 244.44 examples/s]
Map:  58%|█████▊    | 2292/3968 [00:09<00:06, 241.81 examples/s]
Map:  57%|█████▋    | 2278/3968 [00:09<00:06, 244.00 examples/s]
Map:  58%|█████▊    | 2297/3968 [00:09<00:06, 246.89 examples/s]
Map:  57%|█████▋    | 2280/3968 [00:09<00:07, 240.61 examples/s]
Map:  57%|█████▋    | 2248/3968 [00:09<00:07, 244.67 examples/s]
Map:  59%|█████▊    | 2322/3968 [00:09<00:06, 254.63 examples/s]
Map:  58%|█████▊    | 2283/3968 [00:09<00:07, 240.03 examples/s]
Map:  58%|█████▊    | 2306/3968 [00:09<00:06, 248.19 examples/s]
Map:  58%|█████▊    | 2308/3968 [00:09<00:06, 249.79 examples/s]
Map:  59%|█████▊    | 2327/3968 [00:09<00:06, 256.61 examples/s]
Map:  58%|█████▊    | 2292/3968 [00:09<00:07, 238.42 examples/s]
Map:  58%|█████▊    | 2312/3968 [00:09<00:06, 249.80 examples/s]
Map:  59%|█████▉    | 2350/3968 [00:09<00:06, 255.79 examples/s]
Map:  58%|█████▊    | 2283/3968 [00:09<00:07, 234.90 examples/s]
Map:  59%|█████▉    | 2332/3968 [00:09<00:06, 249.27 examples/s]
Map:  59%|█████▉    | 2335/3968 [00:09<00:06, 252.06 examples/s]
Map:  59%|█████▊    | 2322/3968 [00:09<00:06, 251.50 examples/s]
Map:  59%|█████▉    | 2354/3968 [00:09<00:06, 255.55 examples/s]
Map:  58%|█████▊    | 2311/3968 [00:09<00:06, 244.90 examples/s]
Map:  59%|█████▉    | 2340/3968 [00:09<00:06, 251.22 examples/s]
Map:  59%|█████▉    | 2360/3968 [00:09<00:06, 249.88 examples/s]
Map:  59%|█████▉    | 2349/3968 [00:09<00:06, 254.36 examples/s]
Map:  60%|██████    | 2385/3968 [00:09<00:06, 245.92 examples/s]
Map:  60%|█████▉    | 2371/3968 [00:09<00:06, 244.39 examples/s]
Map:  60%|██████    | 2389/3968 [00:09<00:06, 244.01 examples/s]
Map:  59%|█████▉    | 2337/3968 [00:09<00:06, 247.90 examples/s]
Map:  60%|█████▉    | 2366/3968 [00:09<00:06, 250.85 examples/s]
Map:  61%|██████    | 2413/3968 [00:09<00:06, 250.72 examples/s]
Map:  60%|██████    | 2397/3968 [00:09<00:06, 246.06 examples/s]
Map:  60%|██████    | 2397/3968 [00:09<00:06, 246.86 examples/s]
Map:  61%|██████    | 2418/3968 [00:09<00:06, 254.08 examples/s]
Map:  60%|██████    | 2383/3968 [00:09<00:06, 239.86 examples/s]
Map:  61%|██████    | 2402/3968 [00:09<00:06, 246.27 examples/s]
Map:  60%|█████▉    | 2373/3968 [00:09<00:06, 241.43 examples/s]
Map:  61%|██████▏   | 2440/3968 [00:09<00:06, 249.51 examples/s]
Map:  61%|██████    | 2425/3968 [00:09<00:06, 251.24 examples/s]
Map:  61%|██████    | 2425/3968 [00:09<00:06, 252.06 examples/s]
Map:  61%|██████    | 2410/3968 [00:09<00:06, 246.77 examples/s]
Map:  62%|██████▏   | 2455/3968 [00:09<00:06, 250.39 examples/s]
Map:  61%|██████    | 2429/3968 [00:09<00:06, 251.83 examples/s]
Map:  60%|██████    | 2398/3968 [00:09<00:06, 241.84 examples/s]
Map:  62%|██████▏   | 2467/3968 [00:09<00:05, 251.10 examples/s]
Map:  62%|██████▏   | 2463/3968 [00:09<00:06, 248.99 examples/s]
Map:  62%|██████▏   | 2463/3968 [00:09<00:06, 249.48 examples/s]
Map:  62%|██████▏   | 2450/3968 [00:09<00:06, 246.75 examples/s]
Map:  61%|██████    | 2425/3968 [00:09<00:06, 245.98 examples/s]
Map:  63%|██████▎   | 2489/3968 [00:09<00:06, 240.24 examples/s]
Map:  62%|██████▏   | 2466/3968 [00:09<00:06, 246.97 examples/s]
Map:  63%|██████▎   | 2504/3968 [00:10<00:05, 246.67 examples/s]
Map:  62%|██████▏   | 2450/3968 [00:10<00:06, 242.20 examples/s]
Map:  63%|██████▎   | 2500/3968 [00:10<00:06, 243.36 examples/s]
Map:  63%|██████▎   | 2516/3968 [00:10<00:05, 245.09 examples/s]
Map:  63%|██████▎   | 2500/3968 [00:10<00:06, 243.33 examples/s]
Map:  63%|██████▎   | 2486/3968 [00:10<00:06, 240.53 examples/s]
Map:  64%|██████▍   | 2531/3968 [00:10<00:05, 250.54 examples/s]
Map:  63%|██████▎   | 2503/3968 [00:10<00:06, 243.51 examples/s]
Map:  64%|██████▎   | 2527/3968 [00:10<00:05, 248.25 examples/s]
Map:  64%|██████▍   | 2544/3968 [00:10<00:05, 250.98 examples/s]
Map:  64%|██████▎   | 2527/3968 [00:10<00:05, 248.33 examples/s]
Map:  63%|██████▎   | 2512/3968 [00:10<00:06, 242.10 examples/s]
Map:  63%|██████▎   | 2485/3968 [00:10<00:06, 236.47 examples/s]
Map:  65%|██████▍   | 2560/3968 [00:10<00:05, 255.54 examples/s]
Map:  64%|██████▍   | 2530/3968 [00:10<00:05, 246.58 examples/s]
Map:  64%|██████▍   | 2555/3968 [00:10<00:05, 253.66 examples/s]
Map:  65%|██████▍   | 2570/3968 [00:10<00:05, 252.83 examples/s]
Map:  64%|██████▍   | 2555/3968 [00:10<00:05, 253.91 examples/s]
Map:  64%|██████▍   | 2541/3968 [00:10<00:05, 250.93 examples/s]
Map:  63%|██████▎   | 2510/3968 [00:10<00:06, 236.86 examples/s]
Map:  65%|██████▌   | 2588/3968 [00:10<00:05, 258.62 examples/s]
Map:  64%|██████▍   | 2558/3968 [00:10<00:05, 252.92 examples/s]
Map:  65%|██████▌   | 2583/3968 [00:10<00:05, 259.14 examples/s]
Map:  65%|██████▌   | 2599/3968 [00:10<00:05, 262.37 examples/s]
Map:  65%|██████▌   | 2583/3968 [00:10<00:05, 259.29 examples/s]
Map:  65%|██████▍   | 2568/3968 [00:10<00:05, 253.42 examples/s]
Map:  64%|██████▍   | 2540/3968 [00:10<00:05, 247.76 examples/s]
Map:  66%|██████▌   | 2616/3968 [00:10<00:05, 261.12 examples/s]
Map:  65%|██████▌   | 2585/3968 [00:10<00:05, 255.54 examples/s]
Map:  66%|██████▌   | 2610/3968 [00:10<00:05, 255.90 examples/s]
Map:  66%|██████▌   | 2610/3968 [00:10<00:05, 255.76 examples/s]
Map:  65%|██████▌   | 2595/3968 [00:10<00:05, 255.54 examples/s]
Map:  67%|██████▋   | 2643/3968 [00:10<00:05, 257.26 examples/s]
Map:  66%|██████▋   | 2638/3968 [00:10<00:05, 258.18 examples/s]
Map:  66%|██████▌   | 2611/3968 [00:10<00:05, 254.91 examples/s]
Map:  65%|██████▍   | 2579/3968 [00:10<00:05, 249.72 examples/s]
Map:  66%|██████▋   | 2638/3968 [00:10<00:05, 258.59 examples/s]
Map:  66%|██████▋   | 2638/3968 [00:10<00:05, 257.78 examples/s]
Map:  66%|██████▌   | 2625/3968 [00:10<00:05, 258.61 examples/s]
Map:  67%|██████▋   | 2670/3968 [00:10<00:05, 254.76 examples/s]
Map:  66%|██████▋   | 2638/3968 [00:10<00:05, 256.22 examples/s]
Map:  66%|██████▌   | 2605/3968 [00:10<00:05, 250.08 examples/s]
Map:  67%|██████▋   | 2676/3968 [00:10<00:05, 255.00 examples/s]
Map:  67%|██████▋   | 2664/3968 [00:10<00:05, 257.50 examples/s]
Map:  67%|██████▋   | 2675/3968 [00:10<00:05, 252.72 examples/s]
Map:  67%|██████▋   | 2663/3968 [00:10<00:05, 253.66 examples/s]
Map:  68%|██████▊   | 2704/3968 [00:10<00:05, 241.99 examples/s]
Map:  67%|██████▋   | 2675/3968 [00:10<00:05, 251.89 examples/s]
Map:  67%|██████▋   | 2645/3968 [00:10<00:05, 251.20 examples/s]
Map:  68%|██████▊   | 2710/3968 [00:10<00:05, 241.85 examples/s]
Map:  68%|██████▊   | 2699/3968 [00:10<00:05, 245.76 examples/s]
Map:  69%|██████▉   | 2730/3968 [00:10<00:05, 243.26 examples/s]
Map:  68%|██████▊   | 2710/3968 [00:10<00:05, 239.63 examples/s]
Map:  68%|██████▊   | 2699/3968 [00:10<00:05, 244.26 examples/s]
Map:  69%|██████▉   | 2739/3968 [00:10<00:04, 247.74 examples/s]
Map:  69%|██████▊   | 2725/3968 [00:10<00:05, 243.39 examples/s]
Map:  68%|██████▊   | 2710/3968 [00:10<00:05, 239.47 examples/s]
Map:  68%|██████▊   | 2681/3968 [00:10<00:05, 243.76 examples/s]
Map:  70%|██████▉   | 2760/3968 [00:11<00:04, 254.65 examples/s]
Map:  69%|██████▉   | 2736/3968 [00:11<00:05, 243.09 examples/s]
Map:  69%|██████▊   | 2724/3968 [00:11<00:05, 242.68 examples/s]
Map:  70%|██████▉   | 2767/3968 [00:11<00:04, 250.89 examples/s]
Map:  69%|██████▉   | 2752/3968 [00:11<00:04, 247.43 examples/s]
Map:  69%|██████▉   | 2737/3968 [00:11<00:05, 245.67 examples/s]
Map:  70%|██████▉   | 2763/3968 [00:11<00:04, 247.71 examples/s]
Map:  69%|██████▉   | 2749/3968 [00:11<00:05, 243.61 examples/s]
Map:  68%|██████▊   | 2716/3968 [00:11<00:05, 238.38 examples/s]
Map:  70%|███████   | 2797/3968 [00:11<00:04, 243.19 examples/s]
Map:  70%|██████▉   | 2777/3968 [00:11<00:04, 243.64 examples/s]
Map:  70%|██████▉   | 2765/3968 [00:11<00:04, 250.43 examples/s]
Map:  71%|███████   | 2802/3968 [00:11<00:04, 243.69 examples/s]
Map:  70%|██████▉   | 2775/3968 [00:11<00:04, 242.72 examples/s]
Map:  69%|██████▉   | 2740/3968 [00:11<00:05, 236.63 examples/s]
Map:  71%|███████   | 2822/3968 [00:11<00:04, 241.24 examples/s]
Map:  71%|███████   | 2802/3968 [00:11<00:04, 243.64 examples/s]
Map:  70%|███████   | 2797/3968 [00:11<00:04, 235.58 examples/s]
Map:  71%|███████   | 2800/3968 [00:11<00:04, 239.88 examples/s]
Map:  71%|███████   | 2800/3968 [00:11<00:04, 239.37 examples/s]
Map:  70%|██████▉   | 2767/3968 [00:11<00:05, 239.15 examples/s]
Map:  72%|███████▏  | 2839/3968 [00:11<00:04, 241.74 examples/s]
Map:  72%|███████▏  | 2849/3968 [00:11<00:04, 246.96 examples/s]
Map:  71%|███████   | 2822/3968 [00:11<00:04, 232.44 examples/s]
Map:  72%|███████▏  | 2839/3968 [00:11<00:04, 241.57 examples/s]
Map:  71%|███████   | 2824/3968 [00:11<00:04, 237.65 examples/s]
Map:  72%|███████▏  | 2869/3968 [00:11<00:04, 252.83 examples/s]
Map:  73%|███████▎  | 2877/3968 [00:11<00:04, 254.83 examples/s]
Map:  72%|███████▏  | 2847/3968 [00:11<00:04, 235.63 examples/s]
Map:  71%|███████▏  | 2836/3968 [00:11<00:04, 237.40 examples/s]
Map:  71%|███████   | 2801/3968 [00:11<00:05, 229.74 examples/s]
Map:  72%|███████▏  | 2869/3968 [00:11<00:04, 253.33 examples/s]
Map:  72%|███████▏  | 2850/3968 [00:11<00:04, 240.76 examples/s]
Map:  73%|███████▎  | 2895/3968 [00:11<00:04, 253.03 examples/s]
Map:  73%|███████▎  | 2904/3968 [00:11<00:04, 254.22 examples/s]
Map:  72%|███████▏  | 2867/3968 [00:11<00:04, 252.16 examples/s]
Map:  73%|███████▎  | 2877/3968 [00:11<00:04, 247.85 examples/s]
Map:  73%|███████▎  | 2895/3968 [00:11<00:04, 253.20 examples/s]
Map:  73%|███████▎  | 2880/3968 [00:11<00:04, 251.79 examples/s]
Map:  71%|███████▏  | 2836/3968 [00:11<00:04, 229.10 examples/s]
Map:  74%|███████▎  | 2923/3968 [00:11<00:04, 257.51 examples/s]
Map:  74%|███████▍  | 2933/3968 [00:11<00:03, 259.91 examples/s]
Map:  73%|███████▎  | 2893/3968 [00:11<00:04, 250.22 examples/s]
Map:  73%|███████▎  | 2904/3968 [00:11<00:04, 248.08 examples/s]
Map:  74%|███████▎  | 2923/3968 [00:11<00:04, 257.90 examples/s]
Map:  73%|███████▎  | 2907/3968 [00:11<00:04, 256.11 examples/s]
Map:  72%|███████▏  | 2867/3968 [00:11<00:04, 244.45 examples/s]
Map:  74%|███████▍  | 2950/3968 [00:11<00:03, 258.97 examples/s]
Map:  75%|███████▍  | 2961/3968 [00:11<00:03, 258.30 examples/s]
Map:  74%|███████▎  | 2921/3968 [00:11<00:04, 255.95 examples/s]
Map:  74%|███████▍  | 2933/3968 [00:11<00:04, 254.26 examples/s]
Map:  74%|███████▍  | 2950/3968 [00:11<00:03, 258.89 examples/s]
Map:  74%|███████▍  | 2935/3968 [00:11<00:03, 259.05 examples/s]
Map:  73%|███████▎  | 2893/3968 [00:11<00:04, 243.63 examples/s]
Map:  74%|███████▍  | 2949/3968 [00:11<00:03, 259.79 examples/s]
Map:  75%|███████▍  | 2960/3968 [00:11<00:03, 256.28 examples/s]
Map:  75%|███████▌  | 2982/3968 [00:11<00:04, 241.24 examples/s]
Map:  76%|███████▌  | 2997/3968 [00:11<00:03, 245.10 examples/s]
Map:  75%|███████▍  | 2961/3968 [00:11<00:03, 254.88 examples/s]
Map:  74%|███████▎  | 2921/3968 [00:11<00:04, 248.22 examples/s]
Map:  75%|███████▌  | 2982/3968 [00:11<00:04, 240.05 examples/s]
Map:  76%|███████▌  | 3009/3968 [00:12<00:03, 245.54 examples/s]
Map:  75%|███████▌  | 2982/3968 [00:12<00:04, 238.60 examples/s]
Map:  75%|███████▌  | 2993/3968 [00:12<00:04, 235.82 examples/s]
Map:  74%|███████▍  | 2948/3968 [00:12<00:04, 251.86 examples/s]
Map:  76%|███████▌  | 3009/3968 [00:12<00:03, 244.19 examples/s]
Map:  76%|███████▋  | 3035/3968 [00:12<00:03, 245.00 examples/s]
Map:  75%|███████▌  | 2995/3968 [00:12<00:04, 241.89 examples/s]
Map:  76%|███████▋  | 3035/3968 [00:12<00:03, 247.21 examples/s]
Map:  76%|███████▌  | 3008/3968 [00:12<00:03, 241.41 examples/s]
Map:  76%|███████▌  | 3018/3968 [00:12<00:03, 238.46 examples/s]
Map:  76%|███████▋  | 3035/3968 [00:12<00:03, 246.01 examples/s]
Map:  77%|███████▋  | 3060/3968 [00:12<00:03, 244.84 examples/s]
Map:  76%|███████▌  | 3020/3968 [00:12<00:03, 240.27 examples/s]
Map:  75%|███████▌  | 2980/3968 [00:12<00:04, 230.44 examples/s]
Map:  76%|███████▋  | 3035/3968 [00:12<00:03, 244.55 examples/s]
Map:  78%|███████▊  | 3077/3968 [00:12<00:03, 257.37 examples/s]
Map:  77%|███████▋  | 3060/3968 [00:12<00:03, 245.37 examples/s]
Map:  78%|███████▊  | 3090/3968 [00:12<00:03, 254.82 examples/s]
Map:  77%|███████▋  | 3055/3968 [00:12<00:03, 235.75 examples/s]
Map:  77%|███████▋  | 3060/3968 [00:12<00:03, 243.91 examples/s]
Map:  77%|███████▋  | 3058/3968 [00:12<00:03, 241.61 examples/s]
Map:  76%|███████▌  | 3018/3968 [00:12<00:04, 236.89 examples/s]
Map:  78%|███████▊  | 3090/3968 [00:12<00:03, 255.82 examples/s]
Map:  78%|███████▊  | 3085/3968 [00:12<00:03, 247.41 examples/s]
Map:  78%|███████▊  | 3112/3968 [00:12<00:03, 242.08 examples/s]
Map:  79%|███████▉  | 3128/3968 [00:12<00:03, 249.59 examples/s]
Map:  78%|███████▊  | 3090/3968 [00:12<00:03, 253.88 examples/s]
Map:  78%|███████▊  | 3086/3968 [00:12<00:03, 250.04 examples/s]
Map:  79%|███████▉  | 3142/3968 [00:12<00:03, 255.17 examples/s]
Map:  79%|███████▉  | 3128/3968 [00:12<00:03, 249.89 examples/s]
Map:  77%|███████▋  | 3055/3968 [00:12<00:03, 231.96 examples/s]
Map:  79%|███████▊  | 3119/3968 [00:12<00:03, 238.54 examples/s]
Map:  80%|███████▉  | 3163/3968 [00:12<00:03, 243.17 examples/s]
Map:  79%|███████▊  | 3120/3968 [00:12<00:03, 238.53 examples/s]
Map:  79%|███████▉  | 3128/3968 [00:12<00:03, 247.99 examples/s]
Map:  78%|███████▊  | 3085/3968 [00:12<00:03, 243.45 examples/s]
Map:  79%|███████▉  | 3146/3968 [00:12<00:03, 245.62 examples/s]
Map:  80%|████████  | 3179/3968 [00:12<00:03, 247.33 examples/s]
Map:  80%|███████▉  | 3163/3968 [00:12<00:03, 242.63 examples/s]
Map:  80%|████████  | 3188/3968 [00:12<00:03, 241.44 examples/s]
Map:  79%|███████▉  | 3148/3968 [00:12<00:03, 246.07 examples/s]
Map:  80%|███████▉  | 3164/3968 [00:12<00:03, 237.92 examples/s]
Map:  80%|████████  | 3188/3968 [00:12<00:03, 240.15 examples/s]
Map:  79%|███████▊  | 3118/3968 [00:12<00:03, 234.20 examples/s]
Map:  81%|████████  | 3213/3968 [00:12<00:03, 241.80 examples/s]
Map:  80%|████████  | 3182/3968 [00:12<00:03, 240.40 examples/s]
Map:  81%|████████  | 3215/3968 [00:12<00:03, 242.19 examples/s]
Map:  80%|████████  | 3185/3968 [00:12<00:03, 240.36 examples/s]
Map:  80%|████████  | 3189/3968 [00:12<00:03, 238.22 examples/s]
Map:  81%|████████  | 3213/3968 [00:12<00:03, 240.50 examples/s]
Map:  79%|███████▉  | 3145/3968 [00:12<00:03, 241.32 examples/s]
Map:  82%|████████▏ | 3239/3968 [00:12<00:02, 244.40 examples/s]
Map:  82%|████████▏ | 3240/3968 [00:12<00:03, 241.60 examples/s]
Map:  81%|████████  | 3217/3968 [00:13<00:03, 237.92 examples/s]
Map:  81%|████████  | 3215/3968 [00:13<00:03, 240.29 examples/s]
Map:  81%|████████  | 3210/3968 [00:13<00:03, 236.07 examples/s]
Map:  82%|████████▏ | 3239/3968 [00:13<00:02, 243.03 examples/s]
Map:  82%|████████▏ | 3267/3968 [00:13<00:02, 248.89 examples/s]
Map:  82%|████████▏ | 3268/3968 [00:13<00:02, 247.35 examples/s]
Map:  80%|████████  | 3180/3968 [00:13<00:03, 233.64 examples/s]
Map:  82%|████████▏ | 3237/3968 [00:13<00:03, 243.20 examples/s]
Map:  82%|████████▏ | 3240/3968 [00:13<00:03, 238.63 examples/s]
Map:  82%|████████▏ | 3267/3968 [00:13<00:02, 247.92 examples/s]
Map:  83%|████████▎ | 3294/3968 [00:13<00:02, 252.13 examples/s]
Map:  82%|████████▏ | 3254/3968 [00:13<00:03, 236.62 examples/s]
Map:  83%|████████▎ | 3296/3968 [00:13<00:02, 255.03 examples/s]
Map:  82%|████████▏ | 3267/3968 [00:13<00:02, 246.64 examples/s]
Map:  83%|████████▎ | 3294/3968 [00:13<00:02, 251.18 examples/s]
Map:  84%|████████▎ | 3321/3968 [00:13<00:02, 253.03 examples/s]
Map:  81%|████████  | 3217/3968 [00:13<00:03, 231.04 examples/s]
Map:  83%|████████▎ | 3281/3968 [00:13<00:02, 242.27 examples/s]
Map:  83%|████████▎ | 3275/3968 [00:13<00:02, 244.47 examples/s]
Map:  84%|████████▎ | 3322/3968 [00:13<00:02, 250.51 examples/s]
Map:  83%|████████▎ | 3294/3968 [00:13<00:02, 249.60 examples/s]
Map:  84%|████████▎ | 3321/3968 [00:13<00:02, 252.46 examples/s]
Map:  84%|████████▍ | 3347/3968 [00:13<00:02, 251.35 examples/s]
Map:  82%|████████▏ | 3241/3968 [00:13<00:03, 229.94 examples/s]
Map:  83%|████████▎ | 3307/3968 [00:13<00:02, 244.99 examples/s]
Map:  83%|████████▎ | 3304/3968 [00:13<00:02, 248.41 examples/s]
Map:  84%|████████▍ | 3350/3968 [00:13<00:02, 251.00 examples/s]
Map:  84%|████████▎ | 3321/3968 [00:13<00:02, 250.81 examples/s]
Map:  84%|████████▍ | 3347/3968 [00:13<00:02, 250.38 examples/s]
Map:  82%|████████▏ | 3268/3968 [00:13<00:02, 237.58 examples/s]
Map:  84%|████████▍ | 3332/3968 [00:13<00:02, 244.88 examples/s]
Map:  84%|████████▍ | 3330/3968 [00:13<00:02, 247.01 examples/s]
Map:  85%|████████▌ | 3377/3968 [00:13<00:02, 251.09 examples/s]
Map:  85%|████████▌ | 3384/3968 [00:13<00:02, 248.07 examples/s]
Map:  84%|████████▍ | 3347/3968 [00:13<00:02, 248.62 examples/s]
Map:  83%|████████▎ | 3295/3968 [00:13<00:02, 244.82 examples/s]
Map:  85%|████████▍ | 3357/3968 [00:13<00:02, 243.64 examples/s]
Map:  85%|████████▍ | 3355/3968 [00:13<00:02, 245.36 examples/s]
Map:  85%|████████▌ | 3384/3968 [00:13<00:02, 246.69 examples/s]
Map:  86%|████████▌ | 3405/3968 [00:13<00:02, 252.47 examples/s]
Map:  86%|████████▌ | 3410/3968 [00:13<00:02, 245.12 examples/s]
Map:  84%|████████▎ | 3320/3968 [00:13<00:02, 244.57 examples/s]
Map:  85%|████████▌ | 3383/3968 [00:13<00:02, 244.99 examples/s]
Map:  85%|████████▌ | 3384/3968 [00:13<00:02, 245.28 examples/s]
Map:  85%|████████▌ | 3380/3968 [00:13<00:02, 244.65 examples/s]
Map:  86%|████████▌ | 3410/3968 [00:13<00:02, 243.67 examples/s]
Map:  87%|████████▋ | 3435/3968 [00:13<00:02, 243.15 examples/s]
Map:  87%|████████▋ | 3442/3968 [00:13<00:02, 244.98 examples/s]
Map:  84%|████████▍ | 3345/3968 [00:13<00:02, 243.55 examples/s]
Map:  86%|████████▌ | 3410/3968 [00:13<00:02, 243.42 examples/s]
Map:  86%|████████▌ | 3410/3968 [00:13<00:02, 242.57 examples/s]
Map:  86%|████████▌ | 3408/3968 [00:13<00:02, 249.85 examples/s]
Map:  87%|████████▋ | 3465/3968 [00:13<00:02, 250.92 examples/s]
Map:  87%|████████▋ | 3449/3968 [00:13<00:02, 247.22 examples/s]
Map:  87%|████████▋ | 3470/3968 [00:13<00:01, 249.73 examples/s]
Map:  85%|████████▍ | 3370/3968 [00:13<00:02, 240.06 examples/s]
Map:  88%|████████▊ | 3493/3968 [00:13<00:01, 255.54 examples/s]
Map:  87%|████████▋ | 3449/3968 [00:13<00:02, 246.75 examples/s]
Map:  87%|████████▋ | 3449/3968 [00:13<00:02, 245.83 examples/s]
Map:  87%|████████▋ | 3445/3968 [00:13<00:02, 243.56 examples/s]
Map:  88%|████████▊ | 3477/3968 [00:13<00:01, 252.71 examples/s]
Map:  88%|████████▊ | 3499/3968 [00:14<00:01, 256.73 examples/s]
Map:  86%|████████▌ | 3398/3968 [00:14<00:02, 247.51 examples/s]
Map:  89%|████████▊ | 3520/3968 [00:14<00:01, 254.47 examples/s]
Map:  88%|████████▊ | 3477/3968 [00:14<00:01, 252.13 examples/s]
Map:  88%|████████▊ | 3475/3968 [00:14<00:01, 248.95 examples/s]
Map:  87%|████████▋ | 3471/3968 [00:14<00:02, 246.57 examples/s]
Map:  88%|████████▊ | 3504/3968 [00:14<00:01, 254.45 examples/s]
Map:  89%|████████▉ | 3526/3968 [00:14<00:01, 258.97 examples/s]
Map:  86%|████████▋ | 3432/3968 [00:14<00:02, 235.03 examples/s]
Map:  90%|████████▉ | 3552/3968 [00:14<00:01, 271.08 examples/s]
Map:  88%|████████▊ | 3504/3968 [00:14<00:01, 253.73 examples/s]
Map:  88%|████████▊ | 3502/3968 [00:14<00:01, 251.84 examples/s]
Map:  88%|████████▊ | 3499/3968 [00:14<00:01, 253.73 examples/s]
Map:  89%|████████▉ | 3532/3968 [00:14<00:01, 258.86 examples/s]
Map:  90%|████████▉ | 3559/3968 [00:14<00:01, 269.42 examples/s]
Map:  87%|████████▋ | 3460/3968 [00:14<00:02, 243.41 examples/s]
Map:  90%|█████████ | 3580/3968 [00:14<00:01, 271.89 examples/s]
Map:  89%|████████▉ | 3532/3968 [00:14<00:01, 258.40 examples/s]
Map:  89%|████████▉ | 3531/3968 [00:14<00:01, 258.17 examples/s]
Map:  89%|████████▉ | 3526/3968 [00:14<00:01, 256.26 examples/s]
Map:  90%|████████▉ | 3560/3968 [00:14<00:01, 261.55 examples/s]
Map:  90%|█████████ | 3588/3968 [00:14<00:01, 272.32 examples/s]
Map:  88%|████████▊ | 3486/3968 [00:14<00:01, 246.48 examples/s]
Map:  91%|█████████ | 3609/3968 [00:14<00:01, 274.49 examples/s]
Map:  90%|████████▉ | 3560/3968 [00:14<00:01, 260.89 examples/s]
Map:  90%|████████▉ | 3560/3968 [00:14<00:01, 260.97 examples/s]
Map:  90%|████████▉ | 3559/3968 [00:14<00:01, 266.33 examples/s]
Map:  90%|█████████ | 3590/3968 [00:14<00:01, 269.79 examples/s]
Map:  91%|█████████ | 3617/3968 [00:14<00:01, 272.74 examples/s]
Map:  88%|████████▊ | 3511/3968 [00:14<00:01, 246.52 examples/s]
Map:  92%|█████████▏| 3637/3968 [00:14<00:01, 273.92 examples/s]
Map:  90%|█████████ | 3590/3968 [00:14<00:01, 268.98 examples/s]
Map:  90%|█████████ | 3590/3968 [00:14<00:01, 268.50 examples/s]
Map:  90%|█████████ | 3587/3968 [00:14<00:01, 268.70 examples/s]
Map:  91%|█████████ | 3619/3968 [00:14<00:01, 273.42 examples/s]
Map:  92%|█████████▏| 3645/3968 [00:14<00:01, 270.66 examples/s]
Map:  89%|████████▉ | 3540/3968 [00:14<00:01, 255.50 examples/s]
Map:  91%|█████████ | 3619/3968 [00:14<00:01, 272.36 examples/s]
Map:  91%|█████████ | 3619/3968 [00:14<00:01, 272.63 examples/s]
Map:  91%|█████████ | 3614/3968 [00:14<00:01, 266.04 examples/s]
Map:  93%|█████████▎| 3676/3968 [00:14<00:01, 262.58 examples/s]
Map:  92%|█████████▏| 3660/3968 [00:14<00:01, 267.50 examples/s]
Map:  90%|████████▉ | 3568/3968 [00:14<00:01, 261.18 examples/s]
Map:  93%|█████████▎| 3685/3968 [00:14<00:01, 266.13 examples/s]
Map:  92%|█████████▏| 3644/3968 [00:14<00:01, 266.88 examples/s]
Map:  93%|█████████▎| 3703/3968 [00:14<00:01, 259.63 examples/s]
Map:  92%|█████████▏| 3660/3968 [00:14<00:01, 266.32 examples/s]
Map:  92%|█████████▏| 3660/3968 [00:14<00:01, 266.70 examples/s]
Map:  91%|█████████ | 3595/3968 [00:14<00:01, 261.22 examples/s]
Map:  93%|█████████▎| 3688/3968 [00:14<00:01, 265.16 examples/s]
Map:  94%|█████████▍| 3726/3968 [00:14<00:00, 263.23 examples/s]
Map:  94%|█████████▍| 3730/3968 [00:14<00:00, 258.55 examples/s]
Map:  93%|█████████▎| 3688/3968 [00:14<00:01, 264.03 examples/s]
Map:  93%|█████████▎| 3688/3968 [00:14<00:01, 263.93 examples/s]
Map:  94%|█████████▎| 3715/3968 [00:14<00:00, 265.89 examples/s]
Map:  93%|█████████▎| 3684/3968 [00:14<00:01, 262.55 examples/s]
Map:  91%|█████████▏| 3626/3968 [00:14<00:01, 267.38 examples/s]
Map:  95%|█████████▍| 3759/3968 [00:14<00:00, 263.14 examples/s]
Map:  94%|█████████▎| 3715/3968 [00:14<00:00, 264.80 examples/s]
Map:  94%|█████████▎| 3715/3968 [00:14<00:00, 264.81 examples/s]
Map:  92%|█████████▏| 3654/3968 [00:14<00:01, 263.33 examples/s]
Map:  95%|█████████▍| 3765/3968 [00:15<00:00, 258.04 examples/s]
Map:  95%|█████████▍| 3751/3968 [00:15<00:00, 254.74 examples/s]
Map:  94%|█████████▍| 3725/3968 [00:15<00:00, 261.91 examples/s]
Map:  96%|█████████▌| 3796/3968 [00:15<00:00, 253.10 examples/s]
Map:  95%|█████████▍| 3751/3968 [00:15<00:00, 253.17 examples/s]
Map:  95%|█████████▍| 3751/3968 [00:15<00:00, 253.18 examples/s]
Map:  95%|█████████▌| 3777/3968 [00:15<00:00, 254.62 examples/s]
Map:  93%|█████████▎| 3691/3968 [00:15<00:01, 253.44 examples/s]
Map:  96%|█████████▌| 3803/3968 [00:15<00:00, 254.10 examples/s]
Map:  95%|█████████▍| 3765/3968 [00:15<00:00, 256.37 examples/s]
Map:  95%|█████████▌| 3777/3968 [00:15<00:00, 253.18 examples/s]
Map:  96%|█████████▋| 3822/3968 [00:15<00:00, 248.46 examples/s]
Map:  95%|█████████▌| 3777/3968 [00:15<00:00, 252.94 examples/s]
Map:  96%|█████████▌| 3803/3968 [00:15<00:00, 255.23 examples/s]
Map:  94%|█████████▍| 3720/3968 [00:15<00:00, 258.36 examples/s]
Map:  97%|█████████▋| 3830/3968 [00:15<00:00, 249.84 examples/s]
Map:  96%|█████████▌| 3803/3968 [00:15<00:00, 253.48 examples/s]
Map:  97%|█████████▋| 3848/3968 [00:15<00:00, 250.23 examples/s]
Map:  96%|█████████▌| 3803/3968 [00:15<00:00, 252.48 examples/s]
Map:  97%|█████████▋| 3830/3968 [00:15<00:00, 249.79 examples/s]
Map:  96%|█████████▌| 3803/3968 [00:15<00:00, 252.32 examples/s]
Map:  95%|█████████▍| 3757/3968 [00:15<00:00, 250.71 examples/s]
Map:  97%|█████████▋| 3866/3968 [00:15<00:00, 243.38 examples/s]
Map:  97%|█████████▋| 3839/3968 [00:15<00:00, 248.26 examples/s]
Map:  98%|█████████▊| 3882/3968 [00:15<00:00, 237.35 examples/s]
Map:  97%|█████████▋| 3840/3968 [00:15<00:00, 243.76 examples/s]
Map:  97%|█████████▋| 3866/3968 [00:15<00:00, 242.51 examples/s]
Map:  97%|█████████▋| 3840/3968 [00:15<00:00, 244.46 examples/s]
Map:  96%|█████████▌| 3793/3968 [00:15<00:00, 241.60 examples/s]
Map:  98%|█████████▊| 3900/3968 [00:15<00:00, 233.43 examples/s]
Map:  97%|█████████▋| 3865/3968 [00:15<00:00, 241.04 examples/s]
Map:  98%|█████████▊| 3872/3968 [00:15<00:00, 235.05 examples/s]
Map:  97%|█████████▋| 3865/3968 [00:15<00:00, 241.35 examples/s]
Map:  99%|█████████▊| 3917/3968 [00:15<00:00, 233.92 examples/s]
Map:  98%|█████████▊| 3900/3968 [00:15<00:00, 232.13 examples/s]
Map:  96%|█████████▌| 3819/3968 [00:15<00:00, 244.59 examples/s]
Map:  99%|█████████▉| 3934/3968 [00:15<00:00, 229.90 examples/s]
Map:  98%|█████████▊| 3900/3968 [00:15<00:00, 230.57 examples/s]
Map:  98%|█████████▊| 3908/3968 [00:15<00:00, 233.90 examples/s]
Map: 100%|█████████▉| 3952/3968 [00:15<00:00, 232.18 examples/s]
Map:  98%|█████████▊| 3900/3968 [00:15<00:00, 231.35 examples/s]
Map:  99%|█████████▉| 3934/3968 [00:15<00:00, 228.77 examples/s]
Map:  97%|█████████▋| 3854/3968 [00:15<00:00, 239.64 examples/s]
Map: 100%|█████████▉| 3960/3968 [00:15<00:00, 234.53 examples/s]
Map:  99%|█████████▉| 3933/3968 [00:15<00:00, 225.95 examples/s]
Map: 100%|█████████▉| 3960/3968 [00:15<00:00, 233.81 examples/s]
Map:  99%|█████████▉| 3942/3968 [00:15<00:00, 226.56 examples/s]
Map:  99%|█████████▉| 3933/3968 [00:15<00:00, 226.71 examples/s]
Map:  98%|█████████▊| 3889/3968 [00:15<00:00, 233.27 examples/s]
Map: 100%|█████████▉| 3960/3968 [00:16<00:00, 231.53 examples/s]
Map: 100%|█████████▉| 3959/3968 [00:16<00:00, 233.64 examples/s]
Map:  99%|█████████▉| 3922/3968 [00:16<00:00, 224.33 examples/s]
Map:  99%|█████████▉| 3947/3968 [00:16<00:00, 229.30 examples/s]
Map: 100%|██████████| 3968/3968 [00:21<00:00, 182.51 examples/s]
Map: 100%|██████████| 3968/3968 [00:21<00:00, 182.49 examples/s]

Map: 100%|██████████| 3968/3968 [00:21<00:00, 180.69 examples/s]
Map: 100%|██████████| 3968/3968 [00:21<00:00, 180.64 examples/s]
Map: 100%|██████████| 3968/3968 [00:22<00:00, 179.03 examples/s]
Map: 100%|██████████| 3968/3968 [00:22<00:00, 14.42 examples/s] 
Map: 100%|██████████| 3968/3968 [00:22<00:00, 173.87 examples/s]
Map: 100%|██████████| 3968/3968 [00:22<00:00, 173.02 examples/s]
[2025-04-11 17:49:05,161] [INFO] [utils.py:781:see_memory_usage] After creating fp16 partitions: 2
[2025-04-11 17:49:05,161] [INFO] [utils.py:782:see_memory_usage] MA 3.74 GB         Max_MA 3.74 GB         CA 5.85 GB         Max_CA 6 GB 
[2025-04-11 17:49:05,162] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 18.94 GB, percent = 2.5%
[2025-04-11 17:49:05,272] [INFO] [utils.py:781:see_memory_usage] Before creating fp32 partitions
[2025-04-11 17:49:05,272] [INFO] [utils.py:782:see_memory_usage] MA 3.74 GB         Max_MA 3.74 GB         CA 5.85 GB         Max_CA 6 GB 
[2025-04-11 17:49:05,272] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 18.94 GB, percent = 2.5%
[2025-04-11 17:49:05,386] [INFO] [utils.py:781:see_memory_usage] After creating fp32 partitions
[2025-04-11 17:49:05,387] [INFO] [utils.py:782:see_memory_usage] MA 7.48 GB         Max_MA 9.35 GB         CA 11.46 GB         Max_CA 11 GB 
[2025-04-11 17:49:05,387] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 18.94 GB, percent = 2.5%
[2025-04-11 17:49:05,496] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states
[2025-04-11 17:49:05,497] [INFO] [utils.py:782:see_memory_usage] MA 7.48 GB         Max_MA 7.48 GB         CA 11.46 GB         Max_CA 11 GB 
[2025-04-11 17:49:05,497] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 18.94 GB, percent = 2.5%
[2025-04-11 17:49:05,609] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states
[2025-04-11 17:49:05,610] [INFO] [utils.py:782:see_memory_usage] MA 7.48 GB         Max_MA 11.22 GB         CA 15.2 GB         Max_CA 15 GB 
[2025-04-11 17:49:05,610] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 18.98 GB, percent = 2.5%
[2025-04-11 17:49:05,610] [INFO] [stage3.py:486:_setup_for_real_optimizer] optimizer state initialized
[2025-04-11 17:49:06,568] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer
[2025-04-11 17:49:06,569] [INFO] [utils.py:782:see_memory_usage] MA 10.28 GB         Max_MA 12.24 GB         CA 15.2 GB         Max_CA 15 GB 
[2025-04-11 17:49:06,569] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 18.95 GB, percent = 2.5%
[2025-04-11 17:49:06,569] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer_Stage3
[2025-04-11 17:49:06,569] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2025-04-11 17:49:06,569] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2025-04-11 17:49:06,569] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0], mom=[(0.9, 0.999), (0.9, 0.999)]
[2025-04-11 17:49:06,570] [INFO] [config.py:997:print] DeepSpeedEngine configuration:
[2025-04-11 17:49:06,570] [INFO] [config.py:1001:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2025-04-11 17:49:06,570] [INFO] [config.py:1001:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2025-04-11 17:49:06,570] [INFO] [config.py:1001:print]   amp_enabled .................. False
[2025-04-11 17:49:06,570] [INFO] [config.py:1001:print]   amp_params ................... False
[2025-04-11 17:49:06,570] [INFO] [config.py:1001:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
}
[2025-04-11 17:49:06,570] [INFO] [config.py:1001:print]   bfloat16_enabled ............. True
[2025-04-11 17:49:06,570] [INFO] [config.py:1001:print]   bfloat16_immediate_grad_update  False
[2025-04-11 17:49:06,570] [INFO] [config.py:1001:print]   checkpoint_parallel_write_pipeline  False
[2025-04-11 17:49:06,570] [INFO] [config.py:1001:print]   checkpoint_tag_validation_enabled  True
[2025-04-11 17:49:06,570] [INFO] [config.py:1001:print]   checkpoint_tag_validation_fail  False
[2025-04-11 17:49:06,570] [INFO] [config.py:1001:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7feb42558150>
[2025-04-11 17:49:06,570] [INFO] [config.py:1001:print]   communication_data_type ...... None
[2025-04-11 17:49:06,570] [INFO] [config.py:1001:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2025-04-11 17:49:06,570] [INFO] [config.py:1001:print]   curriculum_enabled_legacy .... False
[2025-04-11 17:49:06,570] [INFO] [config.py:1001:print]   curriculum_params_legacy ..... False
[2025-04-11 17:49:06,570] [INFO] [config.py:1001:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2025-04-11 17:49:06,570] [INFO] [config.py:1001:print]   data_efficiency_enabled ...... False
[2025-04-11 17:49:06,570] [INFO] [config.py:1001:print]   dataloader_drop_last ......... False
[2025-04-11 17:49:06,570] [INFO] [config.py:1001:print]   disable_allgather ............ False
[2025-04-11 17:49:06,570] [INFO] [config.py:1001:print]   dump_state ................... False
[2025-04-11 17:49:06,570] [INFO] [config.py:1001:print]   dynamic_loss_scale_args ...... None
[2025-04-11 17:49:06,570] [INFO] [config.py:1001:print]   eigenvalue_enabled ........... False
[2025-04-11 17:49:06,570] [INFO] [config.py:1001:print]   eigenvalue_gas_boundary_resolution  1
[2025-04-11 17:49:06,570] [INFO] [config.py:1001:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2025-04-11 17:49:06,571] [INFO] [config.py:1001:print]   eigenvalue_layer_num ......... 0
[2025-04-11 17:49:06,571] [INFO] [config.py:1001:print]   eigenvalue_max_iter .......... 100
[2025-04-11 17:49:06,571] [INFO] [config.py:1001:print]   eigenvalue_stability ......... 1e-06
[2025-04-11 17:49:06,571] [INFO] [config.py:1001:print]   eigenvalue_tol ............... 0.01
[2025-04-11 17:49:06,571] [INFO] [config.py:1001:print]   eigenvalue_verbose ........... False
[2025-04-11 17:49:06,571] [INFO] [config.py:1001:print]   elasticity_enabled ........... False
[2025-04-11 17:49:06,571] [INFO] [config.py:1001:print]   flops_profiler_config ........ {
    "enabled": false, 
    "recompute_fwd_factor": 0.0, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
[2025-04-11 17:49:06,571] [INFO] [config.py:1001:print]   fp16_auto_cast ............... None
[2025-04-11 17:49:06,571] [INFO] [config.py:1001:print]   fp16_enabled ................. False
[2025-04-11 17:49:06,571] [INFO] [config.py:1001:print]   fp16_master_weights_and_gradients  False
[2025-04-11 17:49:06,571] [INFO] [config.py:1001:print]   global_rank .................. 0
[2025-04-11 17:49:06,571] [INFO] [config.py:1001:print]   grad_accum_dtype ............. None
[2025-04-11 17:49:06,571] [INFO] [config.py:1001:print]   gradient_accumulation_steps .. 16
[2025-04-11 17:49:06,571] [INFO] [config.py:1001:print]   gradient_clipping ............ 1.0
[2025-04-11 17:49:06,571] [INFO] [config.py:1001:print]   gradient_predivide_factor .... 1.0
[2025-04-11 17:49:06,571] [INFO] [config.py:1001:print]   graph_harvesting ............. False
[2025-04-11 17:49:06,571] [INFO] [config.py:1001:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2025-04-11 17:49:06,571] [INFO] [config.py:1001:print]   initial_dynamic_scale ........ 1
[2025-04-11 17:49:06,571] [INFO] [config.py:1001:print]   load_universal_checkpoint .... False
[2025-04-11 17:49:06,571] [INFO] [config.py:1001:print]   loss_scale ................... 1.0
[2025-04-11 17:49:06,571] [INFO] [config.py:1001:print]   memory_breakdown ............. False
[2025-04-11 17:49:06,571] [INFO] [config.py:1001:print]   mics_hierarchial_params_gather  False
[2025-04-11 17:49:06,571] [INFO] [config.py:1001:print]   mics_shard_size .............. -1
[2025-04-11 17:49:06,571] [INFO] [config.py:1001:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2025-04-11 17:49:06,571] [INFO] [config.py:1001:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
}
[2025-04-11 17:49:06,571] [INFO] [config.py:1001:print]   optimizer_legacy_fusion ...... False
[2025-04-11 17:49:06,571] [INFO] [config.py:1001:print]   optimizer_name ............... None
[2025-04-11 17:49:06,571] [INFO] [config.py:1001:print]   optimizer_params ............. None
[2025-04-11 17:49:06,571] [INFO] [config.py:1001:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2025-04-11 17:49:06,571] [INFO] [config.py:1001:print]   pld_enabled .................. False
[2025-04-11 17:49:06,571] [INFO] [config.py:1001:print]   pld_params ................... False
[2025-04-11 17:49:06,571] [INFO] [config.py:1001:print]   prescale_gradients ........... False
[2025-04-11 17:49:06,571] [INFO] [config.py:1001:print]   scheduler_name ............... None
[2025-04-11 17:49:06,571] [INFO] [config.py:1001:print]   scheduler_params ............. None
[2025-04-11 17:49:06,571] [INFO] [config.py:1001:print]   seq_parallel_communication_data_type  torch.float32
[2025-04-11 17:49:06,571] [INFO] [config.py:1001:print]   sparse_attention ............. None
[2025-04-11 17:49:06,571] [INFO] [config.py:1001:print]   sparse_gradients_enabled ..... False
[2025-04-11 17:49:06,571] [INFO] [config.py:1001:print]   steps_per_print .............. inf
[2025-04-11 17:49:06,571] [INFO] [config.py:1001:print]   timers_config ................ enabled=True synchronized=True
[2025-04-11 17:49:06,571] [INFO] [config.py:1001:print]   train_batch_size ............. 128
[2025-04-11 17:49:06,571] [INFO] [config.py:1001:print]   train_micro_batch_size_per_gpu  1
[2025-04-11 17:49:06,571] [INFO] [config.py:1001:print]   use_data_before_expert_parallel_  False
[2025-04-11 17:49:06,571] [INFO] [config.py:1001:print]   use_node_local_storage ....... False
[2025-04-11 17:49:06,571] [INFO] [config.py:1001:print]   wall_clock_breakdown ......... False
[2025-04-11 17:49:06,571] [INFO] [config.py:1001:print]   weight_quantization_config ... None
[2025-04-11 17:49:06,571] [INFO] [config.py:1001:print]   world_size ................... 8
[2025-04-11 17:49:06,571] [INFO] [config.py:1001:print]   zero_allow_untested_optimizer  True
[2025-04-11 17:49:06,571] [INFO] [config.py:1001:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False, ratio=1.0) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=True use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2025-04-11 17:49:06,571] [INFO] [config.py:1001:print]   zero_enabled ................. True
[2025-04-11 17:49:06,571] [INFO] [config.py:1001:print]   zero_force_ds_cpu_optimizer .. True
[2025-04-11 17:49:06,571] [INFO] [config.py:1001:print]   zero_optimization_stage ...... 3
[2025-04-11 17:49:06,572] [INFO] [config.py:987:print_user_config]   json = {
    "train_batch_size": 128, 
    "train_micro_batch_size_per_gpu": 1, 
    "gradient_accumulation_steps": 16, 
    "zero_optimization": {
        "stage": 3, 
        "offload_optimizer": {
            "device": "none", 
            "nvme_path": null
        }, 
        "offload_param": {
            "device": "none", 
            "nvme_path": null
        }, 
        "stage3_gather_16bit_weights_on_model_save": true
    }, 
    "gradient_clipping": 1.0, 
    "steps_per_print": inf, 
    "bf16": {
        "enabled": true
    }, 
    "fp16": {
        "enabled": false
    }, 
    "zero_allow_untested_optimizer": true
}
[INFO|trainer.py:2134] 2025-04-11 17:49:06,573 >> ***** Running training *****
[INFO|trainer.py:2135] 2025-04-11 17:49:06,573 >>   Num examples = 3,968
[INFO|trainer.py:2136] 2025-04-11 17:49:06,573 >>   Num Epochs = 1
[INFO|trainer.py:2137] 2025-04-11 17:49:06,573 >>   Instantaneous batch size per device = 1
[INFO|trainer.py:2140] 2025-04-11 17:49:06,573 >>   Total train batch size (w. parallel, distributed & accumulation) = 128
[INFO|trainer.py:2141] 2025-04-11 17:49:06,573 >>   Gradient Accumulation steps = 16
[INFO|trainer.py:2142] 2025-04-11 17:49:06,573 >>   Total optimization steps = 31
[INFO|trainer.py:2143] 2025-04-11 17:49:06,574 >>   Number of trainable parameters = 8,030,261,248
  0%|          | 0/31 [00:00<?, ?it/s][WARNING|modeling_utils.py:1239] 2025-04-11 17:49:13,044 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed
[WARNING|modeling_utils.py:1239] 2025-04-11 17:49:13,045 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed
[WARNING|modeling_utils.py:1239] 2025-04-11 17:49:13,045 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed
[WARNING|modeling_utils.py:1239] 2025-04-11 17:49:13,045 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed
[WARNING|modeling_utils.py:1239] 2025-04-11 17:49:13,045 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed
[WARNING|modeling_utils.py:1239] 2025-04-11 17:49:13,045 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed
[WARNING|modeling_utils.py:1239] 2025-04-11 17:49:13,045 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed
[WARNING|modeling_utils.py:1239] 2025-04-11 17:49:13,045 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed
[2025-04-11 17:50:30,496] [WARNING] [stage3.py:2069:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  3%|▎         | 1/31 [01:23<41:53, 83.78s/it]
                                              
{'loss': 0.6931, 'grad_norm': 11.571947441853833, 'learning_rate': 2e-07, 'rewards/chosen': 0.0, 'rewards/rejected': 0.0, 'rewards/accuracies': 0.0, 'rewards/margins': 0.0, 'logps/rejected': -0.4396740198135376, 'logps/chosen': -0.445696085691452, 'logits/rejected': -0.10548911243677139, 'logits/chosen': -0.11830310523509979, 'epoch': 0.03}
  3%|▎         | 1/31 [01:23<41:53, 83.78s/it]
  6%|▋         | 2/31 [02:45<39:46, 82.31s/it]
                                              
{'loss': 0.6931, 'grad_norm': 11.814248784272145, 'learning_rate': 4e-07, 'rewards/chosen': 0.0, 'rewards/rejected': 0.0, 'rewards/accuracies': 0.0, 'rewards/margins': 0.0, 'logps/rejected': -0.5113828182220459, 'logps/chosen': -0.46749478578567505, 'logits/rejected': -0.0003231978043913841, 'logits/chosen': -0.033661894500255585, 'epoch': 0.06}
  6%|▋         | 2/31 [02:45<39:46, 82.31s/it]
 10%|▉         | 3/31 [04:10<38:58, 83.52s/it]
 13%|█▎        | 4/31 [05:29<36:54, 82.03s/it]
                                              
{'loss': 0.693, 'grad_norm': 12.108289494586579, 'learning_rate': 8e-07, 'rewards/chosen': -0.0021748943254351616, 'rewards/rejected': -0.0017772708088159561, 'rewards/accuracies': 0.4375, 'rewards/margins': -0.0003976235166192055, 'logps/rejected': -0.4374706447124481, 'logps/chosen': -0.4366103410720825, 'logits/rejected': -0.10036322474479675, 'logits/chosen': -0.07559803128242493, 'epoch': 0.13}
 13%|█▎        | 4/31 [05:29<36:54, 82.03s/it]
 16%|█▌        | 5/31 [06:50<35:23, 81.68s/it]
 19%|█▉        | 6/31 [08:09<33:37, 80.72s/it]
                                              
{'loss': 0.692, 'grad_norm': 11.774541068568574, 'learning_rate': 7.892179482319295e-07, 'rewards/chosen': 0.0007398845627903938, 'rewards/rejected': 8.26176255941391e-05, 'rewards/accuracies': 0.5, 'rewards/margins': 0.0006572669371962547, 'logps/rejected': -0.4731212258338928, 'logps/chosen': -0.45992788672447205, 'logits/rejected': -0.07020770758390427, 'logits/chosen': -0.06707202643156052, 'epoch': 0.19}
 19%|█▉        | 6/31 [08:09<33:37, 80.72s/it]
 23%|██▎       | 7/31 [09:30<32:17, 80.72s/it]
 26%|██▌       | 8/31 [10:49<30:46, 80.27s/it]
                                              
{'loss': 0.6903, 'grad_norm': 12.159729262568094, 'learning_rate': 7.574530561293649e-07, 'rewards/chosen': 0.0006669433787465096, 'rewards/rejected': -0.004133470356464386, 'rewards/accuracies': 0.53125, 'rewards/margins': 0.0048004137352108955, 'logps/rejected': -0.46462082862854004, 'logps/chosen': -0.42579203844070435, 'logits/rejected': -0.10205815732479095, 'logits/chosen': -0.08375360816717148, 'epoch': 0.26}
 26%|██▌       | 8/31 [10:49<30:46, 80.27s/it][2025-04-11 18:01:18,607] [WARNING] [stage3.py:2069:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 29%|██▉       | 9/31 [12:11<29:39, 80.87s/it]
 32%|███▏      | 10/31 [13:32<28:14, 80.68s/it]
                                               
{'loss': 0.6908, 'grad_norm': 12.07115180380848, 'learning_rate': 7.064177772475911e-07, 'rewards/chosen': -0.010402845218777657, 'rewards/rejected': -0.01703275367617607, 'rewards/accuracies': 0.5625, 'rewards/margins': 0.006629908457398415, 'logps/rejected': -0.48226481676101685, 'logps/chosen': -0.4442792534828186, 'logits/rejected': -0.03188939392566681, 'logits/chosen': -0.09350089728832245, 'epoch': 0.32}
 32%|███▏      | 10/31 [13:32<28:14, 80.68s/it]
 35%|███▌      | 11/31 [14:54<27:01, 81.10s/it]
 39%|███▊      | 12/31 [16:12<25:26, 80.37s/it]
                                               
{'loss': 0.6873, 'grad_norm': 13.609149750534845, 'learning_rate': 6.388634366811144e-07, 'rewards/chosen': -0.007573310285806656, 'rewards/rejected': -0.02207978628575802, 'rewards/accuracies': 0.59375, 'rewards/margins': 0.014506475999951363, 'logps/rejected': -0.45899519324302673, 'logps/chosen': -0.4435557723045349, 'logits/rejected': -0.10330042243003845, 'logits/chosen': -0.10750165581703186, 'epoch': 0.39}
 39%|███▊      | 12/31 [16:12<25:26, 80.37s/it]
 42%|████▏     | 13/31 [17:32<24:00, 80.05s/it]
 45%|████▌     | 14/31 [18:53<22:46, 80.36s/it]
                                               
{'loss': 0.6871, 'grad_norm': 11.904747388977652, 'learning_rate': 5.584319064156627e-07, 'rewards/chosen': -0.016268189996480942, 'rewards/rejected': -0.03192996233701706, 'rewards/accuracies': 0.6875, 'rewards/margins': 0.015661772340536118, 'logps/rejected': -0.4714500308036804, 'logps/chosen': -0.4400345981121063, 'logits/rejected': -0.11986065655946732, 'logits/chosen': -0.12328911572694778, 'epoch': 0.45}
 45%|████▌     | 14/31 [18:53<22:46, 80.36s/it]
 48%|████▊     | 15/31 [20:15<21:34, 80.91s/it]
 52%|█████▏    | 16/31 [21:37<20:16, 81.13s/it]
                                               
{'loss': 0.6794, 'grad_norm': 12.06039221763718, 'learning_rate': 4.694592710667722e-07, 'rewards/chosen': -0.0007799873128533363, 'rewards/rejected': -0.0452222116291523, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.04444222152233124, 'logps/rejected': -0.45941871404647827, 'logps/chosen': -0.41899749636650085, 'logits/rejected': -0.11979464441537857, 'logits/chosen': -0.13901682198047638, 'epoch': 0.52}
 52%|█████▏    | 16/31 [21:37<20:16, 81.13s/it]
 55%|█████▍    | 17/31 [22:57<18:51, 80.79s/it]
 58%|█████▊    | 18/31 [24:19<17:37, 81.36s/it]
                                               
{'loss': 0.6809, 'grad_norm': 12.97110877588226, 'learning_rate': 3.767420684358097e-07, 'rewards/chosen': -0.04808980971574783, 'rewards/rejected': -0.05888858437538147, 'rewards/accuracies': 0.53125, 'rewards/margins': 0.010798778384923935, 'logps/rejected': -0.47322365641593933, 'logps/chosen': -0.4313308894634247, 'logits/rejected': -0.1446417272090912, 'logits/chosen': -0.15913553535938263, 'epoch': 0.58}
 58%|█████▊    | 18/31 [24:19<17:37, 81.36s/it]
 61%|██████▏   | 19/31 [25:43<16:23, 81.93s/it]
 65%|██████▍   | 20/31 [27:04<15:00, 81.89s/it]
                                               
{'loss': 0.6777, 'grad_norm': 12.429577743964057, 'learning_rate': 2.85278706915564e-07, 'rewards/chosen': -0.02877308987081051, 'rewards/rejected': -0.06035636365413666, 'rewards/accuracies': 0.59375, 'rewards/margins': 0.031583271920681, 'logps/rejected': -0.4564496874809265, 'logps/chosen': -0.4108501076698303, 'logits/rejected': -0.07377966493368149, 'logits/chosen': -0.09046588093042374, 'epoch': 0.65}
 65%|██████▍   | 20/31 [27:04<15:00, 81.89s/it]
 68%|██████▊   | 21/31 [28:25<13:36, 81.63s/it]
 71%|███████   | 22/31 [29:48<12:18, 82.04s/it]
                                               
{'loss': 0.6757, 'grad_norm': 12.910213641651245, 'learning_rate': 2.0000000000000007e-07, 'rewards/chosen': -0.03493615239858627, 'rewards/rejected': -0.08849434554576874, 'rewards/accuracies': 0.75, 'rewards/margins': 0.05355818569660187, 'logps/rejected': -0.4917394518852234, 'logps/chosen': -0.4436219334602356, 'logits/rejected': -0.16127079725265503, 'logits/chosen': -0.1341308206319809, 'epoch': 0.71}
 71%|███████   | 22/31 [29:48<12:18, 82.04s/it][2025-04-11 18:20:19,575] [WARNING] [stage3.py:2069:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 74%|███████▍  | 23/31 [31:12<11:01, 82.63s/it]
 77%|███████▋  | 24/31 [32:32<09:32, 81.84s/it]
                                               
{'loss': 0.6756, 'grad_norm': 12.02554948532379, 'learning_rate': 1.255033448525066e-07, 'rewards/chosen': -0.04145585000514984, 'rewards/rejected': -0.06335633248090744, 'rewards/accuracies': 0.625, 'rewards/margins': 0.0219004824757576, 'logps/rejected': -0.44665923714637756, 'logps/chosen': -0.4466415047645569, 'logits/rejected': -0.16949886083602905, 'logits/chosen': -0.18938323855400085, 'epoch': 0.77}
 77%|███████▋  | 24/31 [32:32<09:32, 81.84s/it]
 81%|████████  | 25/31 [33:54<08:11, 81.88s/it]
 84%|████████▍ | 26/31 [35:14<06:46, 81.23s/it]
                                               
{'loss': 0.6725, 'grad_norm': 12.86587373513768, 'learning_rate': 6.580487543482548e-08, 'rewards/chosen': -0.050307586789131165, 'rewards/rejected': -0.109750896692276, 'rewards/accuracies': 0.78125, 'rewards/margins': 0.059443309903144836, 'logps/rejected': -0.4697876572608948, 'logps/chosen': -0.4494212865829468, 'logits/rejected': -0.15990367531776428, 'logits/chosen': -0.13489475846290588, 'epoch': 0.84}
 84%|████████▍ | 26/31 [35:14<06:46, 81.23s/it]
 87%|████████▋ | 27/31 [36:36<05:25, 81.33s/it]
 90%|█████████ | 28/31 [37:57<04:04, 81.37s/it]
                                               
{'loss': 0.6756, 'grad_norm': 12.758840836685154, 'learning_rate': 2.4122951685636674e-08, 'rewards/chosen': -0.052454784512519836, 'rewards/rejected': -0.07791034877300262, 'rewards/accuracies': 0.6875, 'rewards/margins': 0.02545556053519249, 'logps/rejected': -0.4758801758289337, 'logps/chosen': -0.45228028297424316, 'logits/rejected': -0.13548368215560913, 'logits/chosen': -0.19097942113876343, 'epoch': 0.9}
 90%|█████████ | 28/31 [37:57<04:04, 81.37s/it]
 94%|█████████▎| 29/31 [39:17<02:41, 80.99s/it]
 97%|█████████▋| 30/31 [40:39<01:21, 81.31s/it]
                                               
{'loss': 0.6677, 'grad_norm': 12.512910410056403, 'learning_rate': 2.7046569032227907e-09, 'rewards/chosen': -0.04884684085845947, 'rewards/rejected': -0.0953206941485405, 'rewards/accuracies': 0.75, 'rewards/margins': 0.04647385701537132, 'logps/rejected': -0.46256959438323975, 'logps/chosen': -0.4378519654273987, 'logits/rejected': -0.15362714231014252, 'logits/chosen': -0.16309575736522675, 'epoch': 0.97}
 97%|█████████▋| 30/31 [40:39<01:21, 81.31s/it]
100%|██████████| 31/31 [42:00<00:00, 81.04s/it][INFO|trainer.py:3503] 2025-04-11 18:31:16,471 >> Saving model checkpoint to /data/username/grafting/saves/llama3-8b/full/dpo_code/checkpoint-31
[INFO|configuration_utils.py:472] 2025-04-11 18:31:16,473 >> Configuration saved in /data/username/grafting/saves/llama3-8b/full/dpo_code/checkpoint-31/config.json
[INFO|configuration_utils.py:807] 2025-04-11 18:31:16,473 >> Configuration saved in /data/username/grafting/saves/llama3-8b/full/dpo_code/checkpoint-31/generation_config.json
[INFO|modeling_utils.py:2773] 2025-04-11 18:31:32,297 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at /data/username/grafting/saves/llama3-8b/full/dpo_code/checkpoint-31/model.safetensors.index.json.
[INFO|tokenization_utils_base.py:2702] 2025-04-11 18:31:32,300 >> tokenizer config file saved in /data/username/grafting/saves/llama3-8b/full/dpo_code/checkpoint-31/tokenizer_config.json
[INFO|tokenization_utils_base.py:2711] 2025-04-11 18:31:32,300 >> Special tokens file saved in /data/username/grafting/saves/llama3-8b/full/dpo_code/checkpoint-31/special_tokens_map.json
[2025-04-11 18:31:32,850] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint global_step31 is about to be saved!
[2025-04-11 18:31:32,857] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: /data/username/grafting/saves/llama3-8b/full/dpo_code/checkpoint-31/global_step31/zero_pp_rank_0_mp_rank_00_model_states.pt
[2025-04-11 18:31:32,857] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /data/username/grafting/saves/llama3-8b/full/dpo_code/checkpoint-31/global_step31/zero_pp_rank_0_mp_rank_00_model_states.pt...
[2025-04-11 18:31:32,867] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /data/username/grafting/saves/llama3-8b/full/dpo_code/checkpoint-31/global_step31/zero_pp_rank_0_mp_rank_00_model_states.pt.
[2025-04-11 18:31:32,869] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /data/username/grafting/saves/llama3-8b/full/dpo_code/checkpoint-31/global_step31/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt...
[2025-04-11 18:31:52,667] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /data/username/grafting/saves/llama3-8b/full/dpo_code/checkpoint-31/global_step31/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt.
[2025-04-11 18:31:52,668] [INFO] [engine.py:3478:_save_zero_checkpoint] zero checkpoint saved /data/username/grafting/saves/llama3-8b/full/dpo_code/checkpoint-31/global_step31/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
[2025-04-11 18:31:52,706] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step31 is ready now!
[INFO|trainer.py:2394] 2025-04-11 18:31:52,710 >> 

Training completed. Do not forget to share your model on huggingface.co/models =)


                                               
{'train_runtime': 2566.1369, 'train_samples_per_second': 1.546, 'train_steps_per_second': 0.012, 'train_loss': 0.6820766233628796, 'epoch': 1.0}
100%|██████████| 31/31 [42:45<00:00, 81.04s/it]
100%|██████████| 31/31 [42:45<00:00, 82.77s/it]
***** train metrics *****
  epoch                    =        1.0
  total_flos               =        0GF
  train_loss               =     0.6821
  train_runtime            = 0:42:46.13
  train_samples            =       3968
  train_samples_per_second =      1.546
  train_steps_per_second   =      0.012
2025-04-11 18:31:52 - INFO - __main__ - *** Training complete ***
2025-04-11 18:31:52 - INFO - __main__ - *** Save model ***
[INFO|trainer.py:3503] 2025-04-11 18:32:02,376 >> Saving model checkpoint to /data/username/grafting/saves/llama3-8b/full/dpo_code
[INFO|configuration_utils.py:472] 2025-04-11 18:32:02,379 >> Configuration saved in /data/username/grafting/saves/llama3-8b/full/dpo_code/config.json
[INFO|configuration_utils.py:807] 2025-04-11 18:32:02,379 >> Configuration saved in /data/username/grafting/saves/llama3-8b/full/dpo_code/generation_config.json
[INFO|modeling_utils.py:2773] 2025-04-11 18:32:21,282 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at /data/username/grafting/saves/llama3-8b/full/dpo_code/model.safetensors.index.json.
[INFO|tokenization_utils_base.py:2702] 2025-04-11 18:32:21,284 >> tokenizer config file saved in /data/username/grafting/saves/llama3-8b/full/dpo_code/tokenizer_config.json
[INFO|tokenization_utils_base.py:2711] 2025-04-11 18:32:21,284 >> Special tokens file saved in /data/username/grafting/saves/llama3-8b/full/dpo_code/special_tokens_map.json
2025-04-11 18:32:21 - INFO - __main__ - Model saved to /data/username/grafting/saves/llama3-8b/full/dpo_code
2025-04-11 18:32:21 - INFO - __main__ - *** Training complete! ***
[93m [WARNING] [0m async_io requires the dev libaio .so object and headers but these were not found.
[93m [WARNING] [0m async_io: please install the libaio-dev package with apt
[93m [WARNING] [0m If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[93m [WARNING] [0m Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[93m [WARNING] [0m sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[93m [WARNING] [0m using untested triton version (2.3.1), only 1.0.0 is known to be compatible
