[2025-04-16 10:56:47,383] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[93m [WARNING] [0m async_io requires the dev libaio .so object and headers but these were not found.
[93m [WARNING] [0m async_io: please install the libaio-dev package with apt
[93m [WARNING] [0m If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[93m [WARNING] [0m Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[93m [WARNING] [0m sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[93m [WARNING] [0m using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[INFO|2025-04-16 10:56:49] llamafactory.cli:143 >> Initializing 8 distributed tasks at: 127.0.0.1:21073
W0416 10:56:50.816000 140232307417728 torch/distributed/run.py:757] 
W0416 10:56:50.816000 140232307417728 torch/distributed/run.py:757] *****************************************
W0416 10:56:50.816000 140232307417728 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0416 10:56:50.816000 140232307417728 torch/distributed/run.py:757] *****************************************
[2025-04-16 10:56:54,576] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-16 10:56:54,577] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-16 10:56:54,578] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-16 10:56:54,601] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-16 10:56:54,631] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-16 10:56:54,655] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-16 10:56:54,660] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[93m [WARNING] [0m async_io requires the dev libaio .so object and headers but these were not found.
[93m [WARNING] [0m async_io requires the dev libaio .so object and headers but these were not found.
[93m [WARNING] [0m async_io requires the dev libaio .so object and headers but these were not found.
[2025-04-16 10:56:54,708] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[93m [WARNING] [0m async_io requires the dev libaio .so object and headers but these were not found.
[93m [WARNING] [0m async_io: please install the libaio-dev package with apt
[93m [WARNING] [0m If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[93m [WARNING] [0m Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[93m [WARNING] [0m async_io: please install the libaio-dev package with apt
[93m [WARNING] [0m If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[93m [WARNING] [0m Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[93m [WARNING] [0m async_io requires the dev libaio .so object and headers but these were not found.
[93m [WARNING] [0m async_io: please install the libaio-dev package with apt
[93m [WARNING] [0m If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[93m [WARNING] [0m Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[93m [WARNING] [0m async_io requires the dev libaio .so object and headers but these were not found.
[93m [WARNING] [0m async_io requires the dev libaio .so object and headers but these were not found.
[93m [WARNING] [0m async_io: please install the libaio-dev package with apt
[93m [WARNING] [0m If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[93m [WARNING] [0m Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[93m [WARNING] [0m async_io requires the dev libaio .so object and headers but these were not found.
[93m [WARNING] [0m async_io: please install the libaio-dev package with apt
[93m [WARNING] [0m If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[93m [WARNING] [0m Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[93m [WARNING] [0m async_io: please install the libaio-dev package with apt
[93m [WARNING] [0m If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[93m [WARNING] [0m Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[93m [WARNING] [0m async_io: please install the libaio-dev package with apt
[93m [WARNING] [0m If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[93m [WARNING] [0m Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[93m [WARNING] [0m async_io: please install the libaio-dev package with apt
[93m [WARNING] [0m If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[93m [WARNING] [0m Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[93m [WARNING] [0m sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[93m [WARNING] [0m using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[93m [WARNING] [0m sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[93m [WARNING] [0m using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[93m [WARNING] [0m sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[93m [WARNING] [0m using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[93m [WARNING] [0m sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[93m [WARNING] [0m using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[93m [WARNING] [0m sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[93m [WARNING] [0m using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[93m [WARNING] [0m sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[93m [WARNING] [0m using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[93m [WARNING] [0m sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[93m [WARNING] [0m using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[93m [WARNING] [0m sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[93m [WARNING] [0m using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[2025-04-16 10:56:55,527] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-04-16 10:56:55,527] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-04-16 10:56:55,595] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-04-16 10:56:55,601] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-04-16 10:56:55,601] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2025-04-16 10:56:55,610] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-04-16 10:56:55,610] [INFO] [comm.py:637:init_distributed] cdb=None
[INFO|2025-04-16 10:56:55] llamafactory.hparams.parser:380 >> Process rank: 6, world size: 8, device: cuda:6, distributed training: True, compute dtype: torch.bfloat16
[INFO|2025-04-16 10:56:55] llamafactory.hparams.parser:380 >> Process rank: 7, world size: 8, device: cuda:7, distributed training: True, compute dtype: torch.bfloat16
[2025-04-16 10:56:55,664] [INFO] [comm.py:637:init_distributed] cdb=None
[INFO|2025-04-16 10:56:55] llamafactory.hparams.parser:380 >> Process rank: 1, world size: 8, device: cuda:1, distributed training: True, compute dtype: torch.bfloat16
[INFO|2025-04-16 10:56:55] llamafactory.hparams.parser:380 >> Process rank: 4, world size: 8, device: cuda:4, distributed training: True, compute dtype: torch.bfloat16
[INFO|2025-04-16 10:56:55] llamafactory.hparams.parser:380 >> Process rank: 3, world size: 8, device: cuda:3, distributed training: True, compute dtype: torch.bfloat16
[INFO|2025-04-16 10:56:55] llamafactory.hparams.parser:380 >> Process rank: 0, world size: 8, device: cuda:0, distributed training: True, compute dtype: torch.bfloat16
[INFO|tokenization_utils_base.py:2287] 2025-04-16 10:56:55,734 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2287] 2025-04-16 10:56:55,734 >> loading file added_tokens.json
[2025-04-16 10:56:55,734] [INFO] [comm.py:637:init_distributed] cdb=None
[INFO|tokenization_utils_base.py:2287] 2025-04-16 10:56:55,735 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2287] 2025-04-16 10:56:55,735 >> loading file tokenizer_config.json
[INFO|2025-04-16 10:56:55] llamafactory.hparams.parser:380 >> Process rank: 2, world size: 8, device: cuda:2, distributed training: True, compute dtype: torch.bfloat16
[INFO|2025-04-16 10:56:55] llamafactory.hparams.parser:380 >> Process rank: 5, world size: 8, device: cuda:5, distributed training: True, compute dtype: torch.bfloat16
[INFO|tokenization_utils_base.py:2533] 2025-04-16 10:56:56,031 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|configuration_utils.py:731] 2025-04-16 10:56:56,032 >> loading configuration file /data/public/Llama-3.1-8B-Instruct/config.json
[INFO|configuration_utils.py:800] 2025-04-16 10:56:56,033 >> Model config LlamaConfig {
  "_name_or_path": "/data/public/Llama-3.1-8B-Instruct",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": [
    128001,
    128008,
    128009
  ],
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 131072,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 8.0,
    "high_freq_factor": 4.0,
    "low_freq_factor": 1.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.43.4",
  "use_cache": false,
  "vocab_size": 128256
}

[INFO|tokenization_utils_base.py:2287] 2025-04-16 10:56:56,034 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2287] 2025-04-16 10:56:56,034 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2287] 2025-04-16 10:56:56,034 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2287] 2025-04-16 10:56:56,034 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2533] 2025-04-16 10:56:56,304 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|2025-04-16 10:56:56] llamafactory.data.template:143 >> Add pad token: <|eot_id|>
[INFO|2025-04-16 10:56:56] llamafactory.data.template:143 >> Add <|eot_id|>,<|eom_id|> to stop words.
[INFO|2025-04-16 10:56:56] llamafactory.data.loader:143 >> Loading dataset /data/public/grafting/sft_split_data/Mathematics/Mathematics.json...
Converting format of dataset (num_proc=16):   0%|          | 0/40188 [00:00<?, ? examples/s]
Converting format of dataset (num_proc=16):   2%|▏         | 744/40188 [00:00<00:05, 7272.83 examples/s]
Converting format of dataset (num_proc=16):  61%|██████    | 24615/40188 [00:00<00:00, 140534.42 examples/s]
Converting format of dataset (num_proc=16):  97%|█████████▋| 38941/40188 [00:00<00:00, 112820.79 examples/s]
Converting format of dataset (num_proc=16): 100%|██████████| 40188/40188 [00:00<00:00, 82769.59 examples/s] 
Running tokenizer on dataset (num_proc=16):   0%|          | 0/40188 [00:00<?, ? examples/s][WARNING|tokenization_utils_base.py:4119] 2025-04-16 10:56:59,687 >> Token indices sequence length is longer than the specified maximum sequence length for this model (2643 > 2048). Running this sequence through the model will result in indexing errors
Running tokenizer on dataset (num_proc=16):   2%|▏         | 1000/40188 [00:01<00:57, 681.78 examples/s]
Running tokenizer on dataset (num_proc=16):   7%|▋         | 3000/40188 [00:01<00:16, 2299.12 examples/s]
Running tokenizer on dataset (num_proc=16):  10%|▉         | 4000/40188 [00:01<00:11, 3053.53 examples/s]
Running tokenizer on dataset (num_proc=16):  12%|█▏        | 5000/40188 [00:01<00:09, 3783.22 examples/s]
Running tokenizer on dataset (num_proc=16):  15%|█▍        | 6000/40188 [00:02<00:07, 4443.22 examples/s]
Running tokenizer on dataset (num_proc=16):  17%|█▋        | 7000/40188 [00:02<00:06, 5053.63 examples/s]
Running tokenizer on dataset (num_proc=16):  20%|█▉        | 8000/40188 [00:02<00:05, 5827.83 examples/s]
Running tokenizer on dataset (num_proc=16):  30%|██▉       | 12000/40188 [00:02<00:02, 9761.15 examples/s][WARNING|tokenization_utils_base.py:4119] 2025-04-16 10:57:01,150 >> Token indices sequence length is longer than the specified maximum sequence length for this model (2404 > 2048). Running this sequence through the model will result in indexing errors
[WARNING|tokenization_utils_base.py:4119] 2025-04-16 10:57:01,168 >> Token indices sequence length is longer than the specified maximum sequence length for this model (3100 > 2048). Running this sequence through the model will result in indexing errors
Running tokenizer on dataset (num_proc=16):  37%|███▋      | 15000/40188 [00:02<00:02, 12513.26 examples/s]
Running tokenizer on dataset (num_proc=16):  41%|████      | 16512/40188 [00:02<00:01, 12956.47 examples/s]
Running tokenizer on dataset (num_proc=16):  45%|████▍     | 18024/40188 [00:02<00:01, 13403.40 examples/s]
Running tokenizer on dataset (num_proc=16):  51%|█████     | 20536/40188 [00:02<00:01, 15191.29 examples/s]
Running tokenizer on dataset (num_proc=16):  60%|█████▉    | 24048/40188 [00:03<00:00, 16692.23 examples/s]
Running tokenizer on dataset (num_proc=16):  69%|██████▊   | 27560/40188 [00:03<00:00, 18770.25 examples/s]
Running tokenizer on dataset (num_proc=16):  74%|███████▎  | 29584/40188 [00:03<00:00, 16717.97 examples/s]
Running tokenizer on dataset (num_proc=16):  79%|███████▊  | 31608/40188 [00:03<00:00, 14018.04 examples/s]
Running tokenizer on dataset (num_proc=16):  84%|████████▎ | 33608/40188 [00:03<00:00, 14282.24 examples/s]
Running tokenizer on dataset (num_proc=16):  87%|████████▋ | 35120/40188 [00:03<00:00, 14137.85 examples/s]
Running tokenizer on dataset (num_proc=16):  91%|█████████ | 36632/40188 [00:04<00:00, 12488.62 examples/s]
Running tokenizer on dataset (num_proc=16):  95%|█████████▍| 38144/40188 [00:04<00:00, 11387.01 examples/s][WARNING|tokenization_utils_base.py:4119] 2025-04-16 10:57:03,011 >> Token indices sequence length is longer than the specified maximum sequence length for this model (2139 > 2048). Running this sequence through the model will result in indexing errors
Running tokenizer on dataset (num_proc=16):  99%|█████████▊| 39677/40188 [00:04<00:00, 8734.60 examples/s] 
Running tokenizer on dataset (num_proc=16): 100%|██████████| 40188/40188 [00:04<00:00, 8241.42 examples/s]
training example:
input_ids:
[128000, 128006, 882, 128007, 271, 3923, 374, 279, 19463, 430, 279, 10973, 6811, 1990, 279, 5219, 12146, 555, 33919, 323, 14596, 8385, 374, 2753, 1109, 477, 6273, 311, 220, 717, 30, 17855, 701, 4320, 439, 264, 4279, 19983, 13, 128009, 128006, 78191, 128007, 271, 791, 5219, 12146, 555, 33919, 323, 14596, 8385, 649, 387, 4205, 505, 400, 16, 3, 311, 400, 1041, 3, 320, 99461, 705, 779, 584, 617, 264, 2860, 315, 400, 1041, 1144, 15487, 220, 1041, 3, 3284, 13840, 315, 5219, 382, 2520, 279, 10973, 6811, 1990, 279, 5219, 12146, 555, 33919, 323, 14596, 8385, 311, 387, 2753, 1109, 477, 6273, 311, 220, 717, 11, 279, 6811, 649, 387, 400, 15, 11, 220, 16, 11, 220, 17, 11, 1144, 509, 2469, 11, 220, 717, 3, 382, 2746, 279, 6811, 374, 400, 15, 55976, 1243, 279, 5219, 527, 6273, 13, 1442, 279, 6811, 374, 400, 16, 55976, 1243, 279, 5219, 527, 400, 16, 3, 10980, 13, 763, 4689, 11, 422, 279, 6811, 374, 400, 74, 55976, 1243, 279, 5219, 527, 400, 74, 3, 10980, 382, 10267, 596, 1797, 279, 1396, 315, 13840, 369, 1855, 6811, 1473, 9, 1442, 279, 6811, 374, 400, 15, 55976, 1243, 584, 617, 400, 1041, 3, 13840, 320, 11536, 279, 5219, 527, 6273, 4390, 9, 1442, 279, 6811, 374, 400, 16, 55976, 1243, 584, 617, 400, 1484, 3, 13840, 320, 11536, 832, 1396, 374, 400, 16, 3, 810, 1109, 279, 1023, 4390, 9, 1442, 279, 6811, 374, 400, 17, 55976, 1243, 584, 617, 400, 3264, 3, 13840, 320, 11536, 832, 1396, 374, 400, 17, 3, 810, 1109, 279, 1023, 4390, 9, 9522, 9, 1442, 279, 6811, 374, 400, 717, 55976, 1243, 584, 617, 400, 2421, 3, 13840, 320, 11536, 832, 1396, 374, 400, 717, 3, 810, 1109, 279, 1023, 3677, 791, 2860, 1396, 315, 13840, 449, 264, 6811, 2753, 1109, 477, 6273, 311, 220, 717, 374, 512, 79145, 220, 1041, 489, 220, 17, 7, 1484, 489, 220, 3264, 489, 1144, 509, 2469, 489, 220, 2421, 8, 284, 220, 1041, 489, 220, 17, 1144, 51953, 1144, 38118, 97165, 1484, 489, 220, 2421, 8, 1144, 51953, 220, 717, 15523, 17, 92, 284, 220, 1041, 489, 220, 17, 1144, 51953, 220, 25612, 284, 220, 4468, 17, 1144, 2595, 4516, 11, 279, 19463, 430, 279, 10973, 6811, 1990, 279, 5219, 12146, 555, 33919, 323, 14596, 8385, 374, 2753, 1109, 477, 6273, 311, 220, 717, 374, 512, 79145, 1144, 38118, 90, 4468, 17, 15523, 1041, 1144, 15487, 220, 1041, 92, 284, 1144, 80175, 36802, 38118, 90, 22741, 15523, 5154, 15, 3500, 1144, 60, 128009]
inputs:
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

What is the probability that the absolute difference between the numbers chosen by Billy and Bobbi is less than or equal to 12? Express your answer as a common fraction.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

The numbers chosen by Billy and Bobbi can be anything from $1$ to $100$ (inclusive), so we have a total of $100 \times 100$ possible pairs of numbers.

For the absolute difference between the numbers chosen by Billy and Bobbi to be less than or equal to 12, the difference can be $0, 1, 2, \ldots, 12$.

If the difference is $0$, then the numbers are equal. If the difference is $1$, then the numbers are $1$ apart. In general, if the difference is $k$, then the numbers are $k$ apart.

Let's count the number of pairs for each difference:

* If the difference is $0$, then we have $100$ pairs (since the numbers are equal).
* If the difference is $1$, then we have $99$ pairs (since one number is $1$ more than the other).
* If the difference is $2$, then we have $98$ pairs (since one number is $2$ more than the other).
*...
* If the difference is $12$, then we have $88$ pairs (since one number is $12$ more than the other).

The total number of pairs with a difference less than or equal to 12 is:
\[ 100 + 2(99 + 98 + \ldots + 88) = 100 + 2 \cdot \frac{(99 + 88) \cdot 12}{2} = 100 + 2 \cdot 936 = 1972 \]

So, the probability that the absolute difference between the numbers chosen by Billy and Bobbi is less than or equal to 12 is:
\[ \frac{1972}{100 \times 100} = \boxed{\frac{493}{2500}} \]<|eot_id|>
label_ids:
[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 791, 5219, 12146, 555, 33919, 323, 14596, 8385, 649, 387, 4205, 505, 400, 16, 3, 311, 400, 1041, 3, 320, 99461, 705, 779, 584, 617, 264, 2860, 315, 400, 1041, 1144, 15487, 220, 1041, 3, 3284, 13840, 315, 5219, 382, 2520, 279, 10973, 6811, 1990, 279, 5219, 12146, 555, 33919, 323, 14596, 8385, 311, 387, 2753, 1109, 477, 6273, 311, 220, 717, 11, 279, 6811, 649, 387, 400, 15, 11, 220, 16, 11, 220, 17, 11, 1144, 509, 2469, 11, 220, 717, 3, 382, 2746, 279, 6811, 374, 400, 15, 55976, 1243, 279, 5219, 527, 6273, 13, 1442, 279, 6811, 374, 400, 16, 55976, 1243, 279, 5219, 527, 400, 16, 3, 10980, 13, 763, 4689, 11, 422, 279, 6811, 374, 400, 74, 55976, 1243, 279, 5219, 527, 400, 74, 3, 10980, 382, 10267, 596, 1797, 279, 1396, 315, 13840, 369, 1855, 6811, 1473, 9, 1442, 279, 6811, 374, 400, 15, 55976, 1243, 584, 617, 400, 1041, 3, 13840, 320, 11536, 279, 5219, 527, 6273, 4390, 9, 1442, 279, 6811, 374, 400, 16, 55976, 1243, 584, 617, 400, 1484, 3, 13840, 320, 11536, 832, 1396, 374, 400, 16, 3, 810, 1109, 279, 1023, 4390, 9, 1442, 279, 6811, 374, 400, 17, 55976, 1243, 584, 617, 400, 3264, 3, 13840, 320, 11536, 832, 1396, 374, 400, 17, 3, 810, 1109, 279, 1023, 4390, 9, 9522, 9, 1442, 279, 6811, 374, 400, 717, 55976, 1243, 584, 617, 400, 2421, 3, 13840, 320, 11536, 832, 1396, 374, 400, 717, 3, 810, 1109, 279, 1023, 3677, 791, 2860, 1396, 315, 13840, 449, 264, 6811, 2753, 1109, 477, 6273, 311, 220, 717, 374, 512, 79145, 220, 1041, 489, 220, 17, 7, 1484, 489, 220, 3264, 489, 1144, 509, 2469, 489, 220, 2421, 8, 284, 220, 1041, 489, 220, 17, 1144, 51953, 1144, 38118, 97165, 1484, 489, 220, 2421, 8, 1144, 51953, 220, 717, 15523, 17, 92, 284, 220, 1041, 489, 220, 17, 1144, 51953, 220, 25612, 284, 220, 4468, 17, 1144, 2595, 4516, 11, 279, 19463, 430, 279, 10973, 6811, 1990, 279, 5219, 12146, 555, 33919, 323, 14596, 8385, 374, 2753, 1109, 477, 6273, 311, 220, 717, 374, 512, 79145, 1144, 38118, 90, 4468, 17, 15523, 1041, 1144, 15487, 220, 1041, 92, 284, 1144, 80175, 36802, 38118, 90, 22741, 15523, 5154, 15, 3500, 1144, 60, 128009]
labels:
The numbers chosen by Billy and Bobbi can be anything from $1$ to $100$ (inclusive), so we have a total of $100 \times 100$ possible pairs of numbers.

For the absolute difference between the numbers chosen by Billy and Bobbi to be less than or equal to 12, the difference can be $0, 1, 2, \ldots, 12$.

If the difference is $0$, then the numbers are equal. If the difference is $1$, then the numbers are $1$ apart. In general, if the difference is $k$, then the numbers are $k$ apart.

Let's count the number of pairs for each difference:

* If the difference is $0$, then we have $100$ pairs (since the numbers are equal).
* If the difference is $1$, then we have $99$ pairs (since one number is $1$ more than the other).
* If the difference is $2$, then we have $98$ pairs (since one number is $2$ more than the other).
*...
* If the difference is $12$, then we have $88$ pairs (since one number is $12$ more than the other).

The total number of pairs with a difference less than or equal to 12 is:
\[ 100 + 2(99 + 98 + \ldots + 88) = 100 + 2 \cdot \frac{(99 + 88) \cdot 12}{2} = 100 + 2 \cdot 936 = 1972 \]

So, the probability that the absolute difference between the numbers chosen by Billy and Bobbi is less than or equal to 12 is:
\[ \frac{1972}{100 \times 100} = \boxed{\frac{493}{2500}} \]<|eot_id|>
[INFO|configuration_utils.py:731] 2025-04-16 10:57:03,548 >> loading configuration file /data/public/Llama-3.1-8B-Instruct/config.json
[INFO|configuration_utils.py:800] 2025-04-16 10:57:03,549 >> Model config LlamaConfig {
  "_name_or_path": "/data/public/Llama-3.1-8B-Instruct",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": [
    128001,
    128008,
    128009
  ],
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 131072,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 8.0,
    "high_freq_factor": 4.0,
    "low_freq_factor": 1.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.43.4",
  "use_cache": false,
  "vocab_size": 128256
}

[INFO|modeling_utils.py:3641] 2025-04-16 10:57:03,588 >> loading weights file /data/public/Llama-3.1-8B-Instruct/model.safetensors.index.json
[INFO|modeling_utils.py:3786] 2025-04-16 10:57:03,588 >> Detected DeepSpeed ZeRO-3: activating zero.init() for this model
[WARNING|logging.py:328] 2025-04-16 10:57:03,591 >> You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
[WARNING|logging.py:328] 2025-04-16 10:57:03,591 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
[WARNING|logging.py:328] 2025-04-16 10:57:03,599 >> Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaForCausalLM is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
[INFO|configuration_utils.py:1038] 2025-04-16 10:57:03,600 >> Generate config GenerationConfig {
  "bos_token_id": 128000,
  "eos_token_id": [
    128001,
    128008,
    128009
  ],
  "use_cache": false
}

[WARNING|logging.py:328] 2025-04-16 10:57:03,601 >> Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaForCausalLM is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaForCausalLM is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaForCausalLM is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaForCausalLM is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaForCausalLM is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaForCausalLM is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaForCausalLM is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
[2025-04-16 10:57:05,162] [INFO] [partition_parameters.py:345:__exit__] finished initializing model - num_params = 291, num_elems = 8.03B
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards:  25%|██▌       | 1/4 [00:00<00:00,  4.57it/s]
Loading checkpoint shards:  25%|██▌       | 1/4 [00:00<00:00,  4.55it/s]
Loading checkpoint shards:  25%|██▌       | 1/4 [00:00<00:00,  4.21it/s]
Loading checkpoint shards:  25%|██▌       | 1/4 [00:00<00:00,  4.12it/s]
Loading checkpoint shards:  25%|██▌       | 1/4 [00:00<00:00,  4.07it/s]
Loading checkpoint shards:  25%|██▌       | 1/4 [00:00<00:00,  3.99it/s]
Loading checkpoint shards:  25%|██▌       | 1/4 [00:00<00:00,  3.93it/s]
Loading checkpoint shards:  25%|██▌       | 1/4 [00:01<00:04,  1.53s/it]
Loading checkpoint shards:  50%|█████     | 2/4 [00:01<00:01,  1.03it/s]
Loading checkpoint shards:  50%|█████     | 2/4 [00:01<00:01,  1.02it/s]
Loading checkpoint shards:  50%|█████     | 2/4 [00:01<00:01,  1.02it/s]
Loading checkpoint shards:  50%|█████     | 2/4 [00:01<00:01,  1.02it/s]
Loading checkpoint shards:  50%|█████     | 2/4 [00:01<00:01,  1.01it/s]
Loading checkpoint shards:  50%|█████     | 2/4 [00:01<00:02,  1.00s/it]
Loading checkpoint shards:  50%|█████     | 2/4 [00:01<00:02,  1.01s/it]
Loading checkpoint shards:  75%|███████▌  | 3/4 [00:02<00:01,  1.11s/it]
Loading checkpoint shards:  75%|███████▌  | 3/4 [00:02<00:01,  1.11s/it]
Loading checkpoint shards:  75%|███████▌  | 3/4 [00:02<00:01,  1.10s/it]
Loading checkpoint shards:  75%|███████▌  | 3/4 [00:02<00:01,  1.10s/it]
Loading checkpoint shards:  75%|███████▌  | 3/4 [00:02<00:01,  1.11s/it]
Loading checkpoint shards:  75%|███████▌  | 3/4 [00:03<00:01,  1.11s/it]
Loading checkpoint shards:  75%|███████▌  | 3/4 [00:03<00:01,  1.11s/it]
Loading checkpoint shards:  50%|█████     | 2/4 [00:03<00:03,  1.54s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.26it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.21it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.25it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.21it/s]
/home/username/.conda/envs/llm/lib/python3.11/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.24it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.20it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.25it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.24it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.20it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.20it/s]
/home/username/.conda/envs/llm/lib/python3.11/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.24it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.19it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.23it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.19it/s]
/home/username/.conda/envs/llm/lib/python3.11/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
Loading checkpoint shards:  75%|███████▌  | 3/4 [00:04<00:01,  1.54s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:05<00:00,  1.10s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:05<00:00,  1.26s/it]
[INFO|modeling_utils.py:4473] 2025-04-16 10:57:10,220 >> All model checkpoint weights were used when initializing LlamaForCausalLM.

[INFO|modeling_utils.py:4481] 2025-04-16 10:57:10,220 >> All the weights of LlamaForCausalLM were initialized from the model checkpoint at /data/public/Llama-3.1-8B-Instruct.
If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training.
[INFO|configuration_utils.py:991] 2025-04-16 10:57:10,223 >> loading configuration file /data/public/Llama-3.1-8B-Instruct/generation_config.json
[INFO|configuration_utils.py:1038] 2025-04-16 10:57:10,223 >> Generate config GenerationConfig {
  "bos_token_id": 128000,
  "do_sample": true,
  "eos_token_id": [
    128001,
    128008,
    128009
  ],
  "temperature": 0.6,
  "top_p": 0.9
}

[INFO|2025-04-16 10:57:10] llamafactory.model.model_utils.checkpointing:143 >> Gradient checkpointing enabled.
[INFO|2025-04-16 10:57:10] llamafactory.model.model_utils.attention:143 >> Using FlashAttention-2 for faster training and inference.
[INFO|2025-04-16 10:57:10] llamafactory.model.adapter:143 >> ZeRO3 / FSDP detected, remaining trainable params in float32.
[INFO|2025-04-16 10:57:10] llamafactory.model.adapter:143 >> Fine-tuning method: Full
[INFO|2025-04-16 10:57:10] llamafactory.model.loader:143 >> trainable params: 8,030,261,248 || all params: 8,030,261,248 || trainable%: 100.0000
/home/username/.conda/envs/llm/lib/python3.11/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
[INFO|trainer.py:648] 2025-04-16 10:57:10,261 >> Using auto half precision backend
[2025-04-16 10:57:10,410] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.4, git-hash=unknown, git-branch=unknown
[2025-04-16 10:57:10,417] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2025-04-16 10:57:10,418] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2025-04-16 10:57:10,418] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2025-04-16 10:57:10,425] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = AdamW
[2025-04-16 10:57:10,425] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'torch.optim.adamw.AdamW'>
[2025-04-16 10:57:10,425] [INFO] [logging.py:96:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer, MiCS is enabled False, Hierarchical params gather False
[2025-04-16 10:57:10,425] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 3 optimizer
[2025-04-16 10:57:10,560] [INFO] [utils.py:781:see_memory_usage] Stage 3 initialize beginning
[2025-04-16 10:57:10,561] [INFO] [utils.py:782:see_memory_usage] MA 1.87 GB         Max_MA 4.68 GB         CA 3.0 GB         Max_CA 5 GB 
[2025-04-16 10:57:10,561] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 18.05 GB, percent = 2.4%
[2025-04-16 10:57:10,562] [INFO] [stage3.py:130:__init__] Reduce bucket size 16777216
[2025-04-16 10:57:10,562] [INFO] [stage3.py:131:__init__] Prefetch bucket size 15099494
[2025-04-16 10:57:10,696] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2025-04-16 10:57:10,697] [INFO] [utils.py:782:see_memory_usage] MA 1.87 GB         Max_MA 1.87 GB         CA 3.0 GB         Max_CA 3 GB 
[2025-04-16 10:57:10,697] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 18.07 GB, percent = 2.4%
Parameter Offload: Total persistent parameters: 266240 in 65 params
[2025-04-16 10:57:10,848] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2025-04-16 10:57:10,848] [INFO] [utils.py:782:see_memory_usage] MA 1.87 GB         Max_MA 1.87 GB         CA 3.0 GB         Max_CA 3 GB 
[2025-04-16 10:57:10,849] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 18.07 GB, percent = 2.4%
[2025-04-16 10:57:10,986] [INFO] [utils.py:781:see_memory_usage] Before creating fp16 partitions
[2025-04-16 10:57:10,987] [INFO] [utils.py:782:see_memory_usage] MA 1.87 GB         Max_MA 1.87 GB         CA 3.0 GB         Max_CA 3 GB 
[2025-04-16 10:57:10,987] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 18.07 GB, percent = 2.4%
[2025-04-16 10:57:12,875] [INFO] [utils.py:781:see_memory_usage] After creating fp16 partitions: 2
[2025-04-16 10:57:12,876] [INFO] [utils.py:782:see_memory_usage] MA 1.87 GB         Max_MA 1.87 GB         CA 1.87 GB         Max_CA 3 GB 
[2025-04-16 10:57:12,876] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 18.07 GB, percent = 2.4%
[2025-04-16 10:57:13,019] [INFO] [utils.py:781:see_memory_usage] Before creating fp32 partitions
[2025-04-16 10:57:13,019] [INFO] [utils.py:782:see_memory_usage] MA 1.87 GB         Max_MA 1.87 GB         CA 1.87 GB         Max_CA 2 GB 
[2025-04-16 10:57:13,020] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 18.07 GB, percent = 2.4%
[2025-04-16 10:57:13,163] [INFO] [utils.py:781:see_memory_usage] After creating fp32 partitions
[2025-04-16 10:57:13,164] [INFO] [utils.py:782:see_memory_usage] MA 5.61 GB         Max_MA 7.48 GB         CA 7.48 GB         Max_CA 7 GB 
[2025-04-16 10:57:13,164] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 18.07 GB, percent = 2.4%
[2025-04-16 10:57:13,303] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states
[2025-04-16 10:57:13,303] [INFO] [utils.py:782:see_memory_usage] MA 5.61 GB         Max_MA 5.61 GB         CA 7.48 GB         Max_CA 7 GB 
[2025-04-16 10:57:13,303] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 18.07 GB, percent = 2.4%
[2025-04-16 10:57:13,444] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states
[2025-04-16 10:57:13,445] [INFO] [utils.py:782:see_memory_usage] MA 5.61 GB         Max_MA 9.35 GB         CA 11.22 GB         Max_CA 11 GB 
[2025-04-16 10:57:13,445] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 18.07 GB, percent = 2.4%
[2025-04-16 10:57:13,445] [INFO] [stage3.py:486:_setup_for_real_optimizer] optimizer state initialized
[2025-04-16 10:57:14,453] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer
[2025-04-16 10:57:14,453] [INFO] [utils.py:782:see_memory_usage] MA 7.51 GB         Max_MA 9.47 GB         CA 11.22 GB         Max_CA 11 GB 
[2025-04-16 10:57:14,453] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 18.05 GB, percent = 2.4%
[2025-04-16 10:57:14,453] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer_Stage3
[2025-04-16 10:57:14,454] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2025-04-16 10:57:14,454] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2025-04-16 10:57:14,454] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0], mom=[(0.9, 0.999), (0.9, 0.999)]
[2025-04-16 10:57:14,454] [INFO] [config.py:997:print] DeepSpeedEngine configuration:
[2025-04-16 10:57:14,455] [INFO] [config.py:1001:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2025-04-16 10:57:14,455] [INFO] [config.py:1001:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2025-04-16 10:57:14,455] [INFO] [config.py:1001:print]   amp_enabled .................. False
[2025-04-16 10:57:14,455] [INFO] [config.py:1001:print]   amp_params ................... False
[2025-04-16 10:57:14,455] [INFO] [config.py:1001:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
}
[2025-04-16 10:57:14,455] [INFO] [config.py:1001:print]   bfloat16_enabled ............. True
[2025-04-16 10:57:14,455] [INFO] [config.py:1001:print]   bfloat16_immediate_grad_update  False
[2025-04-16 10:57:14,455] [INFO] [config.py:1001:print]   checkpoint_parallel_write_pipeline  False
[2025-04-16 10:57:14,455] [INFO] [config.py:1001:print]   checkpoint_tag_validation_enabled  True
[2025-04-16 10:57:14,455] [INFO] [config.py:1001:print]   checkpoint_tag_validation_fail  False
[2025-04-16 10:57:14,455] [INFO] [config.py:1001:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f0f00391f90>
[2025-04-16 10:57:14,455] [INFO] [config.py:1001:print]   communication_data_type ...... None
[2025-04-16 10:57:14,455] [INFO] [config.py:1001:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2025-04-16 10:57:14,455] [INFO] [config.py:1001:print]   curriculum_enabled_legacy .... False
[2025-04-16 10:57:14,455] [INFO] [config.py:1001:print]   curriculum_params_legacy ..... False
[2025-04-16 10:57:14,455] [INFO] [config.py:1001:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2025-04-16 10:57:14,455] [INFO] [config.py:1001:print]   data_efficiency_enabled ...... False
[2025-04-16 10:57:14,455] [INFO] [config.py:1001:print]   dataloader_drop_last ......... False
[2025-04-16 10:57:14,456] [INFO] [config.py:1001:print]   disable_allgather ............ False
[2025-04-16 10:57:14,456] [INFO] [config.py:1001:print]   dump_state ................... False
[2025-04-16 10:57:14,456] [INFO] [config.py:1001:print]   dynamic_loss_scale_args ...... None
[2025-04-16 10:57:14,456] [INFO] [config.py:1001:print]   eigenvalue_enabled ........... False
[2025-04-16 10:57:14,456] [INFO] [config.py:1001:print]   eigenvalue_gas_boundary_resolution  1
[2025-04-16 10:57:14,456] [INFO] [config.py:1001:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2025-04-16 10:57:14,456] [INFO] [config.py:1001:print]   eigenvalue_layer_num ......... 0
[2025-04-16 10:57:14,456] [INFO] [config.py:1001:print]   eigenvalue_max_iter .......... 100
[2025-04-16 10:57:14,456] [INFO] [config.py:1001:print]   eigenvalue_stability ......... 1e-06
[2025-04-16 10:57:14,456] [INFO] [config.py:1001:print]   eigenvalue_tol ............... 0.01
[2025-04-16 10:57:14,456] [INFO] [config.py:1001:print]   eigenvalue_verbose ........... False
[2025-04-16 10:57:14,456] [INFO] [config.py:1001:print]   elasticity_enabled ........... False
[2025-04-16 10:57:14,456] [INFO] [config.py:1001:print]   flops_profiler_config ........ {
    "enabled": false, 
    "recompute_fwd_factor": 0.0, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
[2025-04-16 10:57:14,456] [INFO] [config.py:1001:print]   fp16_auto_cast ............... None
[2025-04-16 10:57:14,456] [INFO] [config.py:1001:print]   fp16_enabled ................. False
[2025-04-16 10:57:14,456] [INFO] [config.py:1001:print]   fp16_master_weights_and_gradients  False
[2025-04-16 10:57:14,456] [INFO] [config.py:1001:print]   global_rank .................. 0
[2025-04-16 10:57:14,456] [INFO] [config.py:1001:print]   grad_accum_dtype ............. None
[2025-04-16 10:57:14,456] [INFO] [config.py:1001:print]   gradient_accumulation_steps .. 4
[2025-04-16 10:57:14,456] [INFO] [config.py:1001:print]   gradient_clipping ............ 1.0
[2025-04-16 10:57:14,456] [INFO] [config.py:1001:print]   gradient_predivide_factor .... 1.0
[2025-04-16 10:57:14,456] [INFO] [config.py:1001:print]   graph_harvesting ............. False
[2025-04-16 10:57:14,456] [INFO] [config.py:1001:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2025-04-16 10:57:14,456] [INFO] [config.py:1001:print]   initial_dynamic_scale ........ 1
[2025-04-16 10:57:14,456] [INFO] [config.py:1001:print]   load_universal_checkpoint .... False
[2025-04-16 10:57:14,456] [INFO] [config.py:1001:print]   loss_scale ................... 1.0
[2025-04-16 10:57:14,456] [INFO] [config.py:1001:print]   memory_breakdown ............. False
[2025-04-16 10:57:14,456] [INFO] [config.py:1001:print]   mics_hierarchial_params_gather  False
[2025-04-16 10:57:14,456] [INFO] [config.py:1001:print]   mics_shard_size .............. -1
[2025-04-16 10:57:14,456] [INFO] [config.py:1001:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2025-04-16 10:57:14,456] [INFO] [config.py:1001:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
}
[2025-04-16 10:57:14,456] [INFO] [config.py:1001:print]   optimizer_legacy_fusion ...... False
[2025-04-16 10:57:14,456] [INFO] [config.py:1001:print]   optimizer_name ............... None
[2025-04-16 10:57:14,456] [INFO] [config.py:1001:print]   optimizer_params ............. None
[2025-04-16 10:57:14,456] [INFO] [config.py:1001:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2025-04-16 10:57:14,456] [INFO] [config.py:1001:print]   pld_enabled .................. False
[2025-04-16 10:57:14,456] [INFO] [config.py:1001:print]   pld_params ................... False
[2025-04-16 10:57:14,456] [INFO] [config.py:1001:print]   prescale_gradients ........... False
[2025-04-16 10:57:14,456] [INFO] [config.py:1001:print]   scheduler_name ............... None
[2025-04-16 10:57:14,456] [INFO] [config.py:1001:print]   scheduler_params ............. None
[2025-04-16 10:57:14,456] [INFO] [config.py:1001:print]   seq_parallel_communication_data_type  torch.float32
[2025-04-16 10:57:14,456] [INFO] [config.py:1001:print]   sparse_attention ............. None
[2025-04-16 10:57:14,456] [INFO] [config.py:1001:print]   sparse_gradients_enabled ..... False
[2025-04-16 10:57:14,456] [INFO] [config.py:1001:print]   steps_per_print .............. inf
[2025-04-16 10:57:14,456] [INFO] [config.py:1001:print]   timers_config ................ enabled=True synchronized=True
[2025-04-16 10:57:14,456] [INFO] [config.py:1001:print]   train_batch_size ............. 128
[2025-04-16 10:57:14,456] [INFO] [config.py:1001:print]   train_micro_batch_size_per_gpu  4
[2025-04-16 10:57:14,456] [INFO] [config.py:1001:print]   use_data_before_expert_parallel_  False
[2025-04-16 10:57:14,456] [INFO] [config.py:1001:print]   use_node_local_storage ....... False
[2025-04-16 10:57:14,456] [INFO] [config.py:1001:print]   wall_clock_breakdown ......... False
[2025-04-16 10:57:14,456] [INFO] [config.py:1001:print]   weight_quantization_config ... None
[2025-04-16 10:57:14,456] [INFO] [config.py:1001:print]   world_size ................... 8
[2025-04-16 10:57:14,456] [INFO] [config.py:1001:print]   zero_allow_untested_optimizer  True
[2025-04-16 10:57:14,457] [INFO] [config.py:1001:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=16777216 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=15099494 param_persistence_threshold=40960 model_persistence_threshold=sys.maxsize max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=True use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2025-04-16 10:57:14,457] [INFO] [config.py:1001:print]   zero_enabled ................. True
[2025-04-16 10:57:14,457] [INFO] [config.py:1001:print]   zero_force_ds_cpu_optimizer .. True
[2025-04-16 10:57:14,457] [INFO] [config.py:1001:print]   zero_optimization_stage ...... 3
[2025-04-16 10:57:14,457] [INFO] [config.py:987:print_user_config]   json = {
    "train_batch_size": 128, 
    "train_micro_batch_size_per_gpu": 4, 
    "gradient_accumulation_steps": 4, 
    "gradient_clipping": 1.0, 
    "zero_allow_untested_optimizer": true, 
    "fp16": {
        "enabled": false, 
        "loss_scale": 0, 
        "loss_scale_window": 1000, 
        "initial_scale_power": 16, 
        "hysteresis": 2, 
        "min_loss_scale": 1
    }, 
    "bf16": {
        "enabled": true
    }, 
    "zero_optimization": {
        "stage": 3, 
        "overlap_comm": false, 
        "contiguous_gradients": true, 
        "sub_group_size": 1.000000e+09, 
        "reduce_bucket_size": 1.677722e+07, 
        "stage3_prefetch_bucket_size": 1.509949e+07, 
        "stage3_param_persistence_threshold": 4.096000e+04, 
        "stage3_max_live_parameters": 1.000000e+09, 
        "stage3_max_reuse_distance": 1.000000e+09, 
        "stage3_gather_16bit_weights_on_model_save": true
    }, 
    "steps_per_print": inf
}
[INFO|trainer.py:2134] 2025-04-16 10:57:14,458 >> ***** Running training *****
[INFO|trainer.py:2135] 2025-04-16 10:57:14,458 >>   Num examples = 40,188
[INFO|trainer.py:2136] 2025-04-16 10:57:14,458 >>   Num Epochs = 3
[INFO|trainer.py:2137] 2025-04-16 10:57:14,458 >>   Instantaneous batch size per device = 4
[INFO|trainer.py:2140] 2025-04-16 10:57:14,458 >>   Total train batch size (w. parallel, distributed & accumulation) = 128
[INFO|trainer.py:2141] 2025-04-16 10:57:14,458 >>   Gradient Accumulation steps = 4
[INFO|trainer.py:2142] 2025-04-16 10:57:14,458 >>   Total optimization steps = 942
[INFO|trainer.py:2143] 2025-04-16 10:57:14,459 >>   Number of trainable parameters = 8,030,261,248
  0%|          | 0/942 [00:00<?, ?it/s]
  0%|          | 1/942 [00:22<6:00:10, 22.96s/it]
  0%|          | 2/942 [00:43<5:41:50, 21.82s/it]
  0%|          | 3/942 [01:05<5:37:53, 21.59s/it]
  0%|          | 4/942 [01:25<5:26:12, 20.87s/it]
  1%|          | 5/942 [01:46<5:27:20, 20.96s/it]
  1%|          | 6/942 [02:08<5:32:18, 21.30s/it]
  1%|          | 7/942 [02:29<5:30:12, 21.19s/it]
  1%|          | 8/942 [02:51<5:34:56, 21.52s/it]
  1%|          | 9/942 [03:13<5:37:50, 21.73s/it]
  1%|          | 10/942 [03:33<5:29:47, 21.23s/it]
                                                  
{'loss': 0.4103, 'grad_norm': 3.7497346342507707, 'learning_rate': 5.263157894736843e-07, 'epoch': 0.03}
  1%|          | 10/942 [03:33<5:29:47, 21.23s/it]
  1%|          | 11/942 [03:55<5:31:28, 21.36s/it]
  1%|▏         | 12/942 [04:16<5:31:37, 21.40s/it]
  1%|▏         | 13/942 [04:39<5:35:38, 21.68s/it]
  1%|▏         | 14/942 [04:59<5:27:57, 21.20s/it]
  2%|▏         | 15/942 [05:20<5:26:58, 21.16s/it][2025-04-16 11:02:58,344] [WARNING] [stage3.py:2069:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  2%|▏         | 16/942 [05:43<5:37:32, 21.87s/it]
  2%|▏         | 17/942 [06:04<5:32:13, 21.55s/it]
  2%|▏         | 18/942 [06:24<5:23:05, 20.98s/it]
  2%|▏         | 19/942 [06:47<5:31:23, 21.54s/it]
  2%|▏         | 20/942 [07:08<5:28:19, 21.37s/it]
                                                  
{'loss': 0.3725, 'grad_norm': 1.9045185090393433, 'learning_rate': 1.0526315789473685e-06, 'epoch': 0.06}
  2%|▏         | 20/942 [07:08<5:28:19, 21.37s/it]
  2%|▏         | 21/942 [07:28<5:24:54, 21.17s/it]
  2%|▏         | 22/942 [07:48<5:17:46, 20.72s/it]
  2%|▏         | 23/942 [08:11<5:27:39, 21.39s/it]
  3%|▎         | 24/942 [08:32<5:25:48, 21.30s/it]
  3%|▎         | 25/942 [08:55<5:32:17, 21.74s/it]
  3%|▎         | 26/942 [09:16<5:27:24, 21.45s/it]
  3%|▎         | 27/942 [09:36<5:24:47, 21.30s/it]
  3%|▎         | 28/942 [09:59<5:28:16, 21.55s/it]
  3%|▎         | 29/942 [10:20<5:26:03, 21.43s/it]
  3%|▎         | 30/942 [10:40<5:20:31, 21.09s/it]
                                                  
{'loss': 0.3339, 'grad_norm': 1.3377916099830132, 'learning_rate': 1.5789473684210526e-06, 'epoch': 0.1}
  3%|▎         | 30/942 [10:40<5:20:31, 21.09s/it]
  3%|▎         | 31/942 [11:01<5:21:10, 21.15s/it]
  3%|▎         | 32/942 [11:22<5:19:20, 21.06s/it]
  4%|▎         | 33/942 [11:43<5:19:49, 21.11s/it]
  4%|▎         | 34/942 [12:05<5:20:44, 21.19s/it]
  4%|▎         | 35/942 [12:26<5:19:23, 21.13s/it]
  4%|▍         | 36/942 [12:48<5:23:53, 21.45s/it]
  4%|▍         | 37/942 [13:08<5:17:11, 21.03s/it]
  4%|▍         | 38/942 [13:28<5:13:59, 20.84s/it]
  4%|▍         | 39/942 [13:49<5:14:17, 20.88s/it]
  4%|▍         | 40/942 [14:11<5:15:56, 21.02s/it]
                                                  
{'loss': 0.3037, 'grad_norm': 0.9975296834093378, 'learning_rate': 2.105263157894737e-06, 'epoch': 0.13}
  4%|▍         | 40/942 [14:11<5:15:56, 21.02s/it]
  4%|▍         | 41/942 [14:31<5:14:13, 20.93s/it]
  4%|▍         | 42/942 [14:53<5:18:03, 21.20s/it]
  5%|▍         | 43/942 [15:15<5:18:58, 21.29s/it]
  5%|▍         | 44/942 [15:35<5:14:41, 21.03s/it]
  5%|▍         | 45/942 [15:56<5:13:44, 20.99s/it]
  5%|▍         | 46/942 [16:17<5:13:56, 21.02s/it]
  5%|▍         | 47/942 [16:38<5:13:09, 20.99s/it]
  5%|▌         | 48/942 [16:58<5:05:43, 20.52s/it]
  5%|▌         | 49/942 [17:18<5:05:12, 20.51s/it]
  5%|▌         | 50/942 [17:39<5:08:28, 20.75s/it]
                                                  
{'loss': 0.2955, 'grad_norm': 1.04910056887481, 'learning_rate': 2.631578947368421e-06, 'epoch': 0.16}
  5%|▌         | 50/942 [17:39<5:08:28, 20.75s/it]
  5%|▌         | 51/942 [18:03<5:21:02, 21.62s/it]
  6%|▌         | 52/942 [18:22<5:10:16, 20.92s/it]
  6%|▌         | 53/942 [18:42<5:04:08, 20.53s/it]
  6%|▌         | 54/942 [19:01<4:59:15, 20.22s/it]
  6%|▌         | 55/942 [19:23<5:04:42, 20.61s/it]
  6%|▌         | 56/942 [19:43<5:01:43, 20.43s/it]
  6%|▌         | 57/942 [20:05<5:10:50, 21.07s/it]
  6%|▌         | 58/942 [20:27<5:11:09, 21.12s/it]
  6%|▋         | 59/942 [20:48<5:10:40, 21.11s/it]
  6%|▋         | 60/942 [21:10<5:13:15, 21.31s/it]
                                                  
{'loss': 0.2869, 'grad_norm': 1.0286924286806316, 'learning_rate': 3.157894736842105e-06, 'epoch': 0.19}
  6%|▋         | 60/942 [21:10<5:13:15, 21.31s/it]
  6%|▋         | 61/942 [21:30<5:09:32, 21.08s/it]
  7%|▋         | 62/942 [21:51<5:06:10, 20.88s/it]
  7%|▋         | 63/942 [22:11<5:02:06, 20.62s/it]
  7%|▋         | 64/942 [22:32<5:03:57, 20.77s/it]
  7%|▋         | 65/942 [22:56<5:17:08, 21.70s/it]
  7%|▋         | 66/942 [23:18<5:19:56, 21.91s/it]
  7%|▋         | 67/942 [23:41<5:23:52, 22.21s/it]
  7%|▋         | 68/942 [24:03<5:21:49, 22.09s/it]
  7%|▋         | 69/942 [24:25<5:21:57, 22.13s/it]
  7%|▋         | 70/942 [24:46<5:16:08, 21.75s/it]
                                                  
{'loss': 0.2818, 'grad_norm': 0.9841868204904943, 'learning_rate': 3.6842105263157896e-06, 'epoch': 0.22}
  7%|▋         | 70/942 [24:46<5:16:08, 21.75s/it]
  8%|▊         | 71/942 [25:07<5:12:37, 21.54s/it]
  8%|▊         | 72/942 [25:27<5:07:33, 21.21s/it]
  8%|▊         | 73/942 [25:47<5:01:16, 20.80s/it]
  8%|▊         | 74/942 [26:09<5:05:23, 21.11s/it]
  8%|▊         | 75/942 [26:29<4:59:45, 20.74s/it]
  8%|▊         | 76/942 [26:50<5:01:04, 20.86s/it]
  8%|▊         | 77/942 [27:10<4:58:17, 20.69s/it]
  8%|▊         | 78/942 [27:31<4:57:11, 20.64s/it]
  8%|▊         | 79/942 [27:51<4:55:01, 20.51s/it]
  8%|▊         | 80/942 [28:12<4:55:38, 20.58s/it]
                                                  
{'loss': 0.2863, 'grad_norm': 1.0111515659972832, 'learning_rate': 4.210526315789474e-06, 'epoch': 0.25}
  8%|▊         | 80/942 [28:12<4:55:38, 20.58s/it]
  9%|▊         | 81/942 [28:31<4:48:54, 20.13s/it]
  9%|▊         | 82/942 [28:52<4:54:45, 20.56s/it]
  9%|▉         | 83/942 [29:14<4:59:02, 20.89s/it]
  9%|▉         | 84/942 [29:35<4:58:30, 20.87s/it]
  9%|▉         | 85/942 [29:56<4:59:03, 20.94s/it]
  9%|▉         | 86/942 [30:16<4:56:56, 20.81s/it]
  9%|▉         | 87/942 [30:37<4:55:56, 20.77s/it]
  9%|▉         | 88/942 [31:01<5:06:54, 21.56s/it]
  9%|▉         | 89/942 [31:22<5:04:06, 21.39s/it]
 10%|▉         | 90/942 [31:42<5:01:01, 21.20s/it]
                                                  
{'loss': 0.2782, 'grad_norm': 0.9783316725047976, 'learning_rate': 4.736842105263158e-06, 'epoch': 0.29}
 10%|▉         | 90/942 [31:42<5:01:01, 21.20s/it]
 10%|▉         | 91/942 [32:03<5:00:17, 21.17s/it]
 10%|▉         | 92/942 [32:27<5:10:50, 21.94s/it]
 10%|▉         | 93/942 [32:48<5:05:52, 21.62s/it]
 10%|▉         | 94/942 [33:08<5:00:19, 21.25s/it]
 10%|█         | 95/942 [33:29<4:55:58, 20.97s/it]
 10%|█         | 96/942 [33:50<4:55:35, 20.96s/it]
 10%|█         | 97/942 [34:10<4:51:51, 20.72s/it]
 10%|█         | 98/942 [34:31<4:54:00, 20.90s/it]
 11%|█         | 99/942 [34:51<4:50:55, 20.71s/it]
 11%|█         | 100/942 [35:13<4:55:13, 21.04s/it]
                                                   
{'loss': 0.2818, 'grad_norm': 0.9345083703627998, 'learning_rate': 4.999570096976961e-06, 'epoch': 0.32}
 11%|█         | 100/942 [35:13<4:55:13, 21.04s/it]
 11%|█         | 101/942 [35:34<4:53:06, 20.91s/it]
 11%|█         | 102/942 [35:55<4:53:28, 20.96s/it]
 11%|█         | 103/942 [36:15<4:50:38, 20.78s/it]
 11%|█         | 104/942 [36:36<4:52:00, 20.91s/it]
 11%|█         | 105/942 [36:57<4:48:52, 20.71s/it]
 11%|█▏        | 106/942 [37:19<4:55:27, 21.21s/it]
 11%|█▏        | 107/942 [37:39<4:51:52, 20.97s/it]
 11%|█▏        | 108/942 [38:01<4:53:19, 21.10s/it]
 12%|█▏        | 109/942 [38:21<4:49:13, 20.83s/it]
 12%|█▏        | 110/942 [38:42<4:50:53, 20.98s/it]
                                                   
{'loss': 0.2769, 'grad_norm': 0.9697704585268458, 'learning_rate': 4.996131759861523e-06, 'epoch': 0.35}
 12%|█▏        | 110/942 [38:42<4:50:53, 20.98s/it]
 12%|█▏        | 111/942 [39:02<4:46:21, 20.68s/it]
 12%|█▏        | 112/942 [39:23<4:45:56, 20.67s/it]
 12%|█▏        | 113/942 [39:46<4:55:26, 21.38s/it]
 12%|█▏        | 114/942 [40:07<4:53:07, 21.24s/it]
 12%|█▏        | 115/942 [40:31<5:05:45, 22.18s/it]
 12%|█▏        | 116/942 [40:52<4:58:11, 21.66s/it]
 12%|█▏        | 117/942 [41:12<4:52:07, 21.25s/it]
 13%|█▎        | 118/942 [41:33<4:48:55, 21.04s/it]
 13%|█▎        | 119/942 [41:55<4:54:02, 21.44s/it]
 13%|█▎        | 120/942 [42:18<4:59:19, 21.85s/it]
                                                   
{'loss': 0.2737, 'grad_norm': 0.9872268294862245, 'learning_rate': 4.989259815308816e-06, 'epoch': 0.38}
 13%|█▎        | 120/942 [42:18<4:59:19, 21.85s/it]
 13%|█▎        | 121/942 [42:39<4:55:37, 21.61s/it]
 13%|█▎        | 122/942 [43:03<5:06:03, 22.39s/it]
 13%|█▎        | 123/942 [43:23<4:55:45, 21.67s/it]
 13%|█▎        | 124/942 [43:45<4:55:22, 21.67s/it]
 13%|█▎        | 125/942 [44:05<4:51:18, 21.39s/it]
 13%|█▎        | 126/942 [44:27<4:53:31, 21.58s/it]
 13%|█▎        | 127/942 [44:49<4:51:25, 21.46s/it]
 14%|█▎        | 128/942 [45:09<4:44:41, 20.98s/it]
 14%|█▎        | 129/942 [45:29<4:43:29, 20.92s/it]
 14%|█▍        | 130/942 [45:49<4:38:16, 20.56s/it]
                                                   
{'loss': 0.2768, 'grad_norm': 0.9901232222401508, 'learning_rate': 4.978963716169166e-06, 'epoch': 0.41}
 14%|█▍        | 130/942 [45:49<4:38:16, 20.56s/it]
 14%|█▍        | 131/942 [46:11<4:41:39, 20.84s/it]
 14%|█▍        | 132/942 [46:30<4:37:22, 20.55s/it]
 14%|█▍        | 133/942 [46:50<4:35:00, 20.40s/it]
 14%|█▍        | 134/942 [47:11<4:33:27, 20.31s/it]
 14%|█▍        | 135/942 [47:32<4:39:39, 20.79s/it]
 14%|█▍        | 136/942 [47:54<4:42:06, 21.00s/it]
 15%|█▍        | 137/942 [48:16<4:44:35, 21.21s/it]
 15%|█▍        | 138/942 [48:40<4:55:37, 22.06s/it]
 15%|█▍        | 139/942 [49:02<4:56:35, 22.16s/it]
 15%|█▍        | 140/942 [49:23<4:50:14, 21.71s/it]
                                                   
{'loss': 0.2766, 'grad_norm': 1.0029219354099483, 'learning_rate': 4.9652576254619926e-06, 'epoch': 0.45}
 15%|█▍        | 140/942 [49:23<4:50:14, 21.71s/it]
 15%|█▍        | 141/942 [49:45<4:51:12, 21.81s/it]
 15%|█▌        | 142/942 [50:07<4:51:01, 21.83s/it]
 15%|█▌        | 143/942 [50:28<4:47:03, 21.56s/it]
 15%|█▌        | 144/942 [50:48<4:42:45, 21.26s/it]
 15%|█▌        | 145/942 [51:10<4:43:30, 21.34s/it]
 15%|█▌        | 146/942 [51:31<4:43:22, 21.36s/it]
 16%|█▌        | 147/942 [51:53<4:43:14, 21.38s/it]
 16%|█▌        | 148/942 [52:12<4:36:45, 20.91s/it]
 16%|█▌        | 149/942 [52:34<4:39:39, 21.16s/it]
 16%|█▌        | 150/942 [52:54<4:33:54, 20.75s/it]
                                                   
{'loss': 0.2809, 'grad_norm': 1.0259938153820212, 'learning_rate': 4.948160396893553e-06, 'epoch': 0.48}
 16%|█▌        | 150/942 [52:54<4:33:54, 20.75s/it]
 16%|█▌        | 151/942 [53:15<4:33:23, 20.74s/it]
 16%|█▌        | 152/942 [53:35<4:31:42, 20.64s/it]
 16%|█▌        | 153/942 [53:56<4:33:29, 20.80s/it]
 16%|█▋        | 154/942 [54:17<4:31:31, 20.67s/it]
 16%|█▋        | 155/942 [54:37<4:30:42, 20.64s/it]
 17%|█▋        | 156/942 [54:58<4:32:40, 20.81s/it]
 17%|█▋        | 157/942 [55:20<4:35:17, 21.04s/it]
 17%|█▋        | 158/942 [55:42<4:38:56, 21.35s/it]
 17%|█▋        | 159/942 [56:02<4:31:49, 20.83s/it]
 17%|█▋        | 160/942 [56:24<4:38:05, 21.34s/it]
                                                   
{'loss': 0.2741, 'grad_norm': 0.9474103632337532, 'learning_rate': 4.927695548922336e-06, 'epoch': 0.51}
 17%|█▋        | 160/942 [56:24<4:38:05, 21.34s/it]
 17%|█▋        | 161/942 [56:45<4:36:35, 21.25s/it]
 17%|█▋        | 162/942 [57:07<4:37:17, 21.33s/it]
 17%|█▋        | 163/942 [57:27<4:33:37, 21.07s/it]
 17%|█▋        | 164/942 [57:49<4:34:44, 21.19s/it]
 18%|█▊        | 165/942 [58:10<4:35:27, 21.27s/it]
 18%|█▊        | 166/942 [58:32<4:37:12, 21.43s/it]
 18%|█▊        | 167/942 [58:55<4:42:51, 21.90s/it]
 18%|█▊        | 168/942 [59:15<4:35:44, 21.38s/it]
 18%|█▊        | 169/942 [59:36<4:32:59, 21.19s/it]
 18%|█▊        | 170/942 [59:57<4:31:38, 21.11s/it]
                                                   
{'loss': 0.2647, 'grad_norm': 0.9240960503512892, 'learning_rate': 4.903891232407731e-06, 'epoch': 0.54}
 18%|█▊        | 170/942 [59:57<4:31:38, 21.11s/it]
 18%|█▊        | 171/942 [1:00:17<4:27:36, 20.83s/it]
 18%|█▊        | 172/942 [1:00:36<4:21:52, 20.41s/it]
 18%|█▊        | 173/942 [1:00:57<4:24:25, 20.63s/it]
 18%|█▊        | 174/942 [1:01:17<4:19:49, 20.30s/it]
 19%|█▊        | 175/942 [1:01:37<4:20:08, 20.35s/it]
 19%|█▊        | 176/942 [1:02:00<4:29:21, 21.10s/it]
 19%|█▉        | 177/942 [1:02:22<4:32:37, 21.38s/it]
 19%|█▉        | 178/942 [1:02:43<4:30:57, 21.28s/it]
 19%|█▉        | 179/942 [1:03:04<4:29:41, 21.21s/it]
 19%|█▉        | 180/942 [1:03:27<4:33:25, 21.53s/it]
                                                     
{'loss': 0.2673, 'grad_norm': 1.0055429859225373, 'learning_rate': 4.876780191886523e-06, 'epoch': 0.57}
 19%|█▉        | 180/942 [1:03:27<4:33:25, 21.53s/it]
 19%|█▉        | 181/942 [1:03:48<4:33:32, 21.57s/it]
 19%|█▉        | 182/942 [1:04:10<4:32:52, 21.54s/it]
 19%|█▉        | 183/942 [1:04:31<4:32:54, 21.57s/it]
 20%|█▉        | 184/942 [1:04:51<4:26:22, 21.08s/it]
 20%|█▉        | 185/942 [1:05:11<4:22:07, 20.78s/it]
 20%|█▉        | 186/942 [1:05:31<4:17:22, 20.43s/it]
 20%|█▉        | 187/942 [1:05:53<4:21:00, 20.74s/it]
 20%|█▉        | 188/942 [1:06:12<4:15:24, 20.32s/it]
 20%|██        | 189/942 [1:06:32<4:15:15, 20.34s/it]
 20%|██        | 190/942 [1:06:52<4:13:44, 20.25s/it]
                                                     
{'loss': 0.2722, 'grad_norm': 0.9815856893433419, 'learning_rate': 4.846399720530434e-06, 'epoch': 0.61}
 20%|██        | 190/942 [1:06:52<4:13:44, 20.25s/it]
 20%|██        | 191/942 [1:07:14<4:18:09, 20.63s/it]
 20%|██        | 192/942 [1:07:36<4:24:01, 21.12s/it]
 20%|██        | 193/942 [1:07:57<4:21:05, 20.91s/it]
 21%|██        | 194/942 [1:08:18<4:21:49, 21.00s/it]
 21%|██        | 195/942 [1:08:39<4:22:09, 21.06s/it]
 21%|██        | 196/942 [1:08:57<4:11:30, 20.23s/it]
 21%|██        | 197/942 [1:09:18<4:12:40, 20.35s/it]
 21%|██        | 198/942 [1:09:39<4:14:58, 20.56s/it]
 21%|██        | 199/942 [1:10:01<4:19:40, 20.97s/it]
 21%|██        | 200/942 [1:10:20<4:14:03, 20.54s/it]
                                                     
{'loss': 0.2728, 'grad_norm': 1.0867590046867948, 'learning_rate': 4.812791608846709e-06, 'epoch': 0.64}
 21%|██        | 200/942 [1:10:20<4:14:03, 20.54s/it]
 21%|██▏       | 201/942 [1:10:41<4:13:04, 20.49s/it]
 21%|██▏       | 202/942 [1:11:04<4:21:53, 21.23s/it]
 22%|██▏       | 203/942 [1:11:27<4:28:07, 21.77s/it]
 22%|██▏       | 204/942 [1:11:47<4:23:08, 21.39s/it]
 22%|██▏       | 205/942 [1:12:07<4:17:24, 20.96s/it]
 22%|██▏       | 206/942 [1:12:30<4:25:28, 21.64s/it]
 22%|██▏       | 207/942 [1:12:53<4:28:07, 21.89s/it]
 22%|██▏       | 208/942 [1:13:14<4:23:54, 21.57s/it]
 22%|██▏       | 209/942 [1:13:33<4:16:47, 21.02s/it]
 22%|██▏       | 210/942 [1:13:54<4:15:25, 20.94s/it]
                                                     
{'loss': 0.2663, 'grad_norm': 0.952445932507859, 'learning_rate': 4.776002087192291e-06, 'epoch': 0.67}
 22%|██▏       | 210/942 [1:13:54<4:15:25, 20.94s/it]
 22%|██▏       | 211/942 [1:14:15<4:15:39, 20.98s/it]
 23%|██▎       | 212/942 [1:14:36<4:14:23, 20.91s/it]
 23%|██▎       | 213/942 [1:15:00<4:23:34, 21.69s/it]
 23%|██▎       | 214/942 [1:15:22<4:24:15, 21.78s/it]
 23%|██▎       | 215/942 [1:15:42<4:18:25, 21.33s/it]
 23%|██▎       | 216/942 [1:16:04<4:20:04, 21.49s/it]
 23%|██▎       | 217/942 [1:16:24<4:16:06, 21.20s/it]
 23%|██▎       | 218/942 [1:16:44<4:10:04, 20.72s/it]
 23%|██▎       | 219/942 [1:17:05<4:12:10, 20.93s/it]
 23%|██▎       | 220/942 [1:17:25<4:07:15, 20.55s/it]
                                                     
{'loss': 0.2651, 'grad_norm': 1.0583015939039986, 'learning_rate': 4.7360817621806585e-06, 'epoch': 0.7}
 23%|██▎       | 220/942 [1:17:25<4:07:15, 20.55s/it]
 23%|██▎       | 221/942 [1:17:45<4:06:48, 20.54s/it]
 24%|██▎       | 222/942 [1:18:06<4:07:20, 20.61s/it]
 24%|██▎       | 223/942 [1:18:27<4:06:38, 20.58s/it]
 24%|██▍       | 224/942 [1:18:46<4:03:15, 20.33s/it]
 24%|██▍       | 225/942 [1:19:08<4:05:59, 20.59s/it]
 24%|██▍       | 226/942 [1:19:27<4:00:10, 20.13s/it]
 24%|██▍       | 227/942 [1:19:50<4:11:40, 21.12s/it]
 24%|██▍       | 228/942 [1:20:12<4:13:10, 21.28s/it]
 24%|██▍       | 229/942 [1:20:33<4:11:43, 21.18s/it]
 24%|██▍       | 230/942 [1:20:54<4:12:16, 21.26s/it]
                                                     
{'loss': 0.2643, 'grad_norm': 0.9123405710358544, 'learning_rate': 4.69308554706882e-06, 'epoch': 0.73}
 24%|██▍       | 230/942 [1:20:54<4:12:16, 21.26s/it]
 25%|██▍       | 231/942 [1:21:15<4:10:16, 21.12s/it]
 25%|██▍       | 232/942 [1:21:35<4:07:37, 20.93s/it]
 25%|██▍       | 233/942 [1:21:57<4:09:28, 21.11s/it]
 25%|██▍       | 234/942 [1:22:18<4:09:14, 21.12s/it]
 25%|██▍       | 235/942 [1:22:38<4:05:17, 20.82s/it]
 25%|██▌       | 236/942 [1:22:58<4:02:36, 20.62s/it]
 25%|██▌       | 237/942 [1:23:19<4:01:51, 20.58s/it]
 25%|██▌       | 238/942 [1:23:40<4:03:08, 20.72s/it]
 25%|██▌       | 239/942 [1:24:01<4:02:56, 20.74s/it]
 25%|██▌       | 240/942 [1:24:23<4:06:44, 21.09s/it]
                                                     
{'loss': 0.2652, 'grad_norm': 0.9471848373479481, 'learning_rate': 4.64707258622021e-06, 'epoch': 0.76}
 25%|██▌       | 240/942 [1:24:23<4:06:44, 21.09s/it]
 26%|██▌       | 241/942 [1:24:43<4:04:00, 20.89s/it]
 26%|██▌       | 242/942 [1:25:05<4:06:46, 21.15s/it]
 26%|██▌       | 243/942 [1:25:27<4:11:13, 21.56s/it]
 26%|██▌       | 244/942 [1:25:47<4:04:18, 21.00s/it]
 26%|██▌       | 245/942 [1:26:11<4:13:41, 21.84s/it]
 26%|██▌       | 246/942 [1:26:32<4:09:57, 21.55s/it]
 26%|██▌       | 247/942 [1:26:51<4:03:40, 21.04s/it]
 26%|██▋       | 248/942 [1:27:11<3:57:40, 20.55s/it]
 26%|██▋       | 249/942 [1:27:32<4:00:21, 20.81s/it]
 27%|██▋       | 250/942 [1:27:54<4:02:27, 21.02s/it]
                                                     
{'loss': 0.2696, 'grad_norm': 0.9305948812222854, 'learning_rate': 4.59810617374739e-06, 'epoch': 0.8}
 27%|██▋       | 250/942 [1:27:54<4:02:27, 21.02s/it]
 27%|██▋       | 251/942 [1:28:14<3:57:44, 20.64s/it]
 27%|██▋       | 252/942 [1:28:34<3:55:29, 20.48s/it]
 27%|██▋       | 253/942 [1:28:56<4:00:26, 20.94s/it]
 27%|██▋       | 254/942 [1:29:16<3:58:40, 20.82s/it]
 27%|██▋       | 255/942 [1:29:37<3:58:50, 20.86s/it]
 27%|██▋       | 256/942 [1:29:58<3:59:42, 20.97s/it]
 27%|██▋       | 257/942 [1:30:19<3:56:59, 20.76s/it]
 27%|██▋       | 258/942 [1:30:39<3:54:19, 20.55s/it]
 27%|██▋       | 259/942 [1:31:03<4:08:15, 21.81s/it]
 28%|██▊       | 260/942 [1:31:25<4:06:46, 21.71s/it]
                                                     
{'loss': 0.2684, 'grad_norm': 0.969274149349572, 'learning_rate': 4.546253666446484e-06, 'epoch': 0.83}
 28%|██▊       | 260/942 [1:31:25<4:06:46, 21.71s/it]
 28%|██▊       | 261/942 [1:31:45<4:01:53, 21.31s/it]
 28%|██▊       | 262/942 [1:32:07<4:01:59, 21.35s/it]
 28%|██▊       | 263/942 [1:32:27<3:57:47, 21.01s/it]
 28%|██▊       | 264/942 [1:32:48<3:56:06, 20.89s/it]
 28%|██▊       | 265/942 [1:33:08<3:55:33, 20.88s/it]
 28%|██▊       | 266/942 [1:33:30<3:58:14, 21.15s/it]
 28%|██▊       | 267/942 [1:33:51<3:57:08, 21.08s/it]
 28%|██▊       | 268/942 [1:34:13<3:57:59, 21.19s/it]
 29%|██▊       | 269/942 [1:34:35<4:01:44, 21.55s/it]
 29%|██▊       | 270/942 [1:34:56<3:58:46, 21.32s/it]
                                                     
{'loss': 0.2648, 'grad_norm': 0.9608479481755663, 'learning_rate': 4.49158639114309e-06, 'epoch': 0.86}
 29%|██▊       | 270/942 [1:34:56<3:58:46, 21.32s/it]
 29%|██▉       | 271/942 [1:35:16<3:54:54, 21.00s/it]
 29%|██▉       | 272/942 [1:35:37<3:54:34, 21.01s/it]
 29%|██▉       | 273/942 [1:35:59<3:55:56, 21.16s/it]
 29%|██▉       | 274/942 [1:36:21<4:00:40, 21.62s/it]
 29%|██▉       | 275/942 [1:36:41<3:55:34, 21.19s/it]
 29%|██▉       | 276/942 [1:37:02<3:53:26, 21.03s/it]
 29%|██▉       | 277/942 [1:37:23<3:54:04, 21.12s/it]
 30%|██▉       | 278/942 [1:37:44<3:51:59, 20.96s/it]
 30%|██▉       | 279/942 [1:38:04<3:47:18, 20.57s/it]
 30%|██▉       | 280/942 [1:38:24<3:44:58, 20.39s/it]
                                                     
{'loss': 0.2751, 'grad_norm': 0.9783203897648601, 'learning_rate': 4.434179546577146e-06, 'epoch': 0.89}
 30%|██▉       | 280/942 [1:38:24<3:44:58, 20.39s/it]
 30%|██▉       | 281/942 [1:38:44<3:44:40, 20.39s/it]
 30%|██▉       | 282/942 [1:39:04<3:44:12, 20.38s/it]
 30%|███       | 283/942 [1:39:26<3:47:59, 20.76s/it]
 30%|███       | 284/942 [1:39:47<3:46:48, 20.68s/it]
 30%|███       | 285/942 [1:40:10<3:55:36, 21.52s/it]
 30%|███       | 286/942 [1:40:30<3:50:11, 21.05s/it]
 30%|███       | 287/942 [1:40:53<3:55:47, 21.60s/it]
 31%|███       | 288/942 [1:41:14<3:54:17, 21.49s/it]
 31%|███       | 289/942 [1:41:38<4:02:55, 22.32s/it]
 31%|███       | 290/942 [1:42:00<3:59:34, 22.05s/it]
                                                     
{'loss': 0.2635, 'grad_norm': 0.9477882653219104, 'learning_rate': 4.374112099961689e-06, 'epoch': 0.92}
 31%|███       | 290/942 [1:42:00<3:59:34, 22.05s/it]
 31%|███       | 291/942 [1:42:21<3:55:12, 21.68s/it]
 31%|███       | 292/942 [1:42:42<3:53:07, 21.52s/it]
 31%|███       | 293/942 [1:43:02<3:47:42, 21.05s/it]
 31%|███       | 294/942 [1:43:22<3:45:07, 20.84s/it]
 31%|███▏      | 295/942 [1:43:46<3:55:19, 21.82s/it]
 31%|███▏      | 296/942 [1:44:07<3:51:10, 21.47s/it]
 32%|███▏      | 297/942 [1:44:29<3:54:34, 21.82s/it]
 32%|███▏      | 298/942 [1:44:52<3:55:06, 21.90s/it]
 32%|███▏      | 299/942 [1:45:12<3:50:02, 21.47s/it]
 32%|███▏      | 300/942 [1:45:33<3:47:53, 21.30s/it]
                                                     
{'loss': 0.2614, 'grad_norm': 0.8914884237698243, 'learning_rate': 4.3114666783578195e-06, 'epoch': 0.96}
 32%|███▏      | 300/942 [1:45:33<3:47:53, 21.30s/it]
 32%|███▏      | 301/942 [1:45:56<3:52:42, 21.78s/it]
 32%|███▏      | 302/942 [1:46:16<3:48:19, 21.41s/it]
 32%|███▏      | 303/942 [1:46:37<3:44:56, 21.12s/it]
 32%|███▏      | 304/942 [1:46:58<3:43:41, 21.04s/it]
 32%|███▏      | 305/942 [1:47:19<3:43:09, 21.02s/it]
 32%|███▏      | 306/942 [1:47:40<3:42:29, 20.99s/it]
 33%|███▎      | 307/942 [1:48:01<3:42:45, 21.05s/it]
 33%|███▎      | 308/942 [1:48:23<3:45:12, 21.31s/it]
 33%|███▎      | 309/942 [1:48:43<3:42:01, 21.04s/it]
 33%|███▎      | 310/942 [1:49:02<3:36:23, 20.54s/it]
                                                     
{'loss': 0.2636, 'grad_norm': 0.9539626417073356, 'learning_rate': 4.246329455015279e-06, 'epoch': 0.99}
 33%|███▎      | 310/942 [1:49:02<3:36:23, 20.54s/it]
 33%|███▎      | 311/942 [1:49:22<3:31:32, 20.11s/it]
 33%|███▎      | 312/942 [1:49:45<3:40:07, 20.96s/it]
 33%|███▎      | 313/942 [1:50:06<3:40:30, 21.03s/it]
 33%|███▎      | 314/942 [1:50:27<3:42:12, 21.23s/it]
 33%|███▎      | 315/942 [1:50:48<3:39:25, 21.00s/it]
 34%|███▎      | 316/942 [1:51:10<3:42:55, 21.37s/it]
 34%|███▎      | 317/942 [1:51:31<3:40:23, 21.16s/it]
 34%|███▍      | 318/942 [1:51:52<3:39:09, 21.07s/it]
 34%|███▍      | 319/942 [1:52:13<3:39:21, 21.13s/it]
 34%|███▍      | 320/942 [1:52:34<3:38:04, 21.04s/it]
                                                     
{'loss': 0.2168, 'grad_norm': 1.144967708211912, 'learning_rate': 4.1787900308349925e-06, 'epoch': 1.02}
 34%|███▍      | 320/942 [1:52:34<3:38:04, 21.04s/it]
 34%|███▍      | 321/942 [1:52:54<3:35:04, 20.78s/it]
 34%|███▍      | 322/942 [1:53:17<3:41:11, 21.40s/it]
 34%|███▍      | 323/942 [1:53:36<3:34:13, 20.76s/it]
 34%|███▍      | 324/942 [1:53:56<3:32:13, 20.61s/it]
 35%|███▍      | 325/942 [1:54:18<3:34:14, 20.83s/it]
 35%|███▍      | 326/942 [1:54:37<3:30:05, 20.46s/it]
 35%|███▍      | 327/942 [1:54:59<3:33:32, 20.83s/it]
 35%|███▍      | 328/942 [1:55:20<3:34:22, 20.95s/it]
 35%|███▍      | 329/942 [1:55:40<3:30:54, 20.64s/it]
 35%|███▌      | 330/942 [1:56:01<3:31:54, 20.77s/it]
                                                     
{'loss': 0.2035, 'grad_norm': 1.0236026242610237, 'learning_rate': 4.108941311116634e-06, 'epoch': 1.05}
 35%|███▌      | 330/942 [1:56:01<3:31:54, 20.77s/it]
 35%|███▌      | 331/942 [1:56:23<3:34:40, 21.08s/it]
 35%|███▌      | 332/942 [1:56:43<3:32:19, 20.89s/it]
 35%|███▌      | 333/942 [1:57:05<3:33:11, 21.00s/it]
 35%|███▌      | 334/942 [1:57:25<3:31:46, 20.90s/it]
 36%|███▌      | 335/942 [1:57:46<3:30:37, 20.82s/it]
 36%|███▌      | 336/942 [1:58:07<3:30:30, 20.84s/it]
 36%|███▌      | 337/942 [1:58:28<3:31:41, 20.99s/it]
 36%|███▌      | 338/942 [1:58:50<3:33:00, 21.16s/it]
 36%|███▌      | 339/942 [1:59:11<3:34:22, 21.33s/it]
 36%|███▌      | 340/942 [1:59:35<3:39:38, 21.89s/it]
                                                     
{'loss': 0.2002, 'grad_norm': 0.9838049804735927, 'learning_rate': 4.036879377760753e-06, 'epoch': 1.08}
 36%|███▌      | 340/942 [1:59:35<3:39:38, 21.89s/it]
 36%|███▌      | 341/942 [1:59:56<3:37:21, 21.70s/it]
 36%|███▋      | 342/942 [2:00:18<3:36:57, 21.70s/it]
 36%|███▋      | 343/942 [2:00:39<3:34:26, 21.48s/it]
 37%|███▋      | 344/942 [2:00:59<3:31:47, 21.25s/it][2025-04-16 12:58:37,261] [WARNING] [stage3.py:2069:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 37%|███▋      | 345/942 [2:01:22<3:36:20, 21.74s/it]
 37%|███▋      | 346/942 [2:01:43<3:33:24, 21.48s/it]
 37%|███▋      | 347/942 [2:02:04<3:31:25, 21.32s/it]
 37%|███▋      | 348/942 [2:02:24<3:26:09, 20.82s/it]
 37%|███▋      | 349/942 [2:02:45<3:27:34, 21.00s/it]
 37%|███▋      | 350/942 [2:03:08<3:32:06, 21.50s/it]
                                                     
{'loss': 0.2003, 'grad_norm': 1.058407326699511, 'learning_rate': 3.962703357101259e-06, 'epoch': 1.11}
 37%|███▋      | 350/942 [2:03:08<3:32:06, 21.50s/it]
 37%|███▋      | 351/942 [2:03:29<3:32:08, 21.54s/it]
 37%|███▋      | 352/942 [2:03:50<3:29:42, 21.33s/it]
 37%|███▋      | 353/942 [2:04:10<3:24:42, 20.85s/it]
 38%|███▊      | 354/942 [2:04:31<3:25:38, 20.98s/it]
 38%|███▊      | 355/942 [2:04:52<3:25:52, 21.04s/it]
 38%|███▊      | 356/942 [2:05:15<3:28:30, 21.35s/it]
 38%|███▊      | 357/942 [2:05:34<3:23:47, 20.90s/it]
 38%|███▊      | 358/942 [2:05:56<3:24:28, 21.01s/it]
 38%|███▊      | 359/942 [2:06:16<3:21:44, 20.76s/it]
 38%|███▊      | 360/942 [2:06:38<3:24:50, 21.12s/it]
                                                     
{'loss': 0.1936, 'grad_norm': 1.0051849342160863, 'learning_rate': 3.886515283550079e-06, 'epoch': 1.15}
 38%|███▊      | 360/942 [2:06:38<3:24:50, 21.12s/it]
 38%|███▊      | 361/942 [2:06:58<3:21:08, 20.77s/it]
 38%|███▊      | 362/942 [2:07:18<3:20:13, 20.71s/it]
 39%|███▊      | 363/942 [2:07:38<3:17:15, 20.44s/it]
 39%|███▊      | 364/942 [2:08:03<3:28:57, 21.69s/it]
 39%|███▊      | 365/942 [2:08:24<3:28:31, 21.68s/it]
 39%|███▉      | 366/942 [2:08:45<3:24:50, 21.34s/it]
 39%|███▉      | 367/942 [2:09:09<3:31:46, 22.10s/it]
 39%|███▉      | 368/942 [2:09:29<3:26:56, 21.63s/it]
 39%|███▉      | 369/942 [2:09:50<3:25:12, 21.49s/it]
 39%|███▉      | 370/942 [2:10:12<3:25:42, 21.58s/it]
                                                     
{'loss': 0.1971, 'grad_norm': 0.9771113275422963, 'learning_rate': 3.8084199592415305e-06, 'epoch': 1.18}
 39%|███▉      | 370/942 [2:10:12<3:25:42, 21.58s/it]
 39%|███▉      | 371/942 [2:10:36<3:30:13, 22.09s/it]
 39%|███▉      | 372/942 [2:10:58<3:30:48, 22.19s/it]
 40%|███▉      | 373/942 [2:11:19<3:26:18, 21.75s/it]
 40%|███▉      | 374/942 [2:11:39<3:22:58, 21.44s/it]
 40%|███▉      | 375/942 [2:12:01<3:24:18, 21.62s/it]
 40%|███▉      | 376/942 [2:12:21<3:18:09, 21.01s/it]
 40%|████      | 377/942 [2:12:43<3:20:46, 21.32s/it]
 40%|████      | 378/942 [2:13:05<3:21:12, 21.41s/it]
 40%|████      | 379/942 [2:13:26<3:20:12, 21.34s/it]
 40%|████      | 380/942 [2:13:48<3:22:34, 21.63s/it]
                                                     
{'loss': 0.1998, 'grad_norm': 1.0155984345364886, 'learning_rate': 3.7285248098695116e-06, 'epoch': 1.21}
 40%|████      | 380/942 [2:13:48<3:22:34, 21.63s/it]
 40%|████      | 381/942 [2:14:07<3:15:36, 20.92s/it]
 41%|████      | 382/942 [2:14:30<3:18:35, 21.28s/it]
 41%|████      | 383/942 [2:14:50<3:16:30, 21.09s/it]
 41%|████      | 384/942 [2:15:11<3:15:21, 21.01s/it]
 41%|████      | 385/942 [2:15:32<3:13:37, 20.86s/it]
 41%|████      | 386/942 [2:15:53<3:14:28, 20.99s/it]
 41%|████      | 387/942 [2:16:13<3:12:54, 20.85s/it]
 41%|████      | 388/942 [2:16:37<3:20:00, 21.66s/it]
 41%|████▏     | 389/942 [2:16:59<3:21:41, 21.88s/it]
 41%|████▏     | 390/942 [2:17:20<3:19:20, 21.67s/it]
                                                     
{'loss': 0.2018, 'grad_norm': 1.0632639591973354, 'learning_rate': 3.6469397369157865e-06, 'epoch': 1.24}
 41%|████▏     | 390/942 [2:17:20<3:19:20, 21.67s/it]
 42%|████▏     | 391/942 [2:17:40<3:13:35, 21.08s/it]
 42%|████▏     | 392/942 [2:18:00<3:09:19, 20.65s/it]
 42%|████▏     | 393/942 [2:18:20<3:07:48, 20.53s/it]
 42%|████▏     | 394/942 [2:18:41<3:07:26, 20.52s/it]
 42%|████▏     | 395/942 [2:19:01<3:07:32, 20.57s/it]
 42%|████▏     | 396/942 [2:19:22<3:07:48, 20.64s/it]
 42%|████▏     | 397/942 [2:19:43<3:07:16, 20.62s/it]
 42%|████▏     | 398/942 [2:20:04<3:10:07, 20.97s/it]
 42%|████▏     | 399/942 [2:20:27<3:13:06, 21.34s/it]
 42%|████▏     | 400/942 [2:20:47<3:10:43, 21.11s/it]
                                                     
{'loss': 0.191, 'grad_norm': 1.0026058535570943, 'learning_rate': 3.5637769664726492e-06, 'epoch': 1.27}
 42%|████▏     | 400/942 [2:20:47<3:10:43, 21.11s/it]
 43%|████▎     | 401/942 [2:21:09<3:11:28, 21.24s/it]
 43%|████▎     | 402/942 [2:21:30<3:10:00, 21.11s/it]
 43%|████▎     | 403/942 [2:21:50<3:08:41, 21.00s/it]
 43%|████▎     | 404/942 [2:22:11<3:06:31, 20.80s/it]
 43%|████▎     | 405/942 [2:22:32<3:06:49, 20.87s/it]
 43%|████▎     | 406/942 [2:22:54<3:10:02, 21.27s/it]
 43%|████▎     | 407/942 [2:23:16<3:11:31, 21.48s/it]
 43%|████▎     | 408/942 [2:23:36<3:08:38, 21.20s/it]
 43%|████▎     | 409/942 [2:23:56<3:04:06, 20.73s/it]
 44%|████▎     | 410/942 [2:24:16<3:00:31, 20.36s/it]
                                                     
{'loss': 0.1993, 'grad_norm': 0.9813703469203191, 'learning_rate': 3.4791508948679263e-06, 'epoch': 1.31}
 44%|████▎     | 410/942 [2:24:16<3:00:31, 20.36s/it]
 44%|████▎     | 411/942 [2:24:36<3:01:00, 20.45s/it]
 44%|████▎     | 412/942 [2:24:56<2:57:43, 20.12s/it]
 44%|████▍     | 413/942 [2:25:17<3:01:07, 20.54s/it]
 44%|████▍     | 414/942 [2:25:37<2:59:45, 20.43s/it]
 44%|████▍     | 415/942 [2:26:00<3:04:21, 20.99s/it]
 44%|████▍     | 416/942 [2:26:20<3:02:51, 20.86s/it]
 44%|████▍     | 417/942 [2:26:42<3:04:44, 21.11s/it]
 44%|████▍     | 418/942 [2:27:04<3:06:28, 21.35s/it]
 44%|████▍     | 419/942 [2:27:23<3:01:57, 20.88s/it]
 45%|████▍     | 420/942 [2:27:45<3:02:47, 21.01s/it]
                                                     
{'loss': 0.2005, 'grad_norm': 0.9818031465742455, 'learning_rate': 3.3931779313046575e-06, 'epoch': 1.34}
 45%|████▍     | 420/942 [2:27:45<3:02:47, 21.01s/it]
 45%|████▍     | 421/942 [2:28:07<3:05:02, 21.31s/it]
 45%|████▍     | 422/942 [2:28:28<3:05:05, 21.36s/it]
 45%|████▍     | 423/942 [2:28:49<3:02:43, 21.12s/it]
 45%|████▌     | 424/942 [2:29:10<3:03:01, 21.20s/it]
 45%|████▌     | 425/942 [2:29:31<3:02:34, 21.19s/it]
 45%|████▌     | 426/942 [2:29:55<3:09:15, 22.01s/it]
 45%|████▌     | 427/942 [2:30:17<3:06:57, 21.78s/it]
 45%|████▌     | 428/942 [2:30:38<3:06:43, 21.80s/it]
 46%|████▌     | 429/942 [2:31:01<3:07:52, 21.97s/it]
 46%|████▌     | 430/942 [2:31:24<3:09:34, 22.22s/it]
                                                     
{'loss': 0.1956, 'grad_norm': 1.0495956264919946, 'learning_rate': 3.3059763377319294e-06, 'epoch': 1.37}
 46%|████▌     | 430/942 [2:31:24<3:09:34, 22.22s/it]
 46%|████▌     | 431/942 [2:31:45<3:06:16, 21.87s/it]
 46%|████▌     | 432/942 [2:32:05<3:03:22, 21.57s/it]
 46%|████▌     | 433/942 [2:32:27<3:02:30, 21.51s/it]
 46%|████▌     | 434/942 [2:32:47<2:57:48, 21.00s/it]
 46%|████▌     | 435/942 [2:33:09<2:59:51, 21.29s/it]
 46%|████▋     | 436/942 [2:33:30<2:59:57, 21.34s/it]
 46%|████▋     | 437/942 [2:33:51<2:58:26, 21.20s/it]
 46%|████▋     | 438/942 [2:34:12<2:58:53, 21.30s/it]
 47%|████▋     | 439/942 [2:34:36<3:03:12, 21.85s/it]
 47%|████▋     | 440/942 [2:34:55<2:57:40, 21.24s/it]
                                                     
{'loss': 0.1943, 'grad_norm': 1.0256642693973994, 'learning_rate': 3.2176660661671167e-06, 'epoch': 1.4}
 47%|████▋     | 440/942 [2:34:55<2:57:40, 21.24s/it]
 47%|████▋     | 441/942 [2:35:16<2:54:38, 20.92s/it]
 47%|████▋     | 442/942 [2:35:35<2:51:10, 20.54s/it]
 47%|████▋     | 443/942 [2:35:54<2:46:09, 19.98s/it]
 47%|████▋     | 444/942 [2:36:13<2:43:57, 19.75s/it]
 47%|████▋     | 445/942 [2:36:34<2:47:11, 20.18s/it]
 47%|████▋     | 446/942 [2:36:55<2:47:02, 20.21s/it]
 47%|████▋     | 447/942 [2:37:15<2:47:59, 20.36s/it]
 48%|████▊     | 448/942 [2:37:37<2:50:18, 20.68s/it]
 48%|████▊     | 449/942 [2:37:56<2:46:52, 20.31s/it]
 48%|████▊     | 450/942 [2:38:17<2:48:35, 20.56s/it]
                                                     
{'loss': 0.1948, 'grad_norm': 1.0251425456384748, 'learning_rate': 3.128368593693325e-06, 'epoch': 1.43}
 48%|████▊     | 450/942 [2:38:17<2:48:35, 20.56s/it]
 48%|████▊     | 451/942 [2:38:39<2:50:25, 20.83s/it]
 48%|████▊     | 452/942 [2:39:00<2:51:10, 20.96s/it]
 48%|████▊     | 453/942 [2:39:20<2:48:57, 20.73s/it]
 48%|████▊     | 454/942 [2:39:40<2:46:48, 20.51s/it]
 48%|████▊     | 455/942 [2:40:01<2:47:38, 20.65s/it]
 48%|████▊     | 456/942 [2:40:21<2:46:11, 20.52s/it]
 49%|████▊     | 457/942 [2:40:42<2:46:08, 20.55s/it]
 49%|████▊     | 458/942 [2:41:03<2:45:53, 20.56s/it]
 49%|████▊     | 459/942 [2:41:24<2:46:18, 20.66s/it]
 49%|████▉     | 460/942 [2:41:46<2:50:20, 21.20s/it]
                                                     
{'loss': 0.196, 'grad_norm': 0.9594187292777461, 'learning_rate': 3.0382067553589866e-06, 'epoch': 1.46}
 49%|████▉     | 460/942 [2:41:46<2:50:20, 21.20s/it]
 49%|████▉     | 461/942 [2:42:07<2:49:28, 21.14s/it]
 49%|████▉     | 462/942 [2:42:27<2:47:28, 20.93s/it]
 49%|████▉     | 463/942 [2:42:50<2:52:06, 21.56s/it]
 49%|████▉     | 464/942 [2:43:11<2:48:52, 21.20s/it]
 49%|████▉     | 465/942 [2:43:33<2:50:51, 21.49s/it]
 49%|████▉     | 466/942 [2:43:53<2:47:14, 21.08s/it]
 50%|████▉     | 467/942 [2:44:15<2:48:01, 21.22s/it]
 50%|████▉     | 468/942 [2:44:35<2:45:09, 20.91s/it]
 50%|████▉     | 469/942 [2:44:57<2:47:14, 21.22s/it]
 50%|████▉     | 470/942 [2:45:17<2:44:47, 20.95s/it]
                                                     
{'loss': 0.1973, 'grad_norm': 1.0043495123570285, 'learning_rate': 2.947304575209482e-06, 'epoch': 1.5}
 50%|████▉     | 470/942 [2:45:17<2:44:47, 20.95s/it]
 50%|█████     | 471/942 [2:45:38<2:44:39, 20.98s/it]
 50%|█████     | 472/942 [2:45:59<2:45:08, 21.08s/it]
 50%|█████     | 473/942 [2:46:21<2:46:40, 21.32s/it]
 50%|█████     | 474/942 [2:46:43<2:46:34, 21.36s/it]
 50%|█████     | 475/942 [2:47:03<2:42:41, 20.90s/it]
 51%|█████     | 476/942 [2:47:23<2:40:14, 20.63s/it]
 51%|█████     | 477/942 [2:47:43<2:38:07, 20.40s/it]
 51%|█████     | 478/942 [2:48:05<2:43:43, 21.17s/it]
 51%|█████     | 479/942 [2:48:26<2:42:52, 21.11s/it]
 51%|█████     | 480/942 [2:48:47<2:41:16, 20.95s/it]
                                                     
{'loss': 0.1962, 'grad_norm': 1.0191505736978375, 'learning_rate': 2.8557870956832135e-06, 'epoch': 1.53}
 51%|█████     | 480/942 [2:48:47<2:41:16, 20.95s/it]
 51%|█████     | 481/942 [2:49:08<2:39:54, 20.81s/it]
 51%|█████     | 482/942 [2:49:29<2:40:56, 20.99s/it]
 51%|█████▏    | 483/942 [2:49:49<2:39:32, 20.85s/it]
 51%|█████▏    | 484/942 [2:50:12<2:42:00, 21.22s/it]
 51%|█████▏    | 485/942 [2:50:32<2:39:17, 20.91s/it][2025-04-16 13:48:09,919] [WARNING] [stage3.py:2069:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 52%|█████▏    | 486/942 [2:50:55<2:43:58, 21.58s/it]
 52%|█████▏    | 487/942 [2:51:17<2:44:18, 21.67s/it]
 52%|█████▏    | 488/942 [2:51:39<2:45:40, 21.90s/it]
 52%|█████▏    | 489/942 [2:52:01<2:45:30, 21.92s/it]
 52%|█████▏    | 490/942 [2:52:21<2:39:50, 21.22s/it]
                                                     
{'loss': 0.1942, 'grad_norm': 1.0271692250434223, 'learning_rate': 2.763780205606802e-06, 'epoch': 1.56}
 52%|█████▏    | 490/942 [2:52:21<2:39:50, 21.22s/it]
 52%|█████▏    | 491/942 [2:52:41<2:36:44, 20.85s/it]
 52%|█████▏    | 492/942 [2:53:02<2:37:54, 21.06s/it]
 52%|█████▏    | 493/942 [2:53:24<2:38:02, 21.12s/it]
 52%|█████▏    | 494/942 [2:53:44<2:36:49, 21.00s/it]
 53%|█████▎    | 495/942 [2:54:05<2:36:02, 20.94s/it]
 53%|█████▎    | 496/942 [2:54:25<2:33:34, 20.66s/it]
 53%|█████▎    | 497/942 [2:54:46<2:33:38, 20.72s/it]
 53%|█████▎    | 498/942 [2:55:07<2:34:05, 20.82s/it]
 53%|█████▎    | 499/942 [2:55:26<2:29:41, 20.27s/it]
 53%|█████▎    | 500/942 [2:55:46<2:28:26, 20.15s/it]
                                                     
{'loss': 0.1949, 'grad_norm': 1.098802732306874, 'learning_rate': 2.671410467026021e-06, 'epoch': 1.59}
 53%|█████▎    | 500/942 [2:55:46<2:28:26, 20.15s/it]
 53%|█████▎    | 501/942 [2:56:08<2:33:02, 20.82s/it]
 53%|█████▎    | 502/942 [2:56:32<2:38:29, 21.61s/it]
 53%|█████▎    | 503/942 [2:56:52<2:35:47, 21.29s/it]
 54%|█████▎    | 504/942 [2:57:13<2:34:40, 21.19s/it]
 54%|█████▎    | 505/942 [2:57:34<2:32:54, 20.99s/it]
 54%|█████▎    | 506/942 [2:57:57<2:38:11, 21.77s/it]
 54%|█████▍    | 507/942 [2:58:17<2:32:48, 21.08s/it]
 54%|█████▍    | 508/942 [2:58:39<2:36:04, 21.58s/it]
 54%|█████▍    | 509/942 [2:59:01<2:34:33, 21.42s/it]
 54%|█████▍    | 510/942 [2:59:22<2:34:18, 21.43s/it]
                                                     
{'loss': 0.1922, 'grad_norm': 0.9821189768702802, 'learning_rate': 2.5788049411106642e-06, 'epoch': 1.62}
 54%|█████▍    | 510/942 [2:59:22<2:34:18, 21.43s/it]
 54%|█████▍    | 511/942 [2:59:44<2:34:44, 21.54s/it]
 54%|█████▍    | 512/942 [3:00:05<2:32:44, 21.31s/it]
 54%|█████▍    | 513/942 [3:00:28<2:36:20, 21.87s/it]
 55%|█████▍    | 514/942 [3:00:49<2:34:59, 21.73s/it]
 55%|█████▍    | 515/942 [3:01:11<2:35:09, 21.80s/it]
 55%|█████▍    | 516/942 [3:01:33<2:34:37, 21.78s/it]
 55%|█████▍    | 517/942 [3:01:55<2:34:47, 21.85s/it]
 55%|█████▍    | 518/942 [3:02:15<2:30:10, 21.25s/it]
 55%|█████▌    | 519/942 [3:02:38<2:33:10, 21.73s/it]
 55%|█████▌    | 520/942 [3:02:58<2:30:25, 21.39s/it]
                                                     
{'loss': 0.1933, 'grad_norm': 1.036747916273735, 'learning_rate': 2.486091013372839e-06, 'epoch': 1.66}
 55%|█████▌    | 520/942 [3:02:58<2:30:25, 21.39s/it]
 55%|█████▌    | 521/942 [3:03:19<2:28:39, 21.19s/it]
 55%|█████▌    | 522/942 [3:03:43<2:35:20, 22.19s/it]
 56%|█████▌    | 523/942 [3:04:06<2:36:16, 22.38s/it]
 56%|█████▌    | 524/942 [3:04:26<2:31:12, 21.70s/it]
 56%|█████▌    | 525/942 [3:04:49<2:32:57, 22.01s/it]
 56%|█████▌    | 526/942 [3:05:14<2:38:31, 22.86s/it]
 56%|█████▌    | 527/942 [3:05:35<2:35:02, 22.42s/it]
 56%|█████▌    | 528/942 [3:05:56<2:30:11, 21.77s/it]
 56%|█████▌    | 529/942 [3:06:16<2:26:36, 21.30s/it]
 56%|█████▋    | 530/942 [3:06:39<2:29:30, 21.77s/it]
                                                     
{'loss': 0.1916, 'grad_norm': 0.9343845446431677, 'learning_rate': 2.3933962184390967e-06, 'epoch': 1.69}
 56%|█████▋    | 530/942 [3:06:39<2:29:30, 21.77s/it]
 56%|█████▋    | 531/942 [3:06:59<2:25:32, 21.25s/it]
 56%|█████▋    | 532/942 [3:07:19<2:23:29, 21.00s/it]
 57%|█████▋    | 533/942 [3:07:39<2:20:54, 20.67s/it]
 57%|█████▋    | 534/942 [3:08:01<2:23:47, 21.14s/it]
 57%|█████▋    | 535/942 [3:08:22<2:22:12, 20.96s/it]
 57%|█████▋    | 536/942 [3:08:42<2:20:13, 20.72s/it]
 57%|█████▋    | 537/942 [3:09:03<2:20:08, 20.76s/it]
 57%|█████▋    | 538/942 [3:09:24<2:21:15, 20.98s/it]
 57%|█████▋    | 539/942 [3:09:45<2:20:52, 20.97s/it]
 57%|█████▋    | 540/942 [3:10:05<2:17:54, 20.58s/it]
                                                     
{'loss': 0.1978, 'grad_norm': 1.0748988284743868, 'learning_rate': 2.3008480646174535e-06, 'epoch': 1.72}
 57%|█████▋    | 540/942 [3:10:05<2:17:54, 20.58s/it]
 57%|█████▋    | 541/942 [3:10:27<2:21:08, 21.12s/it]
 58%|█████▊    | 542/942 [3:10:48<2:19:34, 20.94s/it]
 58%|█████▊    | 543/942 [3:11:09<2:19:47, 21.02s/it]
 58%|█████▊    | 544/942 [3:11:30<2:19:54, 21.09s/it]
 58%|█████▊    | 545/942 [3:11:52<2:19:51, 21.14s/it]
 58%|█████▊    | 546/942 [3:12:12<2:18:17, 20.95s/it]
 58%|█████▊    | 547/942 [3:12:34<2:19:57, 21.26s/it]
 58%|█████▊    | 548/942 [3:12:55<2:18:09, 21.04s/it]
 58%|█████▊    | 549/942 [3:13:17<2:21:01, 21.53s/it]
 58%|█████▊    | 550/942 [3:13:39<2:21:17, 21.63s/it]
                                                     
{'loss': 0.1882, 'grad_norm': 0.9595731311112887, 'learning_rate': 2.2085738585006026e-06, 'epoch': 1.75}
 58%|█████▊    | 550/942 [3:13:39<2:21:17, 21.63s/it]
 58%|█████▊    | 551/942 [3:13:59<2:17:45, 21.14s/it]
 59%|█████▊    | 552/942 [3:14:18<2:13:51, 20.59s/it]
 59%|█████▊    | 553/942 [3:14:41<2:17:50, 21.26s/it]
 59%|█████▉    | 554/942 [3:15:02<2:16:12, 21.06s/it]
 59%|█████▉    | 555/942 [3:15:23<2:16:17, 21.13s/it]
 59%|█████▉    | 556/942 [3:15:43<2:14:08, 20.85s/it]
 59%|█████▉    | 557/942 [3:16:04<2:13:37, 20.82s/it]
 59%|█████▉    | 558/942 [3:16:24<2:11:30, 20.55s/it]
 59%|█████▉    | 559/942 [3:16:46<2:13:14, 20.87s/it]
 59%|█████▉    | 560/942 [3:17:06<2:11:06, 20.59s/it]
                                                     
{'loss': 0.2011, 'grad_norm': 0.9729689174239391, 'learning_rate': 2.1167005298466155e-06, 'epoch': 1.78}
 59%|█████▉    | 560/942 [3:17:06<2:11:06, 20.59s/it]
 60%|█████▉    | 561/942 [3:17:26<2:09:57, 20.47s/it]
 60%|█████▉    | 562/942 [3:17:48<2:13:17, 21.05s/it]
 60%|█████▉    | 563/942 [3:18:09<2:11:43, 20.85s/it]
 60%|█████▉    | 564/942 [3:18:29<2:10:56, 20.79s/it]
 60%|█████▉    | 565/942 [3:18:50<2:11:02, 20.85s/it]
 60%|██████    | 566/942 [3:19:11<2:10:36, 20.84s/it]
 60%|██████    | 567/942 [3:19:32<2:09:44, 20.76s/it][2025-04-16 14:17:10,286] [WARNING] [stage3.py:2069:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 60%|██████    | 568/942 [3:19:55<2:14:54, 21.64s/it]
 60%|██████    | 569/942 [3:20:16<2:13:47, 21.52s/it]
 61%|██████    | 570/942 [3:20:37<2:12:04, 21.30s/it]
                                                     
{'loss': 0.1923, 'grad_norm': 1.0280182893817564, 'learning_rate': 2.0253544569779936e-06, 'epoch': 1.82}
 61%|██████    | 570/942 [3:20:37<2:12:04, 21.30s/it]
 61%|██████    | 571/942 [3:20:57<2:09:23, 20.93s/it]
 61%|██████    | 572/942 [3:21:20<2:12:36, 21.51s/it]
 61%|██████    | 573/942 [3:21:41<2:10:15, 21.18s/it]
 61%|██████    | 574/942 [3:22:02<2:09:35, 21.13s/it]
 61%|██████    | 575/942 [3:22:25<2:12:47, 21.71s/it]
 61%|██████    | 576/942 [3:22:45<2:10:39, 21.42s/it]
 61%|██████▏   | 577/942 [3:23:06<2:08:06, 21.06s/it]
 61%|██████▏   | 578/942 [3:23:28<2:09:38, 21.37s/it]
 61%|██████▏   | 579/942 [3:23:49<2:09:00, 21.32s/it]
 62%|██████▏   | 580/942 [3:24:09<2:06:24, 20.95s/it]
                                                     
{'loss': 0.1973, 'grad_norm': 1.0076791876687736, 'learning_rate': 1.9346612929392635e-06, 'epoch': 1.85}
 62%|██████▏   | 580/942 [3:24:09<2:06:24, 20.95s/it]
 62%|██████▏   | 581/942 [3:24:30<2:05:28, 20.85s/it]
 62%|██████▏   | 582/942 [3:24:52<2:07:54, 21.32s/it]
 62%|██████▏   | 583/942 [3:25:15<2:09:50, 21.70s/it]
 62%|██████▏   | 584/942 [3:25:38<2:11:38, 22.06s/it]
 62%|██████▏   | 585/942 [3:25:59<2:09:21, 21.74s/it]
 62%|██████▏   | 586/942 [3:26:19<2:06:57, 21.40s/it][2025-04-16 14:23:56,722] [WARNING] [stage3.py:2069:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 62%|██████▏   | 587/942 [3:26:42<2:08:37, 21.74s/it]
 62%|██████▏   | 588/942 [3:27:05<2:10:14, 22.07s/it]
 63%|██████▎   | 589/942 [3:27:26<2:08:09, 21.78s/it]
 63%|██████▎   | 590/942 [3:27:47<2:07:51, 21.80s/it]
                                                     
{'loss': 0.1961, 'grad_norm': 0.947195588201668, 'learning_rate': 1.8447457926522454e-06, 'epoch': 1.88}
 63%|██████▎   | 590/942 [3:27:47<2:07:51, 21.80s/it]
 63%|██████▎   | 591/942 [3:28:08<2:05:11, 21.40s/it]
 63%|██████▎   | 592/942 [3:28:29<2:04:33, 21.35s/it]
 63%|██████▎   | 593/942 [3:28:51<2:04:55, 21.48s/it]
 63%|██████▎   | 594/942 [3:29:11<2:02:50, 21.18s/it]
 63%|██████▎   | 595/942 [3:29:32<2:00:41, 20.87s/it]
 63%|██████▎   | 596/942 [3:29:54<2:03:24, 21.40s/it]
 63%|██████▎   | 597/942 [3:30:14<2:00:31, 20.96s/it]
 63%|██████▎   | 598/942 [3:30:36<2:01:38, 21.22s/it]
 64%|██████▎   | 599/942 [3:30:57<2:00:26, 21.07s/it]
 64%|██████▎   | 600/942 [3:31:17<1:58:13, 20.74s/it]
                                                     
{'loss': 0.1909, 'grad_norm': 0.976068232147744, 'learning_rate': 1.7557316413067488e-06, 'epoch': 1.91}
 64%|██████▎   | 600/942 [3:31:17<1:58:13, 20.74s/it]
 64%|██████▍   | 601/942 [3:31:39<1:59:58, 21.11s/it]
 64%|██████▍   | 602/942 [3:32:00<1:59:37, 21.11s/it]
 64%|██████▍   | 603/942 [3:32:19<1:56:39, 20.65s/it]
 64%|██████▍   | 604/942 [3:32:41<1:57:21, 20.83s/it]
 64%|██████▍   | 605/942 [3:33:00<1:55:08, 20.50s/it]
 64%|██████▍   | 606/942 [3:33:21<1:54:20, 20.42s/it]
 64%|██████▍   | 607/942 [3:33:41<1:54:21, 20.48s/it]
 65%|██████▍   | 608/942 [3:34:02<1:54:34, 20.58s/it]
 65%|██████▍   | 609/942 [3:34:23<1:55:35, 20.83s/it]
 65%|██████▍   | 610/942 [3:34:44<1:55:11, 20.82s/it]
                                                     
{'loss': 0.1918, 'grad_norm': 0.9962388683153057, 'learning_rate': 1.667741284222768e-06, 'epoch': 1.94}
 65%|██████▍   | 610/942 [3:34:44<1:55:11, 20.82s/it]
 65%|██████▍   | 611/942 [3:35:06<1:56:33, 21.13s/it]
 65%|██████▍   | 612/942 [3:35:26<1:54:09, 20.76s/it]
 65%|██████▌   | 613/942 [3:35:45<1:51:53, 20.41s/it]
 65%|██████▌   | 614/942 [3:36:07<1:52:49, 20.64s/it]
 65%|██████▌   | 615/942 [3:36:27<1:52:02, 20.56s/it]
 65%|██████▌   | 616/942 [3:36:49<1:54:39, 21.10s/it]
 65%|██████▌   | 617/942 [3:37:11<1:55:44, 21.37s/it]
 66%|██████▌   | 618/942 [3:37:31<1:52:08, 20.77s/it]
 66%|██████▌   | 619/942 [3:37:54<1:55:10, 21.40s/it]
 66%|██████▌   | 620/942 [3:38:14<1:52:55, 21.04s/it]
                                                     
{'loss': 0.1878, 'grad_norm': 0.9591329135329931, 'learning_rate': 1.5808957584181997e-06, 'epoch': 1.97}
 66%|██████▌   | 620/942 [3:38:14<1:52:55, 21.04s/it]
 66%|██████▌   | 621/942 [3:38:34<1:50:46, 20.71s/it]
 66%|██████▌   | 622/942 [3:38:54<1:50:00, 20.63s/it]
 66%|██████▌   | 623/942 [3:39:15<1:49:29, 20.59s/it]
 66%|██████▌   | 624/942 [3:39:36<1:50:34, 20.86s/it]
 66%|██████▋   | 625/942 [3:39:56<1:48:22, 20.51s/it]
 66%|██████▋   | 626/942 [3:40:18<1:50:19, 20.95s/it]
 67%|██████▋   | 627/942 [3:40:39<1:50:42, 21.09s/it]
 67%|██████▋   | 628/942 [3:41:01<1:50:32, 21.12s/it]
 67%|██████▋   | 629/942 [3:41:22<1:51:02, 21.29s/it]
 67%|██████▋   | 630/942 [3:41:42<1:48:44, 20.91s/it]
                                                     
{'loss': 0.1782, 'grad_norm': 1.135346852234509, 'learning_rate': 1.4953145261137869e-06, 'epoch': 2.01}
 67%|██████▋   | 630/942 [3:41:42<1:48:44, 20.91s/it]
 67%|██████▋   | 631/942 [3:42:03<1:48:18, 20.90s/it]
 67%|██████▋   | 632/942 [3:42:24<1:48:28, 20.99s/it]
 67%|██████▋   | 633/942 [3:42:45<1:48:20, 21.04s/it]
 67%|██████▋   | 634/942 [3:43:07<1:48:07, 21.06s/it]
 67%|██████▋   | 635/942 [3:43:28<1:47:53, 21.09s/it]
 68%|██████▊   | 636/942 [3:43:50<1:48:40, 21.31s/it]
 68%|██████▊   | 637/942 [3:44:10<1:46:25, 20.94s/it]
 68%|██████▊   | 638/942 [3:44:31<1:46:31, 21.03s/it]
 68%|██████▊   | 639/942 [3:44:52<1:45:46, 20.95s/it]
 68%|██████▊   | 640/942 [3:45:13<1:46:52, 21.23s/it]
                                                     
{'loss': 0.1304, 'grad_norm': 1.2077990470652227, 'learning_rate': 1.4111153104042994e-06, 'epoch': 2.04}
 68%|██████▊   | 640/942 [3:45:13<1:46:52, 21.23s/it]
 68%|██████▊   | 641/942 [3:45:36<1:47:53, 21.51s/it]
 68%|██████▊   | 642/942 [3:45:58<1:49:20, 21.87s/it]
 68%|██████▊   | 643/942 [3:46:19<1:47:25, 21.56s/it]
 68%|██████▊   | 644/942 [3:46:39<1:45:04, 21.16s/it]
 68%|██████▊   | 645/942 [3:47:01<1:44:43, 21.16s/it]
 69%|██████▊   | 646/942 [3:47:23<1:46:30, 21.59s/it]
 69%|██████▊   | 647/942 [3:47:45<1:46:48, 21.72s/it]
 69%|██████▉   | 648/942 [3:48:07<1:46:43, 21.78s/it]
 69%|██████▉   | 649/942 [3:48:27<1:44:16, 21.35s/it]
 69%|██████▉   | 650/942 [3:48:50<1:45:59, 21.78s/it]
                                                     
{'loss': 0.1236, 'grad_norm': 1.060841396519769, 'learning_rate': 1.3284139333220209e-06, 'epoch': 2.07}
 69%|██████▉   | 650/942 [3:48:50<1:45:59, 21.78s/it]
 69%|██████▉   | 651/942 [3:49:10<1:42:33, 21.15s/it]
 69%|██████▉   | 652/942 [3:49:31<1:41:40, 21.03s/it]
 69%|██████▉   | 653/942 [3:49:53<1:43:45, 21.54s/it]
 69%|██████▉   | 654/942 [3:50:16<1:44:58, 21.87s/it]
 70%|██████▉   | 655/942 [3:50:37<1:43:39, 21.67s/it]
 70%|██████▉   | 656/942 [3:50:57<1:39:53, 20.96s/it]
 70%|██████▉   | 657/942 [3:51:17<1:38:52, 20.82s/it]
 70%|██████▉   | 658/942 [3:51:38<1:39:20, 20.99s/it]
 70%|██████▉   | 659/942 [3:51:58<1:36:31, 20.46s/it]
 70%|███████   | 660/942 [3:52:20<1:38:11, 20.89s/it]
                                                     
{'loss': 0.1271, 'grad_norm': 1.0286782581096483, 'learning_rate': 1.247324156515271e-06, 'epoch': 2.1}
 70%|███████   | 660/942 [3:52:20<1:38:11, 20.89s/it]
 70%|███████   | 661/942 [3:52:41<1:38:22, 21.01s/it]
 70%|███████   | 662/942 [3:53:01<1:36:28, 20.67s/it]
 70%|███████   | 663/942 [3:53:21<1:35:41, 20.58s/it]
 70%|███████   | 664/942 [3:53:41<1:34:11, 20.33s/it]
 71%|███████   | 665/942 [3:54:02<1:34:39, 20.50s/it]
 71%|███████   | 666/942 [3:54:23<1:35:54, 20.85s/it]
 71%|███████   | 667/942 [3:54:47<1:39:49, 21.78s/it]
 71%|███████   | 668/942 [3:55:08<1:38:23, 21.55s/it]
 71%|███████   | 669/942 [3:55:29<1:37:22, 21.40s/it]
 71%|███████   | 670/942 [3:55:51<1:36:54, 21.38s/it]
                                                     
{'loss': 0.1323, 'grad_norm': 1.0216963775320216, 'learning_rate': 1.167957524761134e-06, 'epoch': 2.13}
 71%|███████   | 670/942 [3:55:51<1:36:54, 21.38s/it]
 71%|███████   | 671/942 [3:56:11<1:35:26, 21.13s/it]
 71%|███████▏  | 672/942 [3:56:32<1:34:05, 20.91s/it]
 71%|███████▏  | 673/942 [3:56:52<1:33:15, 20.80s/it]
 72%|███████▏  | 674/942 [3:57:13<1:32:34, 20.73s/it]
 72%|███████▏  | 675/942 [3:57:33<1:31:20, 20.53s/it]
 72%|███████▏  | 676/942 [3:57:53<1:30:04, 20.32s/it]
 72%|███████▏  | 677/942 [3:58:13<1:29:50, 20.34s/it]
 72%|███████▏  | 678/942 [3:58:32<1:28:01, 20.00s/it]
 72%|███████▏  | 679/942 [3:58:53<1:28:47, 20.26s/it]
 72%|███████▏  | 680/942 [3:59:13<1:28:26, 20.25s/it]
                                                     
{'loss': 0.1276, 'grad_norm': 1.0307734500256753, 'learning_rate': 1.090423212527661e-06, 'epoch': 2.17}
 72%|███████▏  | 680/942 [3:59:13<1:28:26, 20.25s/it]
 72%|███████▏  | 681/942 [3:59:34<1:28:27, 20.34s/it]
 72%|███████▏  | 682/942 [3:59:54<1:27:38, 20.23s/it]
 73%|███████▎  | 683/942 [4:00:15<1:28:37, 20.53s/it]
 73%|███████▎  | 684/942 [4:00:37<1:30:10, 20.97s/it]
 73%|███████▎  | 685/942 [4:01:00<1:31:42, 21.41s/it]
 73%|███████▎  | 686/942 [4:01:21<1:31:19, 21.40s/it]
 73%|███████▎  | 687/942 [4:01:41<1:29:25, 21.04s/it]
 73%|███████▎  | 688/942 [4:02:02<1:28:29, 20.91s/it]
 73%|███████▎  | 689/942 [4:02:25<1:30:50, 21.54s/it]
 73%|███████▎  | 690/942 [4:02:47<1:31:57, 21.89s/it]
                                                     
{'loss': 0.1302, 'grad_norm': 1.007986640155223, 'learning_rate': 1.0148278737965845e-06, 'epoch': 2.2}
 73%|███████▎  | 690/942 [4:02:47<1:31:57, 21.89s/it]
 73%|███████▎  | 691/942 [4:03:09<1:31:23, 21.84s/it]
 73%|███████▎  | 692/942 [4:03:29<1:28:25, 21.22s/it]
 74%|███████▎  | 693/942 [4:03:49<1:26:13, 20.78s/it]
 74%|███████▎  | 694/942 [4:04:11<1:27:44, 21.23s/it]
 74%|███████▍  | 695/942 [4:04:33<1:27:55, 21.36s/it]
 74%|███████▍  | 696/942 [4:04:54<1:27:02, 21.23s/it]
 74%|███████▍  | 697/942 [4:05:13<1:24:48, 20.77s/it]
 74%|███████▍  | 698/942 [4:05:35<1:25:46, 21.09s/it]
 74%|███████▍  | 699/942 [4:05:56<1:24:43, 20.92s/it]
 74%|███████▍  | 700/942 [4:06:19<1:27:06, 21.60s/it]
                                                     
{'loss': 0.1228, 'grad_norm': 1.046919754858388, 'learning_rate': 9.412754953531664e-07, 'epoch': 2.23}
 74%|███████▍  | 700/942 [4:06:19<1:27:06, 21.60s/it]
 74%|███████▍  | 701/942 [4:06:41<1:26:59, 21.66s/it]
 75%|███████▍  | 702/942 [4:07:02<1:25:54, 21.48s/it]
 75%|███████▍  | 703/942 [4:07:22<1:23:36, 20.99s/it]
 75%|███████▍  | 704/942 [4:07:44<1:24:48, 21.38s/it]
 75%|███████▍  | 705/942 [4:08:06<1:25:33, 21.66s/it]
 75%|███████▍  | 706/942 [4:08:27<1:24:03, 21.37s/it]
 75%|███████▌  | 707/942 [4:08:49<1:24:08, 21.48s/it]
 75%|███████▌  | 708/942 [4:09:09<1:22:24, 21.13s/it]
 75%|███████▌  | 709/942 [4:09:30<1:22:09, 21.15s/it]
 75%|███████▌  | 710/942 [4:09:52<1:22:27, 21.32s/it]
                                                     
{'loss': 0.1279, 'grad_norm': 1.0233138325962319, 'learning_rate': 8.698672537449385e-07, 'epoch': 2.26}
 75%|███████▌  | 710/942 [4:09:52<1:22:27, 21.32s/it]
 75%|███████▌  | 711/942 [4:10:12<1:21:21, 21.13s/it]
 76%|███████▌  | 712/942 [4:10:33<1:19:48, 20.82s/it]
 76%|███████▌  | 713/942 [4:10:54<1:20:06, 20.99s/it]
 76%|███████▌  | 714/942 [4:11:15<1:20:04, 21.07s/it]
 76%|███████▌  | 715/942 [4:11:37<1:20:08, 21.18s/it]
 76%|███████▌  | 716/942 [4:11:57<1:18:39, 20.88s/it]
 76%|███████▌  | 717/942 [4:12:20<1:20:19, 21.42s/it]
 76%|███████▌  | 718/942 [4:12:40<1:18:48, 21.11s/it]
 76%|███████▋  | 719/942 [4:13:02<1:19:02, 21.27s/it]
 76%|███████▋  | 720/942 [4:13:23<1:18:36, 21.25s/it]
                                                     
{'loss': 0.1263, 'grad_norm': 1.0362568234780745, 'learning_rate': 8.00701376106148e-07, 'epoch': 2.29}
 76%|███████▋  | 720/942 [4:13:23<1:18:36, 21.25s/it]
 77%|███████▋  | 721/942 [4:13:45<1:18:59, 21.45s/it]
 77%|███████▋  | 722/942 [4:14:06<1:18:36, 21.44s/it]
 77%|███████▋  | 723/942 [4:14:29<1:19:41, 21.83s/it]
 77%|███████▋  | 724/942 [4:14:50<1:18:30, 21.61s/it]
 77%|███████▋  | 725/942 [4:15:12<1:18:08, 21.61s/it]
 77%|███████▋  | 726/942 [4:15:34<1:18:15, 21.74s/it]
 77%|███████▋  | 727/942 [4:15:54<1:16:09, 21.25s/it]
 77%|███████▋  | 728/942 [4:16:14<1:15:19, 21.12s/it]
 77%|███████▋  | 729/942 [4:16:34<1:13:08, 20.60s/it]
 77%|███████▋  | 730/942 [4:16:53<1:11:11, 20.15s/it]
                                                     
{'loss': 0.1285, 'grad_norm': 1.1087150570865707, 'learning_rate': 7.338730050393114e-07, 'epoch': 2.32}
 77%|███████▋  | 730/942 [4:16:53<1:11:11, 20.15s/it]
 78%|███████▊  | 731/942 [4:17:14<1:11:54, 20.45s/it]
 78%|███████▊  | 732/942 [4:17:35<1:12:30, 20.72s/it]
 78%|███████▊  | 733/942 [4:17:56<1:12:29, 20.81s/it]
 78%|███████▊  | 734/942 [4:18:17<1:12:19, 20.86s/it]
 78%|███████▊  | 735/942 [4:18:39<1:12:31, 21.02s/it]
 78%|███████▊  | 736/942 [4:19:00<1:12:18, 21.06s/it]
 78%|███████▊  | 737/942 [4:19:21<1:11:23, 20.90s/it]
 78%|███████▊  | 738/942 [4:19:41<1:10:20, 20.69s/it]
 78%|███████▊  | 739/942 [4:20:01<1:09:44, 20.61s/it]
 79%|███████▊  | 740/942 [4:20:23<1:10:30, 20.94s/it]
                                                     
{'loss': 0.1259, 'grad_norm': 0.9761113684584649, 'learning_rate': 6.694740677397846e-07, 'epoch': 2.36}
 79%|███████▊  | 740/942 [4:20:23<1:10:30, 20.94s/it]
 79%|███████▊  | 741/942 [4:20:46<1:12:40, 21.69s/it]
 79%|███████▉  | 742/942 [4:21:07<1:11:42, 21.51s/it]
 79%|███████▉  | 743/942 [4:21:27<1:09:34, 20.98s/it]
 79%|███████▉  | 744/942 [4:21:49<1:10:02, 21.22s/it]
 79%|███████▉  | 745/942 [4:22:09<1:08:16, 20.80s/it]
 79%|███████▉  | 746/942 [4:22:29<1:07:22, 20.63s/it]
 79%|███████▉  | 747/942 [4:22:49<1:06:51, 20.57s/it]
 79%|███████▉  | 748/942 [4:23:09<1:05:46, 20.34s/it]
 80%|███████▉  | 749/942 [4:23:30<1:06:04, 20.54s/it]
 80%|███████▉  | 750/942 [4:23:51<1:05:31, 20.48s/it]
                                                     
{'loss': 0.1233, 'grad_norm': 1.0259495940338315, 'learning_rate': 6.075931495433316e-07, 'epoch': 2.39}
 80%|███████▉  | 750/942 [4:23:51<1:05:31, 20.48s/it]
 80%|███████▉  | 751/942 [4:24:12<1:06:05, 20.76s/it]
 80%|███████▉  | 752/942 [4:24:34<1:07:05, 21.19s/it]
 80%|███████▉  | 753/942 [4:24:55<1:06:22, 21.07s/it]
 80%|████████  | 754/942 [4:25:16<1:05:59, 21.06s/it]
 80%|████████  | 755/942 [4:25:37<1:05:45, 21.10s/it]
 80%|████████  | 756/942 [4:25:57<1:04:07, 20.69s/it]
 80%|████████  | 757/942 [4:26:19<1:04:38, 20.97s/it]
 80%|████████  | 758/942 [4:26:40<1:05:00, 21.20s/it]
 81%|████████  | 759/942 [4:27:00<1:03:26, 20.80s/it]
 81%|████████  | 760/942 [4:27:21<1:02:53, 20.73s/it]
                                                     
{'loss': 0.1234, 'grad_norm': 1.0323050185732705, 'learning_rate': 5.483153720706799e-07, 'epoch': 2.42}
 81%|████████  | 760/942 [4:27:21<1:02:53, 20.73s/it]
 81%|████████  | 761/942 [4:27:41<1:02:32, 20.73s/it]
 81%|████████  | 762/942 [4:28:02<1:02:26, 20.81s/it]
 81%|████████  | 763/942 [4:28:23<1:01:59, 20.78s/it]
 81%|████████  | 764/942 [4:28:43<1:01:15, 20.65s/it]
 81%|████████  | 765/942 [4:29:05<1:01:40, 20.91s/it]
 81%|████████▏ | 766/942 [4:29:24<1:00:01, 20.46s/it]
 81%|████████▏ | 767/942 [4:29:45<59:59, 20.57s/it]  
 82%|████████▏ | 768/942 [4:30:07<1:00:23, 20.83s/it]
 82%|████████▏ | 769/942 [4:30:27<59:23, 20.60s/it]  
 82%|████████▏ | 770/942 [4:30:48<59:49, 20.87s/it]
                                                   
{'loss': 0.1288, 'grad_norm': 0.9669511482644255, 'learning_rate': 4.917222761366477e-07, 'epoch': 2.45}
 82%|████████▏ | 770/942 [4:30:48<59:49, 20.87s/it]
 82%|████████▏ | 771/942 [4:31:09<59:04, 20.73s/it]
 82%|████████▏ | 772/942 [4:31:30<59:09, 20.88s/it]
 82%|████████▏ | 773/942 [4:31:53<1:00:21, 21.43s/it]
 82%|████████▏ | 774/942 [4:32:13<59:23, 21.21s/it]  
 82%|████████▏ | 775/942 [4:32:36<1:00:03, 21.58s/it]
 82%|████████▏ | 776/942 [4:32:56<58:28, 21.14s/it]  
 82%|████████▏ | 777/942 [4:33:17<57:59, 21.09s/it]
 83%|████████▎ | 778/942 [4:33:37<57:13, 20.94s/it]
 83%|████████▎ | 779/942 [4:33:59<57:35, 21.20s/it]
 83%|████████▎ | 780/942 [4:34:22<58:54, 21.82s/it]
                                                   
{'loss': 0.1279, 'grad_norm': 1.0343951388143495, 'learning_rate': 4.378917095849358e-07, 'epoch': 2.48}
 83%|████████▎ | 780/942 [4:34:22<58:54, 21.82s/it]
 83%|████████▎ | 781/942 [4:34:44<58:25, 21.77s/it]
 83%|████████▎ | 782/942 [4:35:04<56:55, 21.35s/it]
 83%|████████▎ | 783/942 [4:35:26<56:50, 21.45s/it]
 83%|████████▎ | 784/942 [4:35:46<55:34, 21.10s/it]
 83%|████████▎ | 785/942 [4:36:08<55:44, 21.30s/it]
 83%|████████▎ | 786/942 [4:36:35<59:41, 22.96s/it]
 84%|████████▎ | 787/942 [4:36:58<59:12, 22.92s/it]
 84%|████████▎ | 788/942 [4:37:19<57:26, 22.38s/it]
 84%|████████▍ | 789/942 [4:37:38<54:47, 21.49s/it]
 84%|████████▍ | 790/942 [4:37:59<53:45, 21.22s/it]
                                                   
{'loss': 0.124, 'grad_norm': 0.94438796334163, 'learning_rate': 3.8689772020285814e-07, 'epoch': 2.52}
 84%|████████▍ | 790/942 [4:37:59<53:45, 21.22s/it]
 84%|████████▍ | 791/942 [4:38:19<52:23, 20.82s/it]
 84%|████████▍ | 792/942 [4:38:40<52:31, 21.01s/it]
 84%|████████▍ | 793/942 [4:39:01<51:52, 20.89s/it]
 84%|████████▍ | 794/942 [4:39:23<52:05, 21.12s/it]
 84%|████████▍ | 795/942 [4:39:45<52:29, 21.43s/it]
 85%|████████▍ | 796/942 [4:40:05<51:27, 21.15s/it]
 85%|████████▍ | 797/942 [4:40:28<52:16, 21.63s/it]
 85%|████████▍ | 798/942 [4:40:48<50:51, 21.19s/it]
 85%|████████▍ | 799/942 [4:41:11<51:33, 21.63s/it]
 85%|████████▍ | 800/942 [4:41:31<50:19, 21.27s/it]
                                                   
{'loss': 0.1269, 'grad_norm': 1.090998171982561, 'learning_rate': 3.38810453863328e-07, 'epoch': 2.55}
 85%|████████▍ | 800/942 [4:41:31<50:19, 21.27s/it]
 85%|████████▌ | 801/942 [4:41:52<49:45, 21.18s/it]
 85%|████████▌ | 802/942 [4:42:12<48:27, 20.77s/it]
 85%|████████▌ | 803/942 [4:42:35<49:33, 21.39s/it]
 85%|████████▌ | 804/942 [4:42:55<48:14, 20.98s/it]
 85%|████████▌ | 805/942 [4:43:17<48:41, 21.33s/it]
 86%|████████▌ | 806/942 [4:43:39<49:01, 21.63s/it]
 86%|████████▌ | 807/942 [4:44:01<48:41, 21.64s/it]
 86%|████████▌ | 808/942 [4:44:22<47:46, 21.39s/it]
 86%|████████▌ | 809/942 [4:44:45<48:21, 21.81s/it]
 86%|████████▌ | 810/942 [4:45:04<46:37, 21.19s/it]
                                                   
{'loss': 0.1245, 'grad_norm': 1.0590298648868628, 'learning_rate': 2.9369605803419714e-07, 'epoch': 2.58}
 86%|████████▌ | 810/942 [4:45:04<46:37, 21.19s/it]
 86%|████████▌ | 811/942 [4:45:25<46:08, 21.13s/it]
 86%|████████▌ | 812/942 [4:45:46<45:33, 21.02s/it]
 86%|████████▋ | 813/942 [4:46:07<45:01, 20.94s/it]
 86%|████████▋ | 814/942 [4:46:28<44:50, 21.02s/it]
 87%|████████▋ | 815/942 [4:46:49<44:31, 21.03s/it]
 87%|████████▋ | 816/942 [4:47:10<44:09, 21.03s/it]
 87%|████████▋ | 817/942 [4:47:30<43:20, 20.80s/it]
 87%|████████▋ | 818/942 [4:47:51<42:36, 20.61s/it]
 87%|████████▋ | 819/942 [4:48:11<42:08, 20.55s/it]
 87%|████████▋ | 820/942 [4:48:31<41:17, 20.31s/it]
                                                   
{'loss': 0.1241, 'grad_norm': 1.1070518470647384, 'learning_rate': 2.516165907876947e-07, 'epoch': 2.61}
 87%|████████▋ | 820/942 [4:48:31<41:17, 20.31s/it]
 87%|████████▋ | 821/942 [4:48:52<41:34, 20.61s/it]
 87%|████████▋ | 822/942 [4:49:14<41:46, 20.89s/it]
 87%|████████▋ | 823/942 [4:49:34<41:19, 20.83s/it]
 87%|████████▋ | 824/942 [4:49:57<42:01, 21.37s/it]
 88%|████████▊ | 825/942 [4:50:17<41:01, 21.04s/it]
 88%|████████▊ | 826/942 [4:50:38<40:24, 20.90s/it]
 88%|████████▊ | 827/942 [4:50:58<39:30, 20.62s/it]
 88%|████████▊ | 828/942 [4:51:21<40:54, 21.53s/it]
 88%|████████▊ | 829/942 [4:51:43<40:30, 21.51s/it]
 88%|████████▊ | 830/942 [4:52:03<39:06, 20.95s/it]
                                                   
{'loss': 0.1224, 'grad_norm': 1.0679570675603616, 'learning_rate': 2.1262993543511717e-07, 'epoch': 2.64}
 88%|████████▊ | 830/942 [4:52:03<39:06, 20.95s/it]
 88%|████████▊ | 831/942 [4:52:23<38:15, 20.68s/it]
 88%|████████▊ | 832/942 [4:52:43<37:35, 20.50s/it]
 88%|████████▊ | 833/942 [4:53:05<38:01, 20.94s/it]
 89%|████████▊ | 834/942 [4:53:26<37:45, 20.98s/it]
 89%|████████▊ | 835/942 [4:53:45<36:28, 20.45s/it]
 89%|████████▊ | 836/942 [4:54:07<37:03, 20.98s/it]
 89%|████████▉ | 837/942 [4:54:28<36:39, 20.94s/it]
 89%|████████▉ | 838/942 [4:54:51<37:11, 21.45s/it]
 89%|████████▉ | 839/942 [4:55:12<36:32, 21.29s/it]
 89%|████████▉ | 840/942 [4:55:33<36:04, 21.22s/it]
                                                   
{'loss': 0.1308, 'grad_norm': 1.1118615962460174, 'learning_rate': 1.7678972090420272e-07, 'epoch': 2.68}
 89%|████████▉ | 840/942 [4:55:33<36:04, 21.22s/it]
 89%|████████▉ | 841/942 [4:55:54<35:50, 21.29s/it]
 89%|████████▉ | 842/942 [4:56:15<35:11, 21.11s/it]
 89%|████████▉ | 843/942 [4:56:35<34:34, 20.95s/it]
 90%|████████▉ | 844/942 [4:56:56<34:03, 20.86s/it]
 90%|████████▉ | 845/942 [4:57:17<33:45, 20.88s/it]
 90%|████████▉ | 846/942 [4:57:37<33:02, 20.66s/it]
 90%|████████▉ | 847/942 [4:57:59<33:27, 21.13s/it]
 90%|█████████ | 848/942 [4:58:21<33:27, 21.36s/it]
 90%|█████████ | 849/942 [4:58:43<33:30, 21.62s/it]
 90%|█████████ | 850/942 [4:59:05<33:09, 21.63s/it]
                                                   
{'loss': 0.1234, 'grad_norm': 1.028792207489206, 'learning_rate': 1.4414524796871026e-07, 'epoch': 2.71}
 90%|█████████ | 850/942 [4:59:05<33:09, 21.63s/it]
 90%|█████████ | 851/942 [4:59:25<32:09, 21.20s/it]
 90%|█████████ | 852/942 [4:59:46<31:36, 21.07s/it]
 91%|█████████ | 853/942 [5:00:06<30:52, 20.82s/it]
 91%|█████████ | 854/942 [5:00:28<30:58, 21.12s/it]
 91%|█████████ | 855/942 [5:00:49<30:28, 21.02s/it]
 91%|█████████ | 856/942 [5:01:10<30:12, 21.07s/it]
 91%|█████████ | 857/942 [5:01:29<29:05, 20.53s/it]
 91%|█████████ | 858/942 [5:01:50<28:51, 20.62s/it]
 91%|█████████ | 859/942 [5:02:13<29:25, 21.27s/it]
 91%|█████████▏| 860/942 [5:02:33<28:28, 20.83s/it]
                                                   
{'loss': 0.1295, 'grad_norm': 1.1115999044423435, 'learning_rate': 1.1474142143168832e-07, 'epoch': 2.74}
 91%|█████████▏| 860/942 [5:02:33<28:28, 20.83s/it]
 91%|█████████▏| 861/942 [5:02:55<28:36, 21.19s/it]
 92%|█████████▏| 862/942 [5:03:15<27:43, 20.79s/it]
 92%|█████████▏| 863/942 [5:03:34<26:57, 20.47s/it]
 92%|█████████▏| 864/942 [5:03:57<27:25, 21.10s/it]
 92%|█████████▏| 865/942 [5:04:18<27:16, 21.25s/it]
 92%|█████████▏| 866/942 [5:04:40<27:04, 21.38s/it]
 92%|█████████▏| 867/942 [5:05:01<26:34, 21.26s/it]
 92%|█████████▏| 868/942 [5:05:21<25:34, 20.74s/it]
 92%|█████████▏| 869/942 [5:05:41<24:54, 20.47s/it]
 92%|█████████▏| 870/942 [5:06:01<24:24, 20.35s/it]
                                                   
{'loss': 0.1246, 'grad_norm': 1.023215245405073, 'learning_rate': 8.861868835570831e-08, 'epoch': 2.77}
 92%|█████████▏| 870/942 [5:06:01<24:24, 20.35s/it]
 92%|█████████▏| 871/942 [5:06:21<24:05, 20.35s/it]
 93%|█████████▎| 872/942 [5:06:42<24:07, 20.68s/it]
 93%|█████████▎| 873/942 [5:07:03<23:47, 20.68s/it]
 93%|█████████▎| 874/942 [5:07:24<23:23, 20.63s/it]
 93%|█████████▎| 875/942 [5:07:44<22:57, 20.56s/it]
 93%|█████████▎| 876/942 [5:08:06<23:08, 21.04s/it]
 93%|█████████▎| 877/942 [5:08:26<22:28, 20.74s/it]
 93%|█████████▎| 878/942 [5:08:47<22:00, 20.63s/it]
 93%|█████████▎| 879/942 [5:09:08<21:53, 20.84s/it]
 93%|█████████▎| 880/942 [5:09:28<21:23, 20.70s/it]
                                                   
{'loss': 0.1287, 'grad_norm': 1.0068858602901523, 'learning_rate': 6.58129824250478e-08, 'epoch': 2.8}
 93%|█████████▎| 880/942 [5:09:28<21:23, 20.70s/it]
 94%|█████████▎| 881/942 [5:09:48<20:51, 20.52s/it]
 94%|█████████▎| 882/942 [5:10:10<20:45, 20.77s/it]
 94%|█████████▎| 883/942 [5:10:31<20:33, 20.90s/it]
 94%|█████████▍| 884/942 [5:10:52<20:23, 21.10s/it]
 94%|█████████▍| 885/942 [5:11:14<20:07, 21.18s/it]
 94%|█████████▍| 886/942 [5:11:35<19:36, 21.02s/it]
 94%|█████████▍| 887/942 [5:11:58<19:52, 21.69s/it]
 94%|█████████▍| 888/942 [5:12:18<19:14, 21.38s/it]
 94%|█████████▍| 889/942 [5:12:40<18:53, 21.39s/it]
 94%|█████████▍| 890/942 [5:13:00<18:15, 21.07s/it]
                                                   
{'loss': 0.1223, 'grad_norm': 1.0113435874090089, 'learning_rate': 4.635567451633821e-08, 'epoch': 2.83}
 94%|█████████▍| 890/942 [5:13:00<18:15, 21.07s/it]
 95%|█████████▍| 891/942 [5:13:21<17:57, 21.14s/it]
 95%|█████████▍| 892/942 [5:13:42<17:21, 20.83s/it]
 95%|█████████▍| 893/942 [5:14:04<17:23, 21.29s/it]
 95%|█████████▍| 894/942 [5:14:26<17:16, 21.60s/it]
 95%|█████████▌| 895/942 [5:14:46<16:25, 20.97s/it]
 95%|█████████▌| 896/942 [5:15:08<16:17, 21.24s/it]
 95%|█████████▌| 897/942 [5:15:30<16:06, 21.48s/it]
 95%|█████████▌| 898/942 [5:15:51<15:39, 21.35s/it]
 95%|█████████▌| 899/942 [5:16:10<14:48, 20.66s/it]
 96%|█████████▌| 900/942 [5:16:30<14:25, 20.61s/it]
                                                   
{'loss': 0.1266, 'grad_norm': 1.0425235124884835, 'learning_rate': 3.027352954568713e-08, 'epoch': 2.87}
 96%|█████████▌| 900/942 [5:16:30<14:25, 20.61s/it]
 96%|█████████▌| 901/942 [5:16:51<14:02, 20.54s/it]
 96%|█████████▌| 902/942 [5:17:13<14:00, 21.01s/it]
 96%|█████████▌| 903/942 [5:17:33<13:34, 20.88s/it]
 96%|█████████▌| 904/942 [5:17:57<13:48, 21.81s/it]
 96%|█████████▌| 905/942 [5:18:18<13:20, 21.63s/it]
 96%|█████████▌| 906/942 [5:18:39<12:47, 21.31s/it]
 96%|█████████▋| 907/942 [5:18:59<12:10, 20.88s/it]
 96%|█████████▋| 908/942 [5:19:20<11:52, 20.95s/it]
 96%|█████████▋| 909/942 [5:19:39<11:08, 20.25s/it]
 97%|█████████▋| 910/942 [5:20:00<10:56, 20.50s/it]
                                                   
{'loss': 0.1252, 'grad_norm': 1.09593587160412, 'learning_rate': 1.758866965162337e-08, 'epoch': 2.9}
 97%|█████████▋| 910/942 [5:20:00<10:56, 20.50s/it]
 97%|█████████▋| 911/942 [5:20:21<10:40, 20.65s/it]
 97%|█████████▋| 912/942 [5:20:44<10:38, 21.29s/it]
 97%|█████████▋| 913/942 [5:21:04<10:12, 21.12s/it]
 97%|█████████▋| 914/942 [5:21:26<09:54, 21.24s/it]
 97%|█████████▋| 915/942 [5:21:46<09:26, 20.97s/it]
 97%|█████████▋| 916/942 [5:22:07<09:05, 20.97s/it]
 97%|█████████▋| 917/942 [5:22:28<08:43, 20.96s/it]
 97%|█████████▋| 918/942 [5:22:49<08:25, 21.06s/it]
 98%|█████████▊| 919/942 [5:23:11<08:08, 21.26s/it]
 98%|█████████▊| 920/942 [5:23:31<07:38, 20.85s/it]
                                                   
{'loss': 0.1244, 'grad_norm': 1.061053573205979, 'learning_rate': 8.318543764516963e-09, 'epoch': 2.93}
 98%|█████████▊| 920/942 [5:23:31<07:38, 20.85s/it]
 98%|█████████▊| 921/942 [5:23:51<07:12, 20.59s/it]
 98%|█████████▊| 922/942 [5:24:12<06:57, 20.85s/it]
 98%|█████████▊| 923/942 [5:24:33<06:36, 20.85s/it]
 98%|█████████▊| 924/942 [5:24:54<06:14, 20.81s/it]
 98%|█████████▊| 925/942 [5:25:15<05:56, 20.97s/it]
 98%|█████████▊| 926/942 [5:25:36<05:36, 21.02s/it]
 98%|█████████▊| 927/942 [5:25:58<05:17, 21.17s/it]
 99%|█████████▊| 928/942 [5:26:20<05:00, 21.44s/it]
 99%|█████████▊| 929/942 [5:26:40<04:32, 21.00s/it]
 99%|█████████▊| 930/942 [5:27:00<04:09, 20.82s/it]
                                                   
{'loss': 0.1264, 'grad_norm': 1.0497111143417108, 'learning_rate': 2.475903604330088e-09, 'epoch': 2.96}
 99%|█████████▊| 930/942 [5:27:00<04:09, 20.82s/it]
 99%|█████████▉| 931/942 [5:27:22<03:52, 21.10s/it]
 99%|█████████▉| 932/942 [5:27:42<03:27, 20.71s/it]
 99%|█████████▉| 933/942 [5:28:03<03:08, 20.90s/it]
 99%|█████████▉| 934/942 [5:28:26<02:52, 21.52s/it]
 99%|█████████▉| 935/942 [5:28:48<02:31, 21.59s/it]
 99%|█████████▉| 936/942 [5:29:09<02:08, 21.44s/it]
 99%|█████████▉| 937/942 [5:29:32<01:49, 21.89s/it]
100%|█████████▉| 938/942 [5:29:52<01:25, 21.45s/it]
100%|█████████▉| 939/942 [5:30:14<01:04, 21.38s/it]
100%|█████████▉| 940/942 [5:30:34<00:42, 21.02s/it]
                                                   
{'loss': 0.1299, 'grad_norm': 1.053987428261104, 'learning_rate': 6.878613971583736e-11, 'epoch': 2.99}
100%|█████████▉| 940/942 [5:30:34<00:42, 21.02s/it]
100%|█████████▉| 941/942 [5:30:54<00:20, 20.69s/it]
100%|██████████| 942/942 [5:31:15<00:00, 20.93s/it][INFO|trainer.py:2394] 2025-04-16 16:28:30,363 >> 

Training completed. Do not forget to share your model on huggingface.co/models =)


                                                   
{'train_runtime': 19875.9045, 'train_samples_per_second': 6.066, 'train_steps_per_second': 0.047, 'train_loss': 0.20179758845747403, 'epoch': 3.0}
100%|██████████| 942/942 [5:31:15<00:00, 20.93s/it]
100%|██████████| 942/942 [5:31:15<00:00, 21.10s/it]
[INFO|trainer.py:3503] 2025-04-16 16:28:40,398 >> Saving model checkpoint to /data/username/grafting/saves/llama3-8b/full/sft_math
[INFO|configuration_utils.py:472] 2025-04-16 16:28:40,400 >> Configuration saved in /data/username/grafting/saves/llama3-8b/full/sft_math/config.json
[INFO|configuration_utils.py:807] 2025-04-16 16:28:40,401 >> Configuration saved in /data/username/grafting/saves/llama3-8b/full/sft_math/generation_config.json
[INFO|modeling_utils.py:2773] 2025-04-16 16:28:57,044 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at /data/username/grafting/saves/llama3-8b/full/sft_math/model.safetensors.index.json.
[INFO|tokenization_utils_base.py:2702] 2025-04-16 16:28:57,047 >> tokenizer config file saved in /data/username/grafting/saves/llama3-8b/full/sft_math/tokenizer_config.json
[INFO|tokenization_utils_base.py:2711] 2025-04-16 16:28:57,047 >> Special tokens file saved in /data/username/grafting/saves/llama3-8b/full/sft_math/special_tokens_map.json
***** train metrics *****
  epoch                    =        3.0
  total_flos               =   142467GF
  train_loss               =     0.2018
  train_runtime            = 5:31:15.90
  train_samples_per_second =      6.066
  train_steps_per_second   =      0.047
Figure saved at: /data/username/grafting/saves/llama3-8b/full/sft_math/training_loss.png
[WARNING|2025-04-16 16:28:57] llamafactory.extras.ploting:148 >> No metric eval_loss to plot.
[WARNING|2025-04-16 16:28:57] llamafactory.extras.ploting:148 >> No metric eval_accuracy to plot.
[INFO|modelcard.py:449] 2025-04-16 16:28:57,745 >> Dropping the following result as it does not have all the necessary fields:
{'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}}
