[2025-04-17 14:37:54,020] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[93m [WARNING] [0m async_io requires the dev libaio .so object and headers but these were not found.
[93m [WARNING] [0m async_io: please install the libaio-dev package with apt
[93m [WARNING] [0m If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[93m [WARNING] [0m Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[93m [WARNING] [0m sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[93m [WARNING] [0m using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[INFO|2025-04-17 14:37:56] llamafactory.cli:143 >> Initializing 8 distributed tasks at: 127.0.0.1:23889
W0417 14:37:57.444000 139837939225216 torch/distributed/run.py:757] 
W0417 14:37:57.444000 139837939225216 torch/distributed/run.py:757] *****************************************
W0417 14:37:57.444000 139837939225216 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0417 14:37:57.444000 139837939225216 torch/distributed/run.py:757] *****************************************
[2025-04-17 14:38:01,226] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-17 14:38:01,283] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-17 14:38:01,283] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-17 14:38:01,284] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-17 14:38:01,294] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-17 14:38:01,336] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[93m [WARNING] [0m async_io requires the dev libaio .so object and headers but these were not found.
[93m [WARNING] [0m async_io requires the dev libaio .so object and headers but these were not found.
[93m [WARNING] [0m async_io requires the dev libaio .so object and headers but these were not found.
[93m [WARNING] [0m async_io requires the dev libaio .so object and headers but these were not found.
[2025-04-17 14:38:01,412] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-17 14:38:01,414] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[93m [WARNING] [0m async_io requires the dev libaio .so object and headers but these were not found.
[93m [WARNING] [0m async_io: please install the libaio-dev package with apt
[93m [WARNING] [0m If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[93m [WARNING] [0m Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[93m [WARNING] [0m async_io requires the dev libaio .so object and headers but these were not found.
[93m [WARNING] [0m async_io: please install the libaio-dev package with apt
[93m [WARNING] [0m If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[93m [WARNING] [0m Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[93m [WARNING] [0m async_io: please install the libaio-dev package with apt
[93m [WARNING] [0m If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[93m [WARNING] [0m Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[93m [WARNING] [0m async_io: please install the libaio-dev package with apt
[93m [WARNING] [0m If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[93m [WARNING] [0m Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[93m [WARNING] [0m async_io: please install the libaio-dev package with apt
[93m [WARNING] [0m If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[93m [WARNING] [0m Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[93m [WARNING] [0m async_io: please install the libaio-dev package with apt
[93m [WARNING] [0m If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[93m [WARNING] [0m Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[93m [WARNING] [0m async_io requires the dev libaio .so object and headers but these were not found.
[93m [WARNING] [0m async_io requires the dev libaio .so object and headers but these were not found.
[93m [WARNING] [0m async_io: please install the libaio-dev package with apt
[93m [WARNING] [0m async_io: please install the libaio-dev package with apt
[93m [WARNING] [0m If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[93m [WARNING] [0m If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[93m [WARNING] [0m Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[93m [WARNING] [0m Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[93m [WARNING] [0m sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[93m [WARNING] [0m using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[93m [WARNING] [0m sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[93m [WARNING] [0m using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[93m [WARNING] [0m sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[93m [WARNING] [0m using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[93m [WARNING] [0m sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[93m [WARNING] [0m using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[93m [WARNING] [0m sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[93m [WARNING] [0m using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[93m [WARNING] [0m sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[93m [WARNING] [0m using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[93m [WARNING] [0m sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[93m [WARNING] [0m using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[93m [WARNING] [0m sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[93m [WARNING] [0m using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[2025-04-17 14:38:02,242] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-04-17 14:38:02,242] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-04-17 14:38:02,246] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-04-17 14:38:02,254] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-04-17 14:38:02,254] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2025-04-17 14:38:02,284] [INFO] [comm.py:637:init_distributed] cdb=None
[INFO|2025-04-17 14:38:02] llamafactory.hparams.parser:380 >> Process rank: 1, world size: 8, device: cuda:1, distributed training: True, compute dtype: torch.bfloat16
[INFO|2025-04-17 14:38:02] llamafactory.hparams.parser:380 >> Process rank: 7, world size: 8, device: cuda:7, distributed training: True, compute dtype: torch.bfloat16
[2025-04-17 14:38:02,356] [INFO] [comm.py:637:init_distributed] cdb=None
[INFO|2025-04-17 14:38:02] llamafactory.hparams.parser:380 >> Process rank: 5, world size: 8, device: cuda:5, distributed training: True, compute dtype: torch.bfloat16
[INFO|2025-04-17 14:38:02] llamafactory.hparams.parser:380 >> Process rank: 0, world size: 8, device: cuda:0, distributed training: True, compute dtype: torch.bfloat16
[INFO|tokenization_utils_base.py:2287] 2025-04-17 14:38:02,388 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2287] 2025-04-17 14:38:02,388 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2287] 2025-04-17 14:38:02,389 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2287] 2025-04-17 14:38:02,389 >> loading file tokenizer_config.json
[INFO|2025-04-17 14:38:02] llamafactory.hparams.parser:380 >> Process rank: 3, world size: 8, device: cuda:3, distributed training: True, compute dtype: torch.bfloat16
[2025-04-17 14:38:02,448] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-04-17 14:38:02,453] [INFO] [comm.py:637:init_distributed] cdb=None
[INFO|2025-04-17 14:38:02] llamafactory.hparams.parser:380 >> Process rank: 4, world size: 8, device: cuda:4, distributed training: True, compute dtype: torch.bfloat16
[INFO|2025-04-17 14:38:02] llamafactory.hparams.parser:380 >> Process rank: 2, world size: 8, device: cuda:2, distributed training: True, compute dtype: torch.bfloat16
[INFO|2025-04-17 14:38:02] llamafactory.hparams.parser:380 >> Process rank: 6, world size: 8, device: cuda:6, distributed training: True, compute dtype: torch.bfloat16
[INFO|tokenization_utils_base.py:2533] 2025-04-17 14:38:02,686 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|configuration_utils.py:731] 2025-04-17 14:38:02,687 >> loading configuration file /data/public/Llama-3.1-8B-Instruct/config.json
[INFO|configuration_utils.py:800] 2025-04-17 14:38:02,687 >> Model config LlamaConfig {
  "_name_or_path": "/data/public/Llama-3.1-8B-Instruct",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": [
    128001,
    128008,
    128009
  ],
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 131072,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 8.0,
    "high_freq_factor": 4.0,
    "low_freq_factor": 1.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.43.4",
  "use_cache": false,
  "vocab_size": 128256
}

[INFO|tokenization_utils_base.py:2287] 2025-04-17 14:38:02,688 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2287] 2025-04-17 14:38:02,688 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2287] 2025-04-17 14:38:02,688 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2287] 2025-04-17 14:38:02,688 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2533] 2025-04-17 14:38:02,955 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|2025-04-17 14:38:02] llamafactory.data.template:143 >> Add pad token: <|eot_id|>
[INFO|2025-04-17 14:38:02] llamafactory.data.template:143 >> Add <|eot_id|>,<|eom_id|> to stop words.
[INFO|2025-04-17 14:38:02] llamafactory.data.loader:143 >> Loading dataset /data/public/grafting/sft_split_data/Chinese_Language/Chinese_Language.json...
Setting num_proc from 16 back to 1 for the train split to disable multiprocessing as it only contains one shard.
Generating train split: 0 examples [00:00, ? examples/s]
Generating train split: 2780 examples [00:00, 14831.02 examples/s]
Generating train split: 5552 examples [00:00, 16295.15 examples/s]
Generating train split: 8164 examples [00:00, 17447.53 examples/s]
Generating train split: 9952 examples [00:00, 16696.61 examples/s]
Generating train split: 9952 examples [00:00, 16568.33 examples/s]
Converting format of dataset (num_proc=16):   0%|          | 0/9952 [00:00<?, ? examples/s]
Converting format of dataset (num_proc=16):   6%|▋         | 622/9952 [00:00<00:01, 6037.68 examples/s]
Converting format of dataset (num_proc=16): 100%|██████████| 9952/9952 [00:00<00:00, 35324.46 examples/s]
Running tokenizer on dataset (num_proc=16):   0%|          | 0/9952 [00:00<?, ? examples/s]
Running tokenizer on dataset (num_proc=16):   6%|▋         | 622/9952 [00:01<00:16, 582.21 examples/s]
Running tokenizer on dataset (num_proc=16):  12%|█▎        | 1244/9952 [00:01<00:07, 1185.50 examples/s]
Running tokenizer on dataset (num_proc=16):  19%|█▉        | 1866/9952 [00:01<00:04, 1784.94 examples/s]
Running tokenizer on dataset (num_proc=16):  25%|██▌       | 2488/9952 [00:01<00:03, 2349.07 examples/s]
Running tokenizer on dataset (num_proc=16):  31%|███▏      | 3110/9952 [00:01<00:02, 2842.72 examples/s]
Running tokenizer on dataset (num_proc=16):  38%|███▊      | 3732/9952 [00:01<00:01, 3239.92 examples/s]
Running tokenizer on dataset (num_proc=16):  44%|████▍     | 4354/9952 [00:01<00:01, 3546.48 examples/s]
Running tokenizer on dataset (num_proc=16):  50%|█████     | 4976/9952 [00:02<00:01, 3779.01 examples/s]
Running tokenizer on dataset (num_proc=16):  56%|█████▋    | 5598/9952 [00:02<00:01, 3949.31 examples/s]
Running tokenizer on dataset (num_proc=16):  69%|██████▉   | 6842/9952 [00:02<00:00, 4552.79 examples/s]
Running tokenizer on dataset (num_proc=16):  75%|███████▌  | 7464/9952 [00:02<00:00, 4618.10 examples/s]
Running tokenizer on dataset (num_proc=16):  81%|████████▏ | 8086/9952 [00:02<00:00, 3637.31 examples/s]
Running tokenizer on dataset (num_proc=16):  88%|████████▊ | 8708/9952 [00:03<00:00, 3515.95 examples/s]
Running tokenizer on dataset (num_proc=16): 100%|██████████| 9952/9952 [00:03<00:00, 4020.01 examples/s]
Running tokenizer on dataset (num_proc=16): 100%|██████████| 9952/9952 [00:03<00:00, 2961.75 examples/s]
training example:
input_ids:
[128000, 128006, 882, 128007, 271, 109425, 111689, 83747, 35894, 102146, 116028, 108208, 11571, 128009, 128006, 78191, 128007, 271, 111689, 83747, 35894, 102146, 116028, 108208, 122503, 43240, 108900, 9554, 114099, 102778, 3922, 86206, 105212, 5486, 105231, 34208, 108306, 119046, 121837, 1811, 88852, 107226, 98184, 89186, 9554, 123755, 49543, 16, 13, 3146, 115890, 126369, 60455, 96455, 334, 5232, 68438, 115890, 116788, 5486, 109060, 34208, 118006, 9554, 126369, 38129, 60455, 96455, 114831, 111689, 83747, 33208, 102465, 114887, 42506, 9554, 33420, 120225, 113294, 69636, 104989, 103167, 116028, 108208, 53953, 61056, 54322, 3490, 17, 13, 3146, 84851, 82042, 80866, 121730, 126369, 334, 5232, 113299, 101402, 83175, 27327, 5486, 103125, 27327, 50667, 31540, 88356, 21990, 126369, 9554, 38129, 124560, 3922, 111689, 83747, 33764, 123892, 163, 65620, 5486, 102465, 104754, 50667, 33208, 102465, 114887, 42506, 9554, 103963, 125823, 3490, 18, 13, 3146, 106246, 118522, 109060, 334, 5232, 117681, 119680, 38129, 118522, 109060, 114722, 3922, 111689, 83747, 86127, 46729, 41053, 9554, 38129, 113294, 69636, 111689, 83747, 110585, 107380, 102146, 61056, 54322, 3490, 19, 13, 3146, 118890, 34226, 109759, 35083, 9554, 61056, 54322, 110778, 334, 5232, 105212, 74770, 68438, 80195, 25333, 30735, 23187, 34226, 45736, 9554, 61056, 54322, 110778, 3922, 105318, 49792, 108174, 34208, 41053, 121342, 111689, 83747, 116028, 108208, 53953, 61056, 54322, 3490, 20, 13, 3146, 109178, 110712, 67178, 102138, 334, 5232, 110712, 76505, 113961, 107246, 51109, 41920, 120792, 33208, 81802, 111, 64026, 69962, 54322, 120792, 102146, 106258, 103129, 35304, 23226, 106594, 35894, 102146, 115165, 3490, 21, 13, 3146, 111689, 83747, 123641, 53953, 32016, 248, 120467, 334, 5232, 108663, 124741, 34208, 109572, 101037, 225, 12554, 122, 32016, 248, 120467, 115286, 108787, 108726, 38093, 116498, 122333, 9554, 19361, 104698, 102146, 33014, 3490, 22, 13, 3146, 115890, 35417, 107463, 126966, 334, 5232, 68438, 104378, 34208, 107322, 42783, 108726, 115890, 35417, 107463, 33764, 35894, 102146, 116028, 108208, 87219, 9554, 127654, 3922, 117681, 119680, 105414, 18655, 87412, 33563, 116405, 3490, 23, 13, 3146, 109189, 114735, 39135, 104696, 334, 5232, 108320, 101772, 29391, 34208, 109589, 17039, 104696, 116051, 39312, 28833, 110585, 5486, 56602, 27327, 112123, 108321, 50667, 105610, 111689, 83747, 110593, 116028, 108208, 3490, 24, 13, 3146, 21601, 103229, 107604, 112355, 334, 5232, 35894, 102146, 116028, 108208, 21043, 117805, 34171, 87219, 3922, 86206, 102208, 29504, 119046, 112355, 3922, 55999, 72843, 114914, 112897, 34208, 104696, 3490, 74257, 108306, 72368, 74770, 19000, 9080, 40053, 104654, 16325, 102210, 20834, 13647, 94, 112500, 126900, 30624, 111689, 83747, 30867, 41053, 5486, 56602, 95337, 11883, 39312, 5486, 124080, 117481, 80866, 121730, 108726, 50667, 1811, 68438, 43167, 33014, 121837, 109806, 74770, 105593, 102700, 23226, 106594, 35894, 102146, 115165, 3922, 113543, 110593, 113614, 1811, 128009]
inputs:
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

如何减少空气污染？<|eot_id|><|start_header_id|>assistant<|end_header_id|>

减少空气污染是一个多方面的挑战，需要政府、企业和个人共同努力。以下是一些有效的措施：

1. **提高能源效率**：通过提高工业、交通和家庭的能源使用效率，可以减少化石燃料的消耗，从而降低污染物排放。

2. **推广清洁能源**：增加太阳能、风能等可再生能源的使用比例，减少对煤炭、石油等化石燃料的依赖。

3. **发展公共交通**：鼓励使用公共交通工具，减少私家车的使用，从而减少汽车尾气排放。

4. **实施更严格的排放标准**：政府可以通过立法设定更高的排放标准，要求工厂和车辆减少污染物排放。

5. **植树造林**：树木能够吸收二氧化碳并释放氧气，有助于改善空气质量。

6. **减少废物焚烧**：控制农业和城市垃圾焚烧，因为这些活动会产生大量的有害气体。

7. **提高公众意识**：通过教育和宣传活动提高公众对空气污染问题的认识，鼓励采取环保行为。

8. **支持绿色技术**：投资研发和应用新技术，如电动汽车、节能建筑材料等，以减少环境污染。

9. **加强国际合作**：空气污染是全球性问题，需要各国共同合作，共享解决方案和技术。

每个人都可以在日常生活中做出贡献，比如减少开车、节约用电、参与社区清洁活动等。通过集体努力，我们可以显著改善空气质量，保护环境健康。<|eot_id|>
label_ids:
[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 111689, 83747, 35894, 102146, 116028, 108208, 122503, 43240, 108900, 9554, 114099, 102778, 3922, 86206, 105212, 5486, 105231, 34208, 108306, 119046, 121837, 1811, 88852, 107226, 98184, 89186, 9554, 123755, 49543, 16, 13, 3146, 115890, 126369, 60455, 96455, 334, 5232, 68438, 115890, 116788, 5486, 109060, 34208, 118006, 9554, 126369, 38129, 60455, 96455, 114831, 111689, 83747, 33208, 102465, 114887, 42506, 9554, 33420, 120225, 113294, 69636, 104989, 103167, 116028, 108208, 53953, 61056, 54322, 3490, 17, 13, 3146, 84851, 82042, 80866, 121730, 126369, 334, 5232, 113299, 101402, 83175, 27327, 5486, 103125, 27327, 50667, 31540, 88356, 21990, 126369, 9554, 38129, 124560, 3922, 111689, 83747, 33764, 123892, 163, 65620, 5486, 102465, 104754, 50667, 33208, 102465, 114887, 42506, 9554, 103963, 125823, 3490, 18, 13, 3146, 106246, 118522, 109060, 334, 5232, 117681, 119680, 38129, 118522, 109060, 114722, 3922, 111689, 83747, 86127, 46729, 41053, 9554, 38129, 113294, 69636, 111689, 83747, 110585, 107380, 102146, 61056, 54322, 3490, 19, 13, 3146, 118890, 34226, 109759, 35083, 9554, 61056, 54322, 110778, 334, 5232, 105212, 74770, 68438, 80195, 25333, 30735, 23187, 34226, 45736, 9554, 61056, 54322, 110778, 3922, 105318, 49792, 108174, 34208, 41053, 121342, 111689, 83747, 116028, 108208, 53953, 61056, 54322, 3490, 20, 13, 3146, 109178, 110712, 67178, 102138, 334, 5232, 110712, 76505, 113961, 107246, 51109, 41920, 120792, 33208, 81802, 111, 64026, 69962, 54322, 120792, 102146, 106258, 103129, 35304, 23226, 106594, 35894, 102146, 115165, 3490, 21, 13, 3146, 111689, 83747, 123641, 53953, 32016, 248, 120467, 334, 5232, 108663, 124741, 34208, 109572, 101037, 225, 12554, 122, 32016, 248, 120467, 115286, 108787, 108726, 38093, 116498, 122333, 9554, 19361, 104698, 102146, 33014, 3490, 22, 13, 3146, 115890, 35417, 107463, 126966, 334, 5232, 68438, 104378, 34208, 107322, 42783, 108726, 115890, 35417, 107463, 33764, 35894, 102146, 116028, 108208, 87219, 9554, 127654, 3922, 117681, 119680, 105414, 18655, 87412, 33563, 116405, 3490, 23, 13, 3146, 109189, 114735, 39135, 104696, 334, 5232, 108320, 101772, 29391, 34208, 109589, 17039, 104696, 116051, 39312, 28833, 110585, 5486, 56602, 27327, 112123, 108321, 50667, 105610, 111689, 83747, 110593, 116028, 108208, 3490, 24, 13, 3146, 21601, 103229, 107604, 112355, 334, 5232, 35894, 102146, 116028, 108208, 21043, 117805, 34171, 87219, 3922, 86206, 102208, 29504, 119046, 112355, 3922, 55999, 72843, 114914, 112897, 34208, 104696, 3490, 74257, 108306, 72368, 74770, 19000, 9080, 40053, 104654, 16325, 102210, 20834, 13647, 94, 112500, 126900, 30624, 111689, 83747, 30867, 41053, 5486, 56602, 95337, 11883, 39312, 5486, 124080, 117481, 80866, 121730, 108726, 50667, 1811, 68438, 43167, 33014, 121837, 109806, 74770, 105593, 102700, 23226, 106594, 35894, 102146, 115165, 3922, 113543, 110593, 113614, 1811, 128009]
labels:
减少空气污染是一个多方面的挑战，需要政府、企业和个人共同努力。以下是一些有效的措施：

1. **提高能源效率**：通过提高工业、交通和家庭的能源使用效率，可以减少化石燃料的消耗，从而降低污染物排放。

2. **推广清洁能源**：增加太阳能、风能等可再生能源的使用比例，减少对煤炭、石油等化石燃料的依赖。

3. **发展公共交通**：鼓励使用公共交通工具，减少私家车的使用，从而减少汽车尾气排放。

4. **实施更严格的排放标准**：政府可以通过立法设定更高的排放标准，要求工厂和车辆减少污染物排放。

5. **植树造林**：树木能够吸收二氧化碳并释放氧气，有助于改善空气质量。

6. **减少废物焚烧**：控制农业和城市垃圾焚烧，因为这些活动会产生大量的有害气体。

7. **提高公众意识**：通过教育和宣传活动提高公众对空气污染问题的认识，鼓励采取环保行为。

8. **支持绿色技术**：投资研发和应用新技术，如电动汽车、节能建筑材料等，以减少环境污染。

9. **加强国际合作**：空气污染是全球性问题，需要各国共同合作，共享解决方案和技术。

每个人都可以在日常生活中做出贡献，比如减少开车、节约用电、参与社区清洁活动等。通过集体努力，我们可以显著改善空气质量，保护环境健康。<|eot_id|>
[INFO|configuration_utils.py:731] 2025-04-17 14:38:09,176 >> loading configuration file /data/public/Llama-3.1-8B-Instruct/config.json
[INFO|configuration_utils.py:800] 2025-04-17 14:38:09,177 >> Model config LlamaConfig {
  "_name_or_path": "/data/public/Llama-3.1-8B-Instruct",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": [
    128001,
    128008,
    128009
  ],
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 131072,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 8.0,
    "high_freq_factor": 4.0,
    "low_freq_factor": 1.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.43.4",
  "use_cache": false,
  "vocab_size": 128256
}

[INFO|modeling_utils.py:3641] 2025-04-17 14:38:09,215 >> loading weights file /data/public/Llama-3.1-8B-Instruct/model.safetensors.index.json
[INFO|modeling_utils.py:3786] 2025-04-17 14:38:09,216 >> Detected DeepSpeed ZeRO-3: activating zero.init() for this model
[WARNING|logging.py:328] 2025-04-17 14:38:09,219 >> You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
[WARNING|logging.py:328] 2025-04-17 14:38:09,220 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
[WARNING|logging.py:328] 2025-04-17 14:38:09,228 >> Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaForCausalLM is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
[INFO|configuration_utils.py:1038] 2025-04-17 14:38:09,229 >> Generate config GenerationConfig {
  "bos_token_id": 128000,
  "eos_token_id": [
    128001,
    128008,
    128009
  ],
  "use_cache": false
}

[WARNING|logging.py:328] 2025-04-17 14:38:09,230 >> Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaForCausalLM is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaForCausalLM is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaForCausalLM is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaForCausalLM is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaForCausalLM is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaForCausalLM is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaForCausalLM is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
[2025-04-17 14:38:10,846] [INFO] [partition_parameters.py:345:__exit__] finished initializing model - num_params = 291, num_elems = 8.03B
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards:  25%|██▌       | 1/4 [00:00<00:00,  4.33it/s]
Loading checkpoint shards:  25%|██▌       | 1/4 [00:00<00:00,  4.32it/s]
Loading checkpoint shards:  25%|██▌       | 1/4 [00:00<00:00,  4.28it/s]
Loading checkpoint shards:  25%|██▌       | 1/4 [00:00<00:00,  4.19it/s]
Loading checkpoint shards:  25%|██▌       | 1/4 [00:00<00:00,  4.12it/s]
Loading checkpoint shards:  25%|██▌       | 1/4 [00:00<00:00,  4.15it/s]
Loading checkpoint shards:  25%|██▌       | 1/4 [00:00<00:00,  4.10it/s]
Loading checkpoint shards:  25%|██▌       | 1/4 [00:11<00:34, 11.66s/it]
Loading checkpoint shards:  50%|█████     | 2/4 [00:11<00:14,  7.00s/it]
Loading checkpoint shards:  50%|█████     | 2/4 [00:11<00:14,  7.00s/it]
Loading checkpoint shards:  50%|█████     | 2/4 [00:11<00:14,  7.01s/it]
Loading checkpoint shards:  50%|█████     | 2/4 [00:11<00:14,  7.01s/it]
Loading checkpoint shards:  50%|█████     | 2/4 [00:11<00:14,  7.01s/it]
Loading checkpoint shards:  50%|█████     | 2/4 [00:11<00:14,  7.02s/it]
Loading checkpoint shards:  50%|█████     | 2/4 [00:12<00:14,  7.02s/it]
Loading checkpoint shards:  75%|███████▌  | 3/4 [00:37<00:15, 15.43s/it]
Loading checkpoint shards:  75%|███████▌  | 3/4 [00:37<00:15, 15.44s/it]
Loading checkpoint shards:  75%|███████▌  | 3/4 [00:37<00:15, 15.44s/it]
Loading checkpoint shards:  75%|███████▌  | 3/4 [00:37<00:15, 15.44s/it]
Loading checkpoint shards:  75%|███████▌  | 3/4 [00:37<00:15, 15.44s/it]
Loading checkpoint shards:  75%|███████▌  | 3/4 [00:37<00:15, 15.44s/it]
Loading checkpoint shards:  75%|███████▌  | 3/4 [00:37<00:15, 15.44s/it]
Loading checkpoint shards:  50%|█████     | 2/4 [00:38<00:40, 20.33s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:39<00:00,  9.98s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:39<00:00,  9.77s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:39<00:00,  9.99s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:39<00:00,  9.77s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:39<00:00,  9.98s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:39<00:00,  9.77s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:39<00:00,  9.98s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:39<00:00,  9.77s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:39<00:00,  9.99s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:39<00:00,  9.77s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:39<00:00,  9.99s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:39<00:00,  9.77s/it]
/home/username/.conda/envs/llm/lib/python3.11/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/home/username/.conda/envs/llm/lib/python3.11/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
Loading checkpoint shards: 100%|██████████| 4/4 [00:39<00:00, 10.00s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:39<00:00,  9.78s/it]
/home/username/.conda/envs/llm/lib/python3.11/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
Loading checkpoint shards:  75%|███████▌  | 3/4 [00:58<00:20, 20.48s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [01:06<00:00, 15.56s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [01:06<00:00, 16.69s/it]
[INFO|modeling_utils.py:4473] 2025-04-17 14:39:17,618 >> All model checkpoint weights were used when initializing LlamaForCausalLM.

[INFO|modeling_utils.py:4481] 2025-04-17 14:39:17,618 >> All the weights of LlamaForCausalLM were initialized from the model checkpoint at /data/public/Llama-3.1-8B-Instruct.
If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training.
[INFO|configuration_utils.py:991] 2025-04-17 14:39:17,620 >> loading configuration file /data/public/Llama-3.1-8B-Instruct/generation_config.json
[INFO|configuration_utils.py:1038] 2025-04-17 14:39:17,620 >> Generate config GenerationConfig {
  "bos_token_id": 128000,
  "do_sample": true,
  "eos_token_id": [
    128001,
    128008,
    128009
  ],
  "temperature": 0.6,
  "top_p": 0.9
}

[INFO|2025-04-17 14:39:17] llamafactory.model.model_utils.checkpointing:143 >> Gradient checkpointing enabled.
[INFO|2025-04-17 14:39:17] llamafactory.model.model_utils.attention:143 >> Using FlashAttention-2 for faster training and inference.
[INFO|2025-04-17 14:39:17] llamafactory.model.adapter:143 >> ZeRO3 / FSDP detected, remaining trainable params in float32.
[INFO|2025-04-17 14:39:17] llamafactory.model.adapter:143 >> Fine-tuning method: Full
[INFO|2025-04-17 14:39:17] llamafactory.model.loader:143 >> trainable params: 8,030,261,248 || all params: 8,030,261,248 || trainable%: 100.0000
/home/username/.conda/envs/llm/lib/python3.11/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
[INFO|trainer.py:648] 2025-04-17 14:39:17,658 >> Using auto half precision backend
[2025-04-17 14:39:17,807] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.4, git-hash=unknown, git-branch=unknown
[2025-04-17 14:39:17,814] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2025-04-17 14:39:17,814] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2025-04-17 14:39:17,815] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2025-04-17 14:39:17,821] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = AdamW
[2025-04-17 14:39:17,821] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'torch.optim.adamw.AdamW'>
[2025-04-17 14:39:17,822] [INFO] [logging.py:96:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer, MiCS is enabled False, Hierarchical params gather False
[2025-04-17 14:39:17,822] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 3 optimizer
[2025-04-17 14:39:17,956] [INFO] [utils.py:781:see_memory_usage] Stage 3 initialize beginning
[2025-04-17 14:39:17,957] [INFO] [utils.py:782:see_memory_usage] MA 1.87 GB         Max_MA 4.68 GB         CA 3.0 GB         Max_CA 5 GB 
[2025-04-17 14:39:17,957] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 17.85 GB, percent = 2.4%
[2025-04-17 14:39:17,958] [INFO] [stage3.py:130:__init__] Reduce bucket size 16777216
[2025-04-17 14:39:17,958] [INFO] [stage3.py:131:__init__] Prefetch bucket size 15099494
[2025-04-17 14:39:18,092] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2025-04-17 14:39:18,092] [INFO] [utils.py:782:see_memory_usage] MA 1.87 GB         Max_MA 1.87 GB         CA 3.0 GB         Max_CA 3 GB 
[2025-04-17 14:39:18,093] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 17.85 GB, percent = 2.4%
Parameter Offload: Total persistent parameters: 266240 in 65 params
[2025-04-17 14:39:18,243] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2025-04-17 14:39:18,243] [INFO] [utils.py:782:see_memory_usage] MA 1.87 GB         Max_MA 1.87 GB         CA 3.0 GB         Max_CA 3 GB 
[2025-04-17 14:39:18,243] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 17.85 GB, percent = 2.4%
[2025-04-17 14:39:18,411] [INFO] [utils.py:781:see_memory_usage] Before creating fp16 partitions
[2025-04-17 14:39:18,411] [INFO] [utils.py:782:see_memory_usage] MA 1.87 GB         Max_MA 1.87 GB         CA 3.0 GB         Max_CA 3 GB 
[2025-04-17 14:39:18,411] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 17.83 GB, percent = 2.4%
[2025-04-17 14:39:20,201] [INFO] [utils.py:781:see_memory_usage] After creating fp16 partitions: 2
[2025-04-17 14:39:20,201] [INFO] [utils.py:782:see_memory_usage] MA 1.87 GB         Max_MA 1.87 GB         CA 1.87 GB         Max_CA 3 GB 
[2025-04-17 14:39:20,201] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 17.85 GB, percent = 2.4%
[2025-04-17 14:39:20,342] [INFO] [utils.py:781:see_memory_usage] Before creating fp32 partitions
[2025-04-17 14:39:20,343] [INFO] [utils.py:782:see_memory_usage] MA 1.87 GB         Max_MA 1.87 GB         CA 1.87 GB         Max_CA 2 GB 
[2025-04-17 14:39:20,343] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 17.83 GB, percent = 2.4%
[2025-04-17 14:39:20,487] [INFO] [utils.py:781:see_memory_usage] After creating fp32 partitions
[2025-04-17 14:39:20,487] [INFO] [utils.py:782:see_memory_usage] MA 5.61 GB         Max_MA 7.48 GB         CA 7.48 GB         Max_CA 7 GB 
[2025-04-17 14:39:20,487] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 17.83 GB, percent = 2.4%
[2025-04-17 14:39:20,625] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states
[2025-04-17 14:39:20,626] [INFO] [utils.py:782:see_memory_usage] MA 5.61 GB         Max_MA 5.61 GB         CA 7.48 GB         Max_CA 7 GB 
[2025-04-17 14:39:20,626] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 17.83 GB, percent = 2.4%
[2025-04-17 14:39:20,766] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states
[2025-04-17 14:39:20,767] [INFO] [utils.py:782:see_memory_usage] MA 5.61 GB         Max_MA 9.35 GB         CA 11.22 GB         Max_CA 11 GB 
[2025-04-17 14:39:20,767] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 17.83 GB, percent = 2.4%
[2025-04-17 14:39:20,767] [INFO] [stage3.py:486:_setup_for_real_optimizer] optimizer state initialized
[2025-04-17 14:39:21,775] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer
[2025-04-17 14:39:21,776] [INFO] [utils.py:782:see_memory_usage] MA 7.51 GB         Max_MA 9.47 GB         CA 11.22 GB         Max_CA 11 GB 
[2025-04-17 14:39:21,776] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 17.86 GB, percent = 2.4%
[2025-04-17 14:39:21,776] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer_Stage3
[2025-04-17 14:39:21,776] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2025-04-17 14:39:21,776] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2025-04-17 14:39:21,776] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0], mom=[(0.9, 0.999), (0.9, 0.999)]
[2025-04-17 14:39:21,777] [INFO] [config.py:997:print] DeepSpeedEngine configuration:
[2025-04-17 14:39:21,777] [INFO] [config.py:1001:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2025-04-17 14:39:21,777] [INFO] [config.py:1001:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2025-04-17 14:39:21,777] [INFO] [config.py:1001:print]   amp_enabled .................. False
[2025-04-17 14:39:21,777] [INFO] [config.py:1001:print]   amp_params ................... False
[2025-04-17 14:39:21,778] [INFO] [config.py:1001:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
}
[2025-04-17 14:39:21,778] [INFO] [config.py:1001:print]   bfloat16_enabled ............. True
[2025-04-17 14:39:21,778] [INFO] [config.py:1001:print]   bfloat16_immediate_grad_update  False
[2025-04-17 14:39:21,778] [INFO] [config.py:1001:print]   checkpoint_parallel_write_pipeline  False
[2025-04-17 14:39:21,778] [INFO] [config.py:1001:print]   checkpoint_tag_validation_enabled  True
[2025-04-17 14:39:21,778] [INFO] [config.py:1001:print]   checkpoint_tag_validation_fail  False
[2025-04-17 14:39:21,778] [INFO] [config.py:1001:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f2e1c292410>
[2025-04-17 14:39:21,778] [INFO] [config.py:1001:print]   communication_data_type ...... None
[2025-04-17 14:39:21,778] [INFO] [config.py:1001:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2025-04-17 14:39:21,778] [INFO] [config.py:1001:print]   curriculum_enabled_legacy .... False
[2025-04-17 14:39:21,778] [INFO] [config.py:1001:print]   curriculum_params_legacy ..... False
[2025-04-17 14:39:21,778] [INFO] [config.py:1001:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2025-04-17 14:39:21,778] [INFO] [config.py:1001:print]   data_efficiency_enabled ...... False
[2025-04-17 14:39:21,778] [INFO] [config.py:1001:print]   dataloader_drop_last ......... False
[2025-04-17 14:39:21,778] [INFO] [config.py:1001:print]   disable_allgather ............ False
[2025-04-17 14:39:21,778] [INFO] [config.py:1001:print]   dump_state ................... False
[2025-04-17 14:39:21,778] [INFO] [config.py:1001:print]   dynamic_loss_scale_args ...... None
[2025-04-17 14:39:21,778] [INFO] [config.py:1001:print]   eigenvalue_enabled ........... False
[2025-04-17 14:39:21,778] [INFO] [config.py:1001:print]   eigenvalue_gas_boundary_resolution  1
[2025-04-17 14:39:21,778] [INFO] [config.py:1001:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2025-04-17 14:39:21,778] [INFO] [config.py:1001:print]   eigenvalue_layer_num ......... 0
[2025-04-17 14:39:21,778] [INFO] [config.py:1001:print]   eigenvalue_max_iter .......... 100
[2025-04-17 14:39:21,778] [INFO] [config.py:1001:print]   eigenvalue_stability ......... 1e-06
[2025-04-17 14:39:21,778] [INFO] [config.py:1001:print]   eigenvalue_tol ............... 0.01
[2025-04-17 14:39:21,778] [INFO] [config.py:1001:print]   eigenvalue_verbose ........... False
[2025-04-17 14:39:21,778] [INFO] [config.py:1001:print]   elasticity_enabled ........... False
[2025-04-17 14:39:21,778] [INFO] [config.py:1001:print]   flops_profiler_config ........ {
    "enabled": false, 
    "recompute_fwd_factor": 0.0, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
[2025-04-17 14:39:21,778] [INFO] [config.py:1001:print]   fp16_auto_cast ............... None
[2025-04-17 14:39:21,778] [INFO] [config.py:1001:print]   fp16_enabled ................. False
[2025-04-17 14:39:21,778] [INFO] [config.py:1001:print]   fp16_master_weights_and_gradients  False
[2025-04-17 14:39:21,778] [INFO] [config.py:1001:print]   global_rank .................. 0
[2025-04-17 14:39:21,778] [INFO] [config.py:1001:print]   grad_accum_dtype ............. None
[2025-04-17 14:39:21,778] [INFO] [config.py:1001:print]   gradient_accumulation_steps .. 4
[2025-04-17 14:39:21,778] [INFO] [config.py:1001:print]   gradient_clipping ............ 1.0
[2025-04-17 14:39:21,778] [INFO] [config.py:1001:print]   gradient_predivide_factor .... 1.0
[2025-04-17 14:39:21,778] [INFO] [config.py:1001:print]   graph_harvesting ............. False
[2025-04-17 14:39:21,778] [INFO] [config.py:1001:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2025-04-17 14:39:21,778] [INFO] [config.py:1001:print]   initial_dynamic_scale ........ 1
[2025-04-17 14:39:21,778] [INFO] [config.py:1001:print]   load_universal_checkpoint .... False
[2025-04-17 14:39:21,778] [INFO] [config.py:1001:print]   loss_scale ................... 1.0
[2025-04-17 14:39:21,778] [INFO] [config.py:1001:print]   memory_breakdown ............. False
[2025-04-17 14:39:21,778] [INFO] [config.py:1001:print]   mics_hierarchial_params_gather  False
[2025-04-17 14:39:21,778] [INFO] [config.py:1001:print]   mics_shard_size .............. -1
[2025-04-17 14:39:21,778] [INFO] [config.py:1001:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2025-04-17 14:39:21,778] [INFO] [config.py:1001:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
}
[2025-04-17 14:39:21,778] [INFO] [config.py:1001:print]   optimizer_legacy_fusion ...... False
[2025-04-17 14:39:21,778] [INFO] [config.py:1001:print]   optimizer_name ............... None
[2025-04-17 14:39:21,778] [INFO] [config.py:1001:print]   optimizer_params ............. None
[2025-04-17 14:39:21,779] [INFO] [config.py:1001:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2025-04-17 14:39:21,779] [INFO] [config.py:1001:print]   pld_enabled .................. False
[2025-04-17 14:39:21,779] [INFO] [config.py:1001:print]   pld_params ................... False
[2025-04-17 14:39:21,779] [INFO] [config.py:1001:print]   prescale_gradients ........... False
[2025-04-17 14:39:21,779] [INFO] [config.py:1001:print]   scheduler_name ............... None
[2025-04-17 14:39:21,779] [INFO] [config.py:1001:print]   scheduler_params ............. None
[2025-04-17 14:39:21,779] [INFO] [config.py:1001:print]   seq_parallel_communication_data_type  torch.float32
[2025-04-17 14:39:21,779] [INFO] [config.py:1001:print]   sparse_attention ............. None
[2025-04-17 14:39:21,779] [INFO] [config.py:1001:print]   sparse_gradients_enabled ..... False
[2025-04-17 14:39:21,779] [INFO] [config.py:1001:print]   steps_per_print .............. inf
[2025-04-17 14:39:21,779] [INFO] [config.py:1001:print]   timers_config ................ enabled=True synchronized=True
[2025-04-17 14:39:21,779] [INFO] [config.py:1001:print]   train_batch_size ............. 128
[2025-04-17 14:39:21,779] [INFO] [config.py:1001:print]   train_micro_batch_size_per_gpu  4
[2025-04-17 14:39:21,779] [INFO] [config.py:1001:print]   use_data_before_expert_parallel_  False
[2025-04-17 14:39:21,779] [INFO] [config.py:1001:print]   use_node_local_storage ....... False
[2025-04-17 14:39:21,779] [INFO] [config.py:1001:print]   wall_clock_breakdown ......... False
[2025-04-17 14:39:21,779] [INFO] [config.py:1001:print]   weight_quantization_config ... None
[2025-04-17 14:39:21,779] [INFO] [config.py:1001:print]   world_size ................... 8
[2025-04-17 14:39:21,779] [INFO] [config.py:1001:print]   zero_allow_untested_optimizer  True
[2025-04-17 14:39:21,779] [INFO] [config.py:1001:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=16777216 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=15099494 param_persistence_threshold=40960 model_persistence_threshold=sys.maxsize max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=True use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2025-04-17 14:39:21,779] [INFO] [config.py:1001:print]   zero_enabled ................. True
[2025-04-17 14:39:21,779] [INFO] [config.py:1001:print]   zero_force_ds_cpu_optimizer .. True
[2025-04-17 14:39:21,779] [INFO] [config.py:1001:print]   zero_optimization_stage ...... 3
[2025-04-17 14:39:21,779] [INFO] [config.py:987:print_user_config]   json = {
    "train_batch_size": 128, 
    "train_micro_batch_size_per_gpu": 4, 
    "gradient_accumulation_steps": 4, 
    "gradient_clipping": 1.0, 
    "zero_allow_untested_optimizer": true, 
    "fp16": {
        "enabled": false, 
        "loss_scale": 0, 
        "loss_scale_window": 1000, 
        "initial_scale_power": 16, 
        "hysteresis": 2, 
        "min_loss_scale": 1
    }, 
    "bf16": {
        "enabled": true
    }, 
    "zero_optimization": {
        "stage": 3, 
        "overlap_comm": false, 
        "contiguous_gradients": true, 
        "sub_group_size": 1.000000e+09, 
        "reduce_bucket_size": 1.677722e+07, 
        "stage3_prefetch_bucket_size": 1.509949e+07, 
        "stage3_param_persistence_threshold": 4.096000e+04, 
        "stage3_max_live_parameters": 1.000000e+09, 
        "stage3_max_reuse_distance": 1.000000e+09, 
        "stage3_gather_16bit_weights_on_model_save": true
    }, 
    "steps_per_print": inf
}
[INFO|trainer.py:2134] 2025-04-17 14:39:21,780 >> ***** Running training *****
[INFO|trainer.py:2135] 2025-04-17 14:39:21,780 >>   Num examples = 9,952
[INFO|trainer.py:2136] 2025-04-17 14:39:21,780 >>   Num Epochs = 3
[INFO|trainer.py:2137] 2025-04-17 14:39:21,780 >>   Instantaneous batch size per device = 4
[INFO|trainer.py:2140] 2025-04-17 14:39:21,780 >>   Total train batch size (w. parallel, distributed & accumulation) = 128
[INFO|trainer.py:2141] 2025-04-17 14:39:21,780 >>   Gradient Accumulation steps = 4
[INFO|trainer.py:2142] 2025-04-17 14:39:21,780 >>   Total optimization steps = 231
[INFO|trainer.py:2143] 2025-04-17 14:39:21,781 >>   Number of trainable parameters = 8,030,261,248
  0%|          | 0/231 [00:00<?, ?it/s]
  0%|          | 1/231 [00:21<1:22:21, 21.49s/it]
  1%|          | 2/231 [00:40<1:16:27, 20.03s/it]
  1%|▏         | 3/231 [00:59<1:13:33, 19.36s/it]
  2%|▏         | 4/231 [01:17<1:11:42, 18.95s/it]
  2%|▏         | 5/231 [01:36<1:11:44, 19.04s/it]
  3%|▎         | 6/231 [01:55<1:10:53, 18.90s/it]
  3%|▎         | 7/231 [02:16<1:13:29, 19.69s/it]
  3%|▎         | 8/231 [02:35<1:12:08, 19.41s/it]
  4%|▍         | 9/231 [02:54<1:11:00, 19.19s/it]
  4%|▍         | 10/231 [03:13<1:10:31, 19.15s/it]
                                                  
{'loss': 1.0997, 'grad_norm': 2.7974208385467834, 'learning_rate': 2.0833333333333334e-06, 'epoch': 0.13}
  4%|▍         | 10/231 [03:13<1:10:31, 19.15s/it]
  5%|▍         | 11/231 [03:31<1:09:40, 19.00s/it]
  5%|▌         | 12/231 [03:50<1:09:07, 18.94s/it]
  6%|▌         | 13/231 [04:09<1:08:51, 18.95s/it]
  6%|▌         | 14/231 [04:28<1:08:58, 19.07s/it]
  6%|▋         | 15/231 [04:47<1:08:28, 19.02s/it]
  7%|▋         | 16/231 [05:07<1:08:41, 19.17s/it]
  7%|▋         | 17/231 [05:26<1:08:42, 19.27s/it]
  8%|▊         | 18/231 [05:45<1:07:39, 19.06s/it]
  8%|▊         | 19/231 [06:04<1:07:18, 19.05s/it]
  9%|▊         | 20/231 [06:23<1:07:01, 19.06s/it]
                                                  
{'loss': 1.0533, 'grad_norm': 1.9726936610843713, 'learning_rate': 4.166666666666667e-06, 'epoch': 0.26}
  9%|▊         | 20/231 [06:23<1:07:01, 19.06s/it]
  9%|▉         | 21/231 [06:42<1:06:57, 19.13s/it]
 10%|▉         | 22/231 [07:01<1:06:19, 19.04s/it]
 10%|▉         | 23/231 [07:21<1:06:50, 19.28s/it]
 10%|█         | 24/231 [07:39<1:05:36, 19.02s/it]
 11%|█         | 25/231 [07:59<1:05:45, 19.15s/it]
 11%|█▏        | 26/231 [08:18<1:05:16, 19.10s/it]
 12%|█▏        | 27/231 [08:37<1:05:17, 19.20s/it]
 12%|█▏        | 28/231 [08:56<1:04:56, 19.19s/it]
 13%|█▎        | 29/231 [09:17<1:05:58, 19.60s/it]
 13%|█▎        | 30/231 [09:36<1:05:25, 19.53s/it]
                                                  
{'loss': 0.9839, 'grad_norm': 1.8587614217974457, 'learning_rate': 4.989642106328829e-06, 'epoch': 0.39}
 13%|█▎        | 30/231 [09:36<1:05:25, 19.53s/it]
 13%|█▎        | 31/231 [09:55<1:04:42, 19.41s/it]
 14%|█▍        | 32/231 [10:14<1:03:28, 19.14s/it]
 14%|█▍        | 33/231 [10:32<1:02:20, 18.89s/it]
 15%|█▍        | 34/231 [10:51<1:02:15, 18.96s/it]
 15%|█▌        | 35/231 [11:10<1:01:39, 18.88s/it]
 16%|█▌        | 36/231 [11:29<1:01:35, 18.95s/it]
 16%|█▌        | 37/231 [11:49<1:01:51, 19.13s/it]
 16%|█▋        | 38/231 [12:08<1:01:08, 19.01s/it]
 17%|█▋        | 39/231 [12:28<1:01:52, 19.34s/it]
 17%|█▋        | 40/231 [12:47<1:01:40, 19.37s/it]
                                                  
{'loss': 0.9399, 'grad_norm': 1.983708825892852, 'learning_rate': 4.926654420291555e-06, 'epoch': 0.51}
 17%|█▋        | 40/231 [12:47<1:01:40, 19.37s/it]
 18%|█▊        | 41/231 [13:07<1:02:03, 19.60s/it]
 18%|█▊        | 42/231 [13:26<1:01:14, 19.44s/it]
 19%|█▊        | 43/231 [13:46<1:01:26, 19.61s/it]
 19%|█▉        | 44/231 [14:06<1:01:26, 19.71s/it]
 19%|█▉        | 45/231 [14:24<59:46, 19.28s/it]  
 20%|█▉        | 46/231 [14:42<57:54, 18.78s/it]
 20%|██        | 47/231 [15:01<57:39, 18.80s/it]
 21%|██        | 48/231 [15:21<58:02, 19.03s/it]
 21%|██        | 49/231 [15:39<57:33, 18.98s/it]
 22%|██▏       | 50/231 [15:57<56:26, 18.71s/it]
                                                
{'loss': 0.9099, 'grad_norm': 1.7787968769772682, 'learning_rate': 4.8078797071003644e-06, 'epoch': 0.64}
 22%|██▏       | 50/231 [15:57<56:26, 18.71s/it]
 22%|██▏       | 51/231 [16:17<56:29, 18.83s/it]
 23%|██▎       | 52/231 [16:35<55:42, 18.67s/it]
 23%|██▎       | 53/231 [16:53<55:03, 18.56s/it]
 23%|██▎       | 54/231 [17:12<54:40, 18.53s/it]
 24%|██▍       | 55/231 [17:32<55:33, 18.94s/it]
 24%|██▍       | 56/231 [17:51<55:28, 19.02s/it]
 25%|██▍       | 57/231 [18:10<55:19, 19.08s/it]
 25%|██▌       | 58/231 [18:29<55:20, 19.19s/it]
 26%|██▌       | 59/231 [18:49<54:56, 19.17s/it]
 26%|██▌       | 60/231 [19:07<53:54, 18.91s/it]
                                                
{'loss': 0.8962, 'grad_norm': 1.7300575951564672, 'learning_rate': 4.636048511366222e-06, 'epoch': 0.77}
 26%|██▌       | 60/231 [19:07<53:54, 18.91s/it]
 26%|██▋       | 61/231 [19:26<53:41, 18.95s/it]
 27%|██▋       | 62/231 [19:45<53:38, 19.04s/it]
 27%|██▋       | 63/231 [20:06<54:29, 19.46s/it]
 28%|██▊       | 64/231 [20:26<54:41, 19.65s/it]
 28%|██▊       | 65/231 [20:46<55:17, 19.98s/it]
 29%|██▊       | 66/231 [21:06<54:18, 19.75s/it]
 29%|██▉       | 67/231 [21:24<53:11, 19.46s/it]
 29%|██▉       | 68/231 [21:43<52:30, 19.33s/it]
 30%|██▉       | 69/231 [22:03<51:58, 19.25s/it]
 30%|███       | 70/231 [22:21<51:05, 19.04s/it]
                                                
{'loss': 0.8738, 'grad_norm': 1.6100172566231483, 'learning_rate': 4.415111107797445e-06, 'epoch': 0.9}
 30%|███       | 70/231 [22:21<51:05, 19.04s/it]
 31%|███       | 71/231 [22:40<50:34, 18.96s/it]
 31%|███       | 72/231 [22:59<50:15, 18.97s/it]
 32%|███▏      | 73/231 [23:17<49:33, 18.82s/it]
 32%|███▏      | 74/231 [23:36<49:25, 18.89s/it]
 32%|███▏      | 75/231 [23:55<49:02, 18.86s/it]
 33%|███▎      | 76/231 [24:15<49:33, 19.18s/it]
 33%|███▎      | 77/231 [24:34<49:03, 19.11s/it]
 34%|███▍      | 78/231 [24:53<48:43, 19.11s/it]
 34%|███▍      | 79/231 [25:12<48:03, 18.97s/it]
 35%|███▍      | 80/231 [25:30<47:17, 18.79s/it]
                                                
{'loss': 0.8341, 'grad_norm': 2.354588801291743, 'learning_rate': 4.1501466872459105e-06, 'epoch': 1.03}
 35%|███▍      | 80/231 [25:30<47:17, 18.79s/it]
 35%|███▌      | 81/231 [25:49<47:06, 18.84s/it]
 35%|███▌      | 82/231 [26:08<46:29, 18.72s/it]
 36%|███▌      | 83/231 [26:27<46:23, 18.80s/it]
 36%|███▋      | 84/231 [26:46<46:18, 18.90s/it]
 37%|███▋      | 85/231 [27:05<46:12, 18.99s/it]
 37%|███▋      | 86/231 [27:24<45:56, 19.01s/it]
 38%|███▊      | 87/231 [27:43<45:35, 19.00s/it][2025-04-17 15:07:26,698] [WARNING] [stage3.py:2069:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 38%|███▊      | 88/231 [28:04<47:00, 19.72s/it]
 39%|███▊      | 89/231 [28:23<45:54, 19.40s/it]
 39%|███▉      | 90/231 [28:42<45:00, 19.15s/it]
                                                
{'loss': 0.6825, 'grad_norm': 1.7430732404331881, 'learning_rate': 3.84724658978894e-06, 'epoch': 1.16}
 39%|███▉      | 90/231 [28:42<45:00, 19.15s/it]
 39%|███▉      | 91/231 [29:00<44:25, 19.04s/it]
 40%|███▉      | 92/231 [29:19<43:52, 18.94s/it]
 40%|████      | 93/231 [29:38<43:49, 19.05s/it]
 41%|████      | 94/231 [29:58<44:08, 19.33s/it]
 41%|████      | 95/231 [30:17<43:32, 19.21s/it]
 42%|████▏     | 96/231 [30:35<42:34, 18.92s/it]
 42%|████▏     | 97/231 [30:57<43:50, 19.63s/it]
 42%|████▏     | 98/231 [31:17<43:53, 19.80s/it]
 43%|████▎     | 99/231 [31:37<43:26, 19.75s/it]
 43%|████▎     | 100/231 [31:55<42:33, 19.49s/it]
                                                 
{'loss': 0.6746, 'grad_norm': 1.825996878246782, 'learning_rate': 3.513374269233737e-06, 'epoch': 1.29}
 43%|████▎     | 100/231 [31:55<42:33, 19.49s/it]
 44%|████▎     | 101/231 [32:14<41:24, 19.11s/it]
 44%|████▍     | 102/231 [32:33<41:22, 19.24s/it]
 45%|████▍     | 103/231 [32:52<40:41, 19.07s/it]
 45%|████▌     | 104/231 [33:11<40:05, 18.94s/it]
 45%|████▌     | 105/231 [33:29<39:26, 18.78s/it]
 46%|████▌     | 106/231 [33:48<39:16, 18.85s/it]
 46%|████▋     | 107/231 [34:07<39:00, 18.88s/it]
 47%|████▋     | 108/231 [34:27<39:09, 19.10s/it]
 47%|████▋     | 109/231 [34:45<38:39, 19.01s/it]
 48%|████▊     | 110/231 [35:04<38:13, 18.95s/it]
                                                 
{'loss': 0.6734, 'grad_norm': 1.8175717187830196, 'learning_rate': 3.1562052083589846e-06, 'epoch': 1.41}
 48%|████▊     | 110/231 [35:04<38:13, 18.95s/it]
 48%|████▊     | 111/231 [35:23<37:58, 18.99s/it]
 48%|████▊     | 112/231 [35:42<37:32, 18.93s/it]
 49%|████▉     | 113/231 [36:01<37:12, 18.92s/it]
 49%|████▉     | 114/231 [36:19<36:39, 18.80s/it]
 50%|████▉     | 115/231 [36:38<36:21, 18.80s/it]
 50%|█████     | 116/231 [36:57<36:05, 18.83s/it]
 51%|█████     | 117/231 [37:16<35:53, 18.89s/it]
 51%|█████     | 118/231 [37:34<35:09, 18.66s/it]
 52%|█████▏    | 119/231 [37:54<35:21, 18.94s/it]
 52%|█████▏    | 120/231 [38:13<35:07, 18.98s/it]
                                                 
{'loss': 0.6791, 'grad_norm': 1.6803049359138955, 'learning_rate': 2.7839504651261873e-06, 'epoch': 1.54}
 52%|█████▏    | 120/231 [38:13<35:07, 18.98s/it]
 52%|█████▏    | 121/231 [38:32<34:44, 18.95s/it]
 53%|█████▎    | 122/231 [38:52<34:55, 19.22s/it]
 53%|█████▎    | 123/231 [39:11<34:34, 19.21s/it]
 54%|█████▎    | 124/231 [39:31<34:49, 19.52s/it]
 54%|█████▍    | 125/231 [39:50<34:23, 19.46s/it]
 55%|█████▍    | 126/231 [40:09<33:36, 19.20s/it]
 55%|█████▍    | 127/231 [40:29<33:29, 19.32s/it]
 55%|█████▌    | 128/231 [40:48<33:10, 19.32s/it]
 56%|█████▌    | 129/231 [41:07<32:36, 19.19s/it]
 56%|█████▋    | 130/231 [41:27<32:45, 19.46s/it]
                                                 
{'loss': 0.6597, 'grad_norm': 1.6102561511262756, 'learning_rate': 2.4051679064055718e-06, 'epoch': 1.67}
 56%|█████▋    | 130/231 [41:27<32:45, 19.46s/it]
 57%|█████▋    | 131/231 [41:47<32:55, 19.76s/it]
 57%|█████▋    | 132/231 [42:06<32:09, 19.49s/it]
 58%|█████▊    | 133/231 [42:25<31:35, 19.34s/it]
 58%|█████▊    | 134/231 [42:43<30:42, 19.00s/it]
 58%|█████▊    | 135/231 [43:02<29:57, 18.72s/it]
 59%|█████▉    | 136/231 [43:20<29:44, 18.78s/it]
 59%|█████▉    | 137/231 [43:39<29:28, 18.82s/it]
 60%|█████▉    | 138/231 [43:58<28:58, 18.69s/it]
 60%|██████    | 139/231 [44:16<28:30, 18.60s/it]
 61%|██████    | 140/231 [44:35<28:32, 18.81s/it]
                                                 
{'loss': 0.6697, 'grad_norm': 1.6292238674479553, 'learning_rate': 2.0285654688164106e-06, 'epoch': 1.8}
 61%|██████    | 140/231 [44:35<28:32, 18.81s/it]
 61%|██████    | 141/231 [44:56<29:11, 19.46s/it]
 61%|██████▏   | 142/231 [45:15<28:40, 19.34s/it]
 62%|██████▏   | 143/231 [45:36<28:41, 19.56s/it]
 62%|██████▏   | 144/231 [45:55<28:30, 19.66s/it]
 63%|██████▎   | 145/231 [46:14<27:49, 19.41s/it]
 63%|██████▎   | 146/231 [46:34<27:31, 19.43s/it]
 64%|██████▎   | 147/231 [46:53<27:08, 19.38s/it]
 64%|██████▍   | 148/231 [47:12<26:31, 19.18s/it]
 65%|██████▍   | 149/231 [47:31<26:15, 19.21s/it]
 65%|██████▍   | 150/231 [47:51<26:10, 19.39s/it]
                                                 
{'loss': 0.6618, 'grad_norm': 1.735892570543218, 'learning_rate': 1.6628009695725348e-06, 'epoch': 1.93}
 65%|██████▍   | 150/231 [47:51<26:10, 19.39s/it]
 65%|██████▌   | 151/231 [48:10<25:42, 19.28s/it]
 66%|██████▌   | 152/231 [48:29<25:15, 19.18s/it]
 66%|██████▌   | 153/231 [48:47<24:35, 18.92s/it]
 67%|██████▋   | 154/231 [49:06<24:25, 19.04s/it]
 67%|██████▋   | 155/231 [49:27<24:34, 19.39s/it]
 68%|██████▊   | 156/231 [49:46<24:05, 19.27s/it]
 68%|██████▊   | 157/231 [50:04<23:35, 19.12s/it]
 68%|██████▊   | 158/231 [50:24<23:15, 19.12s/it]
 69%|██████▉   | 159/231 [50:43<23:09, 19.29s/it]
 69%|██████▉   | 160/231 [51:02<22:47, 19.26s/it]
                                                 
{'loss': 0.6048, 'grad_norm': 1.8292693936713338, 'learning_rate': 1.3162830695366651e-06, 'epoch': 2.06}
 69%|██████▉   | 160/231 [51:02<22:47, 19.26s/it]
 70%|██████▉   | 161/231 [51:21<22:19, 19.13s/it]
 70%|███████   | 162/231 [51:40<21:58, 19.11s/it]
 71%|███████   | 163/231 [52:02<22:23, 19.76s/it]
 71%|███████   | 164/231 [52:21<21:54, 19.62s/it]
 71%|███████▏  | 165/231 [52:39<21:05, 19.18s/it]
 72%|███████▏  | 166/231 [52:58<20:51, 19.26s/it]
 72%|███████▏  | 167/231 [53:18<20:43, 19.43s/it]
 73%|███████▎  | 168/231 [53:38<20:25, 19.46s/it]
 73%|███████▎  | 169/231 [53:58<20:12, 19.56s/it]
 74%|███████▎  | 170/231 [54:17<19:50, 19.52s/it]
                                                 
{'loss': 0.5256, 'grad_norm': 1.6244283815319986, 'learning_rate': 9.969779641987618e-07, 'epoch': 2.19}
 74%|███████▎  | 170/231 [54:17<19:50, 19.52s/it]
 74%|███████▍  | 171/231 [54:37<19:37, 19.63s/it]
 74%|███████▍  | 172/231 [54:57<19:19, 19.65s/it]
 75%|███████▍  | 173/231 [55:15<18:36, 19.25s/it]
 75%|███████▌  | 174/231 [55:35<18:29, 19.46s/it]
 76%|███████▌  | 175/231 [55:55<18:17, 19.60s/it]
 76%|███████▌  | 176/231 [56:14<17:50, 19.47s/it]
 77%|███████▋  | 177/231 [56:33<17:21, 19.29s/it]
 77%|███████▋  | 178/231 [56:52<16:52, 19.10s/it]
 77%|███████▋  | 179/231 [57:11<16:31, 19.06s/it]
 78%|███████▊  | 180/231 [57:30<16:26, 19.34s/it]
                                                 
{'loss': 0.5218, 'grad_norm': 1.585805521165487, 'learning_rate': 7.122262466127513e-07, 'epoch': 2.32}
 78%|███████▊  | 180/231 [57:31<16:26, 19.34s/it]
 78%|███████▊  | 181/231 [57:50<16:05, 19.30s/it]
 79%|███████▉  | 182/231 [58:10<16:01, 19.63s/it]
 79%|███████▉  | 183/231 [58:29<15:35, 19.49s/it]
 80%|███████▉  | 184/231 [58:48<15:04, 19.24s/it]
 80%|████████  | 185/231 [59:06<14:33, 18.98s/it]
 81%|████████  | 186/231 [59:28<14:44, 19.66s/it]
 81%|████████  | 187/231 [59:46<14:13, 19.39s/it]
 81%|████████▏ | 188/231 [1:00:06<13:52, 19.36s/it]
 82%|████████▏ | 189/231 [1:00:24<13:17, 18.99s/it]
 82%|████████▏ | 190/231 [1:00:42<12:54, 18.88s/it]
                                                   
{'loss': 0.5189, 'grad_norm': 1.6039325482771067, 'learning_rate': 4.6857415248004247e-07, 'epoch': 2.44}
 82%|████████▏ | 190/231 [1:00:42<12:54, 18.88s/it]
 83%|████████▎ | 191/231 [1:01:01<12:37, 18.95s/it]
 83%|████████▎ | 192/231 [1:01:21<12:31, 19.26s/it]
 84%|████████▎ | 193/231 [1:01:42<12:22, 19.53s/it]
 84%|████████▍ | 194/231 [1:02:00<11:47, 19.12s/it]
 84%|████████▍ | 195/231 [1:02:19<11:25, 19.03s/it]
 85%|████████▍ | 196/231 [1:02:37<10:59, 18.85s/it]
 85%|████████▌ | 197/231 [1:02:55<10:36, 18.71s/it]
 86%|████████▌ | 198/231 [1:03:15<10:25, 18.95s/it]
 86%|████████▌ | 199/231 [1:03:33<10:01, 18.79s/it]
 87%|████████▋ | 200/231 [1:03:52<09:42, 18.79s/it]
                                                   
{'loss': 0.5211, 'grad_norm': 1.5614374456803468, 'learning_rate': 2.716230669331155e-07, 'epoch': 2.57}
 87%|████████▋ | 200/231 [1:03:52<09:42, 18.79s/it]
 87%|████████▋ | 201/231 [1:04:11<09:20, 18.69s/it]
 87%|████████▋ | 202/231 [1:04:29<08:57, 18.54s/it]
 88%|████████▊ | 203/231 [1:04:48<08:47, 18.83s/it]
 88%|████████▊ | 204/231 [1:05:07<08:30, 18.92s/it]
 89%|████████▊ | 205/231 [1:05:26<08:06, 18.70s/it]
 89%|████████▉ | 206/231 [1:05:44<07:47, 18.72s/it]
 90%|████████▉ | 207/231 [1:06:03<07:30, 18.79s/it]
 90%|█████████ | 208/231 [1:06:22<07:13, 18.87s/it]
 90%|█████████ | 209/231 [1:06:41<06:55, 18.90s/it]
 91%|█████████ | 210/231 [1:07:00<06:33, 18.75s/it]
                                                   
{'loss': 0.5199, 'grad_norm': 1.6168478704851539, 'learning_rate': 1.2590075274920206e-07, 'epoch': 2.7}
 91%|█████████ | 210/231 [1:07:00<06:33, 18.75s/it]
 91%|█████████▏| 211/231 [1:07:18<06:14, 18.75s/it]
 92%|█████████▏| 212/231 [1:07:37<05:57, 18.83s/it]
 92%|█████████▏| 213/231 [1:07:56<05:37, 18.72s/it]
 93%|█████████▎| 214/231 [1:08:16<05:23, 19.05s/it]
 93%|█████████▎| 215/231 [1:08:34<05:00, 18.80s/it]
 94%|█████████▎| 216/231 [1:08:53<04:44, 18.97s/it]
 94%|█████████▍| 217/231 [1:09:14<04:31, 19.42s/it]
 94%|█████████▍| 218/231 [1:09:33<04:11, 19.38s/it]
 95%|█████████▍| 219/231 [1:09:53<03:52, 19.41s/it]
 95%|█████████▌| 220/231 [1:10:11<03:31, 19.20s/it]
                                                   
{'loss': 0.5233, 'grad_norm': 1.541866842411352, 'learning_rate': 3.4757260364132736e-08, 'epoch': 2.83}
 95%|█████████▌| 220/231 [1:10:11<03:31, 19.20s/it]
 96%|█████████▌| 221/231 [1:10:31<03:12, 19.21s/it]
 96%|█████████▌| 222/231 [1:10:50<02:53, 19.24s/it]
 97%|█████████▋| 223/231 [1:11:09<02:34, 19.26s/it]
 97%|█████████▋| 224/231 [1:11:28<02:14, 19.18s/it]
 97%|█████████▋| 225/231 [1:11:47<01:54, 19.11s/it]
 98%|█████████▊| 226/231 [1:12:06<01:35, 19.02s/it]
 98%|█████████▊| 227/231 [1:12:25<01:16, 19.05s/it]
 99%|█████████▊| 228/231 [1:12:45<00:57, 19.22s/it]
 99%|█████████▉| 229/231 [1:13:03<00:38, 19.04s/it]
100%|█████████▉| 230/231 [1:13:23<00:19, 19.12s/it]
                                                   
{'loss': 0.5232, 'grad_norm': 1.5296508248503218, 'learning_rate': 2.879126397345444e-10, 'epoch': 2.96}
100%|█████████▉| 230/231 [1:13:23<00:19, 19.12s/it]
100%|██████████| 231/231 [1:13:42<00:00, 19.09s/it][INFO|trainer.py:2394] 2025-04-17 15:53:04,046 >> 

Training completed. Do not forget to share your model on huggingface.co/models =)


                                                   
{'train_runtime': 4422.2655, 'train_samples_per_second': 6.751, 'train_steps_per_second': 0.052, 'train_loss': 0.7189328071874973, 'epoch': 2.97}
100%|██████████| 231/231 [1:13:42<00:00, 19.09s/it]
100%|██████████| 231/231 [1:13:42<00:00, 19.14s/it]
[INFO|trainer.py:3503] 2025-04-17 15:53:13,476 >> Saving model checkpoint to /data/username/grafting/saves/llama3-8b/full/sft_chinese
[INFO|configuration_utils.py:472] 2025-04-17 15:53:13,478 >> Configuration saved in /data/username/grafting/saves/llama3-8b/full/sft_chinese/config.json
[INFO|configuration_utils.py:807] 2025-04-17 15:53:13,478 >> Configuration saved in /data/username/grafting/saves/llama3-8b/full/sft_chinese/generation_config.json
[INFO|modeling_utils.py:2773] 2025-04-17 15:53:29,751 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at /data/username/grafting/saves/llama3-8b/full/sft_chinese/model.safetensors.index.json.
[INFO|tokenization_utils_base.py:2702] 2025-04-17 15:53:29,754 >> tokenizer config file saved in /data/username/grafting/saves/llama3-8b/full/sft_chinese/tokenizer_config.json
[INFO|tokenization_utils_base.py:2711] 2025-04-17 15:53:29,754 >> Special tokens file saved in /data/username/grafting/saves/llama3-8b/full/sft_chinese/special_tokens_map.json
***** train metrics *****
  epoch                    =     2.9711
  total_flos               =    31843GF
  train_loss               =     0.7189
  train_runtime            = 1:13:42.26
  train_samples_per_second =      6.751
  train_steps_per_second   =      0.052
Figure saved at: /data/username/grafting/saves/llama3-8b/full/sft_chinese/training_loss.png
[WARNING|2025-04-17 15:53:30] llamafactory.extras.ploting:148 >> No metric eval_loss to plot.
[WARNING|2025-04-17 15:53:30] llamafactory.extras.ploting:148 >> No metric eval_accuracy to plot.
[INFO|modelcard.py:449] 2025-04-17 15:53:30,496 >> Dropping the following result as it does not have all the necessary fields:
{'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}}
