2025-09-14 09:15:04,377 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1108 [DEBUG]: logdir: _logs/noise-eval-v2/halfcheetah/bpql-noise_0.100-delay_12
2025-09-14 09:15:04,377 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1109 [DEBUG]: trainer_prefix: noise-eval-v2/halfcheetah/bpql-noise_0.100-delay_12
2025-09-14 09:15:04,377 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1110 [DEBUG]: args.trainer_eval_latencies: {'12': <latency_env.delayed_mdp.ConstantDelay object at 0x7ff57f10bcb0>}
2025-09-14 09:15:04,377 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1111 [DEBUG]: using device: cpu
2025-09-14 09:15:04,381 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1133 [INFO]: Creating new trainer
2025-09-14 09:15:04,514 baseline-bpql-noisepromille100-halfcheetah:113 [DEBUG]: pi network:
NNGaussianPolicy(
  (common_head): Sequential(
    (0): Flatten(start_dim=1, end_dim=-1)
    (1): Linear(in_features=89, out_features=256, bias=True)
    (2): ReLU()
    (3): Linear(in_features=256, out_features=256, bias=True)
    (4): ReLU()
  )
  (mu_head): Sequential(
    (0): Linear(in_features=256, out_features=6, bias=True)
    (1): Unflatten(dim=1, unflattened_size=(6,))
  )
  (log_std_head): Sequential(
    (0): Linear(in_features=256, out_features=6, bias=True)
    (1): Unflatten(dim=1, unflattened_size=(6,))
  )
  (tanh_refit): NNTanhRefit(scale: tensor([[2., 2., 2., 2., 2., 2.]]), shift: tensor([[-1., -1., -1., -1., -1., -1.]]))
)
2025-09-14 09:15:04,514 baseline-bpql-noisepromille100-halfcheetah:114 [DEBUG]: q network:
NNLayerConcat2(
  dim: -1
  (next): Sequential(
    (0): Linear(in_features=23, out_features=256, bias=True)
    (1): ReLU()
    (2): Linear(in_features=256, out_features=256, bias=True)
    (3): ReLU()
    (4): Linear(in_features=256, out_features=1, bias=True)
    (5): NNLayerSqueeze(dim: -1)
  )
  (init_left): Flatten(start_dim=1, end_dim=-1)
  (init_right): Flatten(start_dim=1, end_dim=-1)
)
2025-09-14 09:15:06,434 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1194 [DEBUG]: Starting training session...
2025-09-14 09:15:06,434 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 1/100
2025-09-14 09:18:24,024 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:18:32,080 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: -377.24741 ± 60.169
2025-09-14 09:18:32,080 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(-281.7054), np.float32(-332.61835), np.float32(-375.9575), np.float32(-298.55078), np.float32(-361.25314), np.float32(-466.22055), np.float32(-471.09198), np.float32(-420.11932), np.float32(-379.57938), np.float32(-385.37756)]
2025-09-14 09:18:32,080 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:18:32,080 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (-377.25) for latency 12
2025-09-14 09:18:32,082 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 2/100 (estimated time remaining: 5 hours, 39 minutes, 19 seconds)
2025-09-14 09:21:52,594 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:22:00,500 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: -252.06734 ± 36.151
2025-09-14 09:22:00,500 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(-266.9776), np.float32(-287.57172), np.float32(-318.90665), np.float32(-257.85495), np.float32(-241.14194), np.float32(-270.56357), np.float32(-230.54622), np.float32(-250.20116), np.float32(-184.16003), np.float32(-212.74939)]
2025-09-14 09:22:00,501 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:22:00,501 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (-252.07) for latency 12
2025-09-14 09:22:00,503 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 3/100 (estimated time remaining: 5 hours, 38 minutes, 9 seconds)
2025-09-14 09:25:20,569 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:25:28,370 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 16.04265 ± 150.249
2025-09-14 09:25:28,370 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(80.425476), np.float32(5.074252), np.float32(-139.79709), np.float32(169.73703), np.float32(270.931), np.float32(119.06394), np.float32(-172.56807), np.float32(-219.44252), np.float32(-35.037907), np.float32(82.040405)]
2025-09-14 09:25:28,370 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:25:28,370 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (16.04) for latency 12
2025-09-14 09:25:28,372 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 4/100 (estimated time remaining: 5 hours, 35 minutes, 9 seconds)
2025-09-14 09:28:48,722 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:28:56,383 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 170.78195 ± 307.907
2025-09-14 09:28:56,383 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(132.44743), np.float32(-87.732376), np.float32(-88.44047), np.float32(-7.734771), np.float32(31.411098), np.float32(162.4941), np.float32(101.222626), np.float32(867.79004), np.float32(-46.847267), np.float32(643.2093)]
2025-09-14 09:28:56,383 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:28:56,383 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (170.78) for latency 12
2025-09-14 09:28:56,385 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 5/100 (estimated time remaining: 5 hours, 31 minutes, 58 seconds)
2025-09-14 09:32:13,189 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:32:20,274 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 124.96802 ± 226.759
2025-09-14 09:32:20,274 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(300.8999), np.float32(-1.8463991), np.float32(-25.846638), np.float32(-95.21021), np.float32(64.33329), np.float32(112.767166), np.float32(71.25673), np.float32(55.26143), np.float32(30.379927), np.float32(737.685)]
2025-09-14 09:32:20,274 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:32:20,276 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 6/100 (estimated time remaining: 5 hours, 27 minutes, 22 seconds)
2025-09-14 09:35:38,045 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:35:45,676 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 397.56427 ± 280.546
2025-09-14 09:35:45,677 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(76.99472), np.float32(448.3144), np.float32(344.3884), np.float32(311.44592), np.float32(316.4286), np.float32(266.40118), np.float32(289.66174), np.float32(1130.8944), np.float32(617.2172), np.float32(173.89629)]
2025-09-14 09:35:45,677 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:35:45,677 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (397.56) for latency 12
2025-09-14 09:35:45,679 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 7/100 (estimated time remaining: 5 hours, 23 minutes, 51 seconds)
2025-09-14 09:39:05,893 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:39:13,522 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 355.60645 ± 241.834
2025-09-14 09:39:13,522 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(768.493), np.float32(786.129), np.float32(444.01837), np.float32(172.49643), np.float32(169.08183), np.float32(281.16632), np.float32(262.39523), np.float32(92.09102), np.float32(464.19458), np.float32(115.998726)]
2025-09-14 09:39:13,522 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:39:13,525 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 8/100 (estimated time remaining: 5 hours, 20 minutes, 14 seconds)
2025-09-14 09:42:35,224 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:42:43,062 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 493.60767 ± 446.304
2025-09-14 09:42:43,063 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(342.84805), np.float32(733.00354), np.float32(344.25864), np.float32(399.0317), np.float32(701.84924), np.float32(750.68677), np.float32(895.5913), np.float32(299.57184), np.float32(1082.4786), np.float32(-613.24274)]
2025-09-14 09:42:43,063 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:42:43,063 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (493.61) for latency 12
2025-09-14 09:42:43,065 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 9/100 (estimated time remaining: 5 hours, 17 minutes, 18 seconds)
2025-09-14 09:46:03,630 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:46:11,289 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 950.16687 ± 131.208
2025-09-14 09:46:11,289 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(941.84094), np.float32(909.13696), np.float32(813.3512), np.float32(1016.7569), np.float32(1152.7227), np.float32(1203.595), np.float32(765.9175), np.float32(907.9184), np.float32(905.7428), np.float32(884.6864)]
2025-09-14 09:46:11,289 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:46:11,289 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (950.17) for latency 12
2025-09-14 09:46:11,291 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 10/100 (estimated time remaining: 5 hours, 13 minutes, 55 seconds)
2025-09-14 09:49:31,270 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:49:39,121 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 916.49463 ± 99.913
2025-09-14 09:49:39,122 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(956.6356), np.float32(901.13696), np.float32(909.9018), np.float32(888.4065), np.float32(922.70496), np.float32(1112.6115), np.float32(797.38116), np.float32(736.3613), np.float32(913.75446), np.float32(1026.0522)]
2025-09-14 09:49:39,122 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:49:39,125 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 11/100 (estimated time remaining: 5 hours, 11 minutes, 39 seconds)
2025-09-14 09:53:00,420 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:53:08,170 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1029.12549 ± 113.555
2025-09-14 09:53:08,170 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(909.7279), np.float32(1056.4431), np.float32(1081.5858), np.float32(929.8177), np.float32(1324.4476), np.float32(1011.4682), np.float32(995.22015), np.float32(935.61896), np.float32(1067.3231), np.float32(979.6014)]
2025-09-14 09:53:08,170 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:53:08,171 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (1029.13) for latency 12
2025-09-14 09:53:08,173 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 12/100 (estimated time remaining: 5 hours, 9 minutes, 16 seconds)
2025-09-14 09:56:28,825 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:56:36,463 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 881.14709 ± 421.145
2025-09-14 09:56:36,463 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(-328.08658), np.float32(992.95746), np.float32(909.4838), np.float32(918.9325), np.float32(1033.6846), np.float32(1244.4983), np.float32(848.0819), np.float32(931.2475), np.float32(1212.7084), np.float32(1047.9625)]
2025-09-14 09:56:36,463 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:56:36,465 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 13/100 (estimated time remaining: 5 hours, 5 minutes, 55 seconds)
2025-09-14 09:59:57,004 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:00:04,537 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1139.27002 ± 214.324
2025-09-14 10:00:04,537 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1153.8286), np.float32(1078.3381), np.float32(1197.0372), np.float32(924.8922), np.float32(1102.9257), np.float32(1313.5814), np.float32(972.5053), np.float32(1012.1651), np.float32(1683.4441), np.float32(953.98267)]
2025-09-14 10:00:04,537 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:00:04,537 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (1139.27) for latency 12
2025-09-14 10:00:04,540 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 14/100 (estimated time remaining: 5 hours, 2 minutes, 1 second)
2025-09-14 10:03:25,006 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:03:32,336 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1264.33960 ± 195.389
2025-09-14 10:03:32,336 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1306.7322), np.float32(1159.2401), np.float32(1140.1683), np.float32(1168.0059), np.float32(1241.4075), np.float32(1824.3279), np.float32(1217.8909), np.float32(1171.8695), np.float32(1126.8683), np.float32(1286.8859)]
2025-09-14 10:03:32,336 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:03:32,337 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (1264.34) for latency 12
2025-09-14 10:03:32,339 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 15/100 (estimated time remaining: 4 hours, 58 minutes, 26 seconds)
2025-09-14 10:06:55,137 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:07:03,037 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1243.27722 ± 232.068
2025-09-14 10:07:03,038 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1413.8036), np.float32(1462.5454), np.float32(1108.2936), np.float32(1209.9404), np.float32(1312.6287), np.float32(1166.7965), np.float32(880.17194), np.float32(1716.1868), np.float32(1189.3324), np.float32(973.07294)]
2025-09-14 10:07:03,038 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:07:03,041 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 16/100 (estimated time remaining: 4 hours, 55 minutes, 46 seconds)
2025-09-14 10:10:22,080 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:10:29,357 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1282.72754 ± 211.927
2025-09-14 10:10:29,357 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1220.0394), np.float32(1113.8041), np.float32(1478.2177), np.float32(1715.4547), np.float32(1355.1853), np.float32(1384.3574), np.float32(1366.457), np.float32(995.9389), np.float32(1002.0047), np.float32(1195.8167)]
2025-09-14 10:10:29,358 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:10:29,358 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (1282.73) for latency 12
2025-09-14 10:10:29,360 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 17/100 (estimated time remaining: 4 hours, 51 minutes, 31 seconds)
2025-09-14 10:13:48,969 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:13:56,529 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1256.15808 ± 296.003
2025-09-14 10:13:56,529 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1136.663), np.float32(1346.918), np.float32(1028.9773), np.float32(1018.8062), np.float32(1811.5874), np.float32(1121.8354), np.float32(906.527), np.float32(1087.698), np.float32(1768.4803), np.float32(1334.0885)]
2025-09-14 10:13:56,529 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:13:56,532 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 18/100 (estimated time remaining: 4 hours, 47 minutes, 45 seconds)
2025-09-14 10:17:16,289 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:17:24,026 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1390.99561 ± 363.445
2025-09-14 10:17:24,026 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1559.2406), np.float32(973.8333), np.float32(1187.4653), np.float32(1718.6693), np.float32(1068.5184), np.float32(1146.5813), np.float32(1316.3208), np.float32(2246.3984), np.float32(1166.4445), np.float32(1526.4844)]
2025-09-14 10:17:24,026 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:17:24,026 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (1391.00) for latency 12
2025-09-14 10:17:24,029 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 19/100 (estimated time remaining: 4 hours, 44 minutes, 7 seconds)
2025-09-14 10:20:46,307 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:20:54,265 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1311.31274 ± 198.704
2025-09-14 10:20:54,265 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1193.9015), np.float32(1628.0587), np.float32(1242.1018), np.float32(1155.9332), np.float32(1088.5089), np.float32(1103.1903), np.float32(1266.1215), np.float32(1685.4912), np.float32(1308.9968), np.float32(1440.823)]
2025-09-14 10:20:54,265 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:20:54,268 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 20/100 (estimated time remaining: 4 hours, 41 minutes, 19 seconds)
2025-09-14 10:24:16,095 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:24:23,858 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1563.44116 ± 361.480
2025-09-14 10:24:23,859 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1197.3602), np.float32(1411.6024), np.float32(2119.3394), np.float32(1152.7632), np.float32(1368.0464), np.float32(1451.822), np.float32(2149.424), np.float32(1166.9937), np.float32(1791.0844), np.float32(1825.9768)]
2025-09-14 10:24:23,859 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:24:23,859 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (1563.44) for latency 12
2025-09-14 10:24:23,862 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 21/100 (estimated time remaining: 4 hours, 37 minutes, 33 seconds)
2025-09-14 10:27:46,434 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:27:54,280 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1246.56812 ± 434.881
2025-09-14 10:27:54,280 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1027.9857), np.float32(1522.9598), np.float32(1157.2772), np.float32(263.2089), np.float32(1029.842), np.float32(1463.3885), np.float32(1336.9017), np.float32(2059.1072), np.float32(1410.8035), np.float32(1194.2058)]
2025-09-14 10:27:54,280 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:27:54,283 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 22/100 (estimated time remaining: 4 hours, 35 minutes, 9 seconds)
2025-09-14 10:31:16,549 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:31:24,301 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1233.17578 ± 129.452
2025-09-14 10:31:24,301 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1413.0703), np.float32(1200.8485), np.float32(1153.9125), np.float32(1420.3646), np.float32(1119.3165), np.float32(1262.7546), np.float32(1419.9086), np.float32(1113.6812), np.float32(1140.7235), np.float32(1087.1771)]
2025-09-14 10:31:24,301 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:31:24,305 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 23/100 (estimated time remaining: 4 hours, 32 minutes, 25 seconds)
2025-09-14 10:34:45,111 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:34:52,879 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1586.99524 ± 379.876
2025-09-14 10:34:52,879 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1134.0741), np.float32(1826.3378), np.float32(1796.652), np.float32(1415.3264), np.float32(1315.2963), np.float32(1435.9352), np.float32(2524.3484), np.float32(1538.1985), np.float32(1232.7906), np.float32(1650.9918)]
2025-09-14 10:34:52,879 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:34:52,880 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (1587.00) for latency 12
2025-09-14 10:34:52,883 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 24/100 (estimated time remaining: 4 hours, 29 minutes, 12 seconds)
2025-09-14 10:38:15,233 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:38:22,927 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1274.58972 ± 199.746
2025-09-14 10:38:22,927 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1125.9108), np.float32(1640.8834), np.float32(1521.4979), np.float32(1061.5474), np.float32(1350.6721), np.float32(1093.3312), np.float32(1185.2711), np.float32(1201.7045), np.float32(1487.8408), np.float32(1077.238)]
2025-09-14 10:38:22,927 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:38:22,930 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 25/100 (estimated time remaining: 4 hours, 25 minutes, 39 seconds)
2025-09-14 10:41:44,190 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:41:51,844 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1959.33728 ± 486.730
2025-09-14 10:41:51,844 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1160.4397), np.float32(2321.5535), np.float32(1422.0618), np.float32(2256.7583), np.float32(2364.0923), np.float32(1283.9321), np.float32(1742.3671), np.float32(2148.6533), np.float32(2595.7996), np.float32(2297.7148)]
2025-09-14 10:41:51,844 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:41:51,844 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (1959.34) for latency 12
2025-09-14 10:41:51,848 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 26/100 (estimated time remaining: 4 hours, 21 minutes, 59 seconds)
2025-09-14 10:45:13,315 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:45:20,683 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1415.01562 ± 268.896
2025-09-14 10:45:20,683 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1347.2478), np.float32(1283.0038), np.float32(1336.8796), np.float32(1217.9532), np.float32(1451.2073), np.float32(1113.5762), np.float32(1899.805), np.float32(1154.9949), np.float32(1419.3286), np.float32(1926.1597)]
2025-09-14 10:45:20,683 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:45:20,686 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 27/100 (estimated time remaining: 4 hours, 18 minutes, 6 seconds)
2025-09-14 10:48:41,406 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:48:48,650 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1681.05762 ± 568.900
2025-09-14 10:48:48,650 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1374.3484), np.float32(1468.8549), np.float32(1468.9017), np.float32(1544.4055), np.float32(2996.1917), np.float32(1216.0586), np.float32(1186.9825), np.float32(2147.4595), np.float32(1144.5713), np.float32(2262.8032)]
2025-09-14 10:48:48,650 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:48:48,653 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 28/100 (estimated time remaining: 4 hours, 14 minutes, 7 seconds)
2025-09-14 10:52:07,428 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:52:15,003 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1487.06372 ± 345.526
2025-09-14 10:52:15,003 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1179.2281), np.float32(2137.3523), np.float32(1476.8046), np.float32(1295.105), np.float32(1636.8463), np.float32(2052.5984), np.float32(1225.0948), np.float32(1528.5262), np.float32(1057.6552), np.float32(1281.426)]
2025-09-14 10:52:15,003 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:52:15,007 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 29/100 (estimated time remaining: 4 hours, 10 minutes, 6 seconds)
2025-09-14 10:55:34,618 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:55:43,134 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1546.79395 ± 311.099
2025-09-14 10:55:43,134 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1901.2313), np.float32(1676.0868), np.float32(1581.8827), np.float32(1784.9508), np.float32(1859.5963), np.float32(1210.0696), np.float32(1478.1206), np.float32(983.64825), np.float32(1834.7987), np.float32(1157.5542)]
2025-09-14 10:55:43,134 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:55:43,137 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 30/100 (estimated time remaining: 4 hours, 6 minutes, 10 seconds)
2025-09-14 10:59:05,504 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:59:13,449 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1890.96130 ± 476.309
2025-09-14 10:59:13,449 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2656.4373), np.float32(1303.0886), np.float32(2518.8994), np.float32(1511.0375), np.float32(1836.0206), np.float32(1527.0879), np.float32(1845.1934), np.float32(1769.5338), np.float32(1408.4038), np.float32(2533.9094)]
2025-09-14 10:59:13,450 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:59:13,453 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 31/100 (estimated time remaining: 4 hours, 3 minutes, 2 seconds)
2025-09-14 11:02:36,238 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:02:44,276 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1969.81763 ± 523.903
2025-09-14 11:02:44,276 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2493.8474), np.float32(999.33685), np.float32(2481.2034), np.float32(1798.1118), np.float32(2351.9167), np.float32(1643.1086), np.float32(2294.846), np.float32(2471.817), np.float32(1178.198), np.float32(1985.7917)]
2025-09-14 11:02:44,276 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:02:44,277 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (1969.82) for latency 12
2025-09-14 11:02:44,280 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 32/100 (estimated time remaining: 4 hours, 1 second)
2025-09-14 11:06:05,822 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:06:13,737 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1633.89380 ± 411.641
2025-09-14 11:06:13,737 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1428.0652), np.float32(1396.4944), np.float32(1893.1405), np.float32(1975.0345), np.float32(2557.6865), np.float32(1422.3802), np.float32(1726.9481), np.float32(1051.2234), np.float32(1660.4806), np.float32(1227.4851)]
2025-09-14 11:06:13,737 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:06:13,741 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 33/100 (estimated time remaining: 3 hours, 56 minutes, 53 seconds)
2025-09-14 11:09:35,562 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:09:43,616 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1514.55347 ± 482.062
2025-09-14 11:09:43,617 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1550.726), np.float32(1802.7064), np.float32(1554.8934), np.float32(660.56226), np.float32(1211.338), np.float32(1357.4799), np.float32(2328.906), np.float32(2253.3735), np.float32(1232.6526), np.float32(1192.8971)]
2025-09-14 11:09:43,617 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:09:43,620 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 34/100 (estimated time remaining: 3 hours, 54 minutes, 11 seconds)
2025-09-14 11:13:05,246 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:13:12,900 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1816.24121 ± 534.009
2025-09-14 11:13:12,901 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2542.534), np.float32(2141.3733), np.float32(1274.7673), np.float32(1540.87), np.float32(1284.7498), np.float32(1534.8258), np.float32(2850.1953), np.float32(2038.1133), np.float32(1212.957), np.float32(1742.0255)]
2025-09-14 11:13:12,901 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:13:12,905 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 35/100 (estimated time remaining: 3 hours, 50 minutes, 56 seconds)
2025-09-14 11:16:34,655 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:16:42,690 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1819.96619 ± 557.529
2025-09-14 11:16:42,690 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1170.6123), np.float32(2269.5474), np.float32(1252.2579), np.float32(2996.3337), np.float32(2085.908), np.float32(1932.4215), np.float32(1090.8315), np.float32(2101.7612), np.float32(1535.3809), np.float32(1764.608)]
2025-09-14 11:16:42,690 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:16:42,694 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 36/100 (estimated time remaining: 3 hours, 47 minutes, 20 seconds)
2025-09-14 11:20:03,644 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:20:11,390 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1944.13904 ± 590.712
2025-09-14 11:20:11,390 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1615.7603), np.float32(1345.4656), np.float32(1467.2959), np.float32(1521.1937), np.float32(1368.415), np.float32(2235.2874), np.float32(2444.4004), np.float32(1909.6635), np.float32(3312.4731), np.float32(2221.435)]
2025-09-14 11:20:11,391 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:20:11,395 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 37/100 (estimated time remaining: 3 hours, 43 minutes, 23 seconds)
2025-09-14 11:23:32,853 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:23:41,007 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2129.57983 ± 566.059
2025-09-14 11:23:41,007 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1156.3348), np.float32(2084.818), np.float32(2747.956), np.float32(3029.6584), np.float32(2599.4744), np.float32(1264.9302), np.float32(2126.2292), np.float32(2319.1318), np.float32(2071.7432), np.float32(1895.5212)]
2025-09-14 11:23:41,007 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:23:41,007 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (2129.58) for latency 12
2025-09-14 11:23:41,011 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 38/100 (estimated time remaining: 3 hours, 39 minutes, 55 seconds)
2025-09-14 11:27:00,026 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:27:07,323 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2214.37622 ± 498.460
2025-09-14 11:27:07,323 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1574.2213), np.float32(2107.3218), np.float32(2723.8164), np.float32(1699.3453), np.float32(1758.4528), np.float32(3029.004), np.float32(2807.6694), np.float32(2328.718), np.float32(1702.0328), np.float32(2413.1794)]
2025-09-14 11:27:07,323 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:27:07,323 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (2214.38) for latency 12
2025-09-14 11:27:07,327 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 39/100 (estimated time remaining: 3 hours, 35 minutes, 41 seconds)
2025-09-14 11:30:26,185 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:30:33,772 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2082.01538 ± 712.089
2025-09-14 11:30:33,772 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1170.2257), np.float32(2586.419), np.float32(3171.964), np.float32(1578.807), np.float32(1499.9448), np.float32(2719.8325), np.float32(3019.5056), np.float32(1191.6781), np.float32(2149.0615), np.float32(1732.7167)]
2025-09-14 11:30:33,772 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:30:33,776 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 40/100 (estimated time remaining: 3 hours, 31 minutes, 38 seconds)
2025-09-14 11:33:55,147 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:34:02,926 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1822.13208 ± 497.236
2025-09-14 11:34:02,926 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2493.7139), np.float32(2440.9934), np.float32(1428.8821), np.float32(2322.937), np.float32(1300.5576), np.float32(1338.5654), np.float32(1587.6528), np.float32(2330.5806), np.float32(1179.4685), np.float32(1797.9697)]
2025-09-14 11:34:02,926 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:34:02,930 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 41/100 (estimated time remaining: 3 hours, 28 minutes, 2 seconds)
2025-09-14 11:37:22,346 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:37:29,775 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2062.67139 ± 632.618
2025-09-14 11:37:29,775 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2050.8022), np.float32(2701.4187), np.float32(2000.4756), np.float32(3023.3284), np.float32(1032.9828), np.float32(1883.5302), np.float32(3033.9329), np.float32(1599.9695), np.float32(1453.8873), np.float32(1846.3875)]
2025-09-14 11:37:29,775 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:37:29,780 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 42/100 (estimated time remaining: 3 hours, 24 minutes, 12 seconds)
2025-09-14 11:40:40,395 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:40:47,705 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1814.42834 ± 585.706
2025-09-14 11:40:47,705 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2854.8816), np.float32(1414.8855), np.float32(1440.0929), np.float32(1226.1892), np.float32(1929.1512), np.float32(1667.2815), np.float32(2009.0276), np.float32(2905.7026), np.float32(1396.0018), np.float32(1301.0676)]
2025-09-14 11:40:47,705 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:40:47,709 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 43/100 (estimated time remaining: 3 hours, 18 minutes, 29 seconds)
2025-09-14 11:43:59,552 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:44:06,821 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1894.86523 ± 426.833
2025-09-14 11:44:06,822 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1869.6387), np.float32(1483.3181), np.float32(1531.8289), np.float32(2083.4553), np.float32(1695.3538), np.float32(1610.9882), np.float32(2846.7983), np.float32(2470.6055), np.float32(1550.5104), np.float32(1806.1545)]
2025-09-14 11:44:06,822 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:44:06,826 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 44/100 (estimated time remaining: 3 hours, 13 minutes, 42 seconds)
2025-09-14 11:47:16,851 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:47:24,227 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1785.06189 ± 782.329
2025-09-14 11:47:24,227 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2026.565), np.float32(1372.7145), np.float32(2056.7317), np.float32(1301.2056), np.float32(1774.322), np.float32(463.65396), np.float32(1233.0466), np.float32(2947.0476), np.float32(1451.0428), np.float32(3224.2888)]
2025-09-14 11:47:24,227 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:47:24,231 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 45/100 (estimated time remaining: 3 hours, 8 minutes, 37 seconds)
2025-09-14 11:50:34,603 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:50:41,914 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2405.06763 ± 625.106
2025-09-14 11:50:41,914 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3197.9485), np.float32(1266.4384), np.float32(3004.6118), np.float32(2391.248), np.float32(2264.1667), np.float32(1958.6228), np.float32(2912.1606), np.float32(3020.8643), np.float32(1534.3125), np.float32(2500.303)]
2025-09-14 11:50:41,915 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:50:41,915 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (2405.07) for latency 12
2025-09-14 11:50:41,918 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 46/100 (estimated time remaining: 3 hours, 3 minutes, 8 seconds)
2025-09-14 11:53:52,085 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:53:59,303 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1849.53003 ± 426.550
2025-09-14 11:53:59,304 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1716.0973), np.float32(1476.905), np.float32(1934.4496), np.float32(1555.3967), np.float32(1924.6201), np.float32(2056.5852), np.float32(1180.1677), np.float32(2462.9395), np.float32(2631.0256), np.float32(1557.1138)]
2025-09-14 11:53:59,304 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:53:59,308 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 47/100 (estimated time remaining: 2 hours, 58 minutes, 6 seconds)
2025-09-14 11:57:09,352 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:57:16,646 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2176.40527 ± 517.659
2025-09-14 11:57:16,647 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2442.1365), np.float32(2413.9377), np.float32(2219.5596), np.float32(1318.3134), np.float32(1367.9159), np.float32(2376.0825), np.float32(1687.9377), np.float32(2982.7324), np.float32(2636.8142), np.float32(2318.6228)]
2025-09-14 11:57:16,647 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:57:16,651 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 48/100 (estimated time remaining: 2 hours, 54 minutes, 42 seconds)
2025-09-14 12:00:26,531 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:00:33,789 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2141.01733 ± 601.494
2025-09-14 12:00:33,790 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3194.941), np.float32(1788.9722), np.float32(2190.9355), np.float32(2746.3018), np.float32(2931.5566), np.float32(1451.0474), np.float32(2020.8872), np.float32(1442.4155), np.float32(2143.1204), np.float32(1499.9967)]
2025-09-14 12:00:33,790 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:00:33,794 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 49/100 (estimated time remaining: 2 hours, 51 minutes, 4 seconds)
2025-09-14 12:03:43,584 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:03:51,307 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2044.52026 ± 534.606
2025-09-14 12:03:51,308 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1677.0515), np.float32(2197.0693), np.float32(2614.312), np.float32(1324.0464), np.float32(2587.6794), np.float32(1162.6384), np.float32(1644.8969), np.float32(2633.2932), np.float32(2033.6025), np.float32(2570.6125)]
2025-09-14 12:03:51,308 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:03:51,312 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 50/100 (estimated time remaining: 2 hours, 47 minutes, 48 seconds)
2025-09-14 12:07:00,033 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:07:07,036 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1992.61157 ± 659.409
2025-09-14 12:07:07,037 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1617.2314), np.float32(1488.5428), np.float32(2317.806), np.float32(1683.9419), np.float32(2646.847), np.float32(1381.087), np.float32(3011.2415), np.float32(1286.8616), np.float32(3048.0906), np.float32(1444.4646)]
2025-09-14 12:07:07,037 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:07:07,041 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 51/100 (estimated time remaining: 2 hours, 44 minutes, 11 seconds)
2025-09-14 12:10:14,637 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:10:21,444 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1709.99158 ± 334.876
2025-09-14 12:10:21,445 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1674.0084), np.float32(1799.7776), np.float32(1716.7472), np.float32(1381.5411), np.float32(2618.5898), np.float32(1761.8983), np.float32(1346.0865), np.float32(1567.855), np.float32(1562.5957), np.float32(1670.8163)]
2025-09-14 12:10:21,445 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:10:21,449 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 52/100 (estimated time remaining: 2 hours, 40 minutes, 24 seconds)
2025-09-14 12:13:30,540 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:13:37,844 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1925.22925 ± 424.297
2025-09-14 12:13:37,845 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2231.2922), np.float32(1780.9337), np.float32(2095.0554), np.float32(1362.4117), np.float32(2266.7722), np.float32(1536.7878), np.float32(1650.2067), np.float32(1337.1768), np.float32(2499.777), np.float32(2491.8796)]
2025-09-14 12:13:37,845 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:13:37,849 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 53/100 (estimated time remaining: 2 hours, 36 minutes, 59 seconds)
2025-09-14 12:16:48,472 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:16:55,882 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2036.54333 ± 734.246
2025-09-14 12:16:55,882 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1759.5724), np.float32(1482.7977), np.float32(1711.9336), np.float32(2458.0764), np.float32(2830.2769), np.float32(1358.2751), np.float32(3106.1948), np.float32(1285.3312), np.float32(3156.3496), np.float32(1216.6251)]
2025-09-14 12:16:55,882 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:16:55,887 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 54/100 (estimated time remaining: 2 hours, 33 minutes, 51 seconds)
2025-09-14 12:20:06,778 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:20:13,560 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2011.19922 ± 641.880
2025-09-14 12:20:13,560 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1549.038), np.float32(2224.9292), np.float32(2756.3489), np.float32(2037.2991), np.float32(1270.975), np.float32(1253.273), np.float32(2128.3484), np.float32(1250.9921), np.float32(2445.3794), np.float32(3195.4111)]
2025-09-14 12:20:13,561 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:20:13,565 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 55/100 (estimated time remaining: 2 hours, 30 minutes, 36 seconds)
2025-09-14 12:23:22,930 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:23:30,142 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1959.55078 ± 616.012
2025-09-14 12:23:30,142 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1814.7383), np.float32(2740.6523), np.float32(2400.0325), np.float32(1439.7489), np.float32(2900.3574), np.float32(1472.6522), np.float32(2683.08), np.float32(1339.7704), np.float32(1557.8014), np.float32(1246.6748)]
2025-09-14 12:23:30,142 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:23:30,147 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 56/100 (estimated time remaining: 2 hours, 27 minutes, 27 seconds)
2025-09-14 12:26:40,217 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:26:47,488 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2976.05688 ± 856.423
2025-09-14 12:26:47,489 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1419.1324), np.float32(2448.0166), np.float32(3660.5298), np.float32(3603.443), np.float32(3676.1816), np.float32(3572.067), np.float32(2135.5522), np.float32(3606.9705), np.float32(1878.4279), np.float32(3760.2476)]
2025-09-14 12:26:47,489 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:26:47,489 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (2976.06) for latency 12
2025-09-14 12:26:47,493 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 57/100 (estimated time remaining: 2 hours, 24 minutes, 37 seconds)
2025-09-14 12:29:58,112 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:30:05,228 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2093.92944 ± 685.885
2025-09-14 12:30:05,228 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1518.1222), np.float32(2183.7844), np.float32(1973.9014), np.float32(1211.2201), np.float32(1854.1259), np.float32(1521.6051), np.float32(2758.4082), np.float32(3665.0535), np.float32(2496.6829), np.float32(1756.3878)]
2025-09-14 12:30:05,228 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:30:05,234 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 58/100 (estimated time remaining: 2 hours, 21 minutes, 31 seconds)
2025-09-14 12:33:15,660 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:33:22,834 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2220.33398 ± 755.776
2025-09-14 12:33:22,834 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2748.015), np.float32(1678.8862), np.float32(3443.1619), np.float32(2985.7053), np.float32(2112.651), np.float32(1129.3523), np.float32(1648.2335), np.float32(1880.8096), np.float32(3126.6003), np.float32(1449.9254)]
2025-09-14 12:33:22,834 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:33:22,839 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 59/100 (estimated time remaining: 2 hours, 18 minutes, 10 seconds)
2025-09-14 12:36:32,965 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:36:40,336 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2313.75781 ± 631.530
2025-09-14 12:36:40,337 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1427.9098), np.float32(2611.0154), np.float32(1903.7913), np.float32(2650.2263), np.float32(2238.1426), np.float32(3513.0112), np.float32(2140.4731), np.float32(2685.6763), np.float32(1275.2411), np.float32(2692.092)]
2025-09-14 12:36:40,337 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:36:40,342 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 60/100 (estimated time remaining: 2 hours, 14 minutes, 51 seconds)
2025-09-14 12:39:50,190 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:39:57,345 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2580.76221 ± 520.400
2025-09-14 12:39:57,345 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2230.071), np.float32(1680.1045), np.float32(2847.2883), np.float32(3316.8945), np.float32(2041.7783), np.float32(2140.1875), np.float32(2957.331), np.float32(2625.099), np.float32(3305.983), np.float32(2662.8867)]
2025-09-14 12:39:57,345 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:39:57,350 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 61/100 (estimated time remaining: 2 hours, 11 minutes, 37 seconds)
2025-09-14 12:43:08,083 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:43:15,205 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2052.44092 ± 665.546
2025-09-14 12:43:15,205 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1280.4978), np.float32(3021.7793), np.float32(2016.8829), np.float32(1536.6051), np.float32(1745.1112), np.float32(2848.0813), np.float32(2457.0266), np.float32(1247.8479), np.float32(2910.8208), np.float32(1459.7566)]
2025-09-14 12:43:15,205 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:43:15,210 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 62/100 (estimated time remaining: 2 hours, 8 minutes, 24 seconds)
2025-09-14 12:46:23,409 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:46:30,357 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1863.41248 ± 681.070
2025-09-14 12:46:30,358 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1807.8628), np.float32(1910.5583), np.float32(2467.8665), np.float32(1448.9363), np.float32(3628.8198), np.float32(1283.2665), np.float32(1667.2799), np.float32(1770.4802), np.float32(1238.0333), np.float32(1411.0217)]
2025-09-14 12:46:30,358 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:46:30,363 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 63/100 (estimated time remaining: 2 hours, 4 minutes, 46 seconds)
2025-09-14 12:49:37,394 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:49:44,353 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2728.97705 ± 885.410
2025-09-14 12:49:44,353 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3377.0112), np.float32(1852.9622), np.float32(2077.7717), np.float32(3489.398), np.float32(1225.9874), np.float32(2128.4187), np.float32(3291.8706), np.float32(3718.2227), np.float32(2195.1028), np.float32(3933.0217)]
2025-09-14 12:49:44,353 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:49:44,357 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 64/100 (estimated time remaining: 2 hours, 1 minute, 3 seconds)
2025-09-14 12:52:52,855 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:53:00,058 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2825.65283 ± 1108.422
2025-09-14 12:53:00,058 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3605.9128), np.float32(2280.988), np.float32(1087.116), np.float32(3942.8494), np.float32(1277.4553), np.float32(3818.0994), np.float32(3844.5835), np.float32(1596.4918), np.float32(2893.711), np.float32(3909.322)]
2025-09-14 12:53:00,058 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:53:00,063 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 65/100 (estimated time remaining: 1 hour, 57 minutes, 33 seconds)
2025-09-14 12:56:12,552 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:56:19,794 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2797.86328 ± 1229.428
2025-09-14 12:56:19,795 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4237.734), np.float32(3482.421), np.float32(2105.4158), np.float32(4246.24), np.float32(1596.0139), np.float32(4102.0776), np.float32(1395.7972), np.float32(3895.9858), np.float32(1671.5266), np.float32(1245.419)]
2025-09-14 12:56:19,795 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:56:19,800 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 66/100 (estimated time remaining: 1 hour, 54 minutes, 37 seconds)
2025-09-14 12:59:31,085 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:59:38,054 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3322.67236 ± 803.417
2025-09-14 12:59:38,054 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3944.954), np.float32(3770.9875), np.float32(4002.0813), np.float32(3850.0217), np.float32(2327.0073), np.float32(3904.8503), np.float32(1992.0593), np.float32(3867.9102), np.float32(2026.0621), np.float32(3540.7888)]
2025-09-14 12:59:38,054 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:59:38,054 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (3322.67) for latency 12
2025-09-14 12:59:38,059 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 67/100 (estimated time remaining: 1 hour, 51 minutes, 23 seconds)
2025-09-14 13:02:48,733 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:02:55,923 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3269.93359 ± 606.567
2025-09-14 13:02:55,924 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3755.317), np.float32(3350.7551), np.float32(3864.0642), np.float32(3481.4917), np.float32(1943.8645), np.float32(2552.2146), np.float32(2841.847), np.float32(3939.0676), np.float32(3340.5), np.float32(3630.2144)]
2025-09-14 13:02:55,924 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:02:55,929 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 68/100 (estimated time remaining: 1 hour, 48 minutes, 24 seconds)
2025-09-14 13:06:02,533 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:06:09,252 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2959.72974 ± 1054.851
2025-09-14 13:06:09,252 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3472.108), np.float32(1137.274), np.float32(2031.6388), np.float32(3889.575), np.float32(4076.2632), np.float32(1399.6017), np.float32(3779.9062), np.float32(3767.7744), np.float32(2387.8044), np.float32(3655.3523)]
2025-09-14 13:06:09,252 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:06:09,257 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 69/100 (estimated time remaining: 1 hour, 45 minutes, 3 seconds)
2025-09-14 13:09:05,216 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:09:12,021 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2582.93872 ± 714.992
2025-09-14 13:09:12,022 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3100.9539), np.float32(1692.7833), np.float32(1601.4836), np.float32(3356.7065), np.float32(3356.9783), np.float32(2142.8755), np.float32(2351.8958), np.float32(1778.7356), np.float32(2926.902), np.float32(3520.0715)]
2025-09-14 13:09:12,022 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:09:12,027 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 70/100 (estimated time remaining: 1 hour, 40 minutes, 26 seconds)
2025-09-14 13:12:08,554 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:12:15,301 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3315.62744 ± 901.414
2025-09-14 13:12:15,301 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2538.5925), np.float32(3948.6174), np.float32(3591.4473), np.float32(3612.1694), np.float32(3950.1057), np.float32(2552.7446), np.float32(1181.0986), np.float32(3484.3376), np.float32(4139.815), np.float32(4157.346)]
2025-09-14 13:12:15,301 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:12:15,306 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 71/100 (estimated time remaining: 1 hour, 35 minutes, 33 seconds)
2025-09-14 13:15:11,954 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:15:18,738 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2982.12280 ± 1040.656
2025-09-14 13:15:18,738 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3151.9482), np.float32(2224.8496), np.float32(3341.183), np.float32(4045.552), np.float32(1354.2917), np.float32(3771.107), np.float32(3721.2695), np.float32(1084.7107), np.float32(2902.2922), np.float32(4224.024)]
2025-09-14 13:15:18,739 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:15:18,744 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 72/100 (estimated time remaining: 1 hour, 30 minutes, 55 seconds)
2025-09-14 13:18:18,586 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:18:25,468 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3016.36401 ± 1047.419
2025-09-14 13:18:25,468 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3669.2634), np.float32(2148.5664), np.float32(2541.2217), np.float32(2943.1108), np.float32(1380.2998), np.float32(1376.4579), np.float32(3756.9067), np.float32(4159.4814), np.float32(4004.8389), np.float32(4183.491)]
2025-09-14 13:18:25,468 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:18:25,473 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 73/100 (estimated time remaining: 1 hour, 26 minutes, 45 seconds)
2025-09-14 13:21:39,338 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:21:46,684 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2890.71411 ± 1095.530
2025-09-14 13:21:46,685 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3483.0398), np.float32(1910.8752), np.float32(3872.2324), np.float32(1321.6816), np.float32(1727.0327), np.float32(3557.55), np.float32(3807.6355), np.float32(3942.494), np.float32(3945.4612), np.float32(1339.1365)]
2025-09-14 13:21:46,685 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:21:46,690 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 74/100 (estimated time remaining: 1 hour, 24 minutes, 22 seconds)
2025-09-14 13:24:51,749 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:24:58,155 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2795.71753 ± 1013.253
2025-09-14 13:24:58,155 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1074.0409), np.float32(1281.0232), np.float32(3900.8357), np.float32(2974.8342), np.float32(3702.5034), np.float32(2432.3914), np.float32(3631.811), np.float32(1858.3774), np.float32(3754.5376), np.float32(3346.8213)]
2025-09-14 13:24:58,155 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:24:58,161 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 75/100 (estimated time remaining: 1 hour, 21 minutes, 59 seconds)
2025-09-14 13:27:50,499 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:27:56,871 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3079.38940 ± 671.119
2025-09-14 13:27:56,871 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3131.3228), np.float32(3308.4158), np.float32(1613.1143), np.float32(3866.6726), np.float32(3569.6035), np.float32(2620.648), np.float32(2542.763), np.float32(3218.9207), np.float32(4022.1528), np.float32(2900.2817)]
2025-09-14 13:27:56,871 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:27:56,877 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 76/100 (estimated time remaining: 1 hour, 18 minutes, 27 seconds)
2025-09-14 13:30:48,731 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:30:55,622 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3421.07764 ± 929.310
2025-09-14 13:30:55,622 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3812.266), np.float32(4201.8096), np.float32(1382.1676), np.float32(1978.6177), np.float32(3805.091), np.float32(3980.7744), np.float32(4064.482), np.float32(4006.6045), np.float32(3037.3413), np.float32(3941.624)]
2025-09-14 13:30:55,622 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:30:55,622 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (3421.08) for latency 12
2025-09-14 13:30:55,627 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 77/100 (estimated time remaining: 1 hour, 14 minutes, 57 seconds)
2025-09-14 13:33:53,173 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:33:59,935 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3340.21338 ± 1251.987
2025-09-14 13:33:59,935 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3735.998), np.float32(4231.7773), np.float32(4318.1895), np.float32(3076.6316), np.float32(1291.6953), np.float32(4225.4395), np.float32(3769.316), np.float32(4053.0938), np.float32(4091.2349), np.float32(608.7565)]
2025-09-14 13:33:59,935 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:33:59,940 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 78/100 (estimated time remaining: 1 hour, 11 minutes, 38 seconds)
2025-09-14 13:36:57,285 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:37:04,093 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3342.68481 ± 757.252
2025-09-14 13:37:04,094 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3543.9597), np.float32(4140.573), np.float32(2872.5046), np.float32(3503.2947), np.float32(3261.6323), np.float32(3931.2969), np.float32(3044.7827), np.float32(3909.6804), np.float32(3828.5967), np.float32(1390.526)]
2025-09-14 13:37:04,094 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:37:04,099 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 79/100 (estimated time remaining: 1 hour, 7 minutes, 16 seconds)
2025-09-14 13:40:00,362 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:40:07,109 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3732.56958 ± 244.580
2025-09-14 13:40:07,110 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3297.3708), np.float32(3812.2786), np.float32(3919.7512), np.float32(3797.9604), np.float32(3686.3064), np.float32(3380.325), np.float32(4062.8096), np.float32(3509.093), np.float32(3934.0667), np.float32(3925.734)]
2025-09-14 13:40:07,110 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:40:07,110 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (3732.57) for latency 12
2025-09-14 13:40:07,115 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 80/100 (estimated time remaining: 1 hour, 3 minutes, 37 seconds)
2025-09-14 13:43:04,515 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:43:11,330 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3297.14722 ± 809.491
2025-09-14 13:43:11,330 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3700.5383), np.float32(3940.2395), np.float32(4173.94), np.float32(1719.4811), np.float32(4350.3647), np.float32(3777.426), np.float32(2721.6633), np.float32(3428.0852), np.float32(2606.581), np.float32(2553.1514)]
2025-09-14 13:43:11,330 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:43:11,336 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 81/100 (estimated time remaining: 1 hour, 57 seconds)
2025-09-14 13:46:08,777 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:46:15,487 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3530.92627 ± 753.899
2025-09-14 13:46:15,487 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4094.484), np.float32(3189.8542), np.float32(4085.792), np.float32(3650.3782), np.float32(2801.716), np.float32(4025.8481), np.float32(3823.353), np.float32(3896.2217), np.float32(1633.7959), np.float32(4107.8203)]
2025-09-14 13:46:15,487 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:46:15,493 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 82/100 (estimated time remaining: 58 minutes, 15 seconds)
2025-09-14 13:49:13,991 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:49:20,527 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3543.71411 ± 762.038
2025-09-14 13:49:20,527 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1567.3362), np.float32(3427.7646), np.float32(4201.4204), np.float32(3527.5063), np.float32(3112.354), np.float32(3682.5933), np.float32(4299.507), np.float32(4136.874), np.float32(4093.3745), np.float32(3388.4097)]
2025-09-14 13:49:20,527 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:49:20,532 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 83/100 (estimated time remaining: 55 minutes, 14 seconds)
2025-09-14 13:52:21,393 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:52:27,873 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3020.48975 ± 1270.788
2025-09-14 13:52:27,874 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4293.7354), np.float32(4013.1245), np.float32(2622.911), np.float32(4160.0073), np.float32(1168.6531), np.float32(4375.6357), np.float32(3672.137), np.float32(3394.147), np.float32(1289.7373), np.float32(1214.8062)]
2025-09-14 13:52:27,874 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:52:27,880 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 84/100 (estimated time remaining: 52 minutes, 20 seconds)
2025-09-14 13:55:26,958 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:55:33,797 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3338.83130 ± 605.535
2025-09-14 13:55:33,797 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2484.9023), np.float32(3369.039), np.float32(3847.2202), np.float32(1970.9397), np.float32(3694.9897), np.float32(3617.1328), np.float32(3518.67), np.float32(3825.9583), np.float32(3890.43), np.float32(3169.03)]
2025-09-14 13:55:33,798 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:55:33,803 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 85/100 (estimated time remaining: 49 minutes, 25 seconds)
2025-09-14 13:58:30,486 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:58:37,322 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3401.75537 ± 1022.614
2025-09-14 13:58:37,322 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4399.569), np.float32(4158.2295), np.float32(3934.3027), np.float32(4294.389), np.float32(2137.4734), np.float32(2217.8438), np.float32(2955.069), np.float32(1560.0062), np.float32(4129.7646), np.float32(4230.9067)]
2025-09-14 13:58:37,323 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:58:37,329 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 86/100 (estimated time remaining: 46 minutes, 17 seconds)
2025-09-14 14:01:32,825 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 14:01:39,176 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3613.55200 ± 739.732
2025-09-14 14:01:39,176 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3555.2954), np.float32(3806.3623), np.float32(3819.1382), np.float32(3214.5322), np.float32(1578.5433), np.float32(4176.6895), np.float32(3814.9092), np.float32(4346.662), np.float32(4000.6428), np.float32(3822.7444)]
2025-09-14 14:01:39,176 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:01:39,182 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 87/100 (estimated time remaining: 43 minutes, 6 seconds)
2025-09-14 14:04:33,592 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 14:04:39,894 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3658.63989 ± 632.117
2025-09-14 14:04:39,895 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3851.8591), np.float32(2188.0168), np.float32(3775.1528), np.float32(4096.726), np.float32(4252.4355), np.float32(3717.501), np.float32(3458.7605), np.float32(4180.4263), np.float32(2865.6035), np.float32(4199.916)]
2025-09-14 14:04:39,895 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:04:39,901 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 88/100 (estimated time remaining: 39 minutes, 50 seconds)
2025-09-14 14:07:33,845 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 14:07:40,154 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3649.16406 ± 759.975
2025-09-14 14:07:40,155 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4341.1055), np.float32(3925.1438), np.float32(3962.0276), np.float32(2538.844), np.float32(4141.8613), np.float32(3433.6487), np.float32(3756.4265), np.float32(1941.4808), np.float32(4291.3096), np.float32(4159.794)]
2025-09-14 14:07:40,155 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:07:40,160 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 89/100 (estimated time remaining: 36 minutes, 29 seconds)
2025-09-14 14:10:38,717 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 14:10:45,508 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3373.00708 ± 985.778
2025-09-14 14:10:45,509 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1305.6388), np.float32(3842.66), np.float32(3933.6926), np.float32(2456.168), np.float32(4070.2742), np.float32(3925.0825), np.float32(3993.1182), np.float32(2088.1038), np.float32(4447.5566), np.float32(3667.775)]
2025-09-14 14:10:45,509 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:10:45,514 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 90/100 (estimated time remaining: 33 minutes, 25 seconds)
2025-09-14 14:13:44,413 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 14:13:51,284 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3815.11450 ± 726.882
2025-09-14 14:13:51,285 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3844.631), np.float32(2527.562), np.float32(4358.7314), np.float32(4356.015), np.float32(2269.6538), np.float32(4165.0693), np.float32(4096.8096), np.float32(4328.751), np.float32(4006.673), np.float32(4197.2515)]
2025-09-14 14:13:51,285 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:13:51,285 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (3815.11) for latency 12
2025-09-14 14:13:51,291 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 91/100 (estimated time remaining: 30 minutes, 27 seconds)
2025-09-14 14:16:47,752 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 14:16:54,531 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3940.91797 ± 432.801
2025-09-14 14:16:54,532 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4149.795), np.float32(3946.4944), np.float32(4115.6777), np.float32(4197.0127), np.float32(4355.9453), np.float32(2771.6226), np.float32(3963.2761), np.float32(3657.475), np.float32(4275.235), np.float32(3976.646)]
2025-09-14 14:16:54,532 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:16:54,532 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (3940.92) for latency 12
2025-09-14 14:16:54,537 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 92/100 (estimated time remaining: 27 minutes, 27 seconds)
2025-09-14 14:19:51,502 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 14:19:57,830 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3722.55737 ± 809.506
2025-09-14 14:19:57,831 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4402.244), np.float32(1820.758), np.float32(4257.0947), np.float32(4305.702), np.float32(3762.0847), np.float32(4411.469), np.float32(4048.7656), np.float32(4036.6575), np.float32(3519.2974), np.float32(2661.4988)]
2025-09-14 14:19:57,831 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:19:57,837 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 93/100 (estimated time remaining: 24 minutes, 28 seconds)
2025-09-14 14:22:54,826 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 14:23:01,364 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3577.65771 ± 829.981
2025-09-14 14:23:01,364 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4054.802), np.float32(4226.7783), np.float32(1318.4674), np.float32(4158.8896), np.float32(4027.9062), np.float32(3917.723), np.float32(3291.8738), np.float32(3912.6033), np.float32(3084.3042), np.float32(3783.231)]
2025-09-14 14:23:01,364 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:23:01,370 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 94/100 (estimated time remaining: 21 minutes, 29 seconds)
2025-09-14 14:25:58,578 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 14:26:05,333 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2720.94360 ± 1076.277
2025-09-14 14:26:05,333 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2286.2522), np.float32(1453.2563), np.float32(1289.7477), np.float32(2683.9248), np.float32(3265.7615), np.float32(1844.0288), np.float32(4293.0566), np.float32(3942.0813), np.float32(1967.0913), np.float32(4184.235)]
2025-09-14 14:26:05,334 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:26:05,339 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 95/100 (estimated time remaining: 18 minutes, 23 seconds)
2025-09-14 14:29:01,702 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 14:29:08,545 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3521.88037 ± 890.843
2025-09-14 14:29:08,545 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1466.73), np.float32(4016.3318), np.float32(4020.815), np.float32(2215.9697), np.float32(3505.7856), np.float32(3495.8083), np.float32(3934.308), np.float32(4131.383), np.float32(4153.098), np.float32(4278.578)]
2025-09-14 14:29:08,545 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:29:08,551 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 96/100 (estimated time remaining: 15 minutes, 17 seconds)
2025-09-14 14:32:05,003 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 14:32:11,849 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3222.88037 ± 963.698
2025-09-14 14:32:11,849 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1822.564), np.float32(4336.003), np.float32(2638.5566), np.float32(4250.7803), np.float32(4161.3584), np.float32(3482.6865), np.float32(3104.794), np.float32(4249.115), np.float32(2329.8997), np.float32(1853.0453)]
2025-09-14 14:32:11,849 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:32:11,856 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 97/100 (estimated time remaining: 12 minutes, 13 seconds)
2025-09-14 14:35:07,261 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 14:35:13,806 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3996.72803 ± 153.336
2025-09-14 14:35:13,807 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4240.955), np.float32(3874.686), np.float32(3862.0674), np.float32(4031.1511), np.float32(4107.3115), np.float32(3732.7634), np.float32(3844.6565), np.float32(4070.387), np.float32(4167.922), np.float32(4035.3813)]
2025-09-14 14:35:13,807 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:35:13,807 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (3996.73) for latency 12
2025-09-14 14:35:13,813 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 98/100 (estimated time remaining: 9 minutes, 9 seconds)
2025-09-14 14:38:07,559 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 14:38:13,979 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3373.63428 ± 1046.810
2025-09-14 14:38:13,979 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2708.398), np.float32(3695.8086), np.float32(3976.55), np.float32(3632.5054), np.float32(4348.106), np.float32(4318.5596), np.float32(1603.9783), np.float32(1388.8208), np.float32(4348.995), np.float32(3714.621)]
2025-09-14 14:38:13,979 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:38:13,985 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 99/100 (estimated time remaining: 6 minutes, 5 seconds)
2025-09-14 14:41:07,249 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 14:41:13,634 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3075.94458 ± 1070.196
2025-09-14 14:41:13,634 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4406.6855), np.float32(2706.7747), np.float32(4225.525), np.float32(2033.9214), np.float32(3157.2747), np.float32(1371.2233), np.float32(3982.557), np.float32(1853.6112), np.float32(2635.8616), np.float32(4386.011)]
2025-09-14 14:41:13,635 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:41:13,641 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 100/100 (estimated time remaining: 3 minutes, 1 second)
2025-09-14 14:44:09,732 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 14:44:16,673 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3217.04102 ± 1171.924
2025-09-14 14:44:16,673 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4030.0266), np.float32(4055.7944), np.float32(3438.6165), np.float32(1398.955), np.float32(3570.1245), np.float32(4053.6921), np.float32(1383.5585), np.float32(4296.531), np.float32(1640.5914), np.float32(4302.5205)]
2025-09-14 14:44:16,673 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:44:16,680 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1251 [DEBUG]: Training session finished
