2025-09-14 13:34:47,522 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1108 [DEBUG]: logdir: _logs/noise-eval-v2/halfcheetah/bpql-noise_0.150-delay_18
2025-09-14 13:34:47,523 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1109 [DEBUG]: trainer_prefix: noise-eval-v2/halfcheetah/bpql-noise_0.150-delay_18
2025-09-14 13:34:47,523 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1110 [DEBUG]: args.trainer_eval_latencies: {'18': <latency_env.delayed_mdp.ConstantDelay object at 0x7fcb01d27b00>}
2025-09-14 13:34:47,523 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1111 [DEBUG]: using device: cpu
2025-09-14 13:34:47,528 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1133 [INFO]: Creating new trainer
2025-09-14 13:34:47,685 baseline-bpql-noisepromille150-halfcheetah:113 [DEBUG]: pi network:
NNGaussianPolicy(
  (common_head): Sequential(
    (0): Flatten(start_dim=1, end_dim=-1)
    (1): Linear(in_features=125, out_features=256, bias=True)
    (2): ReLU()
    (3): Linear(in_features=256, out_features=256, bias=True)
    (4): ReLU()
  )
  (mu_head): Sequential(
    (0): Linear(in_features=256, out_features=6, bias=True)
    (1): Unflatten(dim=1, unflattened_size=(6,))
  )
  (log_std_head): Sequential(
    (0): Linear(in_features=256, out_features=6, bias=True)
    (1): Unflatten(dim=1, unflattened_size=(6,))
  )
  (tanh_refit): NNTanhRefit(scale: tensor([[2., 2., 2., 2., 2., 2.]]), shift: tensor([[-1., -1., -1., -1., -1., -1.]]))
)
2025-09-14 13:34:47,685 baseline-bpql-noisepromille150-halfcheetah:114 [DEBUG]: q network:
NNLayerConcat2(
  dim: -1
  (next): Sequential(
    (0): Linear(in_features=23, out_features=256, bias=True)
    (1): ReLU()
    (2): Linear(in_features=256, out_features=256, bias=True)
    (3): ReLU()
    (4): Linear(in_features=256, out_features=1, bias=True)
    (5): NNLayerSqueeze(dim: -1)
  )
  (init_left): Flatten(start_dim=1, end_dim=-1)
  (init_right): Flatten(start_dim=1, end_dim=-1)
)
2025-09-14 13:34:49,246 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1194 [DEBUG]: Starting training session...
2025-09-14 13:34:49,247 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 1/100
2025-09-14 13:37:49,229 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 13:37:59,221 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: -365.20410 ± 37.844
2025-09-14 13:37:59,222 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(-337.76144), np.float32(-347.31403), np.float32(-350.40613), np.float32(-389.87946), np.float32(-425.04504), np.float32(-389.7635), np.float32(-383.50272), np.float32(-277.6441), np.float32(-379.16702), np.float32(-371.55762)]
2025-09-14 13:37:59,222 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:37:59,222 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1226 [INFO]: New best (-365.20) for latency 18
2025-09-14 13:37:59,224 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 2/100 (estimated time remaining: 5 hours, 13 minutes, 27 seconds)
2025-09-14 13:40:54,610 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 13:41:04,520 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: -221.07451 ± 73.012
2025-09-14 13:41:04,520 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(-188.00572), np.float32(-254.6233), np.float32(-258.75644), np.float32(-271.14862), np.float32(-261.4957), np.float32(-283.57333), np.float32(-20.75996), np.float32(-195.35669), np.float32(-227.9068), np.float32(-249.11864)]
2025-09-14 13:41:04,520 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:41:04,520 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1226 [INFO]: New best (-221.07) for latency 18
2025-09-14 13:41:04,523 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 3/100 (estimated time remaining: 5 hours, 6 minutes, 28 seconds)
2025-09-14 13:43:59,337 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 13:44:09,172 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: -148.61044 ± 128.152
2025-09-14 13:44:09,173 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(-79.262085), np.float32(-410.61996), np.float32(-71.1742), np.float32(-150.08446), np.float32(-330.6307), np.float32(-82.54716), np.float32(18.694433), np.float32(-101.959435), np.float32(-231.27657), np.float32(-47.244133)]
2025-09-14 13:44:09,173 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:44:09,173 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1226 [INFO]: New best (-148.61) for latency 18
2025-09-14 13:44:09,175 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 4/100 (estimated time remaining: 5 hours, 1 minute, 44 seconds)
2025-09-14 13:47:04,405 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 13:47:14,521 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: -140.77000 ± 121.890
2025-09-14 13:47:14,521 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(-110.31731), np.float32(76.098114), np.float32(-317.11584), np.float32(28.991571), np.float32(-161.06712), np.float32(-208.19324), np.float32(-32.486313), np.float32(-248.3627), np.float32(-235.27684), np.float32(-199.97041)]
2025-09-14 13:47:14,521 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:47:14,521 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1226 [INFO]: New best (-140.77) for latency 18
2025-09-14 13:47:14,523 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 5/100 (estimated time remaining: 4 hours, 58 minutes, 6 seconds)
2025-09-14 13:50:07,440 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 13:50:16,475 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 2.27573 ± 87.426
2025-09-14 13:50:16,475 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(35.83844), np.float32(-15.3678), np.float32(50.069958), np.float32(-15.049075), np.float32(144.30225), np.float32(-136.9666), np.float32(103.39943), np.float32(-129.95901), np.float32(43.127438), np.float32(-56.637726)]
2025-09-14 13:50:16,476 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:50:16,476 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1226 [INFO]: New best (2.28) for latency 18
2025-09-14 13:50:16,478 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 6/100 (estimated time remaining: 4 hours, 53 minutes, 37 seconds)
2025-09-14 13:53:15,732 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 13:53:26,998 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 9.75651 ± 162.196
2025-09-14 13:53:26,998 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(-207.56055), np.float32(325.00452), np.float32(63.681007), np.float32(-101.00016), np.float32(-57.279057), np.float32(-62.843193), np.float32(-144.25006), np.float32(29.82447), np.float32(268.09293), np.float32(-16.104773)]
2025-09-14 13:53:26,998 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:53:26,998 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1226 [INFO]: New best (9.76) for latency 18
2025-09-14 13:53:27,001 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 7/100 (estimated time remaining: 4 hours, 50 minutes, 42 seconds)
2025-09-14 13:56:44,392 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 13:56:55,609 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 112.24315 ± 161.190
2025-09-14 13:56:55,609 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(313.0483), np.float32(-6.104296), np.float32(15.253398), np.float32(244.84392), np.float32(282.8737), np.float32(108.30406), np.float32(166.16393), np.float32(-220.83147), np.float32(-23.812912), np.float32(242.69292)]
2025-09-14 13:56:55,610 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:56:55,610 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1226 [INFO]: New best (112.24) for latency 18
2025-09-14 13:56:55,612 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 8/100 (estimated time remaining: 4 hours, 54 minutes, 50 seconds)
2025-09-14 14:00:13,244 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:00:22,973 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 208.79678 ± 97.132
2025-09-14 14:00:22,973 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(28.21005), np.float32(296.78696), np.float32(136.554), np.float32(148.39021), np.float32(250.52661), np.float32(328.87943), np.float32(163.07889), np.float32(177.98465), np.float32(188.9781), np.float32(368.5788)]
2025-09-14 14:00:22,973 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:00:22,973 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1226 [INFO]: New best (208.80) for latency 18
2025-09-14 14:00:22,976 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 9/100 (estimated time remaining: 4 hours, 58 minutes, 37 seconds)
2025-09-14 14:03:23,788 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:03:33,629 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 437.32764 ± 62.711
2025-09-14 14:03:33,629 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(370.28003), np.float32(389.41293), np.float32(578.04376), np.float32(433.22696), np.float32(406.52017), np.float32(402.21173), np.float32(520.81714), np.float32(452.5344), np.float32(379.6802), np.float32(440.54886)]
2025-09-14 14:03:33,629 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:03:33,629 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1226 [INFO]: New best (437.33) for latency 18
2025-09-14 14:03:33,632 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 10/100 (estimated time remaining: 4 hours, 56 minutes, 59 seconds)
2025-09-14 14:06:28,899 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:06:38,264 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 537.28064 ± 309.404
2025-09-14 14:06:38,264 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(763.412), np.float32(517.9081), np.float32(-357.80746), np.float32(669.87756), np.float32(522.66046), np.float32(660.00885), np.float32(533.09265), np.float32(736.49634), np.float32(655.00507), np.float32(672.1535)]
2025-09-14 14:06:38,264 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:06:38,264 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1226 [INFO]: New best (537.28) for latency 18
2025-09-14 14:06:38,267 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 11/100 (estimated time remaining: 4 hours, 54 minutes, 32 seconds)
2025-09-14 14:09:32,536 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:09:41,064 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 822.04218 ± 78.488
2025-09-14 14:09:41,065 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(674.3481), np.float32(843.82574), np.float32(926.4193), np.float32(863.1552), np.float32(867.20746), np.float32(710.25586), np.float32(768.67236), np.float32(820.9478), np.float32(827.00775), np.float32(918.5825)]
2025-09-14 14:09:41,065 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:09:41,065 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1226 [INFO]: New best (822.04) for latency 18
2025-09-14 14:09:41,067 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 12/100 (estimated time remaining: 4 hours, 48 minutes, 58 seconds)
2025-09-14 14:12:35,441 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:12:45,295 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 856.02655 ± 110.114
2025-09-14 14:12:45,296 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(552.18304), np.float32(867.8796), np.float32(937.2434), np.float32(880.479), np.float32(864.2827), np.float32(928.9595), np.float32(884.5798), np.float32(852.5919), np.float32(974.072), np.float32(817.99396)]
2025-09-14 14:12:45,296 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:12:45,296 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1226 [INFO]: New best (856.03) for latency 18
2025-09-14 14:12:45,299 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 13/100 (estimated time remaining: 4 hours, 38 minutes, 34 seconds)
2025-09-14 14:15:40,291 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:15:50,287 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 820.04395 ± 425.031
2025-09-14 14:15:50,288 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(887.83215), np.float32(994.88855), np.float32(-363.8716), np.float32(873.0942), np.float32(699.96063), np.float32(1312.8679), np.float32(1059.0626), np.float32(849.7166), np.float32(840.69586), np.float32(1046.1926)]
2025-09-14 14:15:50,288 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:15:50,291 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 14/100 (estimated time remaining: 4 hours, 28 minutes, 55 seconds)
2025-09-14 14:18:55,596 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:19:06,678 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 933.84070 ± 84.424
2025-09-14 14:19:06,678 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(828.1533), np.float32(1008.69855), np.float32(938.00586), np.float32(950.90546), np.float32(937.62384), np.float32(1116.0824), np.float32(941.1919), np.float32(800.97754), np.float32(939.7102), np.float32(877.0582)]
2025-09-14 14:19:06,678 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:19:06,678 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1226 [INFO]: New best (933.84) for latency 18
2025-09-14 14:19:06,681 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 15/100 (estimated time remaining: 4 hours, 27 minutes, 28 seconds)
2025-09-14 14:22:24,850 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:22:35,980 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 867.24377 ± 189.003
2025-09-14 14:22:35,980 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(747.45294), np.float32(516.1349), np.float32(806.8101), np.float32(780.76013), np.float32(689.0692), np.float32(1038.2826), np.float32(1186.6942), np.float32(971.80853), np.float32(883.1799), np.float32(1052.2452)]
2025-09-14 14:22:35,980 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:22:35,983 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 16/100 (estimated time remaining: 4 hours, 31 minutes, 21 seconds)
2025-09-14 14:25:47,463 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:25:57,530 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 929.84766 ± 156.808
2025-09-14 14:25:57,531 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1030.1478), np.float32(1040.228), np.float32(488.91846), np.float32(1043.9086), np.float32(1024.726), np.float32(953.1325), np.float32(950.7074), np.float32(886.7597), np.float32(985.65405), np.float32(894.29346)]
2025-09-14 14:25:57,531 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:25:57,534 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 17/100 (estimated time remaining: 4 hours, 33 minutes, 24 seconds)
2025-09-14 14:28:57,874 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:29:07,646 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 822.96063 ± 336.449
2025-09-14 14:29:07,646 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(927.7932), np.float32(-154.55511), np.float32(1012.30865), np.float32(776.42566), np.float32(993.9225), np.float32(912.1889), np.float32(918.11786), np.float32(798.051), np.float32(1002.75946), np.float32(1042.5945)]
2025-09-14 14:29:07,646 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:29:07,650 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 18/100 (estimated time remaining: 4 hours, 31 minutes, 47 seconds)
2025-09-14 14:32:01,549 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:32:10,128 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 972.26416 ± 98.821
2025-09-14 14:32:10,128 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(958.60364), np.float32(1024.6853), np.float32(927.5188), np.float32(990.4968), np.float32(900.6719), np.float32(936.994), np.float32(867.9604), np.float32(927.50867), np.float32(1241.4905), np.float32(946.71173)]
2025-09-14 14:32:10,128 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:32:10,129 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1226 [INFO]: New best (972.26) for latency 18
2025-09-14 14:32:10,131 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 19/100 (estimated time remaining: 4 hours, 27 minutes, 49 seconds)
2025-09-14 14:35:05,525 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:35:15,228 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 933.51630 ± 85.377
2025-09-14 14:35:15,228 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(934.4031), np.float32(992.04645), np.float32(1035.6926), np.float32(801.6824), np.float32(894.58203), np.float32(930.17035), np.float32(819.1031), np.float32(1085.55), np.float32(963.38385), np.float32(878.5487)]
2025-09-14 14:35:15,229 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:35:15,231 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 20/100 (estimated time remaining: 4 hours, 21 minutes, 30 seconds)
2025-09-14 14:38:11,424 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:38:21,271 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 950.32373 ± 115.539
2025-09-14 14:38:21,272 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1011.1757), np.float32(1037.798), np.float32(950.26526), np.float32(1122.4939), np.float32(703.41736), np.float32(966.4394), np.float32(980.8742), np.float32(829.7737), np.float32(861.601), np.float32(1039.3998)]
2025-09-14 14:38:21,272 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:38:21,275 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 21/100 (estimated time remaining: 4 hours, 12 minutes, 4 seconds)
2025-09-14 14:41:16,871 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:41:26,395 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1052.21509 ± 191.542
2025-09-14 14:41:26,395 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1079.4514), np.float32(814.98566), np.float32(1186.4082), np.float32(951.5555), np.float32(1009.50336), np.float32(880.84296), np.float32(1044.6588), np.float32(1542.6864), np.float32(1059.1674), np.float32(952.89185)]
2025-09-14 14:41:26,395 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:41:26,395 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1226 [INFO]: New best (1052.22) for latency 18
2025-09-14 14:41:26,398 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 22/100 (estimated time remaining: 4 hours, 4 minutes, 36 seconds)
2025-09-14 14:44:22,631 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:44:32,041 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1058.58179 ± 93.319
2025-09-14 14:44:32,042 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1127.7458), np.float32(1024.8386), np.float32(995.2674), np.float32(1149.661), np.float32(1202.3866), np.float32(1082.2538), np.float32(1138.647), np.float32(932.9786), np.float32(1030.424), np.float32(901.61584)]
2025-09-14 14:44:32,042 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:44:32,042 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1226 [INFO]: New best (1058.58) for latency 18
2025-09-14 14:44:32,045 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 23/100 (estimated time remaining: 4 hours, 20 seconds)
2025-09-14 14:47:45,952 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:47:56,727 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 991.28613 ± 101.927
2025-09-14 14:47:56,728 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1087.5594), np.float32(954.2504), np.float32(862.7795), np.float32(1076.0289), np.float32(1096.3706), np.float32(1065.925), np.float32(841.0648), np.float32(1092.219), np.float32(987.4104), np.float32(849.2531)]
2025-09-14 14:47:56,728 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:47:56,731 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 24/100 (estimated time remaining: 4 hours, 2 minutes, 57 seconds)
2025-09-14 14:51:15,343 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:51:26,167 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 993.38879 ± 104.877
2025-09-14 14:51:26,168 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(993.15424), np.float32(1134.4064), np.float32(1020.3576), np.float32(1014.99786), np.float32(938.34045), np.float32(1042.4702), np.float32(999.3533), np.float32(961.31494), np.float32(1101.8795), np.float32(727.61304)]
2025-09-14 14:51:26,168 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:51:26,171 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 25/100 (estimated time remaining: 4 hours, 5 minutes, 58 seconds)
2025-09-14 14:54:31,895 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:54:41,584 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1009.17401 ± 167.542
2025-09-14 14:54:41,585 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1038.4214), np.float32(923.6734), np.float32(1020.20636), np.float32(1322.087), np.float32(968.9935), np.float32(875.99536), np.float32(1221.7365), np.float32(966.3523), np.float32(685.73914), np.float32(1068.5354)]
2025-09-14 14:54:41,585 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:54:41,589 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 26/100 (estimated time remaining: 4 hours, 5 minutes, 4 seconds)
2025-09-14 14:57:39,752 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:57:49,066 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1127.24805 ± 167.221
2025-09-14 14:57:49,067 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1074.1454), np.float32(1153.9897), np.float32(948.23553), np.float32(1181.5123), np.float32(888.7209), np.float32(1401.1112), np.float32(1415.557), np.float32(1163.2758), np.float32(1061.9346), np.float32(983.9981)]
2025-09-14 14:57:49,067 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:57:49,067 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1226 [INFO]: New best (1127.25) for latency 18
2025-09-14 14:57:49,070 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 27/100 (estimated time remaining: 4 hours, 2 minutes, 23 seconds)
2025-09-14 15:00:44,516 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 15:00:53,899 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1135.45923 ± 141.732
2025-09-14 15:00:53,899 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(972.50885), np.float32(1195.6866), np.float32(1246.9851), np.float32(968.20294), np.float32(1144.2913), np.float32(1345.3099), np.float32(1346.9203), np.float32(1015.8124), np.float32(971.1698), np.float32(1147.7047)]
2025-09-14 15:00:53,900 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:00:53,900 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1226 [INFO]: New best (1135.46) for latency 18
2025-09-14 15:00:53,903 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 28/100 (estimated time remaining: 3 hours, 58 minutes, 55 seconds)
2025-09-14 15:03:49,283 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 15:03:58,715 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1112.76550 ± 142.487
2025-09-14 15:03:58,715 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(981.8632), np.float32(936.47186), np.float32(982.82556), np.float32(1069.8615), np.float32(1274.3105), np.float32(1384.7585), np.float32(1143.0968), np.float32(1036.9633), np.float32(1046.6964), np.float32(1270.8088)]
2025-09-14 15:03:58,715 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:03:58,719 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 29/100 (estimated time remaining: 3 hours, 50 minutes, 52 seconds)
2025-09-14 15:06:54,403 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 15:07:03,716 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1108.20154 ± 124.758
2025-09-14 15:07:03,716 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1408.5394), np.float32(907.40753), np.float32(1132.6512), np.float32(1202.8335), np.float32(1044.9084), np.float32(1106.7114), np.float32(1103.2908), np.float32(1037.5116), np.float32(1027.3644), np.float32(1110.7977)]
2025-09-14 15:07:03,716 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:07:03,720 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 30/100 (estimated time remaining: 3 hours, 41 minutes, 53 seconds)
2025-09-14 15:09:57,491 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 15:10:06,235 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1246.35425 ± 226.484
2025-09-14 15:10:06,235 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1578.0101), np.float32(1054.949), np.float32(1461.582), np.float32(1288.0732), np.float32(1519.3492), np.float32(879.7952), np.float32(1177.106), np.float32(1099.0751), np.float32(1003.95276), np.float32(1401.649)]
2025-09-14 15:10:06,235 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:10:06,235 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1226 [INFO]: New best (1246.35) for latency 18
2025-09-14 15:10:06,239 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 31/100 (estimated time remaining: 3 hours, 35 minutes, 45 seconds)
2025-09-14 15:13:06,325 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 15:13:17,207 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1161.26514 ± 161.435
2025-09-14 15:13:17,207 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1284.2374), np.float32(1163.6542), np.float32(1151.16), np.float32(1314.4293), np.float32(1284.2358), np.float32(1006.4969), np.float32(962.10693), np.float32(993.38556), np.float32(1457.2291), np.float32(995.716)]
2025-09-14 15:13:17,207 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:13:17,211 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 32/100 (estimated time remaining: 3 hours, 33 minutes, 28 seconds)
2025-09-14 15:16:35,369 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 15:16:46,178 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1317.46167 ± 318.613
2025-09-14 15:16:46,179 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1127.3247), np.float32(1083.533), np.float32(1092.1965), np.float32(1559.284), np.float32(1648.2194), np.float32(1000.09894), np.float32(2062.1702), np.float32(1121.5896), np.float32(1240.794), np.float32(1239.407)]
2025-09-14 15:16:46,179 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:16:46,179 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1226 [INFO]: New best (1317.46) for latency 18
2025-09-14 15:16:46,183 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 33/100 (estimated time remaining: 3 hours, 35 minutes, 51 seconds)
2025-09-14 15:20:03,455 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 15:20:12,722 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1173.33630 ± 182.578
2025-09-14 15:20:12,722 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(956.7966), np.float32(1305.8256), np.float32(1482.2622), np.float32(986.58044), np.float32(1072.8335), np.float32(1036.5891), np.float32(1165.9071), np.float32(1469.381), np.float32(1223.6995), np.float32(1033.4882)]
2025-09-14 15:20:12,722 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:20:12,726 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 34/100 (estimated time remaining: 3 hours, 37 minutes, 31 seconds)
2025-09-14 15:23:12,656 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 15:23:22,165 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1126.90320 ± 193.567
2025-09-14 15:23:22,166 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1049.6466), np.float32(931.2808), np.float32(1001.6898), np.float32(1477.7877), np.float32(994.1574), np.float32(1396.0789), np.float32(1367.5813), np.float32(1083.9374), np.float32(965.6325), np.float32(1001.2393)]
2025-09-14 15:23:22,166 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:23:22,170 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 35/100 (estimated time remaining: 3 hours, 35 minutes, 15 seconds)
2025-09-14 15:26:18,473 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 15:26:28,014 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1336.69312 ± 282.724
2025-09-14 15:26:28,014 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1835.7704), np.float32(1465.142), np.float32(1499.3762), np.float32(1694.7953), np.float32(1209.4437), np.float32(869.7005), np.float32(1308.8767), np.float32(1056.1791), np.float32(1081.315), np.float32(1346.3312)]
2025-09-14 15:26:28,014 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:26:28,014 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1226 [INFO]: New best (1336.69) for latency 18
2025-09-14 15:26:28,018 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 36/100 (estimated time remaining: 3 hours, 32 minutes, 43 seconds)
2025-09-14 15:29:23,540 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 15:29:33,019 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1306.87964 ± 237.473
2025-09-14 15:29:33,019 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1574.8439), np.float32(1657.3499), np.float32(1095.2172), np.float32(1200.13), np.float32(1300.3151), np.float32(1524.1176), np.float32(945.4117), np.float32(1536.1696), np.float32(1043.4598), np.float32(1191.7819)]
2025-09-14 15:29:33,019 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:29:33,023 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 37/100 (estimated time remaining: 3 hours, 28 minutes, 10 seconds)
2025-09-14 15:32:28,646 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 15:32:38,192 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1097.04724 ± 153.966
2025-09-14 15:32:38,193 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(883.8263), np.float32(1121.3435), np.float32(1275.1213), np.float32(1125.5684), np.float32(995.96185), np.float32(1192.4), np.float32(1409.2034), np.float32(995.48004), np.float32(928.11255), np.float32(1043.4557)]
2025-09-14 15:32:38,193 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:32:38,197 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 38/100 (estimated time remaining: 3 hours, 19 minutes, 55 seconds)
2025-09-14 15:35:33,526 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 15:35:42,022 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1253.29858 ± 231.006
2025-09-14 15:35:42,022 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1060.361), np.float32(1250.8615), np.float32(1087.8777), np.float32(1116.4354), np.float32(1314.5172), np.float32(1644.1205), np.float32(1097.0908), np.float32(1702.148), np.float32(992.9528), np.float32(1266.6211)]
2025-09-14 15:35:42,022 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:35:42,026 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 39/100 (estimated time remaining: 3 hours, 12 minutes, 3 seconds)
2025-09-14 15:38:35,997 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 15:38:45,274 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1310.18713 ± 338.400
2025-09-14 15:38:45,275 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1012.20667), np.float32(984.818), np.float32(1743.4675), np.float32(1674.0172), np.float32(1168.1235), np.float32(1062.2125), np.float32(1484.5803), np.float32(1120.3485), np.float32(1903.2346), np.float32(948.8626)]
2025-09-14 15:38:45,275 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:38:45,279 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 40/100 (estimated time remaining: 3 hours, 7 minutes, 41 seconds)
2025-09-14 15:41:45,726 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 15:41:56,621 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1254.03394 ± 268.747
2025-09-14 15:41:56,621 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1731.5148), np.float32(1047.1824), np.float32(1083.9279), np.float32(952.72534), np.float32(1483.292), np.float32(1245.2297), np.float32(1301.2107), np.float32(973.8748), np.float32(1657.0421), np.float32(1064.3391)]
2025-09-14 15:41:56,621 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:41:56,626 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 41/100 (estimated time remaining: 3 hours, 5 minutes, 43 seconds)
2025-09-14 15:45:14,346 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 15:45:25,277 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1322.34961 ± 291.927
2025-09-14 15:45:25,278 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1410.5237), np.float32(1421.2916), np.float32(1079.9375), np.float32(1229.4874), np.float32(1938.3157), np.float32(1244.8469), np.float32(986.5106), np.float32(998.49695), np.float32(1709.9172), np.float32(1204.1687)]
2025-09-14 15:45:25,278 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:45:25,282 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 42/100 (estimated time remaining: 3 hours, 7 minutes, 16 seconds)
2025-09-14 15:48:41,293 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 15:48:51,193 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1200.14807 ± 486.370
2025-09-14 15:48:51,193 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1395.2449), np.float32(-62.428852), np.float32(1824.8481), np.float32(1493.5756), np.float32(1480.7258), np.float32(1076.277), np.float32(1533.702), np.float32(1008.61365), np.float32(1081.4269), np.float32(1169.4955)]
2025-09-14 15:48:51,193 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:48:51,197 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 43/100 (estimated time remaining: 3 hours, 8 minutes, 6 seconds)
2025-09-14 15:51:50,477 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 15:51:59,998 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1329.23169 ± 329.565
2025-09-14 15:51:59,998 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1047.7219), np.float32(1895.3143), np.float32(1203.5198), np.float32(1782.6843), np.float32(1192.8342), np.float32(1057.7919), np.float32(1044.4489), np.float32(1718.3309), np.float32(959.71533), np.float32(1389.955)]
2025-09-14 15:51:59,998 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:52:00,003 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 44/100 (estimated time remaining: 3 hours, 5 minutes, 48 seconds)
2025-09-14 15:54:55,698 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 15:55:05,385 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1262.55078 ± 312.148
2025-09-14 15:55:05,385 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1279.7427), np.float32(908.2632), np.float32(947.0075), np.float32(1026.4249), np.float32(973.7659), np.float32(1980.584), np.float32(1276.2701), np.float32(1387.1447), np.float32(1307.056), np.float32(1539.2494)]
2025-09-14 15:55:05,386 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:55:05,390 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 45/100 (estimated time remaining: 3 hours, 2 minutes, 57 seconds)
2025-09-14 15:57:59,953 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 15:58:09,334 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1359.60217 ± 218.778
2025-09-14 15:58:09,334 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1339.0315), np.float32(1046.0596), np.float32(1649.6198), np.float32(1263.1023), np.float32(1246.0637), np.float32(1022.4748), np.float32(1620.478), np.float32(1469.6656), np.float32(1303.1812), np.float32(1636.3441)]
2025-09-14 15:58:09,334 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:58:09,334 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1226 [INFO]: New best (1359.60) for latency 18
2025-09-14 15:58:09,339 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 46/100 (estimated time remaining: 2 hours, 58 minutes, 19 seconds)
2025-09-14 16:01:05,417 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 16:01:15,233 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1270.85840 ± 169.942
2025-09-14 16:01:15,234 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1338.8953), np.float32(1244.3792), np.float32(1574.7028), np.float32(1232.3739), np.float32(1534.559), np.float32(1241.5999), np.float32(1169.9902), np.float32(1088.8678), np.float32(1287.5004), np.float32(995.71674)]
2025-09-14 16:01:15,234 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:01:15,238 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 47/100 (estimated time remaining: 2 hours, 50 minutes, 59 seconds)
2025-09-14 16:04:11,328 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 16:04:19,910 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1312.08960 ± 294.834
2025-09-14 16:04:19,910 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1680.805), np.float32(1133.9957), np.float32(1352.032), np.float32(1852.9752), np.float32(1011.40137), np.float32(1166.5769), np.float32(1668.1908), np.float32(1006.8202), np.float32(1073.3102), np.float32(1174.789)]
2025-09-14 16:04:19,910 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:04:19,914 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 48/100 (estimated time remaining: 2 hours, 44 minutes, 4 seconds)
2025-09-14 16:07:13,785 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 16:07:22,990 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1303.76257 ± 211.298
2025-09-14 16:07:22,990 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1132.1041), np.float32(1305.8016), np.float32(1306.4005), np.float32(1385.1012), np.float32(1090.7594), np.float32(1406.9404), np.float32(1117.2178), np.float32(1798.2369), np.float32(1062.1554), np.float32(1432.9088)]
2025-09-14 16:07:22,990 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:07:22,994 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 49/100 (estimated time remaining: 2 hours, 39 minutes, 59 seconds)
2025-09-14 16:10:24,462 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 16:10:34,650 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1313.75256 ± 369.575
2025-09-14 16:10:34,650 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1941.76), np.float32(1614.0146), np.float32(1194.1604), np.float32(1008.9952), np.float32(962.77155), np.float32(964.1565), np.float32(1014.35815), np.float32(1194.2482), np.float32(1964.5691), np.float32(1278.4917)]
2025-09-14 16:10:34,650 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:10:34,655 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 50/100 (estimated time remaining: 2 hours, 37 minutes, 58 seconds)
2025-09-14 16:13:52,844 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 16:14:03,539 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1534.10657 ± 366.012
2025-09-14 16:14:03,539 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1477.5958), np.float32(1158.3326), np.float32(1846.1094), np.float32(1031.8577), np.float32(1925.7322), np.float32(1365.6527), np.float32(1652.2579), np.float32(2261.6494), np.float32(1208.4357), np.float32(1413.4426)]
2025-09-14 16:14:03,539 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:14:03,539 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1226 [INFO]: New best (1534.11) for latency 18
2025-09-14 16:14:03,544 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 51/100 (estimated time remaining: 2 hours, 39 minutes, 2 seconds)
2025-09-14 16:17:19,669 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 16:17:29,405 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1297.58032 ± 255.155
2025-09-14 16:17:29,406 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1136.5399), np.float32(1021.9142), np.float32(1298.174), np.float32(1031.0447), np.float32(1812.7737), np.float32(1467.13), np.float32(1391.8602), np.float32(1256.5654), np.float32(1573.7598), np.float32(986.0408)]
2025-09-14 16:17:29,406 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:17:29,410 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 52/100 (estimated time remaining: 2 hours, 39 minutes, 6 seconds)
2025-09-14 16:20:29,349 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 16:20:39,488 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1228.67603 ± 117.233
2025-09-14 16:20:39,488 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1409.4431), np.float32(1122.2863), np.float32(1110.2034), np.float32(1360.0364), np.float32(1377.5917), np.float32(1329.7631), np.float32(1157.9362), np.float32(1165.5111), np.float32(1133.3904), np.float32(1120.599)]
2025-09-14 16:20:39,488 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:20:39,492 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 53/100 (estimated time remaining: 2 hours, 36 minutes, 43 seconds)
2025-09-14 16:23:34,999 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 16:23:44,644 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1550.64526 ± 399.506
2025-09-14 16:23:44,645 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(944.59), np.float32(1247.4285), np.float32(1664.3882), np.float32(1813.2662), np.float32(1688.7612), np.float32(1696.3068), np.float32(1086.2378), np.float32(1983.7671), np.float32(2225.6514), np.float32(1156.0547)]
2025-09-14 16:23:44,645 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:23:44,645 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1226 [INFO]: New best (1550.65) for latency 18
2025-09-14 16:23:44,649 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 54/100 (estimated time remaining: 2 hours, 33 minutes, 47 seconds)
2025-09-14 16:26:39,486 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 16:26:49,065 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1342.86987 ± 197.232
2025-09-14 16:26:49,065 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1176.6399), np.float32(1393.6866), np.float32(1192.5302), np.float32(1239.6146), np.float32(1461.4465), np.float32(1065.9265), np.float32(1429.7595), np.float32(1161.4015), np.float32(1632.9476), np.float32(1674.746)]
2025-09-14 16:26:49,065 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:26:49,070 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 55/100 (estimated time remaining: 2 hours, 29 minutes, 24 seconds)
2025-09-14 16:29:44,528 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 16:29:54,146 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1295.14087 ± 320.839
2025-09-14 16:29:54,146 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1631.8099), np.float32(1452.203), np.float32(2019.9574), np.float32(1331.4758), np.float32(1059.9719), np.float32(1103.1292), np.float32(1158.8456), np.float32(994.9876), np.float32(1308.4437), np.float32(890.5842)]
2025-09-14 16:29:54,147 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:29:54,151 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 56/100 (estimated time remaining: 2 hours, 22 minutes, 35 seconds)
2025-09-14 16:32:47,901 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 16:32:56,658 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1409.48511 ± 300.943
2025-09-14 16:32:56,658 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1499.6283), np.float32(1901.6412), np.float32(1647.7487), np.float32(1406.875), np.float32(1044.5498), np.float32(1637.9641), np.float32(1281.4105), np.float32(1668.1967), np.float32(1008.6344), np.float32(998.20264)]
2025-09-14 16:32:56,658 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:32:56,663 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 57/100 (estimated time remaining: 2 hours, 15 minutes, 59 seconds)
2025-09-14 16:35:51,802 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 16:36:01,596 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1374.87085 ± 298.484
2025-09-14 16:36:01,596 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1346.5195), np.float32(1672.6996), np.float32(965.28955), np.float32(1099.7504), np.float32(1764.9451), np.float32(1033.0588), np.float32(1657.2793), np.float32(1544.414), np.float32(1629.7565), np.float32(1034.9954)]
2025-09-14 16:36:01,596 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:36:01,601 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 58/100 (estimated time remaining: 2 hours, 12 minutes, 10 seconds)
2025-09-14 16:39:03,689 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 16:39:14,879 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1640.62085 ± 373.116
2025-09-14 16:39:14,880 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1850.6848), np.float32(1564.4479), np.float32(2076.8281), np.float32(1532.4188), np.float32(1359.2654), np.float32(2082.7166), np.float32(1112.8386), np.float32(1495.8479), np.float32(1134.132), np.float32(2197.0278)]
2025-09-14 16:39:14,880 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:39:14,880 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1226 [INFO]: New best (1640.62) for latency 18
2025-09-14 16:39:14,885 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 59/100 (estimated time remaining: 2 hours, 10 minutes, 13 seconds)
2025-09-14 16:42:32,550 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 16:42:43,758 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1680.57458 ± 415.785
2025-09-14 16:42:43,758 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(949.62604), np.float32(2027.1812), np.float32(1780.8783), np.float32(1822.7275), np.float32(1938.1556), np.float32(1133.9922), np.float32(2097.118), np.float32(2130.3838), np.float32(1140.0964), np.float32(1785.5878)]
2025-09-14 16:42:43,758 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:42:43,758 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1226 [INFO]: New best (1680.57) for latency 18
2025-09-14 16:42:43,763 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 60/100 (estimated time remaining: 2 hours, 10 minutes, 28 seconds)
2025-09-14 16:45:57,968 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 16:46:08,158 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1593.60229 ± 334.206
2025-09-14 16:46:08,158 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1379.0643), np.float32(1714.608), np.float32(1753.607), np.float32(2204.8992), np.float32(1125.3298), np.float32(1599.9767), np.float32(1016.3844), np.float32(1808.9875), np.float32(1820.5912), np.float32(1512.5753)]
2025-09-14 16:46:08,158 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:46:08,164 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 61/100 (estimated time remaining: 2 hours, 9 minutes, 52 seconds)
2025-09-14 16:49:07,302 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 16:49:17,478 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1543.17871 ± 418.597
2025-09-14 16:49:17,478 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2180.1753), np.float32(1065.139), np.float32(1874.1414), np.float32(1625.9166), np.float32(1867.6766), np.float32(1091.764), np.float32(1163.2545), np.float32(2102.7117), np.float32(1089.89), np.float32(1371.1183)]
2025-09-14 16:49:17,478 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:49:17,483 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 62/100 (estimated time remaining: 2 hours, 7 minutes, 30 seconds)
2025-09-14 16:52:13,116 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 16:52:23,176 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1467.62964 ± 370.089
2025-09-14 16:52:23,176 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1281.3806), np.float32(2163.5508), np.float32(1660.4625), np.float32(1244.4679), np.float32(1340.2972), np.float32(1100.2361), np.float32(2116.564), np.float32(1135.4163), np.float32(1450.3812), np.float32(1183.5409)]
2025-09-14 16:52:23,176 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:52:23,181 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 63/100 (estimated time remaining: 2 hours, 4 minutes, 20 seconds)
2025-09-14 16:55:17,871 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 16:55:27,515 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1341.66614 ± 222.865
2025-09-14 16:55:27,516 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1152.6427), np.float32(1053.8767), np.float32(1067.2451), np.float32(1764.896), np.float32(1228.4476), np.float32(1406.1995), np.float32(1287.061), np.float32(1337.8048), np.float32(1626.4867), np.float32(1492.0007)]
2025-09-14 16:55:27,516 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:55:27,521 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 64/100 (estimated time remaining: 1 hour, 59 minutes, 57 seconds)
2025-09-14 16:58:22,066 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 16:58:30,733 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1543.98584 ± 505.460
2025-09-14 16:58:30,733 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1679.6348), np.float32(1107.3015), np.float32(1014.24304), np.float32(2580.8076), np.float32(974.19635), np.float32(1800.4883), np.float32(1317.4248), np.float32(2026.9976), np.float32(1859.23), np.float32(1079.5342)]
2025-09-14 16:58:30,733 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:58:30,739 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 65/100 (estimated time remaining: 1 hour, 53 minutes, 38 seconds)
2025-09-14 17:01:24,354 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:01:34,108 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1303.45386 ± 383.747
2025-09-14 17:01:34,109 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1159.5194), np.float32(1177.5278), np.float32(1961.0488), np.float32(1078.5004), np.float32(1523.5638), np.float32(1007.0716), np.float32(1018.3741), np.float32(986.2769), np.float32(2062.2512), np.float32(1060.4053)]
2025-09-14 17:01:34,109 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:01:34,114 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 66/100 (estimated time remaining: 1 hour, 48 minutes, 1 second)
2025-09-14 17:04:29,540 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:04:39,270 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1488.95178 ± 370.889
2025-09-14 17:04:39,270 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1039.0905), np.float32(1027.298), np.float32(1767.4071), np.float32(2199.3538), np.float32(1834.0493), np.float32(1765.3561), np.float32(1195.1633), np.float32(1192.4589), np.float32(1468.3706), np.float32(1400.971)]
2025-09-14 17:04:39,270 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:04:39,275 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 67/100 (estimated time remaining: 1 hour, 44 minutes, 28 seconds)
2025-09-14 17:07:43,675 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:07:54,891 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1281.38098 ± 398.053
2025-09-14 17:07:54,891 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1142.9232), np.float32(2404.617), np.float32(1127.7639), np.float32(970.6754), np.float32(1399.0061), np.float32(1008.98755), np.float32(1131.3055), np.float32(1404.7084), np.float32(1099.9274), np.float32(1123.8944)]
2025-09-14 17:07:54,891 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:07:54,897 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 68/100 (estimated time remaining: 1 hour, 42 minutes, 29 seconds)
2025-09-14 17:11:13,484 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:11:24,685 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1496.67676 ± 496.920
2025-09-14 17:11:24,686 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1389.6826), np.float32(1050.4269), np.float32(945.18866), np.float32(1922.5027), np.float32(1219.6527), np.float32(1100.8358), np.float32(1501.2623), np.float32(2292.058), np.float32(2386.955), np.float32(1158.2026)]
2025-09-14 17:11:24,686 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:11:24,692 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 69/100 (estimated time remaining: 1 hour, 42 minutes, 5 seconds)
2025-09-14 17:14:37,207 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:14:47,178 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1572.96265 ± 415.943
2025-09-14 17:14:47,179 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1922.4666), np.float32(1543.7434), np.float32(1468.8779), np.float32(1386.801), np.float32(1086.4357), np.float32(1654.7976), np.float32(2517.3875), np.float32(995.6345), np.float32(1361.2065), np.float32(1792.2754)]
2025-09-14 17:14:47,179 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:14:47,185 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 70/100 (estimated time remaining: 1 hour, 40 minutes, 53 seconds)
2025-09-14 17:17:47,484 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:17:56,408 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1542.28442 ± 475.853
2025-09-14 17:17:56,409 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2440.8452), np.float32(997.1714), np.float32(2163.5852), np.float32(1058.5107), np.float32(1291.428), np.float32(1726.7087), np.float32(1334.4525), np.float32(1882.7211), np.float32(1010.2226), np.float32(1517.1989)]
2025-09-14 17:17:56,409 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:17:56,414 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 71/100 (estimated time remaining: 1 hour, 38 minutes, 13 seconds)
2025-09-14 17:20:49,773 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:20:58,479 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1346.21948 ± 300.835
2025-09-14 17:20:58,479 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1212.6934), np.float32(1760.2974), np.float32(1492.4536), np.float32(1030.6855), np.float32(944.9135), np.float32(1725.0077), np.float32(1701.0768), np.float32(1434.6222), np.float32(1166.8392), np.float32(993.60583)]
2025-09-14 17:20:58,479 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:20:58,485 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 72/100 (estimated time remaining: 1 hour, 34 minutes, 39 seconds)
2025-09-14 17:23:53,765 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:24:03,726 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1392.81018 ± 265.541
2025-09-14 17:24:03,727 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1085.7336), np.float32(1660.3851), np.float32(1786.1506), np.float32(1160.0831), np.float32(1140.8181), np.float32(1568.5309), np.float32(1643.0347), np.float32(985.2008), np.float32(1428.1245), np.float32(1470.0386)]
2025-09-14 17:24:03,727 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:24:03,734 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 73/100 (estimated time remaining: 1 hour, 30 minutes, 25 seconds)
2025-09-14 17:26:59,443 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:27:09,519 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1506.87219 ± 310.167
2025-09-14 17:27:09,519 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1452.5527), np.float32(1255.8073), np.float32(1848.13), np.float32(1376.262), np.float32(1127.8351), np.float32(1779.8754), np.float32(1154.5536), np.float32(1607.9027), np.float32(1340.5863), np.float32(2125.217)]
2025-09-14 17:27:09,519 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:27:09,525 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 74/100 (estimated time remaining: 1 hour, 25 minutes, 2 seconds)
2025-09-14 17:30:04,618 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:30:14,399 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1557.57593 ± 410.097
2025-09-14 17:30:14,399 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1692.6702), np.float32(2095.2874), np.float32(2339.6816), np.float32(1296.2615), np.float32(1427.3208), np.float32(1185.2213), np.float32(1900.7343), np.float32(1418.919), np.float32(1088.7104), np.float32(1130.9531)]
2025-09-14 17:30:14,399 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:30:14,405 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 75/100 (estimated time remaining: 1 hour, 20 minutes, 21 seconds)
2025-09-14 17:33:09,821 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:33:18,480 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1581.09961 ± 583.929
2025-09-14 17:33:18,480 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1357.2944), np.float32(1141.4324), np.float32(2252.567), np.float32(2607.7383), np.float32(1069.7406), np.float32(1577.4185), np.float32(2415.074), np.float32(1379.7585), np.float32(997.43085), np.float32(1012.54175)]
2025-09-14 17:33:18,480 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:33:18,485 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 76/100 (estimated time remaining: 1 hour, 16 minutes, 50 seconds)
2025-09-14 17:36:33,988 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:36:45,119 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1452.67004 ± 388.994
2025-09-14 17:36:45,119 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1320.3955), np.float32(1796.8857), np.float32(990.4572), np.float32(1879.1995), np.float32(1640.1995), np.float32(2102.857), np.float32(1632.11), np.float32(1160.3469), np.float32(1007.9277), np.float32(996.3209)]
2025-09-14 17:36:45,119 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:36:45,126 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 77/100 (estimated time remaining: 1 hour, 15 minutes, 43 seconds)
2025-09-14 17:40:03,613 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:40:14,576 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1382.98816 ± 561.164
2025-09-14 17:40:14,577 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1910.8402), np.float32(1438.8018), np.float32(2210.5957), np.float32(1176.126), np.float32(1068.2395), np.float32(391.4555), np.float32(1107.7915), np.float32(2288.0608), np.float32(1226.4519), np.float32(1011.51825)]
2025-09-14 17:40:14,577 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:40:14,583 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 78/100 (estimated time remaining: 1 hour, 14 minutes, 25 seconds)
2025-09-14 17:43:18,251 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:43:28,271 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1625.06067 ± 600.685
2025-09-14 17:43:28,271 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1153.7676), np.float32(1594.2194), np.float32(1150.0973), np.float32(2814.2468), np.float32(1637.5454), np.float32(1194.6025), np.float32(1056.2634), np.float32(1173.9242), np.float32(1862.7504), np.float32(2613.1907)]
2025-09-14 17:43:28,271 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:43:28,277 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 79/100 (estimated time remaining: 1 hour, 11 minutes, 46 seconds)
2025-09-14 17:46:24,417 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:46:34,423 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1569.03455 ± 631.109
2025-09-14 17:46:34,424 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1524.7603), np.float32(1100.9683), np.float32(3062.2676), np.float32(1744.0487), np.float32(2388.6365), np.float32(1152.9547), np.float32(1239.4576), np.float32(1197.4519), np.float32(1284.2847), np.float32(995.51605)]
2025-09-14 17:46:34,424 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:46:34,430 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 80/100 (estimated time remaining: 1 hour, 8 minutes, 36 seconds)
2025-09-14 17:49:29,848 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:49:39,530 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1377.71704 ± 346.932
2025-09-14 17:49:39,530 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1191.7148), np.float32(1194.1643), np.float32(958.2617), np.float32(1231.0103), np.float32(1229.5427), np.float32(2058.21), np.float32(1925.1703), np.float32(1538.981), np.float32(1024.4121), np.float32(1425.7029)]
2025-09-14 17:49:39,530 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:49:39,536 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 81/100 (estimated time remaining: 1 hour, 5 minutes, 24 seconds)
2025-09-14 17:52:35,621 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:52:45,197 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1392.99792 ± 410.168
2025-09-14 17:52:45,197 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1114.3351), np.float32(1051.6533), np.float32(1158.4093), np.float32(1454.2921), np.float32(1221.7738), np.float32(1049.9283), np.float32(1562.1292), np.float32(1284.2124), np.float32(2498.3638), np.float32(1534.8818)]
2025-09-14 17:52:45,197 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:52:45,203 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 82/100 (estimated time remaining: 1 hour, 48 seconds)
2025-09-14 17:55:38,265 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:55:47,895 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1733.34961 ± 586.397
2025-09-14 17:55:47,896 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2539.9753), np.float32(1178.3262), np.float32(1056.1011), np.float32(1869.3103), np.float32(1323.4962), np.float32(1005.605), np.float32(2195.1624), np.float32(1899.1755), np.float32(2743.374), np.float32(1522.9701)]
2025-09-14 17:55:47,896 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:55:47,896 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1226 [INFO]: New best (1733.35) for latency 18
2025-09-14 17:55:47,902 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 83/100 (estimated time remaining: 55 minutes, 59 seconds)
2025-09-14 17:58:43,270 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:58:53,007 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1516.44812 ± 555.402
2025-09-14 17:58:53,007 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1034.013), np.float32(1149.5779), np.float32(1381.8823), np.float32(1248.3756), np.float32(1238.1174), np.float32(1910.2009), np.float32(1280.4199), np.float32(1260.429), np.float32(1642.4288), np.float32(3019.0374)]
2025-09-14 17:58:53,008 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:58:53,014 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 84/100 (estimated time remaining: 52 minutes, 24 seconds)
2025-09-14 18:01:56,662 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 18:02:07,650 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1418.37354 ± 262.190
2025-09-14 18:02:07,650 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1692.8868), np.float32(1373.0775), np.float32(1520.5415), np.float32(1041.5104), np.float32(1782.6145), np.float32(1450.7833), np.float32(1731.999), np.float32(1423.763), np.float32(1104.1781), np.float32(1062.3822)]
2025-09-14 18:02:07,650 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 18:02:07,657 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 85/100 (estimated time remaining: 49 minutes, 46 seconds)
2025-09-14 18:05:25,685 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 18:05:36,665 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1719.52832 ± 641.996
2025-09-14 18:05:36,665 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1864.481), np.float32(2349.6765), np.float32(2559.841), np.float32(1058.4547), np.float32(1533.3046), np.float32(1456.0258), np.float32(2918.797), np.float32(1264.0298), np.float32(970.27423), np.float32(1220.3986)]
2025-09-14 18:05:36,665 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 18:05:36,672 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 86/100 (estimated time remaining: 47 minutes, 51 seconds)
2025-09-14 18:08:49,386 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 18:08:59,296 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1686.89062 ± 461.552
2025-09-14 18:08:59,297 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1229.055), np.float32(2001.455), np.float32(2197.8618), np.float32(1163.455), np.float32(2455.574), np.float32(1615.6709), np.float32(2160.9675), np.float32(1241.6772), np.float32(1630.6849), np.float32(1172.5056)]
2025-09-14 18:08:59,297 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 18:08:59,304 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 87/100 (estimated time remaining: 45 minutes, 27 seconds)
2025-09-14 18:11:59,954 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 18:12:09,356 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1581.45227 ± 422.732
2025-09-14 18:12:09,357 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2004.1891), np.float32(1860.6903), np.float32(1768.8447), np.float32(1257.266), np.float32(1799.0662), np.float32(1547.6713), np.float32(1307.6444), np.float32(975.9731), np.float32(978.91754), np.float32(2314.2598)]
2025-09-14 18:12:09,357 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 18:12:09,364 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 88/100 (estimated time remaining: 42 minutes, 31 seconds)
2025-09-14 18:15:04,819 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 18:15:13,332 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1849.04358 ± 748.297
2025-09-14 18:15:13,332 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3221.8423), np.float32(2679.902), np.float32(1798.4231), np.float32(1149.6305), np.float32(2933.6416), np.float32(1390.8934), np.float32(1503.9427), np.float32(1206.9865), np.float32(1397.511), np.float32(1207.6622)]
2025-09-14 18:15:13,332 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 18:15:13,332 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1226 [INFO]: New best (1849.04) for latency 18
2025-09-14 18:15:13,338 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 89/100 (estimated time remaining: 39 minutes, 12 seconds)
2025-09-14 18:18:07,223 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 18:18:17,164 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1620.84338 ± 647.404
2025-09-14 18:18:17,164 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1266.1857), np.float32(962.4446), np.float32(1491.5886), np.float32(1188.1411), np.float32(1998.7202), np.float32(1237.4642), np.float32(3188.8013), np.float32(2112.441), np.float32(979.1781), np.float32(1783.4686)]
2025-09-14 18:18:17,164 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 18:18:17,171 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 90/100 (estimated time remaining: 35 minutes, 32 seconds)
2025-09-14 18:21:11,830 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 18:21:21,734 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1594.85522 ± 360.339
2025-09-14 18:21:21,734 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(995.9872), np.float32(1686.6212), np.float32(2039.5374), np.float32(1919.0912), np.float32(1416.0585), np.float32(2088.3362), np.float32(1778.9537), np.float32(1137.5876), np.float32(1264.2689), np.float32(1622.1104)]
2025-09-14 18:21:21,734 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 18:21:21,741 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 91/100 (estimated time remaining: 31 minutes, 30 seconds)
2025-09-14 18:24:17,317 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 18:24:27,334 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1247.70386 ± 473.044
2025-09-14 18:24:27,334 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2042.4966), np.float32(1962.031), np.float32(1351.3517), np.float32(1110.509), np.float32(1246.7579), np.float32(1072.4758), np.float32(357.31323), np.float32(951.79016), np.float32(918.468), np.float32(1463.8434)]
2025-09-14 18:24:27,335 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 18:24:27,342 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 92/100 (estimated time remaining: 27 minutes, 50 seconds)
2025-09-14 18:27:22,939 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 18:27:32,779 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1460.83154 ± 466.649
2025-09-14 18:27:32,779 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1272.5802), np.float32(2517.5369), np.float32(1045.5127), np.float32(934.78125), np.float32(1308.2195), np.float32(1357.7828), np.float32(2124.8713), np.float32(1242.8254), np.float32(1555.4789), np.float32(1248.7263)]
2025-09-14 18:27:32,779 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 18:27:32,785 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 93/100 (estimated time remaining: 24 minutes, 37 seconds)
2025-09-14 18:30:45,589 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 18:30:56,330 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1391.47083 ± 189.583
2025-09-14 18:30:56,330 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1579.7916), np.float32(1318.471), np.float32(1171.8473), np.float32(1494.7291), np.float32(1798.2784), np.float32(1477.1229), np.float32(1272.8171), np.float32(1399.9432), np.float32(1194.6571), np.float32(1207.0507)]
2025-09-14 18:30:56,330 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 18:30:56,337 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 94/100 (estimated time remaining: 22 minutes)
2025-09-14 18:34:09,443 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 18:34:20,241 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1705.35376 ± 583.658
2025-09-14 18:34:20,241 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2214.6562), np.float32(1919.6722), np.float32(2199.226), np.float32(1048.9055), np.float32(1173.4263), np.float32(1361.0452), np.float32(1602.1609), np.float32(1686.455), np.float32(2903.3567), np.float32(944.6333)]
2025-09-14 18:34:20,241 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 18:34:20,248 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 95/100 (estimated time remaining: 19 minutes, 15 seconds)
2025-09-14 18:37:19,534 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 18:37:29,073 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 2122.89429 ± 842.335
2025-09-14 18:37:29,073 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2227.805), np.float32(1799.897), np.float32(3012.0754), np.float32(376.43423), np.float32(3230.5564), np.float32(1016.60565), np.float32(2003.4375), np.float32(2191.8699), np.float32(2548.8572), np.float32(2821.4045)]
2025-09-14 18:37:29,073 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 18:37:29,073 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1226 [INFO]: New best (2122.89) for latency 18
2025-09-14 18:37:29,080 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 96/100 (estimated time remaining: 16 minutes, 7 seconds)
2025-09-14 18:40:22,219 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 18:40:30,583 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1602.50745 ± 516.237
2025-09-14 18:40:30,583 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1003.32324), np.float32(1160.1945), np.float32(1191.6608), np.float32(2437.0496), np.float32(1185.6743), np.float32(1494.5586), np.float32(2609.4028), np.float32(1488.1068), np.float32(1754.8419), np.float32(1700.2614)]
2025-09-14 18:40:30,583 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 18:40:30,589 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 97/100 (estimated time remaining: 12 minutes, 50 seconds)
2025-09-14 18:43:19,506 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 18:43:28,774 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 2043.37573 ± 553.838
2025-09-14 18:43:28,775 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2240.5405), np.float32(2128.8284), np.float32(3358.2253), np.float32(2269.546), np.float32(1667.027), np.float32(1961.9656), np.float32(1276.4352), np.float32(2306.253), np.float32(1433.1913), np.float32(1791.7452)]
2025-09-14 18:43:28,775 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 18:43:28,781 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 98/100 (estimated time remaining: 9 minutes, 33 seconds)
2025-09-14 18:46:17,485 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 18:46:26,001 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1777.48242 ± 680.917
2025-09-14 18:46:26,001 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1055.2594), np.float32(1586.2358), np.float32(2145.755), np.float32(3003.7322), np.float32(1705.1355), np.float32(1178.9768), np.float32(3014.4255), np.float32(1460.7372), np.float32(1233.6493), np.float32(1390.9167)]
2025-09-14 18:46:26,001 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 18:46:26,008 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 99/100 (estimated time remaining: 6 minutes, 11 seconds)
2025-09-14 18:49:10,583 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 18:49:19,173 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1745.54297 ± 477.083
2025-09-14 18:49:19,174 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1224.464), np.float32(1973.1437), np.float32(1292.8866), np.float32(2467.1855), np.float32(1494.5615), np.float32(1489.5707), np.float32(2590.1135), np.float32(1168.6747), np.float32(1783.2766), np.float32(1971.5543)]
2025-09-14 18:49:19,174 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 18:49:19,180 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1199 [INFO]: Iteration 100/100 (estimated time remaining: 2 minutes, 59 seconds)
2025-09-14 18:52:03,641 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 18:52:12,326 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1221 [DEBUG]: Total Reward: 1847.97302 ± 409.266
2025-09-14 18:52:12,326 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1715.9336), np.float32(1553.2755), np.float32(1993.5609), np.float32(1745.2848), np.float32(2208.7593), np.float32(1869.3236), np.float32(2034.5121), np.float32(1334.3907), np.float32(2738.8916), np.float32(1285.7968)]
2025-09-14 18:52:12,327 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 18:52:12,333 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille150-halfcheetah):1251 [DEBUG]: Training session finished
