2025-09-10 21:47:14,394 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1108 [DEBUG]: logdir: _logs/noise-eval/halfcheetah/bpql-noise_0.200-delay_3
2025-09-10 21:47:14,394 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1109 [DEBUG]: trainer_prefix: noise-eval/halfcheetah/bpql-noise_0.200-delay_3
2025-09-10 21:47:14,394 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1110 [DEBUG]: args.trainer_eval_latencies: {'3': <latency_env.delayed_mdp.ConstantDelay object at 0x7e1df535b830>}
2025-09-10 21:47:14,394 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1111 [DEBUG]: using device: cpu
2025-09-10 21:47:14,397 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1133 [INFO]: Creating new trainer
2025-09-10 21:47:14,403 baseline-bpql-noisepromille200-halfcheetah:113 [DEBUG]: pi network:
NNGaussianPolicy(
  (common_head): Sequential(
    (0): Flatten(start_dim=1, end_dim=-1)
    (1): Linear(in_features=35, out_features=256, bias=True)
    (2): ReLU()
    (3): Linear(in_features=256, out_features=256, bias=True)
    (4): ReLU()
  )
  (mu_head): Sequential(
    (0): Linear(in_features=256, out_features=6, bias=True)
    (1): Unflatten(dim=1, unflattened_size=(6,))
  )
  (log_std_head): Sequential(
    (0): Linear(in_features=256, out_features=6, bias=True)
    (1): Unflatten(dim=1, unflattened_size=(6,))
  )
  (tanh_refit): NNTanhRefit(scale: tensor([[2., 2., 2., 2., 2., 2.]]), shift: tensor([[-1., -1., -1., -1., -1., -1.]]))
)
2025-09-10 21:47:14,403 baseline-bpql-noisepromille200-halfcheetah:114 [DEBUG]: q network:
NNLayerConcat2(
  dim: -1
  (next): Sequential(
    (0): Linear(in_features=23, out_features=256, bias=True)
    (1): ReLU()
    (2): Linear(in_features=256, out_features=256, bias=True)
    (3): ReLU()
    (4): Linear(in_features=256, out_features=1, bias=True)
    (5): NNLayerSqueeze(dim: -1)
  )
  (init_left): Flatten(start_dim=1, end_dim=-1)
  (init_right): Flatten(start_dim=1, end_dim=-1)
)
2025-09-10 21:47:15,161 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1194 [DEBUG]: Starting training session...
2025-09-10 21:47:15,161 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 1/100
2025-09-10 21:49:53,011 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-10 21:50:09,765 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: -315.95303 ± 19.152
2025-09-10 21:50:09,765 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(-310.22318), np.float32(-285.4525), np.float32(-293.5242), np.float32(-324.3484), np.float32(-318.36765), np.float32(-301.47305), np.float32(-311.69016), np.float32(-338.66113), np.float32(-352.36963), np.float32(-323.42044)]
2025-09-10 21:50:09,765 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-10 21:50:09,765 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1226 [INFO]: New best (-315.95) for latency 3
2025-09-10 21:50:09,766 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 2/100 (estimated time remaining: 4 hours, 48 minutes, 5 seconds)
2025-09-10 21:52:58,311 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-10 21:53:14,683 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: -201.75162 ± 40.041
2025-09-10 21:53:14,683 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(-218.80399), np.float32(-120.40189), np.float32(-135.99696), np.float32(-211.02768), np.float32(-193.9009), np.float32(-207.69215), np.float32(-247.21133), np.float32(-214.17737), np.float32(-246.33742), np.float32(-221.96654)]
2025-09-10 21:53:14,683 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-10 21:53:14,683 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1226 [INFO]: New best (-201.75) for latency 3
2025-09-10 21:53:14,683 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 3/100 (estimated time remaining: 4 hours, 53 minutes, 36 seconds)
2025-09-10 21:56:02,344 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-10 21:56:18,771 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 27.85437 ± 106.293
2025-09-10 21:56:18,771 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(61.716854), np.float32(217.38521), np.float32(-27.404762), np.float32(-22.03593), np.float32(18.077742), np.float32(-116.646126), np.float32(4.9173937), np.float32(178.13249), np.float32(87.35881), np.float32(-122.95794)]
2025-09-10 21:56:18,772 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-10 21:56:18,772 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1226 [INFO]: New best (27.85) for latency 3
2025-09-10 21:56:18,772 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 4/100 (estimated time remaining: 4 hours, 52 minutes, 56 seconds)
2025-09-10 21:59:06,906 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-10 21:59:23,211 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 764.57733 ± 137.812
2025-09-10 21:59:23,211 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(872.194), np.float32(807.4462), np.float32(996.0706), np.float32(848.35254), np.float32(745.3445), np.float32(778.188), np.float32(705.1792), np.float32(821.9377), np.float32(493.13235), np.float32(577.92834)]
2025-09-10 21:59:23,212 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-10 21:59:23,212 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1226 [INFO]: New best (764.58) for latency 3
2025-09-10 21:59:23,212 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 5/100 (estimated time remaining: 4 hours, 51 minutes, 13 seconds)
2025-09-10 22:02:10,731 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-10 22:02:27,158 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 1077.25366 ± 301.045
2025-09-10 22:02:27,159 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(481.81168), np.float32(1176.117), np.float32(993.281), np.float32(1275.6803), np.float32(1280.787), np.float32(1277.7128), np.float32(1339.0431), np.float32(523.7997), np.float32(1248.7705), np.float32(1175.5343)]
2025-09-10 22:02:27,159 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-10 22:02:27,159 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1226 [INFO]: New best (1077.25) for latency 3
2025-09-10 22:02:27,159 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 6/100 (estimated time remaining: 4 hours, 48 minutes, 47 seconds)
2025-09-10 22:05:14,988 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-10 22:05:31,284 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 1525.09290 ± 171.461
2025-09-10 22:05:31,284 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1780.0522), np.float32(1435.0234), np.float32(1295.8876), np.float32(1610.803), np.float32(1745.8466), np.float32(1640.4354), np.float32(1300.2579), np.float32(1313.6667), np.float32(1547.1389), np.float32(1581.817)]
2025-09-10 22:05:31,284 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-10 22:05:31,284 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1226 [INFO]: New best (1525.09) for latency 3
2025-09-10 22:05:31,285 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 7/100 (estimated time remaining: 4 hours, 48 minutes, 44 seconds)
2025-09-10 22:08:19,222 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-10 22:08:35,569 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 1800.52539 ± 221.712
2025-09-10 22:08:35,569 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1749.8252), np.float32(1906.2344), np.float32(1601.3036), np.float32(2062.7144), np.float32(1516.68), np.float32(1439.5137), np.float32(1902.3351), np.float32(2142.8396), np.float32(1959.2042), np.float32(1724.6038)]
2025-09-10 22:08:35,569 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-10 22:08:35,569 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1226 [INFO]: New best (1800.53) for latency 3
2025-09-10 22:08:35,570 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 8/100 (estimated time remaining: 4 hours, 45 minutes, 28 seconds)
2025-09-10 22:11:23,902 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-10 22:11:40,266 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 1597.03735 ± 715.278
2025-09-10 22:11:40,267 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1857.227), np.float32(2274.904), np.float32(2174.3823), np.float32(1892.6547), np.float32(453.27313), np.float32(2027.3711), np.float32(1738.3832), np.float32(2336.7844), np.float32(881.6741), np.float32(333.7211)]
2025-09-10 22:11:40,267 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-10 22:11:40,268 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 9/100 (estimated time remaining: 4 hours, 42 minutes, 35 seconds)
2025-09-10 22:14:28,473 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-10 22:14:44,734 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 2024.88574 ± 546.260
2025-09-10 22:14:44,734 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2206.2988), np.float32(2365.0479), np.float32(1468.6168), np.float32(2404.2883), np.float32(2352.342), np.float32(2031.807), np.float32(2208.7139), np.float32(584.6679), np.float32(2307.9116), np.float32(2319.162)]
2025-09-10 22:14:44,734 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-10 22:14:44,734 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1226 [INFO]: New best (2024.89) for latency 3
2025-09-10 22:14:44,735 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 10/100 (estimated time remaining: 4 hours, 39 minutes, 31 seconds)
2025-09-10 22:17:32,678 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-10 22:17:49,017 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 1625.27771 ± 913.145
2025-09-10 22:17:49,018 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2319.9583), np.float32(822.8354), np.float32(2557.7373), np.float32(906.7115), np.float32(421.76996), np.float32(2316.4304), np.float32(2558.8982), np.float32(1113.0625), np.float32(441.7242), np.float32(2793.6492)]
2025-09-10 22:17:49,018 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-10 22:17:49,019 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 11/100 (estimated time remaining: 4 hours, 36 minutes, 33 seconds)
2025-09-10 22:20:36,948 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-10 22:20:53,227 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 2153.80298 ± 960.899
2025-09-10 22:20:53,227 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1847.4565), np.float32(2768.1658), np.float32(2901.2642), np.float32(2763.3691), np.float32(-87.42471), np.float32(2219.1404), np.float32(2958.936), np.float32(882.6518), np.float32(2819.909), np.float32(2464.56)]
2025-09-10 22:20:53,227 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-10 22:20:53,228 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1226 [INFO]: New best (2153.80) for latency 3
2025-09-10 22:20:53,229 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 12/100 (estimated time remaining: 4 hours, 33 minutes, 30 seconds)
2025-09-10 22:23:41,070 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-10 22:23:57,392 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 2405.80005 ± 870.020
2025-09-10 22:23:57,392 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3171.9473), np.float32(2741.071), np.float32(2894.546), np.float32(1724.028), np.float32(2898.802), np.float32(148.31403), np.float32(1969.6847), np.float32(2826.2034), np.float32(3041.7385), np.float32(2641.6663)]
2025-09-10 22:23:57,392 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-10 22:23:57,392 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1226 [INFO]: New best (2405.80) for latency 3
2025-09-10 22:23:57,393 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 13/100 (estimated time remaining: 4 hours, 30 minutes, 24 seconds)
2025-09-10 22:26:45,821 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-10 22:27:02,120 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 3241.17505 ± 145.483
2025-09-10 22:27:02,120 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3183.0918), np.float32(3350.5251), np.float32(3000.5815), np.float32(3258.314), np.float32(3168.572), np.float32(3109.2554), np.float32(3176.6394), np.float32(3339.4026), np.float32(3559.0156), np.float32(3266.352)]
2025-09-10 22:27:02,120 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-10 22:27:02,120 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1226 [INFO]: New best (3241.18) for latency 3
2025-09-10 22:27:02,121 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 14/100 (estimated time remaining: 4 hours, 27 minutes, 20 seconds)
2025-09-10 22:29:50,677 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-10 22:30:06,954 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 3167.92627 ± 151.762
2025-09-10 22:30:06,954 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3246.513), np.float32(3407.5195), np.float32(3211.1404), np.float32(3152.0134), np.float32(2845.1057), np.float32(3066.7834), np.float32(3240.7302), np.float32(3153.7722), np.float32(3326.9663), np.float32(3028.7192)]
2025-09-10 22:30:06,954 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-10 22:30:06,955 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 15/100 (estimated time remaining: 4 hours, 24 minutes, 22 seconds)
2025-09-10 22:32:55,444 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-10 22:33:11,757 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 3391.99927 ± 179.958
2025-09-10 22:33:11,757 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3309.0352), np.float32(3337.496), np.float32(3450.9321), np.float32(3500.844), np.float32(2957.0464), np.float32(3310.238), np.float32(3422.6638), np.float32(3505.4863), np.float32(3439.796), np.float32(3686.4539)]
2025-09-10 22:33:11,757 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-10 22:33:11,757 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1226 [INFO]: New best (3392.00) for latency 3
2025-09-10 22:33:11,759 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 16/100 (estimated time remaining: 4 hours, 21 minutes, 26 seconds)
2025-09-10 22:35:59,958 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-10 22:36:16,232 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 3381.30273 ± 309.156
2025-09-10 22:36:16,232 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3529.9265), np.float32(3489.1582), np.float32(3556.553), np.float32(3401.7725), np.float32(3633.6655), np.float32(2488.9114), np.float32(3398.2395), np.float32(3452.0876), np.float32(3329.9514), np.float32(3532.761)]
2025-09-10 22:36:16,232 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-10 22:36:16,233 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 17/100 (estimated time remaining: 4 hours, 18 minutes, 26 seconds)
2025-09-10 22:39:04,194 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-10 22:39:20,522 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 3472.55591 ± 132.381
2025-09-10 22:39:20,523 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3226.7615), np.float32(3680.17), np.float32(3433.4268), np.float32(3425.454), np.float32(3472.201), np.float32(3312.7266), np.float32(3574.2373), np.float32(3479.777), np.float32(3651.5303), np.float32(3469.273)]
2025-09-10 22:39:20,523 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-10 22:39:20,523 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1226 [INFO]: New best (3472.56) for latency 3
2025-09-10 22:39:20,524 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 18/100 (estimated time remaining: 4 hours, 15 minutes, 23 seconds)
2025-09-10 22:42:08,948 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-10 22:42:25,243 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 3720.31445 ± 121.311
2025-09-10 22:42:25,243 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3688.9006), np.float32(3665.0127), np.float32(3631.5107), np.float32(3499.8704), np.float32(3838.7468), np.float32(3848.133), np.float32(3646.692), np.float32(3711.192), np.float32(3945.516), np.float32(3727.572)]
2025-09-10 22:42:25,243 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-10 22:42:25,243 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1226 [INFO]: New best (3720.31) for latency 3
2025-09-10 22:42:25,244 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 19/100 (estimated time remaining: 4 hours, 12 minutes, 19 seconds)
2025-09-10 22:45:13,577 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-10 22:45:29,920 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 2637.67310 ± 957.977
2025-09-10 22:45:29,920 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2634.5564), np.float32(3465.1804), np.float32(2065.5679), np.float32(2054.9197), np.float32(3648.0789), np.float32(2797.226), np.float32(3134.3308), np.float32(3575.618), np.float32(262.2857), np.float32(2738.9666)]
2025-09-10 22:45:29,920 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-10 22:45:29,922 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 20/100 (estimated time remaining: 4 hours, 9 minutes, 12 seconds)
2025-09-10 22:48:18,667 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-10 22:48:34,992 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 3495.27612 ± 555.360
2025-09-10 22:48:34,993 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2638.4976), np.float32(3542.268), np.float32(3942.7383), np.float32(3918.9558), np.float32(3539.4656), np.float32(2244.74), np.float32(3884.3774), np.float32(3546.768), np.float32(3859.9507), np.float32(3834.9998)]
2025-09-10 22:48:34,993 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-10 22:48:34,994 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 21/100 (estimated time remaining: 4 hours, 6 minutes, 11 seconds)
2025-09-10 22:51:23,059 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-10 22:51:39,378 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 3794.40234 ± 84.895
2025-09-10 22:51:39,378 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3688.3708), np.float32(3962.7976), np.float32(3808.7698), np.float32(3744.256), np.float32(3858.2595), np.float32(3706.4897), np.float32(3708.7852), np.float32(3838.7942), np.float32(3750.924), np.float32(3876.576)]
2025-09-10 22:51:39,378 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-10 22:51:39,378 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1226 [INFO]: New best (3794.40) for latency 3
2025-09-10 22:51:39,380 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 22/100 (estimated time remaining: 4 hours, 3 minutes, 5 seconds)
2025-09-10 22:54:27,432 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-10 22:54:43,728 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 3520.60034 ± 834.974
2025-09-10 22:54:43,728 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3474.14), np.float32(4089.3545), np.float32(3752.2388), np.float32(3677.7566), np.float32(3937.1067), np.float32(3679.3652), np.float32(1069.3873), np.float32(3687.2256), np.float32(4004.4), np.float32(3835.0308)]
2025-09-10 22:54:43,728 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-10 22:54:43,730 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 23/100 (estimated time remaining: 4 hours, 2 seconds)
2025-09-10 22:57:31,942 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-10 22:57:48,255 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 3814.72656 ± 344.591
2025-09-10 22:57:48,256 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4007.541), np.float32(3718.2402), np.float32(4143.5454), np.float32(4082.7542), np.float32(4011.6643), np.float32(3812.5784), np.float32(3725.043), np.float32(2874.143), np.float32(3787.5686), np.float32(3984.1873)]
2025-09-10 22:57:48,256 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-10 22:57:48,256 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1226 [INFO]: New best (3814.73) for latency 3
2025-09-10 22:57:48,257 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 24/100 (estimated time remaining: 3 hours, 56 minutes, 54 seconds)
2025-09-10 23:00:36,617 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-10 23:00:52,953 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 3781.03052 ± 87.613
2025-09-10 23:00:52,953 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3863.4368), np.float32(3795.4705), np.float32(3871.8997), np.float32(3665.047), np.float32(3913.347), np.float32(3796.488), np.float32(3770.238), np.float32(3744.1567), np.float32(3778.9685), np.float32(3611.2522)]
2025-09-10 23:00:52,953 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-10 23:00:52,955 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 25/100 (estimated time remaining: 3 hours, 53 minutes, 50 seconds)
2025-09-10 23:03:40,973 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-10 23:03:57,318 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 3788.47070 ± 665.383
2025-09-10 23:03:57,319 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4089.2356), np.float32(2208.7397), np.float32(3871.6047), np.float32(2837.6677), np.float32(4288.678), np.float32(4199.0254), np.float32(3849.2473), np.float32(4329.8926), np.float32(4021.5388), np.float32(4189.077)]
2025-09-10 23:03:57,319 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-10 23:03:57,321 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 26/100 (estimated time remaining: 3 hours, 50 minutes, 34 seconds)
2025-09-10 23:06:45,513 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-10 23:07:01,855 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 3925.64209 ± 121.737
2025-09-10 23:07:01,856 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3846.7117), np.float32(3878.9343), np.float32(4214.6274), np.float32(3809.042), np.float32(3766.2795), np.float32(4001.6658), np.float32(3982.114), np.float32(3987.9827), np.float32(3907.4275), np.float32(3861.6372)]
2025-09-10 23:07:01,856 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-10 23:07:01,856 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1226 [INFO]: New best (3925.64) for latency 3
2025-09-10 23:07:01,858 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 27/100 (estimated time remaining: 3 hours, 47 minutes, 32 seconds)
2025-09-10 23:09:50,073 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-10 23:10:06,384 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 3937.32031 ± 146.965
2025-09-10 23:10:06,384 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4007.0325), np.float32(4042.5322), np.float32(3856.6128), np.float32(3760.3203), np.float32(3788.6655), np.float32(3916.8748), np.float32(4033.6682), np.float32(4260.8545), np.float32(3929.5747), np.float32(3777.066)]
2025-09-10 23:10:06,384 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-10 23:10:06,384 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1226 [INFO]: New best (3937.32) for latency 3
2025-09-10 23:10:06,386 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 28/100 (estimated time remaining: 3 hours, 44 minutes, 30 seconds)
2025-09-10 23:12:54,951 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-10 23:13:11,214 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4071.65161 ± 145.970
2025-09-10 23:13:11,214 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3905.408), np.float32(3876.903), np.float32(4263.562), np.float32(4085.9375), np.float32(4007.8257), np.float32(4008.444), np.float32(4159.261), np.float32(3916.19), np.float32(4315.1396), np.float32(4177.8447)]
2025-09-10 23:13:11,214 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-10 23:13:11,215 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1226 [INFO]: New best (4071.65) for latency 3
2025-09-10 23:13:11,216 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 29/100 (estimated time remaining: 3 hours, 41 minutes, 30 seconds)
2025-09-10 23:15:59,345 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-10 23:16:15,684 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 3974.42236 ± 89.344
2025-09-10 23:16:15,685 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3798.802), np.float32(4046.7075), np.float32(3956.6848), np.float32(4018.779), np.float32(4058.8801), np.float32(4015.4956), np.float32(3828.4211), np.float32(4077.0605), np.float32(3991.537), np.float32(3951.8557)]
2025-09-10 23:16:15,685 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-10 23:16:15,687 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 30/100 (estimated time remaining: 3 hours, 38 minutes, 22 seconds)
2025-09-10 23:19:03,807 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-10 23:19:20,114 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 3684.60669 ± 935.809
2025-09-10 23:19:20,114 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3887.5925), np.float32(4017.1375), np.float32(4157.1416), np.float32(3976.8882), np.float32(4271.7666), np.float32(1509.5828), np.float32(2223.3752), np.float32(4440.4087), np.float32(4076.2324), np.float32(4285.9395)]
2025-09-10 23:19:20,114 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-10 23:19:20,116 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 31/100 (estimated time remaining: 3 hours, 35 minutes, 19 seconds)
2025-09-10 23:22:08,166 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-10 23:22:24,492 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4052.11597 ± 479.536
2025-09-10 23:22:24,493 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4367.1147), np.float32(4282.565), np.float32(4311.597), np.float32(4280.857), np.float32(3925.9004), np.float32(4170.038), np.float32(2662.7063), np.float32(4281.6357), np.float32(4171.995), np.float32(4066.7485)]
2025-09-10 23:22:24,493 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-10 23:22:24,495 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 32/100 (estimated time remaining: 3 hours, 32 minutes, 12 seconds)
2025-09-10 23:25:12,719 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-10 23:25:29,062 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4152.09863 ± 201.584
2025-09-10 23:25:29,062 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3887.858), np.float32(4430.264), np.float32(4220.545), np.float32(3880.8997), np.float32(4041.7456), np.float32(4285.874), np.float32(4182.259), np.float32(3931.548), np.float32(4192.139), np.float32(4467.85)]
2025-09-10 23:25:29,063 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-10 23:25:29,063 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1226 [INFO]: New best (4152.10) for latency 3
2025-09-10 23:25:29,065 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 33/100 (estimated time remaining: 3 hours, 29 minutes, 8 seconds)
2025-09-10 23:28:17,340 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-10 23:28:33,685 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4300.90332 ± 137.400
2025-09-10 23:28:33,685 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4332.6367), np.float32(4619.7554), np.float32(4155.813), np.float32(4118.217), np.float32(4301.398), np.float32(4319.182), np.float32(4229.768), np.float32(4182.4585), np.float32(4400.5273), np.float32(4349.2803)]
2025-09-10 23:28:33,685 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-10 23:28:33,685 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1226 [INFO]: New best (4300.90) for latency 3
2025-09-10 23:28:33,687 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 34/100 (estimated time remaining: 3 hours, 26 minutes, 1 second)
2025-09-10 23:31:22,172 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-10 23:31:38,465 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4152.99121 ± 263.197
2025-09-10 23:31:38,465 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4141.3335), np.float32(3925.9258), np.float32(4155.374), np.float32(4378.8135), np.float32(4327.249), np.float32(4437.7837), np.float32(4203.857), np.float32(4215.4243), np.float32(4268.4604), np.float32(3475.6885)]
2025-09-10 23:31:38,465 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-10 23:31:38,467 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 35/100 (estimated time remaining: 3 hours, 23 minutes)
2025-09-10 23:34:26,581 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-10 23:34:42,984 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4108.53223 ± 191.968
2025-09-10 23:34:42,984 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4335.779), np.float32(3907.226), np.float32(4067.2644), np.float32(4396.396), np.float32(3786.296), np.float32(4235.74), np.float32(3882.338), np.float32(4088.161), np.float32(4138.771), np.float32(4247.3516)]
2025-09-10 23:34:42,984 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-10 23:34:42,986 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 36/100 (estimated time remaining: 3 hours, 19 minutes, 57 seconds)
2025-09-10 23:37:31,311 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-10 23:37:47,548 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4163.11426 ± 611.830
2025-09-10 23:37:47,548 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4480.341), np.float32(4512.3115), np.float32(4357.397), np.float32(4417.679), np.float32(4227.7437), np.float32(4327.3384), np.float32(2346.4646), np.float32(4357.054), np.float32(4374.716), np.float32(4230.1025)]
2025-09-10 23:37:47,548 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-10 23:37:47,550 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 37/100 (estimated time remaining: 3 hours, 16 minutes, 55 seconds)
2025-09-10 23:40:35,941 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-10 23:40:52,239 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4166.37549 ± 165.243
2025-09-10 23:40:52,240 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4110.1553), np.float32(3978.1692), np.float32(3882.1084), np.float32(4379.5156), np.float32(4312.074), np.float32(4194.14), np.float32(4109.737), np.float32(4017.48), np.float32(4354.9414), np.float32(4325.434)]
2025-09-10 23:40:52,240 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-10 23:40:52,242 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 38/100 (estimated time remaining: 3 hours, 13 minutes, 52 seconds)
2025-09-10 23:43:40,239 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-10 23:43:56,534 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4315.56934 ± 171.582
2025-09-10 23:43:56,534 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4206.5186), np.float32(4284.615), np.float32(4533.109), np.float32(3967.5725), np.float32(4216.2725), np.float32(4253.342), np.float32(4424.1597), np.float32(4293.203), np.float32(4605.3467), np.float32(4371.5527)]
2025-09-10 23:43:56,534 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-10 23:43:56,534 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1226 [INFO]: New best (4315.57) for latency 3
2025-09-10 23:43:56,537 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 39/100 (estimated time remaining: 3 hours, 10 minutes, 43 seconds)
2025-09-10 23:46:44,241 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-10 23:47:00,494 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4365.13721 ± 117.491
2025-09-10 23:47:00,494 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4453.7305), np.float32(4164.475), np.float32(4168.4536), np.float32(4354.6606), np.float32(4481.9253), np.float32(4290.487), np.float32(4400.3975), np.float32(4364.9116), np.float32(4459.887), np.float32(4512.447)]
2025-09-10 23:47:00,494 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-10 23:47:00,494 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1226 [INFO]: New best (4365.14) for latency 3
2025-09-10 23:47:00,497 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 40/100 (estimated time remaining: 3 hours, 7 minutes, 28 seconds)
2025-09-10 23:49:49,174 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-10 23:50:05,484 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4343.09863 ± 109.138
2025-09-10 23:50:05,485 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4374.118), np.float32(4320.802), np.float32(4416.5396), np.float32(4472.5522), np.float32(4342.014), np.float32(4341.638), np.float32(4072.3572), np.float32(4257.2114), np.float32(4372.531), np.float32(4461.222)]
2025-09-10 23:50:05,485 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-10 23:50:05,487 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 41/100 (estimated time remaining: 3 hours, 4 minutes, 30 seconds)
2025-09-10 23:52:53,649 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-10 23:53:09,916 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4569.71924 ± 93.178
2025-09-10 23:53:09,916 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4580.7725), np.float32(4474.4385), np.float32(4518.576), np.float32(4609.6465), np.float32(4491.658), np.float32(4458.886), np.float32(4764.7876), np.float32(4552.222), np.float32(4697.184), np.float32(4549.0205)]
2025-09-10 23:53:09,916 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-10 23:53:09,916 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1226 [INFO]: New best (4569.72) for latency 3
2025-09-10 23:53:09,919 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 42/100 (estimated time remaining: 3 hours, 1 minute, 23 seconds)
2025-09-10 23:55:58,212 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-10 23:56:14,481 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4293.45605 ± 259.891
2025-09-10 23:56:14,481 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4241.4805), np.float32(4393.331), np.float32(4301.8584), np.float32(4379.91), np.float32(4315.358), np.float32(4234.676), np.float32(4592.074), np.float32(3581.2856), np.float32(4508.885), np.float32(4385.702)]
2025-09-10 23:56:14,481 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-10 23:56:14,484 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 43/100 (estimated time remaining: 2 hours, 58 minutes, 18 seconds)
2025-09-10 23:59:02,921 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-10 23:59:19,216 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4392.55859 ± 377.347
2025-09-10 23:59:19,216 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4377.9233), np.float32(4276.647), np.float32(4632.058), np.float32(4546.9346), np.float32(4423.0283), np.float32(4472.2383), np.float32(4609.09), np.float32(3320.889), np.float32(4561.18), np.float32(4705.5957)]
2025-09-10 23:59:19,216 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-10 23:59:19,219 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 44/100 (estimated time remaining: 2 hours, 55 minutes, 18 seconds)
2025-09-11 00:02:07,697 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 00:02:23,981 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 3799.78271 ± 1271.020
2025-09-11 00:02:23,982 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1088.6559), np.float32(1471.9624), np.float32(4543.627), np.float32(4346.4194), np.float32(4564.3433), np.float32(4673.114), np.float32(4294.9204), np.float32(4151.42), np.float32(4333.7314), np.float32(4529.6357)]
2025-09-11 00:02:23,982 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 00:02:23,984 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 45/100 (estimated time remaining: 2 hours, 52 minutes, 23 seconds)
2025-09-11 00:05:12,303 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 00:05:28,637 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4385.77441 ± 24.501
2025-09-11 00:05:28,637 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4397.617), np.float32(4406.104), np.float32(4388.2993), np.float32(4343.31), np.float32(4398.195), np.float32(4387.9956), np.float32(4406.244), np.float32(4418.7285), np.float32(4364.498), np.float32(4346.7495)]
2025-09-11 00:05:28,637 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 00:05:28,640 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 46/100 (estimated time remaining: 2 hours, 49 minutes, 14 seconds)
2025-09-11 00:08:16,753 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 00:08:33,057 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4222.58057 ± 767.875
2025-09-11 00:08:33,057 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4623.538), np.float32(4594.584), np.float32(4589.384), np.float32(4208.524), np.float32(4292.116), np.float32(4387.5864), np.float32(4529.623), np.float32(2000.3568), np.float32(4150.8833), np.float32(4849.209)]
2025-09-11 00:08:33,057 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 00:08:33,060 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 47/100 (estimated time remaining: 2 hours, 46 minutes, 9 seconds)
2025-09-11 00:11:21,197 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 00:11:37,508 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4068.37964 ± 1123.125
2025-09-11 00:11:37,508 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4521.187), np.float32(4742.53), np.float32(4563.199), np.float32(4239.2407), np.float32(4544.697), np.float32(4120.275), np.float32(4494.4688), np.float32(4331.7354), np.float32(737.1524), np.float32(4389.313)]
2025-09-11 00:11:37,508 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 00:11:37,511 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 48/100 (estimated time remaining: 2 hours, 43 minutes, 4 seconds)
2025-09-11 00:14:25,953 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 00:14:42,259 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4557.90967 ± 99.920
2025-09-11 00:14:42,259 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4515.312), np.float32(4511.05), np.float32(4606.201), np.float32(4761.9604), np.float32(4447.5576), np.float32(4583.2173), np.float32(4454.9897), np.float32(4656.945), np.float32(4431.751), np.float32(4610.1147)]
2025-09-11 00:14:42,259 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 00:14:42,262 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 49/100 (estimated time remaining: 2 hours, 39 minutes, 59 seconds)
2025-09-11 00:17:30,750 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 00:17:46,998 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4503.62402 ± 86.515
2025-09-11 00:17:46,998 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4608.92), np.float32(4509.147), np.float32(4571.61), np.float32(4442.2817), np.float32(4509.219), np.float32(4430.595), np.float32(4406.8545), np.float32(4641.236), np.float32(4551.8657), np.float32(4364.5107)]
2025-09-11 00:17:46,998 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 00:17:47,001 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 50/100 (estimated time remaining: 2 hours, 36 minutes, 54 seconds)
2025-09-11 00:20:35,073 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 00:20:51,361 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4549.14160 ± 119.766
2025-09-11 00:20:51,362 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4603.3457), np.float32(4839.6274), np.float32(4404.2925), np.float32(4543.0415), np.float32(4518.547), np.float32(4526.885), np.float32(4496.612), np.float32(4571.328), np.float32(4604.108), np.float32(4383.6245)]
2025-09-11 00:20:51,362 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 00:20:51,365 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 51/100 (estimated time remaining: 2 hours, 33 minutes, 47 seconds)
2025-09-11 00:23:39,762 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 00:23:56,072 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4229.50635 ± 812.118
2025-09-11 00:23:56,072 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4301.7485), np.float32(4605.361), np.float32(4520.4443), np.float32(4593.543), np.float32(4554.2495), np.float32(4480.005), np.float32(1813.5774), np.float32(4640.76), np.float32(4439.585), np.float32(4345.787)]
2025-09-11 00:23:56,072 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 00:23:56,075 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 52/100 (estimated time remaining: 2 hours, 30 minutes, 45 seconds)
2025-09-11 00:26:44,086 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 00:27:00,358 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4394.47363 ± 710.710
2025-09-11 00:27:00,358 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4700.4434), np.float32(4369.8047), np.float32(4746.654), np.float32(2289.9788), np.float32(4611.742), np.float32(4690.4253), np.float32(4768.894), np.float32(4631.683), np.float32(4487.659), np.float32(4647.4507)]
2025-09-11 00:27:00,358 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 00:27:00,361 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 53/100 (estimated time remaining: 2 hours, 27 minutes, 39 seconds)
2025-09-11 00:29:48,549 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 00:30:04,885 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4329.04004 ± 1088.821
2025-09-11 00:30:04,885 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4748.746), np.float32(4848.758), np.float32(1105.1687), np.float32(4696.079), np.float32(4479.86), np.float32(5055.394), np.float32(4444.897), np.float32(4622.7427), np.float32(4771.042), np.float32(4517.7065)]
2025-09-11 00:30:04,885 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 00:30:04,888 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 54/100 (estimated time remaining: 2 hours, 24 minutes, 32 seconds)
2025-09-11 00:32:52,934 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 00:33:09,262 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4575.47949 ± 100.700
2025-09-11 00:33:09,262 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4533.63), np.float32(4505.774), np.float32(4555.7715), np.float32(4638.713), np.float32(4460.6636), np.float32(4777.522), np.float32(4475.7266), np.float32(4725.2656), np.float32(4569.333), np.float32(4512.396)]
2025-09-11 00:33:09,262 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 00:33:09,262 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1226 [INFO]: New best (4575.48) for latency 3
2025-09-11 00:33:09,265 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 55/100 (estimated time remaining: 2 hours, 21 minutes, 24 seconds)
2025-09-11 00:35:58,611 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 00:36:14,860 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4605.87988 ± 120.919
2025-09-11 00:36:14,860 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4760.3696), np.float32(4702.351), np.float32(4572.876), np.float32(4561.842), np.float32(4565.5933), np.float32(4527.9644), np.float32(4627.121), np.float32(4325.7407), np.float32(4744.5347), np.float32(4670.409)]
2025-09-11 00:36:14,861 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 00:36:14,861 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1226 [INFO]: New best (4605.88) for latency 3
2025-09-11 00:36:14,864 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 56/100 (estimated time remaining: 2 hours, 18 minutes, 31 seconds)
2025-09-11 00:39:03,514 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 00:39:19,834 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4103.22559 ± 1041.696
2025-09-11 00:39:19,834 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4611.1577), np.float32(4794.801), np.float32(4605.6055), np.float32(4589.602), np.float32(4671.5786), np.float32(1595.3569), np.float32(2592.7239), np.float32(4605.334), np.float32(4792.4043), np.float32(4173.69)]
2025-09-11 00:39:19,834 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 00:39:19,837 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 57/100 (estimated time remaining: 2 hours, 15 minutes, 29 seconds)
2025-09-11 00:42:08,161 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 00:42:24,497 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4450.26904 ± 456.557
2025-09-11 00:42:24,497 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4615.524), np.float32(4717.9165), np.float32(4622.326), np.float32(4614.6777), np.float32(3163.3289), np.float32(4632.7417), np.float32(4887.1777), np.float32(4371.946), np.float32(4583.5303), np.float32(4293.5205)]
2025-09-11 00:42:24,497 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 00:42:24,501 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 58/100 (estimated time remaining: 2 hours, 12 minutes, 27 seconds)
2025-09-11 00:45:12,907 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 00:45:29,247 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4776.81543 ± 236.639
2025-09-11 00:45:29,248 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4571.5317), np.float32(4825.9023), np.float32(4570.445), np.float32(4820.451), np.float32(4969.551), np.float32(5071.9126), np.float32(4922.183), np.float32(4419.8115), np.float32(4484.168), np.float32(5112.1943)]
2025-09-11 00:45:29,248 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 00:45:29,248 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1226 [INFO]: New best (4776.82) for latency 3
2025-09-11 00:45:29,251 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 59/100 (estimated time remaining: 2 hours, 9 minutes, 24 seconds)
2025-09-11 00:48:17,526 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 00:48:33,817 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4672.20703 ± 125.423
2025-09-11 00:48:33,817 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4700.482), np.float32(4484.2554), np.float32(4635.769), np.float32(4937.6104), np.float32(4688.365), np.float32(4759.1494), np.float32(4539.4907), np.float32(4658.8325), np.float32(4550.7563), np.float32(4767.361)]
2025-09-11 00:48:33,817 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 00:48:33,821 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 60/100 (estimated time remaining: 2 hours, 6 minutes, 21 seconds)
2025-09-11 00:51:22,013 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 00:51:38,328 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4634.05078 ± 134.217
2025-09-11 00:51:38,328 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4624.0317), np.float32(4806.507), np.float32(4572.0957), np.float32(4791.6465), np.float32(4711.1904), np.float32(4578.9814), np.float32(4693.003), np.float32(4513.11), np.float32(4713.041), np.float32(4336.898)]
2025-09-11 00:51:38,329 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 00:51:38,332 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 61/100 (estimated time remaining: 2 hours, 3 minutes, 7 seconds)
2025-09-11 00:54:26,847 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 00:54:43,125 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4618.77441 ± 251.109
2025-09-11 00:54:43,125 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4297.417), np.float32(4233.021), np.float32(4645.2256), np.float32(4489.7637), np.float32(4576.0054), np.float32(5122.8857), np.float32(4512.549), np.float32(4794.724), np.float32(4648.2393), np.float32(4867.913)]
2025-09-11 00:54:43,125 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 00:54:43,151 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 62/100 (estimated time remaining: 2 hours, 1 second)
2025-09-11 00:57:31,368 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 00:57:47,713 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4729.76074 ± 93.033
2025-09-11 00:57:47,713 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4818.222), np.float32(4581.6104), np.float32(4710.4585), np.float32(4689.4497), np.float32(4824.0957), np.float32(4831.419), np.float32(4633.61), np.float32(4723.326), np.float32(4856.561), np.float32(4628.8555)]
2025-09-11 00:57:47,713 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 00:57:47,717 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 63/100 (estimated time remaining: 1 hour, 56 minutes, 56 seconds)
2025-09-11 01:00:36,305 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 01:00:52,593 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4570.43018 ± 177.615
2025-09-11 01:00:52,593 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4415.824), np.float32(4449.3955), np.float32(4504.107), np.float32(4907.9824), np.float32(4400.745), np.float32(4626.7812), np.float32(4335.4346), np.float32(4593.431), np.float32(4826.116), np.float32(4644.4844)]
2025-09-11 01:00:52,593 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 01:00:52,596 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 64/100 (estimated time remaining: 1 hour, 53 minutes, 52 seconds)
2025-09-11 01:03:40,917 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 01:03:57,224 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4537.08350 ± 502.210
2025-09-11 01:03:57,225 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3062.7888), np.float32(4808.3877), np.float32(4662.033), np.float32(4667.158), np.float32(4505.8643), np.float32(4677.8013), np.float32(4659.1016), np.float32(4926.4595), np.float32(4729.804), np.float32(4671.439)]
2025-09-11 01:03:57,225 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 01:03:57,228 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 65/100 (estimated time remaining: 1 hour, 50 minutes, 48 seconds)
2025-09-11 01:06:45,869 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 01:07:02,195 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4687.51855 ± 198.465
2025-09-11 01:07:02,195 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4916.822), np.float32(4514.0767), np.float32(4654.7065), np.float32(4578.4697), np.float32(4876.8257), np.float32(4916.249), np.float32(4947.868), np.float32(4386.5312), np.float32(4609.8906), np.float32(4473.751)]
2025-09-11 01:07:02,195 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 01:07:02,217 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 66/100 (estimated time remaining: 1 hour, 47 minutes, 47 seconds)
2025-09-11 01:09:50,523 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 01:10:06,841 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4738.32324 ± 127.095
2025-09-11 01:10:06,841 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4660.1504), np.float32(4836.182), np.float32(4879.8247), np.float32(4808.3696), np.float32(4603.0913), np.float32(4619.697), np.float32(4538.282), np.float32(4890.785), np.float32(4880.264), np.float32(4666.5806)]
2025-09-11 01:10:06,841 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 01:10:06,845 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 67/100 (estimated time remaining: 1 hour, 44 minutes, 41 seconds)
2025-09-11 01:12:54,948 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 01:13:11,263 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4649.40527 ± 155.625
2025-09-11 01:13:11,263 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4894.587), np.float32(4886.5947), np.float32(4542.161), np.float32(4434.6997), np.float32(4710.696), np.float32(4553.0044), np.float32(4651.214), np.float32(4446.3037), np.float32(4613.136), np.float32(4761.654)]
2025-09-11 01:13:11,263 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 01:13:11,267 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 68/100 (estimated time remaining: 1 hour, 41 minutes, 35 seconds)
2025-09-11 01:15:59,191 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 01:16:15,502 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4510.36816 ± 653.183
2025-09-11 01:16:15,502 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4994.741), np.float32(4602.0024), np.float32(4747.702), np.float32(4725.81), np.float32(4402.3306), np.float32(5189.7427), np.float32(4461.638), np.float32(4756.9907), np.float32(2671.1797), np.float32(4551.54)]
2025-09-11 01:16:15,502 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 01:16:15,506 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 69/100 (estimated time remaining: 1 hour, 38 minutes, 26 seconds)
2025-09-11 01:19:03,702 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 01:19:20,012 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4747.02588 ± 63.780
2025-09-11 01:19:20,012 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4813.531), np.float32(4690.0493), np.float32(4777.842), np.float32(4846.0674), np.float32(4810.987), np.float32(4674.7646), np.float32(4751.4785), np.float32(4701.0957), np.float32(4643.9395), np.float32(4760.51)]
2025-09-11 01:19:20,012 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 01:19:20,016 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 70/100 (estimated time remaining: 1 hour, 35 minutes, 21 seconds)
2025-09-11 01:22:08,034 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 01:22:24,319 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4777.90283 ± 152.997
2025-09-11 01:22:24,319 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4688.373), np.float32(4701.452), np.float32(4955.3013), np.float32(4927.2314), np.float32(4839.8726), np.float32(4989.848), np.float32(4835.4106), np.float32(4665.773), np.float32(4709.525), np.float32(4466.2393)]
2025-09-11 01:22:24,319 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 01:22:24,319 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1226 [INFO]: New best (4777.90) for latency 3
2025-09-11 01:22:24,323 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 71/100 (estimated time remaining: 1 hour, 32 minutes, 12 seconds)
2025-09-11 01:25:12,701 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 01:25:29,053 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4694.46777 ± 126.460
2025-09-11 01:25:29,053 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4699.7305), np.float32(4635.6284), np.float32(4665.2363), np.float32(4715.6553), np.float32(4794.1436), np.float32(4674.055), np.float32(4824.89), np.float32(4926.3623), np.float32(4534.53), np.float32(4474.445)]
2025-09-11 01:25:29,053 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 01:25:29,058 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 72/100 (estimated time remaining: 1 hour, 29 minutes, 8 seconds)
2025-09-11 01:28:17,848 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 01:28:34,123 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4423.16553 ± 841.377
2025-09-11 01:28:34,123 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4569.4346), np.float32(5037.931), np.float32(4444.2617), np.float32(4768.17), np.float32(4180.8726), np.float32(4700.69), np.float32(4921.788), np.float32(1998.848), np.float32(4865.4653), np.float32(4744.1973)]
2025-09-11 01:28:34,123 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 01:28:34,128 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 73/100 (estimated time remaining: 1 hour, 26 minutes, 8 seconds)
2025-09-11 01:31:22,351 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 01:31:38,662 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4743.86377 ± 104.735
2025-09-11 01:31:38,662 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4633.526), np.float32(4622.428), np.float32(4612.8228), np.float32(4859.6978), np.float32(4780.264), np.float32(4835.065), np.float32(4696.0186), np.float32(4675.06), np.float32(4923.78), np.float32(4799.9717)]
2025-09-11 01:31:38,662 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 01:31:38,666 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 74/100 (estimated time remaining: 1 hour, 23 minutes, 5 seconds)
2025-09-11 01:34:26,927 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 01:34:43,155 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4793.22266 ± 203.770
2025-09-11 01:34:43,156 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5035.3276), np.float32(5051.13), np.float32(4646.763), np.float32(4805.176), np.float32(4313.7144), np.float32(4800.507), np.float32(4932.592), np.float32(4894.0923), np.float32(4728.184), np.float32(4724.743)]
2025-09-11 01:34:43,156 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 01:34:43,156 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1226 [INFO]: New best (4793.22) for latency 3
2025-09-11 01:34:43,160 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 75/100 (estimated time remaining: 1 hour, 20 minutes)
2025-09-11 01:37:31,643 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 01:37:47,929 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4320.00293 ± 928.113
2025-09-11 01:37:47,929 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4537.9326), np.float32(4867.679), np.float32(2996.9312), np.float32(4734.487), np.float32(4664.169), np.float32(4817.55), np.float32(2052.9995), np.float32(4730.3584), np.float32(4891.869), np.float32(4906.051)]
2025-09-11 01:37:47,929 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 01:37:47,933 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 76/100 (estimated time remaining: 1 hour, 16 minutes, 58 seconds)
2025-09-11 01:40:36,334 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 01:40:52,568 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4773.95020 ± 104.394
2025-09-11 01:40:52,569 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4808.4854), np.float32(4830.9155), np.float32(4560.2764), np.float32(4773.934), np.float32(4710.3916), np.float32(4745.5566), np.float32(4885.1714), np.float32(4780.643), np.float32(4958.0728), np.float32(4686.053)]
2025-09-11 01:40:52,569 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 01:40:52,576 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 77/100 (estimated time remaining: 1 hour, 13 minutes, 52 seconds)
2025-09-11 01:43:40,550 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 01:43:56,799 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4836.63232 ± 180.028
2025-09-11 01:43:56,799 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4454.8867), np.float32(4768.6587), np.float32(4574.0864), np.float32(4960.5635), np.float32(4955.958), np.float32(4965.311), np.float32(4975.8994), np.float32(5016.0444), np.float32(4905.9697), np.float32(4788.949)]
2025-09-11 01:43:56,799 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 01:43:56,799 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1226 [INFO]: New best (4836.63) for latency 3
2025-09-11 01:43:56,803 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 78/100 (estimated time remaining: 1 hour, 10 minutes, 44 seconds)
2025-09-11 01:46:44,794 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 01:47:01,088 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4718.00781 ± 146.472
2025-09-11 01:47:01,088 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4634.7124), np.float32(4846.8657), np.float32(4843.1577), np.float32(4640.278), np.float32(4581.054), np.float32(4972.1396), np.float32(4538.6963), np.float32(4808.856), np.float32(4795.2617), np.float32(4519.0576)]
2025-09-11 01:47:01,088 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 01:47:01,093 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 79/100 (estimated time remaining: 1 hour, 7 minutes, 38 seconds)
2025-09-11 01:49:49,880 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 01:50:06,180 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4764.00439 ± 115.696
2025-09-11 01:50:06,180 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4720.4463), np.float32(4921.2974), np.float32(4817.6733), np.float32(4758.2876), np.float32(4733.26), np.float32(4938.5903), np.float32(4560.7104), np.float32(4771.8647), np.float32(4821.709), np.float32(4596.1978)]
2025-09-11 01:50:06,180 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 01:50:06,184 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 80/100 (estimated time remaining: 1 hour, 4 minutes, 36 seconds)
2025-09-11 01:52:55,104 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 01:53:11,392 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4513.24951 ± 780.220
2025-09-11 01:53:11,392 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4727.4307), np.float32(4430.334), np.float32(4904.7837), np.float32(4794.4844), np.float32(4741.6157), np.float32(4849.418), np.float32(4935.2974), np.float32(4809.055), np.float32(2206.1292), np.float32(4733.9507)]
2025-09-11 01:53:11,392 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 01:53:11,396 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 81/100 (estimated time remaining: 1 hour, 1 minute, 33 seconds)
2025-09-11 01:55:59,471 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 01:56:15,719 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4753.55566 ± 164.903
2025-09-11 01:56:15,719 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4438.261), np.float32(4811.219), np.float32(4668.534), np.float32(4742.947), np.float32(4857.1626), np.float32(4804.5264), np.float32(4871.27), np.float32(4873.588), np.float32(4489.5947), np.float32(4978.4556)]
2025-09-11 01:56:15,719 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 01:56:15,723 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 82/100 (estimated time remaining: 58 minutes, 27 seconds)
2025-09-11 01:59:04,006 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 01:59:20,207 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4718.22510 ± 184.517
2025-09-11 01:59:20,207 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4649.6763), np.float32(4887.2573), np.float32(5004.8345), np.float32(4436.8076), np.float32(4662.946), np.float32(4663.294), np.float32(4827.734), np.float32(4812.6797), np.float32(4396.0474), np.float32(4840.9736)]
2025-09-11 01:59:20,207 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 01:59:20,212 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 83/100 (estimated time remaining: 55 minutes, 24 seconds)
2025-09-11 02:02:08,461 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 02:02:24,752 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4845.20264 ± 173.333
2025-09-11 02:02:24,752 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4633.501), np.float32(4837.9443), np.float32(4891.1626), np.float32(4577.5215), np.float32(5048.689), np.float32(5117.2124), np.float32(4964.6274), np.float32(4767.1206), np.float32(4954.011), np.float32(4660.238)]
2025-09-11 02:02:24,752 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 02:02:24,752 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1226 [INFO]: New best (4845.20) for latency 3
2025-09-11 02:02:24,757 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 84/100 (estimated time remaining: 52 minutes, 20 seconds)
2025-09-11 02:05:13,143 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 02:05:29,404 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4695.20312 ± 145.295
2025-09-11 02:05:29,404 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4633.0938), np.float32(4676.59), np.float32(4849.6245), np.float32(4743.9053), np.float32(4742.4985), np.float32(4851.7554), np.float32(4799.031), np.float32(4783.704), np.float32(4456.6904), np.float32(4415.135)]
2025-09-11 02:05:29,404 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 02:05:29,409 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 85/100 (estimated time remaining: 49 minutes, 14 seconds)
2025-09-11 02:08:17,593 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 02:08:33,816 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4657.40234 ± 86.546
2025-09-11 02:08:33,816 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4659.0586), np.float32(4830.2783), np.float32(4556.2847), np.float32(4670.567), np.float32(4753.611), np.float32(4582.728), np.float32(4555.6694), np.float32(4590.4), np.float32(4727.129), np.float32(4648.3022)]
2025-09-11 02:08:33,816 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 02:08:33,821 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 86/100 (estimated time remaining: 46 minutes, 7 seconds)
2025-09-11 02:11:22,006 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 02:11:38,297 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4542.22705 ± 643.581
2025-09-11 02:11:38,297 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4750.7524), np.float32(4784.563), np.float32(4612.7563), np.float32(4794.237), np.float32(4890.2695), np.float32(4645.3823), np.float32(4969.3335), np.float32(4624.3735), np.float32(2639.1506), np.float32(4711.4478)]
2025-09-11 02:11:38,297 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 02:11:38,320 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 87/100 (estimated time remaining: 43 minutes, 3 seconds)
2025-09-11 02:14:26,694 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 02:14:42,961 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4905.01172 ± 156.289
2025-09-11 02:14:42,961 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5034.3193), np.float32(5003.674), np.float32(4877.8403), np.float32(4846.4854), np.float32(4680.872), np.float32(4880.751), np.float32(4681.6665), np.float32(5161.5054), np.float32(4792.3657), np.float32(5090.6343)]
2025-09-11 02:14:42,961 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 02:14:42,961 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1226 [INFO]: New best (4905.01) for latency 3
2025-09-11 02:14:42,966 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 88/100 (estimated time remaining: 39 minutes, 59 seconds)
2025-09-11 02:17:31,478 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 02:17:47,754 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4830.63086 ± 172.677
2025-09-11 02:17:47,755 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5044.835), np.float32(4629.862), np.float32(4534.8423), np.float32(4760.7637), np.float32(4938.8003), np.float32(4913.9297), np.float32(4933.555), np.float32(5027.3745), np.float32(4616.228), np.float32(4906.121)]
2025-09-11 02:17:47,755 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 02:17:47,760 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 89/100 (estimated time remaining: 36 minutes, 55 seconds)
2025-09-11 02:20:36,112 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 02:20:52,387 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4824.79004 ± 131.333
2025-09-11 02:20:52,387 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4818.882), np.float32(4869.479), np.float32(4769.8145), np.float32(4971.118), np.float32(4709.0137), np.float32(4627.2383), np.float32(4802.426), np.float32(5118.827), np.float32(4743.7104), np.float32(4817.39)]
2025-09-11 02:20:52,387 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 02:20:52,392 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 90/100 (estimated time remaining: 33 minutes, 50 seconds)
2025-09-11 02:23:40,995 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 02:23:57,262 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4863.63965 ± 196.687
2025-09-11 02:23:57,262 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4957.2056), np.float32(4646.1206), np.float32(5157.311), np.float32(4909.497), np.float32(4817.282), np.float32(5148.463), np.float32(4541.497), np.float32(4963.1196), np.float32(4844.7114), np.float32(4651.1836)]
2025-09-11 02:23:57,262 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 02:23:57,268 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 91/100 (estimated time remaining: 30 minutes, 46 seconds)
2025-09-11 02:26:45,699 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 02:27:02,055 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4845.72803 ± 155.978
2025-09-11 02:27:02,055 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4906.361), np.float32(4727.0996), np.float32(4576.8965), np.float32(4830.4624), np.float32(4872.451), np.float32(4755.1016), np.float32(4758.314), np.float32(5125.143), np.float32(5085.9565), np.float32(4819.496)]
2025-09-11 02:27:02,055 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 02:27:02,060 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 92/100 (estimated time remaining: 27 minutes, 42 seconds)
2025-09-11 02:29:50,539 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 02:30:06,787 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4753.38818 ± 247.935
2025-09-11 02:30:06,788 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4900.9424), np.float32(4956.4165), np.float32(4802.8384), np.float32(4808.4727), np.float32(4085.6072), np.float32(4922.0737), np.float32(4597.393), np.float32(4843.9287), np.float32(4937.2188), np.float32(4678.9927)]
2025-09-11 02:30:06,788 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 02:30:06,793 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 93/100 (estimated time remaining: 24 minutes, 38 seconds)
2025-09-11 02:32:55,553 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 02:33:11,846 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4929.20459 ± 90.985
2025-09-11 02:33:11,846 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4963.5713), np.float32(5054.8716), np.float32(4879.4917), np.float32(4981.852), np.float32(4731.249), np.float32(5049.7563), np.float32(4952.209), np.float32(4873.4136), np.float32(4932.917), np.float32(4872.713)]
2025-09-11 02:33:11,846 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 02:33:11,847 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1226 [INFO]: New best (4929.20) for latency 3
2025-09-11 02:33:11,852 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 94/100 (estimated time remaining: 21 minutes, 33 seconds)
2025-09-11 02:36:00,212 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 02:36:16,460 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4589.56152 ± 807.193
2025-09-11 02:36:16,460 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4799.7847), np.float32(4618.9233), np.float32(4766.8574), np.float32(4597.0835), np.float32(4982.502), np.float32(4692.9326), np.float32(5039.782), np.float32(2230.2078), np.float32(4993.667), np.float32(5173.8726)]
2025-09-11 02:36:16,460 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 02:36:16,465 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 95/100 (estimated time remaining: 18 minutes, 28 seconds)
2025-09-11 02:39:04,648 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 02:39:20,950 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4928.04590 ± 176.126
2025-09-11 02:39:20,950 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4946.129), np.float32(5100.183), np.float32(5126.66), np.float32(4553.2896), np.float32(5074.7114), np.float32(4982.369), np.float32(5002.2344), np.float32(4696.0933), np.float32(4812.9507), np.float32(4985.836)]
2025-09-11 02:39:20,950 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 02:39:20,955 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 96/100 (estimated time remaining: 15 minutes, 23 seconds)
2025-09-11 02:42:09,141 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 02:42:25,429 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4826.62891 ± 197.181
2025-09-11 02:42:25,429 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4595.3276), np.float32(4881.6777), np.float32(4856.387), np.float32(4688.649), np.float32(5037.5317), np.float32(4928.7993), np.float32(4973.9214), np.float32(4382.09), np.float32(4992.863), np.float32(4929.0405)]
2025-09-11 02:42:25,429 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 02:42:25,434 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 97/100 (estimated time remaining: 12 minutes, 18 seconds)
2025-09-11 02:45:13,558 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 02:45:29,797 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4916.97559 ± 111.768
2025-09-11 02:45:29,797 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4718.8896), np.float32(5106.135), np.float32(4922.9316), np.float32(5017.63), np.float32(4888.3857), np.float32(5031.9067), np.float32(4878.5767), np.float32(4810.2837), np.float32(4975.2695), np.float32(4819.7495)]
2025-09-11 02:45:29,797 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 02:45:29,803 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 98/100 (estimated time remaining: 9 minutes, 13 seconds)
2025-09-11 02:48:18,036 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 02:48:34,315 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4871.91943 ± 92.965
2025-09-11 02:48:34,315 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4887.1685), np.float32(4902.2505), np.float32(4841.79), np.float32(4838.657), np.float32(4887.186), np.float32(4724.039), np.float32(4980.6875), np.float32(4894.811), np.float32(5037.2285), np.float32(4725.377)]
2025-09-11 02:48:34,315 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 02:48:34,321 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 99/100 (estimated time remaining: 6 minutes, 8 seconds)
2025-09-11 02:51:22,191 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 02:51:38,517 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4915.34131 ± 133.093
2025-09-11 02:51:38,517 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4749.2876), np.float32(5032.9727), np.float32(4900.5703), np.float32(5127.396), np.float32(4894.3276), np.float32(4740.3022), np.float32(4955.3906), np.float32(4764.9727), np.float32(5100.9624), np.float32(4887.234)]
2025-09-11 02:51:38,517 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 02:51:38,523 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1199 [INFO]: Iteration 100/100 (estimated time remaining: 3 minutes, 4 seconds)
2025-09-11 02:54:26,679 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-11 02:54:42,968 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1221 [DEBUG]: Total Reward: 4893.23779 ± 129.920
2025-09-11 02:54:42,968 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4957.4785), np.float32(4804.119), np.float32(4944.09), np.float32(4884.626), np.float32(4876.2817), np.float32(4951.5083), np.float32(4676.9062), np.float32(4724.675), np.float32(5161.125), np.float32(4951.571)]
2025-09-11 02:54:42,968 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 02:54:42,973 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille200-halfcheetah):1251 [DEBUG]: Training session finished
