2025-09-14 08:43:01,616 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1108 [DEBUG]: logdir: _logs/noise-eval-v2/halfcheetah/bpql-noise_0.050-delay_6
2025-09-14 08:43:01,617 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1109 [DEBUG]: trainer_prefix: noise-eval-v2/halfcheetah/bpql-noise_0.050-delay_6
2025-09-14 08:43:01,617 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1110 [DEBUG]: args.trainer_eval_latencies: {'6': <latency_env.delayed_mdp.ConstantDelay object at 0x7f304e937b60>}
2025-09-14 08:43:01,617 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1111 [DEBUG]: using device: cpu
2025-09-14 08:43:01,621 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1133 [INFO]: Creating new trainer
2025-09-14 08:43:01,767 baseline-bpql-noisepromille50-halfcheetah:113 [DEBUG]: pi network:
NNGaussianPolicy(
  (common_head): Sequential(
    (0): Flatten(start_dim=1, end_dim=-1)
    (1): Linear(in_features=53, out_features=256, bias=True)
    (2): ReLU()
    (3): Linear(in_features=256, out_features=256, bias=True)
    (4): ReLU()
  )
  (mu_head): Sequential(
    (0): Linear(in_features=256, out_features=6, bias=True)
    (1): Unflatten(dim=1, unflattened_size=(6,))
  )
  (log_std_head): Sequential(
    (0): Linear(in_features=256, out_features=6, bias=True)
    (1): Unflatten(dim=1, unflattened_size=(6,))
  )
  (tanh_refit): NNTanhRefit(scale: tensor([[2., 2., 2., 2., 2., 2.]]), shift: tensor([[-1., -1., -1., -1., -1., -1.]]))
)
2025-09-14 08:43:01,767 baseline-bpql-noisepromille50-halfcheetah:114 [DEBUG]: q network:
NNLayerConcat2(
  dim: -1
  (next): Sequential(
    (0): Linear(in_features=23, out_features=256, bias=True)
    (1): ReLU()
    (2): Linear(in_features=256, out_features=256, bias=True)
    (3): ReLU()
    (4): Linear(in_features=256, out_features=1, bias=True)
    (5): NNLayerSqueeze(dim: -1)
  )
  (init_left): Flatten(start_dim=1, end_dim=-1)
  (init_right): Flatten(start_dim=1, end_dim=-1)
)
2025-09-14 08:43:03,548 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1194 [DEBUG]: Starting training session...
2025-09-14 08:43:03,549 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 1/100
2025-09-14 08:46:37,877 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 08:46:45,193 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: -332.09372 ± 48.321
2025-09-14 08:46:45,193 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(-206.4097), np.float32(-345.5868), np.float32(-327.70004), np.float32(-373.8621), np.float32(-375.21158), np.float32(-342.33002), np.float32(-335.11093), np.float32(-381.21942), np.float32(-336.1196), np.float32(-297.38705)]
2025-09-14 08:46:45,194 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 08:46:45,194 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (-332.09) for latency 6
2025-09-14 08:46:45,196 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 2/100 (estimated time remaining: 6 hours, 5 minutes, 43 seconds)
2025-09-14 08:50:19,079 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 08:50:26,215 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: -219.11594 ± 89.838
2025-09-14 08:50:26,216 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(-174.4904), np.float32(-334.285), np.float32(-132.98949), np.float32(-298.0442), np.float32(-141.18124), np.float32(-409.07178), np.float32(-180.47067), np.float32(-174.55206), np.float32(-204.04495), np.float32(-142.02945)]
2025-09-14 08:50:26,216 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 08:50:26,216 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (-219.12) for latency 6
2025-09-14 08:50:26,218 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 3/100 (estimated time remaining: 6 hours, 1 minute, 30 seconds)
2025-09-14 08:53:56,817 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 08:54:04,827 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 90.93345 ± 62.358
2025-09-14 08:54:04,828 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(26.747), np.float32(129.73946), np.float32(143.17712), np.float32(47.943226), np.float32(126.77355), np.float32(61.06656), np.float32(209.11061), np.float32(53.327328), np.float32(121.47623), np.float32(-10.026598)]
2025-09-14 08:54:04,828 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 08:54:04,828 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (90.93) for latency 6
2025-09-14 08:54:04,830 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 4/100 (estimated time remaining: 5 hours, 56 minutes, 21 seconds)
2025-09-14 08:57:42,492 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 08:57:50,646 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 522.19592 ± 182.116
2025-09-14 08:57:50,646 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(676.0101), np.float32(462.81506), np.float32(508.28848), np.float32(347.15594), np.float32(562.4114), np.float32(153.87756), np.float32(446.7728), np.float32(620.2118), np.float32(865.4997), np.float32(578.9161)]
2025-09-14 08:57:50,647 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 08:57:50,647 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (522.20) for latency 6
2025-09-14 08:57:50,649 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 5/100 (estimated time remaining: 5 hours, 54 minutes, 50 seconds)
2025-09-14 09:01:23,799 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 09:01:31,584 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 790.09827 ± 500.218
2025-09-14 09:01:31,584 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1516.7693), np.float32(379.07242), np.float32(1463.5955), np.float32(382.74936), np.float32(1178.5093), np.float32(1407.5131), np.float32(288.81427), np.float32(353.85657), np.float32(446.45822), np.float32(483.6445)]
2025-09-14 09:01:31,584 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:01:31,585 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (790.10) for latency 6
2025-09-14 09:01:31,587 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 6/100 (estimated time remaining: 5 hours, 50 minutes, 52 seconds)
2025-09-14 09:04:55,557 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 09:05:03,605 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 1051.16833 ± 564.371
2025-09-14 09:05:03,605 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(861.4137), np.float32(414.9615), np.float32(1361.8777), np.float32(1649.0923), np.float32(1833.8993), np.float32(555.7724), np.float32(707.8534), np.float32(392.80743), np.float32(1950.7137), np.float32(783.2919)]
2025-09-14 09:05:03,605 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:05:03,605 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (1051.17) for latency 6
2025-09-14 09:05:03,608 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 7/100 (estimated time remaining: 5 hours, 44 minutes, 10 seconds)
2025-09-14 09:08:24,736 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 09:08:31,994 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 1483.29688 ± 488.680
2025-09-14 09:08:31,994 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(488.66388), np.float32(1132.1885), np.float32(1555.177), np.float32(1921.0693), np.float32(894.70917), np.float32(1868.971), np.float32(1749.7798), np.float32(2142.5325), np.float32(1350.8633), np.float32(1729.0146)]
2025-09-14 09:08:31,994 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:08:31,994 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (1483.30) for latency 6
2025-09-14 09:08:31,997 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 8/100 (estimated time remaining: 5 hours, 36 minutes, 35 seconds)
2025-09-14 09:11:30,651 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 09:11:37,222 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 1355.53638 ± 637.086
2025-09-14 09:11:37,223 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(968.28577), np.float32(2203.0164), np.float32(899.6849), np.float32(857.17926), np.float32(2361.2942), np.float32(558.91376), np.float32(1368.0282), np.float32(2096.6428), np.float32(1584.3131), np.float32(658.0044)]
2025-09-14 09:11:37,223 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:11:37,225 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 9/100 (estimated time remaining: 5 hours, 22 minutes, 44 seconds)
2025-09-14 09:14:27,390 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 09:14:33,934 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 1923.17065 ± 661.699
2025-09-14 09:14:33,935 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1239.3695), np.float32(2336.5764), np.float32(1206.9434), np.float32(928.03754), np.float32(2136.5676), np.float32(1323.6254), np.float32(2663.3691), np.float32(2875.5183), np.float32(1988.0948), np.float32(2533.6052)]
2025-09-14 09:14:33,935 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:14:33,935 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (1923.17) for latency 6
2025-09-14 09:14:33,937 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 10/100 (estimated time remaining: 5 hours, 4 minutes, 19 seconds)
2025-09-14 09:17:24,092 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 09:17:30,651 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 2168.36597 ± 767.626
2025-09-14 09:17:30,651 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(817.6265), np.float32(1865.3063), np.float32(2236.9023), np.float32(2588.0972), np.float32(1814.6305), np.float32(3098.222), np.float32(2767.32), np.float32(2843.0706), np.float32(888.83704), np.float32(2763.6475)]
2025-09-14 09:17:30,651 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:17:30,651 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (2168.37) for latency 6
2025-09-14 09:17:30,653 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 11/100 (estimated time remaining: 4 hours, 47 minutes, 43 seconds)
2025-09-14 09:20:06,409 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 09:20:12,562 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 2592.81177 ± 714.520
2025-09-14 09:20:12,562 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3057.402), np.float32(2811.8281), np.float32(2796.8943), np.float32(3151.6443), np.float32(1208.5187), np.float32(2777.7163), np.float32(1230.5286), np.float32(2948.328), np.float32(2607.6292), np.float32(3337.6272)]
2025-09-14 09:20:12,562 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:20:12,562 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (2592.81) for latency 6
2025-09-14 09:20:12,565 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 12/100 (estimated time remaining: 4 hours, 29 minutes, 39 seconds)
2025-09-14 09:23:32,779 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 09:23:40,763 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 2704.26025 ± 580.860
2025-09-14 09:23:40,763 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2915.4746), np.float32(1773.3508), np.float32(3162.2698), np.float32(2686.56), np.float32(3114.7405), np.float32(3481.7488), np.float32(3022.7449), np.float32(1789.7032), np.float32(2051.5012), np.float32(3044.505)]
2025-09-14 09:23:40,764 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:23:40,764 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (2704.26) for latency 6
2025-09-14 09:23:40,767 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 13/100 (estimated time remaining: 4 hours, 26 minutes, 34 seconds)
2025-09-14 09:27:06,066 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 09:27:13,953 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 1977.77930 ± 791.366
2025-09-14 09:27:13,953 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1654.7339), np.float32(1140.702), np.float32(2621.3547), np.float32(1391.2289), np.float32(3077.294), np.float32(1310.7521), np.float32(1069.8857), np.float32(2397.0564), np.float32(3386.0945), np.float32(1728.6914)]
2025-09-14 09:27:13,953 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:27:13,956 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 14/100 (estimated time remaining: 4 hours, 31 minutes, 39 seconds)
2025-09-14 09:30:38,375 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 09:30:46,430 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 3811.49414 ± 166.578
2025-09-14 09:30:46,430 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4019.8828), np.float32(4046.521), np.float32(3792.26), np.float32(3997.3281), np.float32(3746.5125), np.float32(3495.4458), np.float32(3830.994), np.float32(3617.2808), np.float32(3795.1558), np.float32(3773.561)]
2025-09-14 09:30:46,430 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:30:46,430 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (3811.49) for latency 6
2025-09-14 09:30:46,433 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 15/100 (estimated time remaining: 4 hours, 38 minutes, 46 seconds)
2025-09-14 09:34:12,427 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 09:34:20,584 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 3197.72363 ± 865.272
2025-09-14 09:34:20,584 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3314.8418), np.float32(1804.5703), np.float32(4711.806), np.float32(3462.5151), np.float32(2593.6), np.float32(4208.539), np.float32(3480.1023), np.float32(2745.9668), np.float32(3610.387), np.float32(2044.9081)]
2025-09-14 09:34:20,584 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:34:20,587 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 16/100 (estimated time remaining: 4 hours, 46 minutes, 8 seconds)
2025-09-14 09:37:45,604 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 09:37:53,539 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 3530.84961 ± 847.802
2025-09-14 09:37:53,539 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3900.9324), np.float32(4058.4753), np.float32(3764.5842), np.float32(3703.2573), np.float32(4077.0518), np.float32(2268.7795), np.float32(3812.8655), np.float32(4225.895), np.float32(3975.0142), np.float32(1521.6422)]
2025-09-14 09:37:53,539 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:37:53,543 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 17/100 (estimated time remaining: 4 hours, 57 minutes, 4 seconds)
2025-09-14 09:41:18,096 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 09:41:26,100 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 3518.49731 ± 744.876
2025-09-14 09:41:26,100 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3578.4688), np.float32(3383.507), np.float32(3355.423), np.float32(3930.9573), np.float32(3825.0352), np.float32(1433.0736), np.float32(4047.7935), np.float32(3647.9902), np.float32(3724.1787), np.float32(4258.5483)]
2025-09-14 09:41:26,100 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:41:26,104 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 18/100 (estimated time remaining: 4 hours, 54 minutes, 44 seconds)
2025-09-14 09:44:51,213 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 09:44:59,487 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 2736.55908 ± 795.324
2025-09-14 09:44:59,487 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2124.7422), np.float32(1277.8732), np.float32(2848.2664), np.float32(2613.0635), np.float32(4064.8435), np.float32(2470.8872), np.float32(2562.2683), np.float32(3988.3467), np.float32(2282.4187), np.float32(3132.8801)]
2025-09-14 09:44:59,487 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:44:59,490 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 19/100 (estimated time remaining: 4 hours, 51 minutes, 14 seconds)
2025-09-14 09:48:24,760 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 09:48:32,970 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 3147.41650 ± 1100.871
2025-09-14 09:48:32,971 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1638.0239), np.float32(4439.668), np.float32(2221.126), np.float32(2750.4094), np.float32(4031.391), np.float32(4347.477), np.float32(1972.6697), np.float32(4128.21), np.float32(1847.7183), np.float32(4097.473)]
2025-09-14 09:48:32,971 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:48:32,975 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 20/100 (estimated time remaining: 4 hours, 47 minutes, 57 seconds)
2025-09-14 09:51:58,707 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 09:52:06,912 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 3489.49756 ± 919.704
2025-09-14 09:52:06,913 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4317.753), np.float32(3707.1687), np.float32(4333.573), np.float32(3906.1472), np.float32(4111.2227), np.float32(3627.2083), np.float32(1739.5426), np.float32(3576.5398), np.float32(3878.6072), np.float32(1697.2148)]
2025-09-14 09:52:06,913 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:52:06,916 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 21/100 (estimated time remaining: 4 hours, 44 minutes, 21 seconds)
2025-09-14 09:55:20,760 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 09:55:28,113 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 3808.12451 ± 951.755
2025-09-14 09:55:28,113 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4360.2964), np.float32(4400.7827), np.float32(2616.4968), np.float32(4365.776), np.float32(4366.224), np.float32(4332.5986), np.float32(2895.1519), np.float32(4554.9297), np.float32(1738.9586), np.float32(4450.031)]
2025-09-14 09:55:28,113 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:55:28,116 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 22/100 (estimated time remaining: 4 hours, 37 minutes, 42 seconds)
2025-09-14 09:58:18,085 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 09:58:24,596 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 3306.44922 ± 1078.798
2025-09-14 09:58:24,597 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3788.6108), np.float32(2105.3606), np.float32(4177.329), np.float32(2350.1597), np.float32(4157.797), np.float32(3952.6536), np.float32(2092.635), np.float32(4031.0999), np.float32(4805.734), np.float32(1603.1122)]
2025-09-14 09:58:24,597 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:58:24,600 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 23/100 (estimated time remaining: 4 hours, 24 minutes, 48 seconds)
2025-09-14 10:00:53,299 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 10:00:58,836 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4110.42334 ± 805.058
2025-09-14 10:00:58,837 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4523.923), np.float32(3572.1077), np.float32(1958.7969), np.float32(5032.848), np.float32(3986.4226), np.float32(4551.502), np.float32(4416.3496), np.float32(4549.253), np.float32(4231.175), np.float32(4281.854)]
2025-09-14 10:00:58,837 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:00:58,837 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (4110.42) for latency 6
2025-09-14 10:00:58,839 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 24/100 (estimated time remaining: 4 hours, 6 minutes, 13 seconds)
2025-09-14 10:03:17,455 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 10:03:22,992 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4007.29834 ± 1163.302
2025-09-14 10:03:22,993 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4541.993), np.float32(4546.37), np.float32(4616.6787), np.float32(4709.4155), np.float32(1534.7767), np.float32(4590.0005), np.float32(4452.6567), np.float32(4678.005), np.float32(1842.8102), np.float32(4560.2793)]
2025-09-14 10:03:22,993 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:03:22,995 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 25/100 (estimated time remaining: 3 hours, 45 minutes, 28 seconds)
2025-09-14 10:05:31,836 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 10:05:37,280 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4298.39355 ± 825.248
2025-09-14 10:05:37,281 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4584.939), np.float32(4271.1064), np.float32(4289.477), np.float32(4261.272), np.float32(4506.1006), np.float32(1928.9021), np.float32(4632.8906), np.float32(4661.73), np.float32(4778.265), np.float32(5069.2554)]
2025-09-14 10:05:37,281 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:05:37,281 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (4298.39) for latency 6
2025-09-14 10:05:37,284 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 26/100 (estimated time remaining: 3 hours, 22 minutes, 35 seconds)
2025-09-14 10:07:45,743 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 10:07:51,183 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4173.18506 ± 771.120
2025-09-14 10:07:51,183 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4593.194), np.float32(4649.8774), np.float32(2961.6094), np.float32(4689.1973), np.float32(3868.5176), np.float32(4391.9575), np.float32(4767.73), np.float32(4866.1543), np.float32(2509.987), np.float32(4433.623)]
2025-09-14 10:07:51,183 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:07:51,186 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 27/100 (estimated time remaining: 3 hours, 3 minutes, 17 seconds)
2025-09-14 10:09:59,560 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 10:10:05,006 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4348.39844 ± 182.432
2025-09-14 10:10:05,006 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4377.232), np.float32(4603.943), np.float32(4204.478), np.float32(4289.0146), np.float32(4535.3037), np.float32(4015.4617), np.float32(4591.332), np.float32(4393.552), np.float32(4310.0137), np.float32(4163.6543)]
2025-09-14 10:10:05,006 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:10:05,006 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (4348.40) for latency 6
2025-09-14 10:10:05,009 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 28/100 (estimated time remaining: 2 hours, 50 minutes, 25 seconds)
2025-09-14 10:12:13,425 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 10:12:18,867 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4720.94629 ± 192.255
2025-09-14 10:12:18,867 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4434.2617), np.float32(4978.687), np.float32(4898.127), np.float32(4609.6514), np.float32(4753.7783), np.float32(4812.2275), np.float32(4360.657), np.float32(4729.3306), np.float32(4915.674), np.float32(4717.075)]
2025-09-14 10:12:18,867 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:12:18,867 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (4720.95) for latency 6
2025-09-14 10:12:18,870 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 29/100 (estimated time remaining: 2 hours, 43 minutes, 12 seconds)
2025-09-14 10:14:27,227 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 10:14:32,667 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4324.69238 ± 502.077
2025-09-14 10:14:32,667 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4297.385), np.float32(4294.17), np.float32(4716.9517), np.float32(4390.182), np.float32(4934.797), np.float32(2952.963), np.float32(4612.948), np.float32(4227.7075), np.float32(4416.651), np.float32(4403.1694)]
2025-09-14 10:14:32,667 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:14:32,671 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 30/100 (estimated time remaining: 2 hours, 38 minutes, 29 seconds)
2025-09-14 10:16:41,063 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 10:16:46,503 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4810.67383 ± 129.387
2025-09-14 10:16:46,504 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4820.874), np.float32(4563.3364), np.float32(4677.371), np.float32(4770.436), np.float32(5036.1206), np.float32(4956.0435), np.float32(4889.186), np.float32(4866.176), np.float32(4787.116), np.float32(4740.0776)]
2025-09-14 10:16:46,504 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:16:46,504 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (4810.67) for latency 6
2025-09-14 10:16:46,507 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 31/100 (estimated time remaining: 2 hours, 36 minutes, 9 seconds)
2025-09-14 10:18:54,826 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 10:19:00,266 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4459.90186 ± 589.549
2025-09-14 10:19:00,266 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4726.5605), np.float32(4619.6367), np.float32(4440.5654), np.float32(4752.204), np.float32(4457.5117), np.float32(4556.1274), np.float32(4840.4897), np.float32(2738.0232), np.float32(4624.2817), np.float32(4843.6196)]
2025-09-14 10:19:00,266 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:19:00,269 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 32/100 (estimated time remaining: 2 hours, 33 minutes, 53 seconds)
2025-09-14 10:21:08,667 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 10:21:14,125 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4941.82178 ± 145.560
2025-09-14 10:21:14,125 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5074.1104), np.float32(5225.007), np.float32(4970.1626), np.float32(4991.8296), np.float32(4692.565), np.float32(4760.884), np.float32(4864.4824), np.float32(4891.275), np.float32(4921.935), np.float32(5025.9707)]
2025-09-14 10:21:14,125 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:21:14,125 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (4941.82) for latency 6
2025-09-14 10:21:14,129 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 33/100 (estimated time remaining: 2 hours, 31 minutes, 40 seconds)
2025-09-14 10:23:22,582 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 10:23:28,025 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4833.79053 ± 186.289
2025-09-14 10:23:28,025 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5022.593), np.float32(4581.062), np.float32(4863.3716), np.float32(5105.566), np.float32(4950.7017), np.float32(4764.7583), np.float32(4758.858), np.float32(4622.5854), np.float32(5062.3784), np.float32(4606.0366)]
2025-09-14 10:23:28,025 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:23:28,029 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 34/100 (estimated time remaining: 2 hours, 29 minutes, 26 seconds)
2025-09-14 10:25:36,552 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 10:25:41,998 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4598.56494 ± 621.943
2025-09-14 10:25:41,998 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4881.106), np.float32(3688.507), np.float32(5141.6855), np.float32(4026.1929), np.float32(3570.016), np.float32(4340.675), np.float32(5008.8145), np.float32(4822.8257), np.float32(4990.0938), np.float32(5515.731)]
2025-09-14 10:25:41,998 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:25:42,002 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 35/100 (estimated time remaining: 2 hours, 27 minutes, 15 seconds)
2025-09-14 10:27:51,033 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 10:27:56,454 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4098.24707 ± 1222.401
2025-09-14 10:27:56,455 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4781.189), np.float32(5016.5747), np.float32(4107.835), np.float32(4889.838), np.float32(3989.8389), np.float32(1505.3671), np.float32(4796.231), np.float32(4668.3564), np.float32(2034.1421), np.float32(5193.0967)]
2025-09-14 10:27:56,455 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:27:56,459 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 36/100 (estimated time remaining: 2 hours, 25 minutes, 9 seconds)
2025-09-14 10:30:04,992 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 10:30:10,436 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4607.63184 ± 1040.772
2025-09-14 10:30:10,436 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5177.0864), np.float32(1566.4735), np.float32(5120.316), np.float32(4793.6343), np.float32(4671.4014), np.float32(5051.847), np.float32(4450.607), np.float32(5067.8794), np.float32(4913.4834), np.float32(5263.5933)]
2025-09-14 10:30:10,436 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:30:10,440 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 37/100 (estimated time remaining: 2 hours, 22 minutes, 58 seconds)
2025-09-14 10:32:19,105 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 10:32:24,554 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4941.72900 ± 216.118
2025-09-14 10:32:24,554 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5103.8306), np.float32(4691.0513), np.float32(5001.7153), np.float32(5285.6196), np.float32(4582.312), np.float32(4651.1655), np.float32(4981.4775), np.float32(4950.9204), np.float32(5084.0337), np.float32(5085.158)]
2025-09-14 10:32:24,554 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:32:24,558 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 38/100 (estimated time remaining: 2 hours, 20 minutes, 47 seconds)
2025-09-14 10:34:33,247 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 10:34:38,689 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4748.30371 ± 798.453
2025-09-14 10:34:38,689 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2528.4849), np.float32(4897.129), np.float32(5118.669), np.float32(5250.11), np.float32(5042.453), np.float32(5161.528), np.float32(5226.5137), np.float32(5191.0728), np.float32(4895.678), np.float32(4171.3975)]
2025-09-14 10:34:38,689 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:34:38,693 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 39/100 (estimated time remaining: 2 hours, 18 minutes, 36 seconds)
2025-09-14 10:36:47,360 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 10:36:52,812 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4979.50732 ± 189.486
2025-09-14 10:36:52,812 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4724.8022), np.float32(4757.498), np.float32(5157.088), np.float32(4733.8564), np.float32(4999.153), np.float32(5196.7085), np.float32(4818.231), np.float32(5076.9224), np.float32(5185.7637), np.float32(5145.0493)]
2025-09-14 10:36:52,812 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:36:52,812 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (4979.51) for latency 6
2025-09-14 10:36:52,816 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 40/100 (estimated time remaining: 2 hours, 16 minutes, 23 seconds)
2025-09-14 10:39:01,494 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 10:39:06,939 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4590.25195 ± 879.462
2025-09-14 10:39:06,939 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5050.322), np.float32(4468.3286), np.float32(5183.6274), np.float32(4768.704), np.float32(5044.3867), np.float32(4627.7017), np.float32(5315.4146), np.float32(4947.188), np.float32(4402.1494), np.float32(2094.6997)]
2025-09-14 10:39:06,939 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:39:06,943 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 41/100 (estimated time remaining: 2 hours, 14 minutes, 5 seconds)
2025-09-14 10:41:15,734 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 10:41:21,176 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4577.10498 ± 1045.798
2025-09-14 10:41:21,177 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5106.649), np.float32(5144.502), np.float32(1687.0636), np.float32(5182.162), np.float32(5324.5356), np.float32(5082.162), np.float32(4735.459), np.float32(3791.9756), np.float32(4820.863), np.float32(4895.6807)]
2025-09-14 10:41:21,177 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:41:21,181 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 42/100 (estimated time remaining: 2 hours, 11 minutes, 54 seconds)
2025-09-14 10:43:29,897 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 10:43:35,334 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 5029.68652 ± 164.821
2025-09-14 10:43:35,334 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5079.114), np.float32(4752.978), np.float32(5172.272), np.float32(5006.8735), np.float32(5159.052), np.float32(5175.625), np.float32(5226.6885), np.float32(4765.413), np.float32(5079.5825), np.float32(4879.271)]
2025-09-14 10:43:35,334 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:43:35,334 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (5029.69) for latency 6
2025-09-14 10:43:35,338 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 43/100 (estimated time remaining: 2 hours, 9 minutes, 41 seconds)
2025-09-14 10:45:44,031 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 10:45:49,481 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4040.02808 ± 1213.117
2025-09-14 10:45:49,482 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1263.1886), np.float32(4099.48), np.float32(4581.1484), np.float32(4730.726), np.float32(2109.7957), np.float32(4872.5366), np.float32(4542.899), np.float32(4874.1724), np.float32(4873.0767), np.float32(4453.2573)]
2025-09-14 10:45:49,482 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:45:49,485 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 44/100 (estimated time remaining: 2 hours, 7 minutes, 27 seconds)
2025-09-14 10:47:58,300 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 10:48:03,738 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4661.32324 ± 609.583
2025-09-14 10:48:03,738 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4689.92), np.float32(4986.11), np.float32(5013.2544), np.float32(4721.3926), np.float32(4870.401), np.float32(4881.727), np.float32(2857.5884), np.float32(4929.1826), np.float32(4888.85), np.float32(4774.8105)]
2025-09-14 10:48:03,739 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:48:03,742 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 45/100 (estimated time remaining: 2 hours, 5 minutes, 14 seconds)
2025-09-14 10:50:12,397 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 10:50:17,828 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4849.82373 ± 717.426
2025-09-14 10:50:17,829 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5191.185), np.float32(2810.994), np.float32(4905.5454), np.float32(5273.5312), np.float32(4904.0083), np.float32(4776.3135), np.float32(5112.2983), np.float32(5415.6606), np.float32(5378.351), np.float32(4730.3516)]
2025-09-14 10:50:17,829 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:50:17,832 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 46/100 (estimated time remaining: 2 hours, 2 minutes, 59 seconds)
2025-09-14 10:52:26,151 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 10:52:31,582 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 5099.54688 ± 132.923
2025-09-14 10:52:31,582 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5101.0044), np.float32(5272.3643), np.float32(4788.8623), np.float32(5098.0786), np.float32(5138.216), np.float32(5005.638), np.float32(5171.4385), np.float32(5013.2505), np.float32(5149.337), np.float32(5257.282)]
2025-09-14 10:52:31,582 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:52:31,582 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (5099.55) for latency 6
2025-09-14 10:52:31,586 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 47/100 (estimated time remaining: 2 hours, 40 seconds)
2025-09-14 10:54:40,277 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 10:54:45,712 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 5114.93701 ± 206.920
2025-09-14 10:54:45,712 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4829.012), np.float32(5389.4487), np.float32(4936.223), np.float32(5316.797), np.float32(5335.425), np.float32(5058.551), np.float32(5060.8286), np.float32(4971.812), np.float32(5373.556), np.float32(4877.7188)]
2025-09-14 10:54:45,712 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:54:45,712 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (5114.94) for latency 6
2025-09-14 10:54:45,716 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 48/100 (estimated time remaining: 1 hour, 58 minutes, 26 seconds)
2025-09-14 10:56:54,526 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 10:56:59,988 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4686.62695 ± 495.353
2025-09-14 10:56:59,988 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3615.1414), np.float32(4902.51), np.float32(4996.2495), np.float32(5183.5054), np.float32(3984.923), np.float32(4785.748), np.float32(4394.855), np.float32(4939.6235), np.float32(4916.5757), np.float32(5147.141)]
2025-09-14 10:56:59,988 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:56:59,993 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 49/100 (estimated time remaining: 1 hour, 56 minutes, 13 seconds)
2025-09-14 10:59:08,953 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 10:59:14,393 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4977.50879 ± 947.032
2025-09-14 10:59:14,393 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5437.6357), np.float32(5368.3096), np.float32(5103.243), np.float32(5326.1084), np.float32(5221.9067), np.float32(5273.6133), np.float32(5131.7124), np.float32(5335.668), np.float32(2154.4768), np.float32(5422.4155)]
2025-09-14 10:59:14,393 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:59:14,397 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 50/100 (estimated time remaining: 1 hour, 54 minutes)
2025-09-14 11:01:22,816 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 11:01:28,259 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 5015.10303 ± 241.844
2025-09-14 11:01:28,259 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4746.2114), np.float32(5056.948), np.float32(5136.261), np.float32(5085.3716), np.float32(4654.5645), np.float32(5146.972), np.float32(4812.3936), np.float32(4781.6826), np.float32(5403.8145), np.float32(5326.8135)]
2025-09-14 11:01:28,259 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:01:28,263 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 51/100 (estimated time remaining: 1 hour, 51 minutes, 44 seconds)
2025-09-14 11:03:36,699 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 11:03:42,129 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 5162.57227 ± 181.139
2025-09-14 11:03:42,129 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5405.9043), np.float32(4989.208), np.float32(5236.2173), np.float32(5011.034), np.float32(5180.407), np.float32(5397.564), np.float32(4896.263), np.float32(5186.954), np.float32(5365.294), np.float32(4956.8804)]
2025-09-14 11:03:42,129 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:03:42,129 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (5162.57) for latency 6
2025-09-14 11:03:42,133 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 52/100 (estimated time remaining: 1 hour, 49 minutes, 31 seconds)
2025-09-14 11:05:50,975 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 11:05:56,408 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 5022.34619 ± 402.315
2025-09-14 11:05:56,408 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4199.7407), np.float32(4908.2207), np.float32(5350.5425), np.float32(5202.0156), np.float32(4829.737), np.float32(5446.8833), np.float32(4504.9233), np.float32(5430.2812), np.float32(5388.229), np.float32(4962.8867)]
2025-09-14 11:05:56,408 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:05:56,412 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 53/100 (estimated time remaining: 1 hour, 47 minutes, 18 seconds)
2025-09-14 11:08:05,362 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 11:08:10,803 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 5268.16602 ± 120.608
2025-09-14 11:08:10,803 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5317.307), np.float32(5341.834), np.float32(5354.5693), np.float32(5382.723), np.float32(5215.8667), np.float32(5410.9604), np.float32(5171.4194), np.float32(5019.0093), np.float32(5133.6562), np.float32(5334.3164)]
2025-09-14 11:08:10,803 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:08:10,804 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (5268.17) for latency 6
2025-09-14 11:08:10,807 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 54/100 (estimated time remaining: 1 hour, 45 minutes, 5 seconds)
2025-09-14 11:10:19,797 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 11:10:25,239 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4807.53418 ± 1133.845
2025-09-14 11:10:25,239 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4974.858), np.float32(5008.505), np.float32(5507.95), np.float32(5167.0146), np.float32(5243.8823), np.float32(5215.5107), np.float32(4720.885), np.float32(1470.247), np.float32(5422.7397), np.float32(5343.756)]
2025-09-14 11:10:25,239 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:10:25,243 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 55/100 (estimated time remaining: 1 hour, 42 minutes, 51 seconds)
2025-09-14 11:12:34,230 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 11:12:39,677 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4891.20215 ± 1013.785
2025-09-14 11:12:39,677 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5016.6924), np.float32(5281.0425), np.float32(5025.077), np.float32(5433.8457), np.float32(5274.688), np.float32(5311.668), np.float32(5201.7656), np.float32(1874.252), np.float32(5139.8677), np.float32(5353.1265)]
2025-09-14 11:12:39,677 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:12:39,681 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 56/100 (estimated time remaining: 1 hour, 40 minutes, 42 seconds)
2025-09-14 11:14:48,562 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 11:14:54,024 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 5125.03125 ± 222.838
2025-09-14 11:14:54,024 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5255.6963), np.float32(5343.5747), np.float32(5074.2393), np.float32(5028.841), np.float32(5216.329), np.float32(5328.416), np.float32(5301.421), np.float32(5102.148), np.float32(4547.7905), np.float32(5051.8574)]
2025-09-14 11:14:54,024 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:14:54,028 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 57/100 (estimated time remaining: 1 hour, 38 minutes, 32 seconds)
2025-09-14 11:17:02,921 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 11:17:08,353 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 5134.38574 ± 159.836
2025-09-14 11:17:08,354 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5222.4287), np.float32(5191.4507), np.float32(5335.375), np.float32(5108.923), np.float32(5362.1123), np.float32(5095.733), np.float32(4964.4307), np.float32(5114.785), np.float32(5160.795), np.float32(4787.826)]
2025-09-14 11:17:08,354 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:17:08,357 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 58/100 (estimated time remaining: 1 hour, 36 minutes, 18 seconds)
2025-09-14 11:19:16,942 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 11:19:22,380 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 5077.43164 ± 649.681
2025-09-14 11:19:22,380 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5620.726), np.float32(5200.8506), np.float32(5179.5176), np.float32(3207.5632), np.float32(4873.8257), np.float32(5427.306), np.float32(5310.1626), np.float32(5352.223), np.float32(5371.98), np.float32(5230.1636)]
2025-09-14 11:19:22,381 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:19:22,385 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 59/100 (estimated time remaining: 1 hour, 34 minutes, 1 second)
2025-09-14 11:21:30,806 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 11:21:36,266 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 5125.35254 ± 700.215
2025-09-14 11:21:36,267 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3050.4062), np.float32(5492.818), np.float32(5476.568), np.float32(5348.669), np.float32(5329.004), np.float32(5399.1304), np.float32(5277.328), np.float32(5230.938), np.float32(5496.8164), np.float32(5151.8506)]
2025-09-14 11:21:36,267 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:21:36,271 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 60/100 (estimated time remaining: 1 hour, 31 minutes, 42 seconds)
2025-09-14 11:23:44,836 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 11:23:50,269 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4858.78076 ± 946.978
2025-09-14 11:23:50,269 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5327.2446), np.float32(3145.8264), np.float32(5338.527), np.float32(5358.1367), np.float32(5149.8633), np.float32(5220.204), np.float32(5298.144), np.float32(2818.2822), np.float32(5578.793), np.float32(5352.789)]
2025-09-14 11:23:50,269 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:23:50,273 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 61/100 (estimated time remaining: 1 hour, 29 minutes, 24 seconds)
2025-09-14 11:25:58,687 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 11:26:04,113 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 5230.10547 ± 149.447
2025-09-14 11:26:04,114 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4955.8633), np.float32(5157.6), np.float32(5361.3447), np.float32(5466.6035), np.float32(5058.22), np.float32(5158.8994), np.float32(5314.1313), np.float32(5164.098), np.float32(5370.0083), np.float32(5294.2905)]
2025-09-14 11:26:04,114 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:26:04,118 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 62/100 (estimated time remaining: 1 hour, 27 minutes, 6 seconds)
2025-09-14 11:28:12,540 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 11:28:17,982 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 5235.46338 ± 292.534
2025-09-14 11:28:17,983 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4889.3936), np.float32(5091.1675), np.float32(5460.6973), np.float32(5313.4814), np.float32(5290.2095), np.float32(4553.1665), np.float32(5554.724), np.float32(5396.178), np.float32(5457.8936), np.float32(5347.724)]
2025-09-14 11:28:17,983 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:28:17,987 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 63/100 (estimated time remaining: 1 hour, 24 minutes, 49 seconds)
2025-09-14 11:30:26,496 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 11:30:31,919 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4994.05322 ± 861.494
2025-09-14 11:30:31,920 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5455.6406), np.float32(5480.234), np.float32(2484.6086), np.float32(5241.825), np.float32(5121.22), np.float32(5390.893), np.float32(5379.8457), np.float32(5231.178), np.float32(5409.9077), np.float32(4745.1816)]
2025-09-14 11:30:31,920 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:30:31,924 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 64/100 (estimated time remaining: 1 hour, 22 minutes, 34 seconds)
2025-09-14 11:32:40,443 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 11:32:45,872 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 5225.94043 ± 211.977
2025-09-14 11:32:45,872 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5310.947), np.float32(5468.604), np.float32(5306.847), np.float32(5153.516), np.float32(5299.0605), np.float32(5125.7856), np.float32(5250.2124), np.float32(5421.0586), np.float32(5259.7954), np.float32(4663.574)]
2025-09-14 11:32:45,873 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:32:45,877 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 65/100 (estimated time remaining: 1 hour, 20 minutes, 21 seconds)
2025-09-14 11:34:54,677 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 11:35:00,120 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 5327.51270 ± 77.708
2025-09-14 11:35:00,120 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5251.3994), np.float32(5423.027), np.float32(5376.006), np.float32(5465.943), np.float32(5350.484), np.float32(5323.982), np.float32(5201.8174), np.float32(5327.1626), np.float32(5241.3027), np.float32(5314.0015)]
2025-09-14 11:35:00,120 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:35:00,120 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (5327.51) for latency 6
2025-09-14 11:35:00,125 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 66/100 (estimated time remaining: 1 hour, 18 minutes, 8 seconds)
2025-09-14 11:37:08,790 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 11:37:14,232 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 5152.50977 ± 160.010
2025-09-14 11:37:14,232 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5353.909), np.float32(5346.697), np.float32(4957.351), np.float32(4967.883), np.float32(5277.6904), np.float32(5089.805), np.float32(5197.3394), np.float32(4882.092), np.float32(5225.9634), np.float32(5226.3687)]
2025-09-14 11:37:14,232 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:37:14,236 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 67/100 (estimated time remaining: 1 hour, 15 minutes, 56 seconds)
2025-09-14 11:39:22,824 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 11:39:28,263 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 5024.48828 ± 548.445
2025-09-14 11:39:28,263 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5176.753), np.float32(5359.004), np.float32(5114.449), np.float32(5185.3794), np.float32(5013.919), np.float32(5339.954), np.float32(5174.596), np.float32(5139.3887), np.float32(3409.2983), np.float32(5332.1416)]
2025-09-14 11:39:28,263 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:39:28,268 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 68/100 (estimated time remaining: 1 hour, 13 minutes, 43 seconds)
2025-09-14 11:41:36,694 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 11:41:42,129 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 5146.00195 ± 219.570
2025-09-14 11:41:42,129 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5153.4873), np.float32(5223.55), np.float32(5164.7573), np.float32(4596.3916), np.float32(5223.4824), np.float32(5366.562), np.float32(5175.3447), np.float32(5382.903), np.float32(5252.5503), np.float32(4920.9927)]
2025-09-14 11:41:42,130 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:41:42,134 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 69/100 (estimated time remaining: 1 hour, 11 minutes, 29 seconds)
2025-09-14 11:43:50,584 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 11:43:56,036 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4538.39307 ± 1373.172
2025-09-14 11:43:56,036 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5088.321), np.float32(5437.5986), np.float32(5410.9863), np.float32(5607.5796), np.float32(5398.6895), np.float32(5432.443), np.float32(3865.8074), np.float32(1423.9307), np.float32(2589.1667), np.float32(5129.401)]
2025-09-14 11:43:56,036 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:43:56,040 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 70/100 (estimated time remaining: 1 hour, 9 minutes, 15 seconds)
2025-09-14 11:46:04,598 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 11:46:10,048 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4717.70410 ± 1189.187
2025-09-14 11:46:10,049 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5249.477), np.float32(5030.032), np.float32(4827.4375), np.float32(5201.904), np.float32(5146.7695), np.float32(5166.5376), np.float32(5124.239), np.float32(1166.3994), np.float32(5198.3716), np.float32(5065.869)]
2025-09-14 11:46:10,049 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:46:10,053 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 71/100 (estimated time remaining: 1 hour, 6 minutes, 59 seconds)
2025-09-14 11:48:18,592 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 11:48:24,067 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 5318.88623 ± 141.594
2025-09-14 11:48:24,067 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5452.144), np.float32(5368.374), np.float32(5291.9194), np.float32(5368.0684), np.float32(5139.22), np.float32(5084.9585), np.float32(5228.7124), np.float32(5527.052), np.float32(5495.6636), np.float32(5232.753)]
2025-09-14 11:48:24,067 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:48:24,071 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 72/100 (estimated time remaining: 1 hour, 4 minutes, 45 seconds)
2025-09-14 11:50:32,568 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 11:50:38,009 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 5247.98096 ± 124.333
2025-09-14 11:50:38,009 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5285.9595), np.float32(5328.631), np.float32(5065.837), np.float32(5193.966), np.float32(5325.9526), np.float32(5026.0703), np.float32(5402.3877), np.float32(5207.4297), np.float32(5224.7246), np.float32(5418.846)]
2025-09-14 11:50:38,009 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:50:38,013 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 73/100 (estimated time remaining: 1 hour, 2 minutes, 30 seconds)
2025-09-14 11:52:46,385 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 11:52:51,850 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 5246.21045 ± 69.900
2025-09-14 11:52:51,850 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5106.237), np.float32(5133.408), np.float32(5243.9062), np.float32(5263.541), np.float32(5279.5874), np.float32(5240.3853), np.float32(5336.711), np.float32(5308.2363), np.float32(5301.399), np.float32(5248.696)]
2025-09-14 11:52:51,850 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:52:51,855 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 74/100 (estimated time remaining: 1 hour, 16 seconds)
2025-09-14 11:55:00,387 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 11:55:05,848 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 5355.35059 ± 149.710
2025-09-14 11:55:05,848 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5457.4614), np.float32(5015.653), np.float32(5283.168), np.float32(5422.598), np.float32(5514.1714), np.float32(5172.5146), np.float32(5334.26), np.float32(5468.926), np.float32(5469.138), np.float32(5415.623)]
2025-09-14 11:55:05,848 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:55:05,848 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (5355.35) for latency 6
2025-09-14 11:55:05,853 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 75/100 (estimated time remaining: 58 minutes, 3 seconds)
2025-09-14 11:57:14,330 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 11:57:19,793 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 5285.86035 ± 157.929
2025-09-14 11:57:19,793 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5334.2085), np.float32(4977.473), np.float32(5374.605), np.float32(5456.8286), np.float32(5440.8354), np.float32(5395.144), np.float32(5210.7373), np.float32(5049.051), np.float32(5399.401), np.float32(5220.3174)]
2025-09-14 11:57:19,793 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:57:19,798 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 76/100 (estimated time remaining: 55 minutes, 48 seconds)
2025-09-14 11:59:28,102 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 11:59:33,541 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 5089.33350 ± 742.173
2025-09-14 11:59:33,541 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5460.0386), np.float32(5281.5405), np.float32(2879.9868), np.float32(5242.63), np.float32(5436.423), np.float32(5346.9663), np.float32(5450.2446), np.float32(5287.5107), np.float32(5157.7656), np.float32(5350.2305)]
2025-09-14 11:59:33,541 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:59:33,546 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 77/100 (estimated time remaining: 53 minutes, 33 seconds)
2025-09-14 12:01:41,897 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 12:01:47,335 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 5224.58691 ± 217.924
2025-09-14 12:01:47,335 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5396.825), np.float32(4841.3364), np.float32(5117.6675), np.float32(5229.5645), np.float32(5404.491), np.float32(5437.5996), np.float32(5206.236), np.float32(5501.446), np.float32(5248.8647), np.float32(4861.8345)]
2025-09-14 12:01:47,335 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:01:47,340 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 78/100 (estimated time remaining: 51 minutes, 18 seconds)
2025-09-14 12:03:56,332 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 12:04:01,777 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 5382.38477 ± 110.413
2025-09-14 12:04:01,777 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5494.536), np.float32(5347.2964), np.float32(5515.695), np.float32(5401.3945), np.float32(5397.691), np.float32(5478.9907), np.float32(5365.648), np.float32(5295.1577), np.float32(5412.6274), np.float32(5114.8086)]
2025-09-14 12:04:01,777 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:04:01,777 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (5382.38) for latency 6
2025-09-14 12:04:01,782 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 79/100 (estimated time remaining: 49 minutes, 7 seconds)
2025-09-14 12:06:10,526 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 12:06:15,976 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 5290.71924 ± 147.707
2025-09-14 12:06:15,976 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5294.636), np.float32(5028.3203), np.float32(5483.0386), np.float32(5467.812), np.float32(5398.5933), np.float32(5395.498), np.float32(5279.324), np.float32(5174.1426), np.float32(5305.801), np.float32(5080.0234)]
2025-09-14 12:06:15,976 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:06:15,981 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 80/100 (estimated time remaining: 46 minutes, 54 seconds)
2025-09-14 12:08:24,799 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 12:08:30,258 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 5351.79980 ± 150.935
2025-09-14 12:08:30,258 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5357.54), np.float32(5413.5024), np.float32(5391.797), np.float32(5488.355), np.float32(5347.7036), np.float32(5339.5645), np.float32(5338.621), np.float32(5495.956), np.float32(5415.6846), np.float32(4929.2783)]
2025-09-14 12:08:30,258 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:08:30,263 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 81/100 (estimated time remaining: 44 minutes, 41 seconds)
2025-09-14 12:10:38,972 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 12:10:44,434 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 5229.58984 ± 190.305
2025-09-14 12:10:44,434 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5428.391), np.float32(5337.771), np.float32(5100.1304), np.float32(5008.624), np.float32(5370.324), np.float32(5553.9214), np.float32(5249.1206), np.float32(5188.7466), np.float32(5167.595), np.float32(4891.2725)]
2025-09-14 12:10:44,434 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:10:44,440 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 82/100 (estimated time remaining: 42 minutes, 29 seconds)
2025-09-14 12:12:53,024 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 12:12:58,474 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4880.51611 ± 1029.716
2025-09-14 12:12:58,474 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5449.897), np.float32(5397.498), np.float32(5582.324), np.float32(5084.4736), np.float32(2314.4526), np.float32(5230.7896), np.float32(5478.3486), np.float32(3501.4785), np.float32(5327.509), np.float32(5438.392)]
2025-09-14 12:12:58,474 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:12:58,479 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 83/100 (estimated time remaining: 40 minutes, 16 seconds)
2025-09-14 12:15:07,501 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 12:15:12,959 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 5286.68262 ± 175.913
2025-09-14 12:15:12,959 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5535.227), np.float32(5353.6426), np.float32(5324.5728), np.float32(5098.768), np.float32(5256.243), np.float32(5313.4297), np.float32(5376.143), np.float32(5070.388), np.float32(5542.526), np.float32(4995.882)]
2025-09-14 12:15:12,960 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:15:12,964 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 84/100 (estimated time remaining: 38 minutes, 2 seconds)
2025-09-14 12:17:21,736 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 12:17:27,188 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 5363.11816 ± 134.508
2025-09-14 12:17:27,189 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5360.0503), np.float32(5185.5713), np.float32(5474.689), np.float32(5538.2637), np.float32(5519.917), np.float32(5175.665), np.float32(5288.2456), np.float32(5493.18), np.float32(5385.8), np.float32(5209.8013)]
2025-09-14 12:17:27,189 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:17:27,194 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 85/100 (estimated time remaining: 35 minutes, 47 seconds)
2025-09-14 12:19:36,143 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 12:19:41,597 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 5207.55225 ± 94.772
2025-09-14 12:19:41,597 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5102.996), np.float32(5081.6978), np.float32(5269.486), np.float32(5171.0264), np.float32(5382.235), np.float32(5301.14), np.float32(5141.2227), np.float32(5304.1616), np.float32(5176.5), np.float32(5145.0576)]
2025-09-14 12:19:41,597 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:19:41,602 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 86/100 (estimated time remaining: 33 minutes, 34 seconds)
2025-09-14 12:21:50,420 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 12:21:55,879 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 5162.49316 ± 288.398
2025-09-14 12:21:55,879 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4425.449), np.float32(5261.848), np.float32(5104.884), np.float32(5239.0547), np.float32(5257.8726), np.float32(5126.1094), np.float32(5445.107), np.float32(5380.9546), np.float32(4937.163), np.float32(5446.4937)]
2025-09-14 12:21:55,879 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:21:55,884 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 87/100 (estimated time remaining: 31 minutes, 20 seconds)
2025-09-14 12:24:04,385 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 12:24:09,832 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 5275.22021 ± 264.973
2025-09-14 12:24:09,833 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5455.1294), np.float32(5345.4824), np.float32(5170.6724), np.float32(4597.3213), np.float32(5294.9277), np.float32(5616.814), np.float32(5125.738), np.float32(5496.4565), np.float32(5302.062), np.float32(5347.5957)]
2025-09-14 12:24:09,833 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:24:09,838 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 88/100 (estimated time remaining: 29 minutes, 5 seconds)
2025-09-14 12:26:18,165 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 12:26:23,632 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 5055.65771 ± 646.101
2025-09-14 12:26:23,632 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5364.169), np.float32(5282.2876), np.float32(5392.561), np.float32(5391.993), np.float32(5305.35), np.float32(3173.3909), np.float32(4860.842), np.float32(5403.0054), np.float32(5207.642), np.float32(5175.3354)]
2025-09-14 12:26:23,633 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:26:23,638 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 89/100 (estimated time remaining: 26 minutes, 49 seconds)
2025-09-14 12:28:31,839 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 12:28:37,310 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 5318.94775 ± 181.029
2025-09-14 12:28:37,310 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5328.1797), np.float32(5473.05), np.float32(5554.619), np.float32(5273.476), np.float32(5476.4272), np.float32(5430.078), np.float32(4903.783), np.float32(5318.3926), np.float32(5127.8213), np.float32(5303.6484)]
2025-09-14 12:28:37,310 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:28:37,316 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 90/100 (estimated time remaining: 24 minutes, 34 seconds)
2025-09-14 12:30:45,404 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 12:30:50,853 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 5163.74072 ± 204.057
2025-09-14 12:30:50,853 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5238.9185), np.float32(4915.652), np.float32(5360.6855), np.float32(4811.1816), np.float32(5497.9663), np.float32(5060.2227), np.float32(5306.392), np.float32(5311.9907), np.float32(5030.3286), np.float32(5104.07)]
2025-09-14 12:30:50,853 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:30:50,858 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 91/100 (estimated time remaining: 22 minutes, 18 seconds)
2025-09-14 12:32:59,083 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 12:33:04,546 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4894.93652 ± 1111.574
2025-09-14 12:33:04,546 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5536.4775), np.float32(5436.6914), np.float32(1612.1705), np.float32(4992.3735), np.float32(5104.601), np.float32(5119.4854), np.float32(5009.6226), np.float32(5232.272), np.float32(5350.304), np.float32(5555.362)]
2025-09-14 12:33:04,546 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:33:04,552 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 92/100 (estimated time remaining: 20 minutes, 3 seconds)
2025-09-14 12:35:12,958 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 12:35:18,391 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 5154.33301 ± 389.966
2025-09-14 12:35:18,391 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5240.552), np.float32(5294.9487), np.float32(5275.0034), np.float32(4037.1746), np.float32(4998.912), np.float32(5336.585), np.float32(5310.493), np.float32(5198.3677), np.float32(5427.9736), np.float32(5423.3145)]
2025-09-14 12:35:18,391 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:35:18,397 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 93/100 (estimated time remaining: 17 minutes, 49 seconds)
2025-09-14 12:37:26,632 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 12:37:32,102 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4985.31299 ± 1109.595
2025-09-14 12:37:32,102 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5414.313), np.float32(5127.9683), np.float32(4986.797), np.float32(1689.3894), np.float32(5467.643), np.float32(5362.85), np.float32(5473.0615), np.float32(5470.406), np.float32(5430.571), np.float32(5430.1284)]
2025-09-14 12:37:32,102 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:37:32,107 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 94/100 (estimated time remaining: 15 minutes, 35 seconds)
2025-09-14 12:39:40,962 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 12:39:46,424 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 5258.78369 ± 166.515
2025-09-14 12:39:46,424 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4983.7354), np.float32(5371.3853), np.float32(5293.1494), np.float32(5365.4307), np.float32(4979.393), np.float32(5284.4106), np.float32(5395.8276), np.float32(5236.7393), np.float32(5519.9507), np.float32(5157.816)]
2025-09-14 12:39:46,425 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:39:46,430 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 95/100 (estimated time remaining: 13 minutes, 22 seconds)
2025-09-14 12:41:55,072 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 12:42:00,511 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 5399.96191 ± 122.379
2025-09-14 12:42:00,511 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5504.185), np.float32(5277.417), np.float32(5489.2305), np.float32(5518.3047), np.float32(5181.1514), np.float32(5513.13), np.float32(5284.298), np.float32(5545.8457), np.float32(5349.4805), np.float32(5336.576)]
2025-09-14 12:42:00,511 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:42:00,512 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (5399.96) for latency 6
2025-09-14 12:42:00,517 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 96/100 (estimated time remaining: 11 minutes, 9 seconds)
2025-09-14 12:44:09,132 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 12:44:14,587 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 5386.24658 ± 111.218
2025-09-14 12:44:14,587 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5162.9756), np.float32(5252.125), np.float32(5313.8545), np.float32(5552.5923), np.float32(5452.347), np.float32(5475.452), np.float32(5363.7876), np.float32(5466.1147), np.float32(5376.6255), np.float32(5446.589)]
2025-09-14 12:44:14,587 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:44:14,593 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 97/100 (estimated time remaining: 8 minutes, 56 seconds)
2025-09-14 12:46:23,210 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 12:46:28,654 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4871.05322 ± 1170.078
2025-09-14 12:46:28,655 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5235.567), np.float32(1391.2659), np.float32(5022.4277), np.float32(5460.298), np.float32(5306.6284), np.float32(5388.1426), np.float32(4953.3994), np.float32(5372.0503), np.float32(5368.9907), np.float32(5211.759)]
2025-09-14 12:46:28,655 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:46:28,660 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 98/100 (estimated time remaining: 6 minutes, 42 seconds)
2025-09-14 12:48:37,294 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 12:48:42,744 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 5347.31592 ± 125.937
2025-09-14 12:48:42,744 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5374.2676), np.float32(5392.5747), np.float32(5589.5317), np.float32(5142.6567), np.float32(5129.569), np.float32(5310.0967), np.float32(5403.0454), np.float32(5398.7036), np.float32(5367.0083), np.float32(5365.7085)]
2025-09-14 12:48:42,744 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:48:42,750 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 99/100 (estimated time remaining: 4 minutes, 28 seconds)
2025-09-14 12:50:51,553 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 12:50:57,008 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 5335.88428 ± 191.736
2025-09-14 12:50:57,008 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5418.1196), np.float32(5455.694), np.float32(5469.621), np.float32(5324.3154), np.float32(5445.7305), np.float32(5118.51), np.float32(5636.9355), np.float32(5346.2656), np.float32(4932.2227), np.float32(5211.4287)]
2025-09-14 12:50:57,008 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:50:57,014 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 100/100 (estimated time remaining: 2 minutes, 14 seconds)
2025-09-14 12:53:03,100 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 12:53:08,530 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4949.47754 ± 958.336
2025-09-14 12:53:08,530 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5104.5874), np.float32(2108.044), np.float32(5319.1284), np.float32(5324.581), np.float32(5528.168), np.float32(5002.9043), np.float32(5441.832), np.float32(5187.0454), np.float32(5192.24), np.float32(5286.241)]
2025-09-14 12:53:08,530 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:53:08,536 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1251 [DEBUG]: Training session finished
