2025-09-14 12:55:07,193 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1108 [DEBUG]: logdir: _logs/noise-eval-v2/halfcheetah/bpql-noise_0.025-delay_18
2025-09-14 12:55:07,194 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1109 [DEBUG]: trainer_prefix: noise-eval-v2/halfcheetah/bpql-noise_0.025-delay_18
2025-09-14 12:55:07,194 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1110 [DEBUG]: args.trainer_eval_latencies: {'18': <latency_env.delayed_mdp.ConstantDelay object at 0x7f9a734966f0>}
2025-09-14 12:55:07,194 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1111 [DEBUG]: using device: cpu
2025-09-14 12:55:07,197 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1133 [INFO]: Creating new trainer
2025-09-14 12:55:07,313 baseline-bpql-noisepromille25-halfcheetah:113 [DEBUG]: pi network:
NNGaussianPolicy(
  (common_head): Sequential(
    (0): Flatten(start_dim=1, end_dim=-1)
    (1): Linear(in_features=125, out_features=256, bias=True)
    (2): ReLU()
    (3): Linear(in_features=256, out_features=256, bias=True)
    (4): ReLU()
  )
  (mu_head): Sequential(
    (0): Linear(in_features=256, out_features=6, bias=True)
    (1): Unflatten(dim=1, unflattened_size=(6,))
  )
  (log_std_head): Sequential(
    (0): Linear(in_features=256, out_features=6, bias=True)
    (1): Unflatten(dim=1, unflattened_size=(6,))
  )
  (tanh_refit): NNTanhRefit(scale: tensor([[2., 2., 2., 2., 2., 2.]]), shift: tensor([[-1., -1., -1., -1., -1., -1.]]))
)
2025-09-14 12:55:07,313 baseline-bpql-noisepromille25-halfcheetah:114 [DEBUG]: q network:
NNLayerConcat2(
  dim: -1
  (next): Sequential(
    (0): Linear(in_features=23, out_features=256, bias=True)
    (1): ReLU()
    (2): Linear(in_features=256, out_features=256, bias=True)
    (3): ReLU()
    (4): Linear(in_features=256, out_features=1, bias=True)
    (5): NNLayerSqueeze(dim: -1)
  )
  (init_left): Flatten(start_dim=1, end_dim=-1)
  (init_right): Flatten(start_dim=1, end_dim=-1)
)
2025-09-14 12:55:09,313 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1194 [DEBUG]: Starting training session...
2025-09-14 12:55:09,314 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 1/100
2025-09-14 13:37:59,726 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 13:38:07,131 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: -510.20892 ± 99.975
2025-09-14 13:38:07,131 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(-443.95053), np.float32(-468.11838), np.float32(-380.9082), np.float32(-691.06067), np.float32(-521.2556), np.float32(-647.71436), np.float32(-386.32846), np.float32(-600.03925), np.float32(-492.58432), np.float32(-470.1295)]
2025-09-14 13:38:07,131 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:38:07,131 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (-510.21) for latency 18
2025-09-14 13:38:07,133 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 2/100 (estimated time remaining: 70 hours, 53 minutes, 24 seconds)
2025-09-14 13:40:32,605 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 13:40:40,009 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: -278.99762 ± 46.161
2025-09-14 13:40:40,009 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(-254.18233), np.float32(-242.81296), np.float32(-379.63156), np.float32(-272.51562), np.float32(-260.69858), np.float32(-224.70282), np.float32(-305.97183), np.float32(-300.3133), np.float32(-226.04369), np.float32(-323.10358)]
2025-09-14 13:40:40,009 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:40:40,009 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (-279.00) for latency 18
2025-09-14 13:40:40,011 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 3/100 (estimated time remaining: 37 hours, 10 minutes, 4 seconds)
2025-09-14 13:43:44,041 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 13:43:51,567 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: -269.45233 ± 48.778
2025-09-14 13:43:51,567 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(-300.35226), np.float32(-269.54916), np.float32(-185.70302), np.float32(-216.13484), np.float32(-274.86542), np.float32(-253.00713), np.float32(-380.1809), np.float32(-284.1075), np.float32(-256.0511), np.float32(-274.5718)]
2025-09-14 13:43:51,567 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:43:51,567 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (-269.45) for latency 18
2025-09-14 13:43:51,569 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 4/100 (estimated time remaining: 26 hours, 14 minutes, 46 seconds)
2025-09-14 13:46:16,316 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 13:46:23,731 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: -17.83836 ± 121.238
2025-09-14 13:46:23,731 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(77.09561), np.float32(-114.68931), np.float32(61.593212), np.float32(187.25713), np.float32(110.28425), np.float32(-207.52812), np.float32(-46.215332), np.float32(-175.5999), np.float32(-62.85541), np.float32(-7.7257075)]
2025-09-14 13:46:23,731 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:46:23,731 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (-17.84) for latency 18
2025-09-14 13:46:23,733 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 5/100 (estimated time remaining: 20 hours, 29 minutes, 46 seconds)
2025-09-14 13:48:38,883 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 13:48:46,264 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: -30.65126 ± 113.647
2025-09-14 13:48:46,264 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(53.14798), np.float32(-202.46407), np.float32(-5.284212), np.float32(180.45128), np.float32(-34.16013), np.float32(-78.72858), np.float32(-132.8416), np.float32(-146.25), np.float32(113.3725), np.float32(-53.755806)]
2025-09-14 13:48:46,265 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:48:46,267 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 6/100 (estimated time remaining: 16 hours, 58 minutes, 42 seconds)
2025-09-14 13:52:14,058 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 13:52:21,450 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 196.81934 ± 131.099
2025-09-14 13:52:21,450 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(323.4742), np.float32(162.43854), np.float32(131.1213), np.float32(-9.758358), np.float32(271.06552), np.float32(61.322502), np.float32(193.02357), np.float32(477.34976), np.float32(131.951), np.float32(226.2052)]
2025-09-14 13:52:21,450 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:52:21,450 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (196.82) for latency 18
2025-09-14 13:52:21,453 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 7/100 (estimated time remaining: 4 hours, 27 minutes, 41 seconds)
2025-09-14 13:54:51,662 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 13:54:59,052 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 547.92413 ± 64.368
2025-09-14 13:54:59,052 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(531.0417), np.float32(669.94214), np.float32(528.9707), np.float32(475.97922), np.float32(494.44037), np.float32(577.48584), np.float32(597.2491), np.float32(450.52643), np.float32(535.5596), np.float32(618.0461)]
2025-09-14 13:54:59,052 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:54:59,052 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (547.92) for latency 18
2025-09-14 13:54:59,057 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 8/100 (estimated time remaining: 4 hours, 26 minutes, 18 seconds)
2025-09-14 13:57:26,134 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 13:57:33,509 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 700.75610 ± 129.552
2025-09-14 13:57:33,509 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(715.93646), np.float32(754.2927), np.float32(693.17053), np.float32(455.57718), np.float32(525.6693), np.float32(583.1427), np.float32(815.54645), np.float32(867.4841), np.float32(793.4044), np.float32(803.3376)]
2025-09-14 13:57:33,509 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:57:33,509 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (700.76) for latency 18
2025-09-14 13:57:33,511 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 9/100 (estimated time remaining: 4 hours, 12 minutes, 3 seconds)
2025-09-14 14:00:20,687 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:00:28,070 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 870.62488 ± 68.944
2025-09-14 14:00:28,070 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(744.49414), np.float32(778.6387), np.float32(942.5978), np.float32(838.24304), np.float32(922.1523), np.float32(895.498), np.float32(840.566), np.float32(888.58685), np.float32(873.37775), np.float32(982.09467)]
2025-09-14 14:00:28,070 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:00:28,071 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (870.62) for latency 18
2025-09-14 14:00:28,073 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 10/100 (estimated time remaining: 4 hours, 16 minutes, 6 seconds)
2025-09-14 14:02:47,008 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:02:54,395 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 946.00763 ± 63.486
2025-09-14 14:02:54,395 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(912.6408), np.float32(933.6801), np.float32(846.403), np.float32(1012.98254), np.float32(844.35785), np.float32(920.7552), np.float32(1044.7909), np.float32(979.4509), np.float32(974.72595), np.float32(990.2891)]
2025-09-14 14:02:54,395 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:02:54,395 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (946.01) for latency 18
2025-09-14 14:02:54,398 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 11/100 (estimated time remaining: 4 hours, 14 minutes, 26 seconds)
2025-09-14 14:05:14,345 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:05:21,748 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1019.72949 ± 87.050
2025-09-14 14:05:21,748 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(839.6285), np.float32(1124.5204), np.float32(994.3654), np.float32(1137.6188), np.float32(1063.8408), np.float32(1080.1296), np.float32(953.1352), np.float32(937.5812), np.float32(1051.637), np.float32(1014.8384)]
2025-09-14 14:05:21,748 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:05:21,748 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (1019.73) for latency 18
2025-09-14 14:05:21,751 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 12/100 (estimated time remaining: 3 hours, 51 minutes, 29 seconds)
2025-09-14 14:07:59,900 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:08:07,290 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 979.32483 ± 88.429
2025-09-14 14:08:07,290 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(891.7724), np.float32(1021.3131), np.float32(968.5064), np.float32(964.4254), np.float32(870.5708), np.float32(974.3943), np.float32(1028.4744), np.float32(1058.0112), np.float32(1160.1058), np.float32(855.67474)]
2025-09-14 14:08:07,291 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:08:07,293 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 13/100 (estimated time remaining: 3 hours, 51 minutes, 12 seconds)
2025-09-14 14:10:32,948 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:10:40,342 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1056.43469 ± 84.573
2025-09-14 14:10:40,342 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1216.1921), np.float32(991.3336), np.float32(1038.9642), np.float32(1011.465), np.float32(1085.1455), np.float32(959.20764), np.float32(1002.8741), np.float32(1157.1486), np.float32(1140.7025), np.float32(961.31366)]
2025-09-14 14:10:40,342 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:10:40,342 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (1056.43) for latency 18
2025-09-14 14:10:40,345 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 14/100 (estimated time remaining: 3 hours, 48 minutes, 10 seconds)
2025-09-14 14:13:24,131 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:13:31,502 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1110.09949 ± 89.012
2025-09-14 14:13:31,503 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(999.09735), np.float32(1073.8839), np.float32(1165.9548), np.float32(1156.6892), np.float32(1108.9589), np.float32(1204.4133), np.float32(1286.7615), np.float32(1043.6204), np.float32(1070.8557), np.float32(990.7607)]
2025-09-14 14:13:31,503 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:13:31,503 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (1110.10) for latency 18
2025-09-14 14:13:31,505 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 15/100 (estimated time remaining: 3 hours, 44 minutes, 35 seconds)
2025-09-14 14:15:52,744 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:16:00,093 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1090.84045 ± 72.167
2025-09-14 14:16:00,093 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(973.82367), np.float32(1084.1333), np.float32(1106.4742), np.float32(1023.0884), np.float32(1030.5225), np.float32(1167.45), np.float32(1199.5992), np.float32(1155.6372), np.float32(1146.5897), np.float32(1021.0864)]
2025-09-14 14:16:00,093 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:16:00,096 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 16/100 (estimated time remaining: 3 hours, 42 minutes, 36 seconds)
2025-09-14 14:18:13,823 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:18:21,214 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1147.31616 ± 76.053
2025-09-14 14:18:21,214 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1176.0272), np.float32(991.37317), np.float32(1064.0834), np.float32(1148.4391), np.float32(1124.8651), np.float32(1221.4883), np.float32(1230.1353), np.float32(1086.6064), np.float32(1219.3927), np.float32(1210.7524)]
2025-09-14 14:18:21,214 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:18:21,215 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (1147.32) for latency 18
2025-09-14 14:18:21,217 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 17/100 (estimated time remaining: 3 hours, 38 minutes, 15 seconds)
2025-09-14 14:20:37,025 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:20:44,417 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1229.71204 ± 80.627
2025-09-14 14:20:44,417 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1199.058), np.float32(1180.9888), np.float32(1424.3776), np.float32(1185.6393), np.float32(1126.5659), np.float32(1230.3059), np.float32(1250.378), np.float32(1170.4778), np.float32(1215.481), np.float32(1313.8483)]
2025-09-14 14:20:44,417 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:20:44,417 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (1229.71) for latency 18
2025-09-14 14:20:44,419 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 18/100 (estimated time remaining: 3 hours, 29 minutes, 28 seconds)
2025-09-14 14:23:02,948 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:23:10,328 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1124.90112 ± 161.091
2025-09-14 14:23:10,328 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1133.936), np.float32(1310.3164), np.float32(1034.1168), np.float32(1266.6515), np.float32(973.0964), np.float32(1100.6753), np.float32(1179.9742), np.float32(1120.5354), np.float32(1350.1722), np.float32(779.5375)]
2025-09-14 14:23:10,328 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:23:10,331 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 19/100 (estimated time remaining: 3 hours, 24 minutes, 59 seconds)
2025-09-14 14:25:30,728 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:25:38,088 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1207.87915 ± 54.097
2025-09-14 14:25:38,088 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1266.9043), np.float32(1131.3005), np.float32(1271.9675), np.float32(1182.3682), np.float32(1232.7117), np.float32(1188.4569), np.float32(1185.3978), np.float32(1236.1648), np.float32(1113.2123), np.float32(1270.3085)]
2025-09-14 14:25:38,088 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:25:38,091 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 20/100 (estimated time remaining: 3 hours, 16 minutes, 10 seconds)
2025-09-14 14:27:54,938 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:28:02,310 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1206.78552 ± 75.706
2025-09-14 14:28:02,310 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1258.4149), np.float32(1269.3756), np.float32(1130.6484), np.float32(1346.7078), np.float32(1118.9387), np.float32(1167.6001), np.float32(1234.316), np.float32(1122.7412), np.float32(1273.8445), np.float32(1145.2684)]
2025-09-14 14:28:02,310 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:28:02,313 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 21/100 (estimated time remaining: 3 hours, 12 minutes, 35 seconds)
2025-09-14 14:31:19,244 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:31:26,627 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1286.90552 ± 193.242
2025-09-14 14:31:26,627 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1163.2781), np.float32(867.4493), np.float32(1272.1077), np.float32(1453.2283), np.float32(1370.3013), np.float32(1251.3606), np.float32(1175.5374), np.float32(1385.6521), np.float32(1291.2024), np.float32(1638.9371)]
2025-09-14 14:31:26,627 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:31:26,627 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (1286.91) for latency 18
2025-09-14 14:31:26,631 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 22/100 (estimated time remaining: 3 hours, 26 minutes, 49 seconds)
2025-09-14 14:38:18,001 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:38:25,409 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1205.99719 ± 177.730
2025-09-14 14:38:25,409 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1100.3755), np.float32(1307.6549), np.float32(1334.4464), np.float32(949.7834), np.float32(1345.5165), np.float32(817.47437), np.float32(1294.0072), np.float32(1245.6074), np.float32(1348.0758), np.float32(1317.0304)]
2025-09-14 14:38:25,409 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:38:25,412 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 23/100 (estimated time remaining: 4 hours, 35 minutes, 51 seconds)
2025-09-14 14:40:55,126 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:41:02,533 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1295.50757 ± 133.695
2025-09-14 14:41:02,533 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1233.8), np.float32(1665.1488), np.float32(1329.6683), np.float32(1282.043), np.float32(1182.9026), np.float32(1337.2457), np.float32(1249.7058), np.float32(1216.8678), np.float32(1280.6221), np.float32(1177.07)]
2025-09-14 14:41:02,533 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:41:02,533 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (1295.51) for latency 18
2025-09-14 14:41:02,547 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 24/100 (estimated time remaining: 4 hours, 35 minutes, 12 seconds)
2025-09-14 14:46:44,083 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:46:51,464 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1121.22290 ± 461.905
2025-09-14 14:46:51,464 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1160.8623), np.float32(1162.4404), np.float32(1065.6129), np.float32(-184.74129), np.float32(1177.964), np.float32(1532.8551), np.float32(1325.4188), np.float32(1101.2932), np.float32(1365.7614), np.float32(1504.7627)]
2025-09-14 14:46:51,464 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:46:51,467 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 25/100 (estimated time remaining: 5 hours, 22 minutes, 35 seconds)
2025-09-14 14:51:36,792 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:51:44,164 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1170.80054 ± 104.126
2025-09-14 14:51:44,164 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1181.6896), np.float32(1188.3385), np.float32(956.52496), np.float32(1292.536), np.float32(1164.1829), np.float32(1160.6666), np.float32(1198.3267), np.float32(1205.348), np.float32(1030.2137), np.float32(1330.1776)]
2025-09-14 14:51:44,164 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:51:44,167 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 26/100 (estimated time remaining: 5 hours, 55 minutes, 27 seconds)
2025-09-14 14:54:08,896 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:54:16,268 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1318.63062 ± 163.364
2025-09-14 14:54:16,269 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1290.3907), np.float32(1264.0763), np.float32(1283.8508), np.float32(1423.1388), np.float32(1712.0314), np.float32(1254.6863), np.float32(1062.1372), np.float32(1427.3453), np.float32(1222.1609), np.float32(1246.4877)]
2025-09-14 14:54:16,269 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:54:16,269 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (1318.63) for latency 18
2025-09-14 14:54:16,272 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 27/100 (estimated time remaining: 5 hours, 37 minutes, 50 seconds)
2025-09-14 15:07:11,444 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 15:07:18,827 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1342.65625 ± 146.991
2025-09-14 15:07:18,828 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1162.346), np.float32(1244.0947), np.float32(1395.4062), np.float32(1609.7223), np.float32(1242.2554), np.float32(1542.9038), np.float32(1214.746), np.float32(1486.8413), np.float32(1258.3652), np.float32(1269.8812)]
2025-09-14 15:07:18,828 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:07:18,828 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (1342.66) for latency 18
2025-09-14 15:07:18,831 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 28/100 (estimated time remaining: 7 hours, 1 minute, 47 seconds)
2025-09-14 15:10:02,141 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 15:10:09,512 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1301.79297 ± 121.976
2025-09-14 15:10:09,513 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1127.1713), np.float32(1177.0834), np.float32(1500.014), np.float32(1377.4847), np.float32(1184.2104), np.float32(1450.5479), np.float32(1305.6213), np.float32(1304.0383), np.float32(1191.7844), np.float32(1399.9736)]
2025-09-14 15:10:09,513 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:10:09,516 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 29/100 (estimated time remaining: 6 hours, 59 minutes, 16 seconds)
2025-09-14 15:15:19,685 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 15:15:27,046 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1338.00977 ± 210.199
2025-09-14 15:15:27,046 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1743.7489), np.float32(1660.0414), np.float32(1363.0021), np.float32(1115.9531), np.float32(1333.3015), np.float32(1415.3428), np.float32(1231.5634), np.float32(1268.0415), np.float32(1053.0857), np.float32(1196.0181)]
2025-09-14 15:15:27,046 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:15:27,050 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 30/100 (estimated time remaining: 6 hours, 46 minutes, 1 second)
2025-09-14 15:18:24,604 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 15:18:31,971 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1431.42200 ± 239.960
2025-09-14 15:18:31,971 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1547.8896), np.float32(1248.7848), np.float32(1842.6327), np.float32(1813.0499), np.float32(1380.9277), np.float32(1182.6342), np.float32(1248.5787), np.float32(1493.9542), np.float32(1458.4668), np.float32(1097.3015)]
2025-09-14 15:18:31,971 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:18:31,971 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (1431.42) for latency 18
2025-09-14 15:18:31,974 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 31/100 (estimated time remaining: 6 hours, 15 minutes, 9 seconds)
2025-09-14 15:22:19,552 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 15:22:26,936 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1416.66345 ± 162.084
2025-09-14 15:22:26,936 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1313.3492), np.float32(1327.7385), np.float32(1428.0472), np.float32(1766.7098), np.float32(1338.0099), np.float32(1338.392), np.float32(1401.918), np.float32(1347.7003), np.float32(1677.8059), np.float32(1226.9644)]
2025-09-14 15:22:26,936 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:22:26,939 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 32/100 (estimated time remaining: 6 hours, 28 minutes, 51 seconds)
2025-09-14 15:25:11,887 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 15:25:19,244 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1420.16760 ± 116.088
2025-09-14 15:25:19,244 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1458.8226), np.float32(1419.9048), np.float32(1466.654), np.float32(1306.37), np.float32(1425.6129), np.float32(1280.1912), np.float32(1666.2814), np.float32(1546.4445), np.float32(1305.9424), np.float32(1325.4525)]
2025-09-14 15:25:19,244 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:25:19,248 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 33/100 (estimated time remaining: 4 hours, 4 minutes, 53 seconds)
2025-09-14 15:29:55,964 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 15:30:03,369 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1431.65051 ± 195.609
2025-09-14 15:30:03,369 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1300.5059), np.float32(1483.9406), np.float32(1366.8696), np.float32(1758.424), np.float32(1171.8124), np.float32(1788.7524), np.float32(1330.3351), np.float32(1467.2676), np.float32(1219.1858), np.float32(1429.4117)]
2025-09-14 15:30:03,369 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:30:03,369 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (1431.65) for latency 18
2025-09-14 15:30:03,373 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 34/100 (estimated time remaining: 4 hours, 26 minutes, 37 seconds)
2025-09-14 15:32:40,437 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 15:32:47,800 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1374.84326 ± 150.128
2025-09-14 15:32:47,800 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1448.2366), np.float32(1392.9858), np.float32(1256.9209), np.float32(1509.1511), np.float32(1188.2158), np.float32(1438.1245), np.float32(1302.7422), np.float32(1285.9215), np.float32(1708.2375), np.float32(1217.8961)]
2025-09-14 15:32:47,800 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:32:47,803 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 35/100 (estimated time remaining: 3 hours, 48 minutes, 57 seconds)
2025-09-14 15:35:24,502 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 15:35:31,872 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1292.25977 ± 60.058
2025-09-14 15:35:31,872 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1337.1925), np.float32(1178.1858), np.float32(1222.473), np.float32(1284.7877), np.float32(1304.4504), np.float32(1352.651), np.float32(1226.7295), np.float32(1344.2289), np.float32(1308.5352), np.float32(1363.3635)]
2025-09-14 15:35:31,872 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:35:31,875 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 36/100 (estimated time remaining: 3 hours, 40 minutes, 58 seconds)
2025-09-14 15:39:28,373 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 15:39:35,732 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1444.57739 ± 552.683
2025-09-14 15:39:35,733 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1368.6641), np.float32(1387.1993), np.float32(-95.015114), np.float32(1677.1049), np.float32(1791.5203), np.float32(1648.9493), np.float32(1933.9109), np.float32(1461.2827), np.float32(1906.1794), np.float32(1365.9778)]
2025-09-14 15:39:35,733 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:39:35,733 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (1444.58) for latency 18
2025-09-14 15:39:35,736 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 37/100 (estimated time remaining: 3 hours, 39 minutes, 28 seconds)
2025-09-14 15:43:00,522 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 15:43:07,917 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1459.02258 ± 259.844
2025-09-14 15:43:07,917 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1301.9508), np.float32(1334.5945), np.float32(1579.3331), np.float32(1298.0874), np.float32(1404.6337), np.float32(1786.5648), np.float32(1197.5718), np.float32(2064.6814), np.float32(1364.583), np.float32(1258.2249)]
2025-09-14 15:43:07,917 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:43:07,917 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (1459.02) for latency 18
2025-09-14 15:43:07,921 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 38/100 (estimated time remaining: 3 hours, 44 minutes, 25 seconds)
2025-09-14 15:46:20,242 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 15:46:27,627 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1334.79639 ± 92.036
2025-09-14 15:46:27,627 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1443.1947), np.float32(1377.4696), np.float32(1340.3269), np.float32(1470.1097), np.float32(1200.6785), np.float32(1253.2944), np.float32(1398.1748), np.float32(1192.6561), np.float32(1380.4203), np.float32(1291.6396)]
2025-09-14 15:46:27,627 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:46:27,631 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 39/100 (estimated time remaining: 3 hours, 23 minutes, 24 seconds)
2025-09-14 15:49:04,156 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 15:49:11,516 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1479.57581 ± 206.737
2025-09-14 15:49:11,516 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1268.8701), np.float32(1501.735), np.float32(1253.1426), np.float32(1500.8782), np.float32(1233.0035), np.float32(1322.8147), np.float32(1580.333), np.float32(1910.1377), np.float32(1697.8223), np.float32(1527.0205)]
2025-09-14 15:49:11,516 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:49:11,516 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (1479.58) for latency 18
2025-09-14 15:49:11,520 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 40/100 (estimated time remaining: 3 hours, 20 minutes, 1 second)
2025-09-14 15:51:48,799 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 15:51:56,159 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1530.42639 ± 364.572
2025-09-14 15:51:56,159 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1434.1405), np.float32(1265.12), np.float32(1444.7616), np.float32(1386.5084), np.float32(1488.9086), np.float32(2575.9658), np.float32(1634.0945), np.float32(1278.709), np.float32(1491.24), np.float32(1304.815)]
2025-09-14 15:51:56,159 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:51:56,159 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (1530.43) for latency 18
2025-09-14 15:51:56,163 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 41/100 (estimated time remaining: 3 hours, 16 minutes, 51 seconds)
2025-09-14 15:54:34,955 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 15:54:42,317 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1429.20056 ± 249.506
2025-09-14 15:54:42,318 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1913.7373), np.float32(1867.9436), np.float32(1266.635), np.float32(1223.0597), np.float32(1274.5145), np.float32(1210.2004), np.float32(1274.3962), np.float32(1501.4866), np.float32(1278.9642), np.float32(1481.0679)]
2025-09-14 15:54:42,318 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:54:42,321 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 42/100 (estimated time remaining: 2 hours, 58 minutes, 17 seconds)
2025-09-14 15:57:23,117 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 15:57:30,477 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1520.22998 ± 170.235
2025-09-14 15:57:30,478 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1493.2064), np.float32(1508.8148), np.float32(1396.1721), np.float32(1405.337), np.float32(1706.3237), np.float32(1812.9417), np.float32(1705.1532), np.float32(1305.8707), np.float32(1584.4116), np.float32(1284.0695)]
2025-09-14 15:57:30,478 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:57:30,481 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 43/100 (estimated time remaining: 2 hours, 46 minutes, 45 seconds)
2025-09-14 16:00:18,293 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 16:00:25,661 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1392.84534 ± 555.196
2025-09-14 16:00:25,661 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1397.0342), np.float32(1481.4563), np.float32(1524.0664), np.float32(2291.2395), np.float32(1313.8536), np.float32(1520.7212), np.float32(1572.5956), np.float32(-82.54178), np.float32(1381.4407), np.float32(1528.5878)]
2025-09-14 16:00:25,662 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:00:25,666 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 44/100 (estimated time remaining: 2 hours, 39 minutes, 13 seconds)
2025-09-14 16:06:24,345 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 16:06:31,720 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1386.28784 ± 486.416
2025-09-14 16:06:31,720 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1303.5308), np.float32(1618.2174), np.float32(1754.1301), np.float32(2084.5269), np.float32(1901.185), np.float32(479.74603), np.float32(1246.111), np.float32(1509.8828), np.float32(1325.5662), np.float32(639.9824)]
2025-09-14 16:06:31,720 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:06:31,724 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 45/100 (estimated time remaining: 3 hours, 14 minutes, 10 seconds)
2025-09-14 16:10:31,022 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 16:10:38,392 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1516.63635 ± 242.398
2025-09-14 16:10:38,392 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1687.667), np.float32(1467.8375), np.float32(1578.1947), np.float32(2029.9911), np.float32(1342.0509), np.float32(1297.0018), np.float32(1281.7494), np.float32(1808.9603), np.float32(1383.0142), np.float32(1289.8964)]
2025-09-14 16:10:38,393 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:10:38,397 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 46/100 (estimated time remaining: 3 hours, 25 minutes, 44 seconds)
2025-09-14 16:27:49,911 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 16:27:57,273 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1546.09790 ± 249.333
2025-09-14 16:27:57,273 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1539.6173), np.float32(1426.8252), np.float32(1746.0223), np.float32(1280.9961), np.float32(1317.6163), np.float32(1275.2632), np.float32(1331.0574), np.float32(2008.3707), np.float32(1685.3613), np.float32(1849.8481)]
2025-09-14 16:27:57,273 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:27:57,273 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (1546.10) for latency 18
2025-09-14 16:27:57,278 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 47/100 (estimated time remaining: 5 hours, 59 minutes, 5 seconds)
2025-09-14 16:45:01,546 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 16:45:08,883 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1411.93359 ± 660.631
2025-09-14 16:45:08,883 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1865.3972), np.float32(1354.9617), np.float32(1468.9731), np.float32(2032.5647), np.float32(1342.3341), np.float32(2012.8478), np.float32(-416.56732), np.float32(1429.9784), np.float32(1649.1967), np.float32(1379.65)]
2025-09-14 16:45:08,883 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:45:08,887 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 48/100 (estimated time remaining: 8 hours, 24 minutes, 59 seconds)
2025-09-14 16:47:42,770 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 16:47:50,128 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1515.05151 ± 221.838
2025-09-14 16:47:50,128 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1304.2185), np.float32(1443.9716), np.float32(1591.2285), np.float32(1975.9478), np.float32(1424.4193), np.float32(1300.792), np.float32(1302.601), np.float32(1359.7322), np.float32(1807.0372), np.float32(1640.5682)]
2025-09-14 16:47:50,128 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:47:50,132 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 49/100 (estimated time remaining: 8 hours, 13 minutes, 2 seconds)
2025-09-14 16:50:10,443 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 16:50:17,774 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1591.94043 ± 251.583
2025-09-14 16:50:17,774 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1383.8313), np.float32(1679.6357), np.float32(1708.0526), np.float32(1306.81), np.float32(1500.304), np.float32(1262.4885), np.float32(1974.4209), np.float32(1461.402), np.float32(1595.6837), np.float32(2046.7755)]
2025-09-14 16:50:17,774 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:50:17,774 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (1591.94) for latency 18
2025-09-14 16:50:17,778 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 50/100 (estimated time remaining: 7 hours, 26 minutes, 25 seconds)
2025-09-14 16:52:45,261 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 16:52:52,594 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 2054.26074 ± 596.521
2025-09-14 16:52:52,595 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2548.5103), np.float32(1196.3124), np.float32(1899.6033), np.float32(2185.1343), np.float32(2975.0544), np.float32(1358.2948), np.float32(1483.5923), np.float32(2589.4998), np.float32(2688.5479), np.float32(1618.056)]
2025-09-14 16:52:52,595 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:52:52,595 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (2054.26) for latency 18
2025-09-14 16:52:52,599 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 51/100 (estimated time remaining: 7 hours, 2 minutes, 22 seconds)
2025-09-14 16:55:21,672 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 16:55:29,010 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1795.53247 ± 483.799
2025-09-14 16:55:29,010 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2122.5461), np.float32(2749.1672), np.float32(1342.999), np.float32(2034.5142), np.float32(1624.558), np.float32(1322.06), np.float32(1614.6058), np.float32(1376.9991), np.float32(1342.9153), np.float32(2424.96)]
2025-09-14 16:55:29,010 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:55:29,014 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 52/100 (estimated time remaining: 4 hours, 29 minutes, 47 seconds)
2025-09-14 16:57:46,709 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 16:57:54,059 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1983.12268 ± 557.681
2025-09-14 16:57:54,059 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2122.427), np.float32(1416.9473), np.float32(2855.958), np.float32(1580.9791), np.float32(1601.8145), np.float32(2223.8591), np.float32(2335.9077), np.float32(1391.6047), np.float32(2896.7134), np.float32(1405.0165)]
2025-09-14 16:57:54,059 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:57:54,063 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 53/100 (estimated time remaining: 2 hours, 2 minutes, 25 seconds)
2025-09-14 17:00:16,448 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:00:23,786 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 2004.01343 ± 1017.996
2025-09-14 17:00:23,786 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1419.7999), np.float32(-91.60086), np.float32(1827.5145), np.float32(3043.2952), np.float32(1759.3243), np.float32(1313.1383), np.float32(2798.1833), np.float32(3151.0386), np.float32(1486.072), np.float32(3333.3696)]
2025-09-14 17:00:23,786 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:00:23,790 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 54/100 (estimated time remaining: 1 hour, 58 minutes, 4 seconds)
2025-09-14 17:02:39,545 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:02:46,900 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1494.07886 ± 185.985
2025-09-14 17:02:46,900 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1286.5736), np.float32(1476.4752), np.float32(1431.9363), np.float32(1355.6486), np.float32(1309.879), np.float32(1843.3765), np.float32(1591.7374), np.float32(1810.3256), np.float32(1383.7162), np.float32(1451.1198)]
2025-09-14 17:02:46,900 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:02:46,905 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 55/100 (estimated time remaining: 1 hour, 54 minutes, 51 seconds)
2025-09-14 17:05:06,922 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:05:14,278 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1816.82495 ± 681.299
2025-09-14 17:05:14,278 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1820.6517), np.float32(1515.8833), np.float32(1789.1344), np.float32(1642.6029), np.float32(536.7742), np.float32(3054.5444), np.float32(2786.9878), np.float32(2144.3662), np.float32(1525.1697), np.float32(1352.1357)]
2025-09-14 17:05:14,278 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:05:14,283 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 56/100 (estimated time remaining: 1 hour, 51 minutes, 15 seconds)
2025-09-14 17:07:43,551 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:07:50,909 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1816.77307 ± 344.225
2025-09-14 17:07:50,909 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2204.4236), np.float32(2020.7668), np.float32(1842.2423), np.float32(1851.2015), np.float32(2130.0852), np.float32(1677.0366), np.float32(1498.771), np.float32(2322.7625), np.float32(1270.4987), np.float32(1349.9438)]
2025-09-14 17:07:50,909 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:07:50,913 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 57/100 (estimated time remaining: 1 hour, 48 minutes, 48 seconds)
2025-09-14 17:10:13,191 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:10:20,538 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 2063.48169 ± 433.019
2025-09-14 17:10:20,538 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1860.7618), np.float32(2290.2112), np.float32(1858.0164), np.float32(1861.7966), np.float32(2893.5413), np.float32(2203.8442), np.float32(2472.4666), np.float32(1487.0006), np.float32(1404.5613), np.float32(2302.6155)]
2025-09-14 17:10:20,538 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:10:20,538 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (2063.48) for latency 18
2025-09-14 17:10:20,542 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 58/100 (estimated time remaining: 1 hour, 46 minutes, 59 seconds)
2025-09-14 17:12:42,487 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:12:49,828 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 2038.71069 ± 491.777
2025-09-14 17:12:49,828 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2245.2366), np.float32(2516.4453), np.float32(1365.6681), np.float32(2400.1238), np.float32(1401.5123), np.float32(1497.3542), np.float32(1546.6107), np.float32(2380.6567), np.float32(2360.6423), np.float32(2672.8577)]
2025-09-14 17:12:49,829 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:12:49,833 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 59/100 (estimated time remaining: 1 hour, 44 minutes, 26 seconds)
2025-09-14 17:15:10,360 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:15:17,695 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1992.91211 ± 653.929
2025-09-14 17:15:17,695 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2021.1068), np.float32(2084.0283), np.float32(2252.946), np.float32(3211.7239), np.float32(1421.5797), np.float32(1427.3917), np.float32(1540.6227), np.float32(1435.8188), np.float32(1429.4956), np.float32(3104.408)]
2025-09-14 17:15:17,696 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:15:17,700 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 60/100 (estimated time remaining: 1 hour, 42 minutes, 36 seconds)
2025-09-14 17:17:35,469 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:17:42,811 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1808.83765 ± 365.400
2025-09-14 17:17:42,811 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1738.7637), np.float32(1934.4376), np.float32(1554.9357), np.float32(2517.118), np.float32(1912.5463), np.float32(1398.292), np.float32(2203.1277), np.float32(2050.1116), np.float32(1346.3662), np.float32(1432.6774)]
2025-09-14 17:17:42,811 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:17:42,816 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 61/100 (estimated time remaining: 1 hour, 39 minutes, 48 seconds)
2025-09-14 17:19:58,708 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:20:06,053 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1882.73315 ± 346.824
2025-09-14 17:20:06,053 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1888.1821), np.float32(2321.434), np.float32(1524.5868), np.float32(2147.4412), np.float32(1559.512), np.float32(1391.1694), np.float32(2487.1174), np.float32(1950.2517), np.float32(1971.8185), np.float32(1585.8174)]
2025-09-14 17:20:06,053 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:20:06,057 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 62/100 (estimated time remaining: 1 hour, 35 minutes, 34 seconds)
2025-09-14 17:22:24,878 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:22:32,225 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 2423.27368 ± 532.922
2025-09-14 17:22:32,225 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1439.5073), np.float32(2696.682), np.float32(2050.281), np.float32(2515.3176), np.float32(1883.9025), np.float32(1915.906), np.float32(3116.4763), np.float32(2974.4104), np.float32(2786.1301), np.float32(2854.1216)]
2025-09-14 17:22:32,225 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:22:32,225 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (2423.27) for latency 18
2025-09-14 17:22:32,229 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 63/100 (estimated time remaining: 1 hour, 32 minutes, 40 seconds)
2025-09-14 17:24:59,501 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:25:06,831 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1947.22876 ± 519.076
2025-09-14 17:25:06,831 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2290.2048), np.float32(1656.5347), np.float32(2992.4932), np.float32(1635.7202), np.float32(1712.0424), np.float32(1290.0383), np.float32(2054.6821), np.float32(2597.812), np.float32(1911.4425), np.float32(1331.3156)]
2025-09-14 17:25:06,831 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:25:06,835 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 64/100 (estimated time remaining: 1 hour, 30 minutes, 53 seconds)
2025-09-14 17:27:32,571 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:27:39,934 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 2036.54004 ± 531.947
2025-09-14 17:27:39,934 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1357.3226), np.float32(2253.7983), np.float32(1823.3477), np.float32(1722.9498), np.float32(2919.9346), np.float32(2936.0737), np.float32(1580.3181), np.float32(1505.735), np.float32(2332.1033), np.float32(1933.817)]
2025-09-14 17:27:39,934 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:27:39,939 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 65/100 (estimated time remaining: 1 hour, 29 minutes, 4 seconds)
2025-09-14 17:30:07,885 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:30:15,227 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 2673.32227 ± 776.856
2025-09-14 17:30:15,228 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3271.3442), np.float32(3647.2522), np.float32(1930.1556), np.float32(3650.224), np.float32(1323.247), np.float32(2945.266), np.float32(1771.8816), np.float32(2767.6157), np.float32(3213.5833), np.float32(2212.6528)]
2025-09-14 17:30:15,228 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:30:15,228 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (2673.32) for latency 18
2025-09-14 17:30:15,232 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 66/100 (estimated time remaining: 1 hour, 27 minutes, 46 seconds)
2025-09-14 17:32:33,518 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:32:40,863 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1932.53101 ± 515.103
2025-09-14 17:32:40,863 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1334.0128), np.float32(1745.1327), np.float32(2148.9492), np.float32(2694.1162), np.float32(1502.1426), np.float32(1483.899), np.float32(2702.5532), np.float32(2382.637), np.float32(2058.9563), np.float32(1272.9094)]
2025-09-14 17:32:40,863 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:32:40,868 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 67/100 (estimated time remaining: 1 hour, 25 minutes, 32 seconds)
2025-09-14 17:35:05,954 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:35:13,310 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 2717.81689 ± 1346.375
2025-09-14 17:35:13,311 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1767.7235), np.float32(-185.35786), np.float32(4083.335), np.float32(4118.321), np.float32(2199.2075), np.float32(2115.1458), np.float32(2624.3904), np.float32(4073.3845), np.float32(4201.9062), np.float32(2180.112)]
2025-09-14 17:35:13,311 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:35:13,311 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (2717.82) for latency 18
2025-09-14 17:35:13,315 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 68/100 (estimated time remaining: 1 hour, 23 minutes, 43 seconds)
2025-09-14 17:37:34,672 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:37:42,009 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 2348.46436 ± 929.604
2025-09-14 17:37:42,009 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3436.2566), np.float32(1700.1295), np.float32(1677.7798), np.float32(1529.272), np.float32(1448.8474), np.float32(2313.5476), np.float32(1923.7856), np.float32(1740.1776), np.float32(3891.3022), np.float32(3823.547)]
2025-09-14 17:37:42,010 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:37:42,014 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 69/100 (estimated time remaining: 1 hour, 20 minutes, 33 seconds)
2025-09-14 17:40:12,157 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:40:19,501 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 2736.67310 ± 1142.914
2025-09-14 17:40:19,501 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2491.4824), np.float32(2263.322), np.float32(4124.2827), np.float32(4128.8677), np.float32(1857.9648), np.float32(1464.0159), np.float32(1337.0023), np.float32(4016.1252), np.float32(4046.9702), np.float32(1636.6968)]
2025-09-14 17:40:19,501 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:40:19,501 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (2736.67) for latency 18
2025-09-14 17:40:19,505 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 70/100 (estimated time remaining: 1 hour, 18 minutes, 29 seconds)
2025-09-14 17:42:37,096 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:42:44,407 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 3914.39380 ± 826.491
2025-09-14 17:42:44,407 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4359.1714), np.float32(4353.3706), np.float32(4272.1694), np.float32(4090.894), np.float32(4454.0986), np.float32(3886.9717), np.float32(4482.455), np.float32(1617.5938), np.float32(4244.2153), np.float32(3382.9966)]
2025-09-14 17:42:44,407 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:42:44,407 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (3914.39) for latency 18
2025-09-14 17:42:44,412 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 71/100 (estimated time remaining: 1 hour, 14 minutes, 55 seconds)
2025-09-14 17:45:05,225 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:45:12,564 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 3186.43311 ± 726.223
2025-09-14 17:45:12,565 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3306.0474), np.float32(3423.258), np.float32(2295.6272), np.float32(3812.706), np.float32(1585.081), np.float32(3708.8242), np.float32(3947.3433), np.float32(3823.1711), np.float32(3217.695), np.float32(2744.5762)]
2025-09-14 17:45:12,565 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:45:12,570 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 72/100 (estimated time remaining: 1 hour, 12 minutes, 39 seconds)
2025-09-14 17:47:31,985 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:47:39,324 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 2596.12427 ± 982.488
2025-09-14 17:47:39,324 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2211.1228), np.float32(2446.7598), np.float32(2819.0488), np.float32(1734.4724), np.float32(1966.5413), np.float32(4175.6934), np.float32(3049.5554), np.float32(1438.4851), np.float32(1677.2041), np.float32(4442.3604)]
2025-09-14 17:47:39,324 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:47:39,329 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 73/100 (estimated time remaining: 1 hour, 9 minutes, 37 seconds)
2025-09-14 17:50:02,376 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:50:09,726 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 3849.28052 ± 1035.205
2025-09-14 17:50:09,726 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4645.1445), np.float32(4613.385), np.float32(4647.901), np.float32(4644.98), np.float32(2708.632), np.float32(4534.4365), np.float32(3577.7253), np.float32(3274.0525), np.float32(1450.4901), np.float32(4396.059)]
2025-09-14 17:50:09,726 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:50:09,731 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 74/100 (estimated time remaining: 1 hour, 7 minutes, 17 seconds)
2025-09-14 17:52:22,952 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:52:30,335 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 3212.36206 ± 1489.186
2025-09-14 17:52:30,335 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4687.0415), np.float32(4620.9536), np.float32(139.70633), np.float32(4122.5483), np.float32(2164.1538), np.float32(4509.82), np.float32(4688.8003), np.float32(3052.4412), np.float32(2339.7317), np.float32(1798.4244)]
2025-09-14 17:52:30,335 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:52:30,340 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 75/100 (estimated time remaining: 1 hour, 3 minutes, 20 seconds)
2025-09-14 17:54:46,517 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:54:53,868 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 3554.52612 ± 1245.628
2025-09-14 17:54:53,868 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4455.52), np.float32(4544.929), np.float32(4530.967), np.float32(1550.5175), np.float32(3272.0417), np.float32(1432.9474), np.float32(4642.4995), np.float32(4283.244), np.float32(2316.3242), np.float32(4516.271)]
2025-09-14 17:54:53,868 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:54:53,873 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 76/100 (estimated time remaining: 1 hour, 47 seconds)
2025-09-14 18:04:14,090 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 18:04:21,442 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 3394.07031 ± 1074.477
2025-09-14 18:04:21,443 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4687.0728), np.float32(3426.3953), np.float32(3595.6384), np.float32(4672.067), np.float32(3628.562), np.float32(3506.1191), np.float32(4634.638), np.float32(2241.222), np.float32(1683.5433), np.float32(1865.4443)]
2025-09-14 18:04:21,443 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 18:04:21,448 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 77/100 (estimated time remaining: 1 hour, 31 minutes, 54 seconds)
2025-09-14 18:10:12,605 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 18:10:19,942 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 4588.99756 ± 667.467
2025-09-14 18:10:19,942 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4844.959), np.float32(2952.9033), np.float32(4918.248), np.float32(3636.8948), np.float32(5021.126), np.float32(4967.9067), np.float32(4956.263), np.float32(4803.6616), np.float32(4877.1846), np.float32(4910.8325)]
2025-09-14 18:10:19,942 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 18:10:19,942 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (4589.00) for latency 18
2025-09-14 18:10:19,947 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 78/100 (estimated time remaining: 1 hour, 44 minutes, 18 seconds)
2025-09-14 18:12:45,394 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 18:12:52,747 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 3743.86279 ± 1260.297
2025-09-14 18:12:52,748 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1495.8771), np.float32(3793.607), np.float32(4484.691), np.float32(2820.2256), np.float32(5075.566), np.float32(1949.1539), np.float32(4887.273), np.float32(3112.5242), np.float32(4965.3794), np.float32(4854.3335)]
2025-09-14 18:12:52,748 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 18:12:52,753 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 79/100 (estimated time remaining: 1 hour, 39 minutes, 57 seconds)
2025-09-14 18:15:14,817 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 18:15:22,156 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 4638.95703 ± 674.359
2025-09-14 18:15:22,156 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2819.0945), np.float32(5074.9443), np.float32(5049.76), np.float32(4815.4004), np.float32(4920.3965), np.float32(4870.179), np.float32(4967.15), np.float32(4061.0571), np.float32(4651.848), np.float32(5159.739)]
2025-09-14 18:15:22,156 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 18:15:22,156 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (4638.96) for latency 18
2025-09-14 18:15:22,161 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 80/100 (estimated time remaining: 1 hour, 36 minutes, 1 second)
2025-09-14 18:21:07,962 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 18:21:15,302 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 4269.58398 ± 947.728
2025-09-14 18:21:15,302 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4885.6235), np.float32(4494.2485), np.float32(2058.6892), np.float32(4648.5), np.float32(5035.862), np.float32(4405.885), np.float32(4746.346), np.float32(4551.1294), np.float32(5026.5083), np.float32(2843.0457)]
2025-09-14 18:21:15,302 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 18:21:15,307 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 81/100 (estimated time remaining: 1 hour, 45 minutes, 25 seconds)
2025-09-14 18:23:38,281 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 18:23:45,648 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 4638.20068 ± 983.351
2025-09-14 18:23:45,648 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4886.869), np.float32(5062.7), np.float32(4993.2983), np.float32(5050.3247), np.float32(5144.4526), np.float32(5054.7827), np.float32(1722.3787), np.float32(4824.4053), np.float32(4599.8413), np.float32(5042.958)]
2025-09-14 18:23:45,648 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 18:23:45,653 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 82/100 (estimated time remaining: 1 hour, 13 minutes, 43 seconds)
2025-09-14 18:27:52,722 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 18:28:00,073 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 4223.62402 ± 1186.915
2025-09-14 18:28:00,073 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2909.1216), np.float32(4815.262), np.float32(5163.175), np.float32(5221.234), np.float32(5249.3657), np.float32(4933.496), np.float32(4659.286), np.float32(4837.1357), np.float32(2143.12), np.float32(2305.0376)]
2025-09-14 18:28:00,073 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 18:28:00,078 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 83/100 (estimated time remaining: 1 hour, 3 minutes, 36 seconds)
2025-09-14 18:49:44,311 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 18:49:51,645 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 3939.53198 ± 836.249
2025-09-14 18:49:51,645 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4760.6353), np.float32(4510.63), np.float32(3720.0063), np.float32(5077.139), np.float32(4487.757), np.float32(4399.4014), np.float32(2276.5542), np.float32(2942.6052), np.float32(3895.1123), np.float32(3325.481)]
2025-09-14 18:49:51,645 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 18:49:51,652 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 84/100 (estimated time remaining: 2 hours, 5 minutes, 44 seconds)
2025-09-14 18:53:06,377 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 18:53:13,726 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 4357.94629 ± 1346.326
2025-09-14 18:53:13,726 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5079.3877), np.float32(4800.9727), np.float32(5187.735), np.float32(2035.4038), np.float32(5252.3604), np.float32(4984.912), np.float32(4931.4336), np.float32(1359.5822), np.float32(4807.0293), np.float32(5140.6396)]
2025-09-14 18:53:13,726 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 18:53:13,731 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 85/100 (estimated time remaining: 2 hours, 1 minute, 9 seconds)
2025-09-14 18:56:13,318 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 18:56:20,670 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 3765.71606 ± 1154.607
2025-09-14 18:56:20,670 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2850.382), np.float32(2441.6218), np.float32(1875.1075), np.float32(4548.2334), np.float32(4548.9443), np.float32(2493.9453), np.float32(4511.5615), np.float32(5289.122), np.float32(4218.9463), np.float32(4879.2983)]
2025-09-14 18:56:20,670 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 18:56:20,676 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 86/100 (estimated time remaining: 1 hour, 45 minutes, 16 seconds)
2025-09-14 18:59:51,740 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 18:59:59,085 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 4263.94678 ± 1472.840
2025-09-14 18:59:59,086 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5209.886), np.float32(2685.4224), np.float32(559.8429), np.float32(5064.58), np.float32(5130.281), np.float32(5253.4497), np.float32(5254.789), np.float32(4950.8267), np.float32(3648.1848), np.float32(4882.208)]
2025-09-14 18:59:59,086 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 18:59:59,091 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 87/100 (estimated time remaining: 1 hour, 41 minutes, 25 seconds)
2025-09-14 19:03:13,572 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 19:03:20,908 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 4659.17041 ± 798.143
2025-09-14 19:03:20,908 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5278.4204), np.float32(3775.5056), np.float32(5067.3677), np.float32(4779.56), np.float32(5070.88), np.float32(4829.8013), np.float32(4861.4497), np.float32(5005.1357), np.float32(5320.6436), np.float32(2602.9385)]
2025-09-14 19:03:20,908 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 19:03:20,908 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (4659.17) for latency 18
2025-09-14 19:03:20,914 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 88/100 (estimated time remaining: 1 hour, 31 minutes, 54 seconds)
2025-09-14 19:06:33,640 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 19:06:40,985 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 4080.90503 ± 1237.298
2025-09-14 19:06:40,985 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4836.9004), np.float32(4920.354), np.float32(4247.2437), np.float32(4567.987), np.float32(4488.8086), np.float32(1933.0933), np.float32(4303.753), np.float32(5126.892), np.float32(4961.819), np.float32(1422.201)]
2025-09-14 19:06:40,986 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 19:06:40,991 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 89/100 (estimated time remaining: 40 minutes, 22 seconds)
2025-09-14 19:10:01,540 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 19:10:08,870 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 4735.10840 ± 1176.146
2025-09-14 19:10:08,871 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5071.031), np.float32(5206.1045), np.float32(5030.119), np.float32(5227.574), np.float32(5180.948), np.float32(5385.8364), np.float32(4410.6733), np.float32(5302.8774), np.float32(1291.339), np.float32(5244.583)]
2025-09-14 19:10:08,871 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 19:10:08,871 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (4735.11) for latency 18
2025-09-14 19:10:08,876 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 90/100 (estimated time remaining: 37 minutes, 13 seconds)
2025-09-14 19:13:06,852 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 19:13:14,186 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 5140.95361 ± 158.683
2025-09-14 19:13:14,186 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5259.828), np.float32(5364.091), np.float32(5213.3286), np.float32(5081.735), np.float32(4881.8647), np.float32(4958.3228), np.float32(5314.6104), np.float32(5175.8457), np.float32(4938.8174), np.float32(5221.0947)]
2025-09-14 19:13:14,186 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 19:13:14,186 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (5140.95) for latency 18
2025-09-14 19:13:14,192 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 91/100 (estimated time remaining: 33 minutes, 47 seconds)
2025-09-14 19:16:02,660 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 19:16:10,008 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 5271.32178 ± 130.144
2025-09-14 19:16:10,008 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5299.548), np.float32(5049.5737), np.float32(5234.1274), np.float32(5288.247), np.float32(5381.518), np.float32(5394.3984), np.float32(5250.921), np.float32(5395.9043), np.float32(5026.65), np.float32(5392.332)]
2025-09-14 19:16:10,008 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 19:16:10,008 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (5271.32) for latency 18
2025-09-14 19:16:10,014 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 92/100 (estimated time remaining: 29 minutes, 7 seconds)
2025-09-14 19:19:06,912 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 19:19:14,260 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 5143.84082 ± 186.361
2025-09-14 19:19:14,260 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5283.5054), np.float32(5146.856), np.float32(5292.6846), np.float32(4981.609), np.float32(4731.6206), np.float32(4977.2036), np.float32(5343.598), np.float32(5186.299), np.float32(5151.5464), np.float32(5343.4854)]
2025-09-14 19:19:14,260 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 19:19:14,266 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 93/100 (estimated time remaining: 25 minutes, 25 seconds)
2025-09-14 19:22:25,247 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 19:22:32,611 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 4158.76123 ± 1568.831
2025-09-14 19:22:32,612 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5297.6416), np.float32(5063.48), np.float32(1184.2585), np.float32(5168.801), np.float32(5283.2085), np.float32(4880.9985), np.float32(3176.7864), np.float32(1318.1195), np.float32(5179.453), np.float32(5034.868)]
2025-09-14 19:22:32,612 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 19:22:32,617 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 94/100 (estimated time remaining: 22 minutes, 12 seconds)
2025-09-14 19:25:30,005 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 19:25:37,354 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 4910.39600 ± 872.214
2025-09-14 19:25:37,354 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5216.0376), np.float32(5243.853), np.float32(5225.811), np.float32(2344.0662), np.float32(5211.9453), np.float32(5427.4434), np.float32(4740.2363), np.float32(5180.9204), np.float32(5349.0015), np.float32(5164.6475)]
2025-09-14 19:25:37,354 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 19:25:37,360 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 95/100 (estimated time remaining: 18 minutes, 34 seconds)
2025-09-14 19:29:58,384 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 19:30:05,752 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 5234.89307 ± 86.478
2025-09-14 19:30:05,752 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5341.3774), np.float32(5141.4272), np.float32(5217.1235), np.float32(5295.009), np.float32(5240.5815), np.float32(5302.874), np.float32(5335.5977), np.float32(5055.6846), np.float32(5245.6255), np.float32(5173.6304)]
2025-09-14 19:30:05,752 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 19:30:05,758 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 96/100 (estimated time remaining: 16 minutes, 51 seconds)
2025-09-14 19:33:08,544 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 19:33:15,884 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 4810.96777 ± 895.506
2025-09-14 19:33:15,884 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5298.707), np.float32(3547.847), np.float32(2599.7092), np.float32(5215.6294), np.float32(5314.088), np.float32(5312.2407), np.float32(5217.8906), np.float32(5262.7), np.float32(5175.433), np.float32(5165.429)]
2025-09-14 19:33:15,884 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 19:33:15,890 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 97/100 (estimated time remaining: 13 minutes, 40 seconds)
2025-09-14 19:36:21,281 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 19:36:28,641 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 5284.89014 ± 103.655
2025-09-14 19:36:28,641 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5320.1216), np.float32(5210.43), np.float32(5270.185), np.float32(5192.498), np.float32(5291.369), np.float32(5446.7246), np.float32(5411.8994), np.float32(5300.6396), np.float32(5068.87), np.float32(5336.163)]
2025-09-14 19:36:28,641 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 19:36:28,641 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (5284.89) for latency 18
2025-09-14 19:36:28,647 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 98/100 (estimated time remaining: 10 minutes, 20 seconds)
2025-09-14 19:39:27,192 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 19:39:34,563 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 5202.88232 ± 100.870
2025-09-14 19:39:34,564 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5289.768), np.float32(4965.0103), np.float32(5209.734), np.float32(5243.7954), np.float32(5066.25), np.float32(5200.109), np.float32(5262.176), np.float32(5291.485), np.float32(5224.1187), np.float32(5276.3794)]
2025-09-14 19:39:34,564 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 19:39:34,571 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 99/100 (estimated time remaining: 6 minutes, 48 seconds)
2025-09-14 19:42:28,235 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 19:42:35,564 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 4965.47949 ± 1211.652
2025-09-14 19:42:35,564 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5466.6104), np.float32(5124.159), np.float32(5258.216), np.float32(5400.585), np.float32(5442.104), np.float32(5424.0703), np.float32(5396.691), np.float32(5418.1675), np.float32(5381.9526), np.float32(1342.2373)]
2025-09-14 19:42:35,564 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 19:42:35,570 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 100/100 (estimated time remaining: 3 minutes, 23 seconds)
2025-09-14 19:48:51,809 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 19:48:59,172 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 5273.52148 ± 73.476
2025-09-14 19:48:59,172 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5357.8223), np.float32(5200.5674), np.float32(5325.741), np.float32(5257.8955), np.float32(5138.632), np.float32(5304.3525), np.float32(5191.6875), np.float32(5377.657), np.float32(5316.2026), np.float32(5264.6562)]
2025-09-14 19:48:59,172 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 19:48:59,178 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1251 [DEBUG]: Training session finished
