2025-09-14 13:33:53,405 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1108 [DEBUG]: logdir: _logs/noise-eval-v2/halfcheetah/bpql-noise_0.100-delay_18
2025-09-14 13:33:53,406 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1109 [DEBUG]: trainer_prefix: noise-eval-v2/halfcheetah/bpql-noise_0.100-delay_18
2025-09-14 13:33:53,406 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1110 [DEBUG]: args.trainer_eval_latencies: {'18': <latency_env.delayed_mdp.ConstantDelay object at 0x7f7dc1db7e60>}
2025-09-14 13:33:53,406 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1111 [DEBUG]: using device: cpu
2025-09-14 13:33:53,411 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1133 [INFO]: Creating new trainer
2025-09-14 13:33:53,560 baseline-bpql-noisepromille100-halfcheetah:113 [DEBUG]: pi network:
NNGaussianPolicy(
  (common_head): Sequential(
    (0): Flatten(start_dim=1, end_dim=-1)
    (1): Linear(in_features=125, out_features=256, bias=True)
    (2): ReLU()
    (3): Linear(in_features=256, out_features=256, bias=True)
    (4): ReLU()
  )
  (mu_head): Sequential(
    (0): Linear(in_features=256, out_features=6, bias=True)
    (1): Unflatten(dim=1, unflattened_size=(6,))
  )
  (log_std_head): Sequential(
    (0): Linear(in_features=256, out_features=6, bias=True)
    (1): Unflatten(dim=1, unflattened_size=(6,))
  )
  (tanh_refit): NNTanhRefit(scale: tensor([[2., 2., 2., 2., 2., 2.]]), shift: tensor([[-1., -1., -1., -1., -1., -1.]]))
)
2025-09-14 13:33:53,560 baseline-bpql-noisepromille100-halfcheetah:114 [DEBUG]: q network:
NNLayerConcat2(
  dim: -1
  (next): Sequential(
    (0): Linear(in_features=23, out_features=256, bias=True)
    (1): ReLU()
    (2): Linear(in_features=256, out_features=256, bias=True)
    (3): ReLU()
    (4): Linear(in_features=256, out_features=1, bias=True)
    (5): NNLayerSqueeze(dim: -1)
  )
  (init_left): Flatten(start_dim=1, end_dim=-1)
  (init_right): Flatten(start_dim=1, end_dim=-1)
)
2025-09-14 13:33:55,036 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1194 [DEBUG]: Starting training session...
2025-09-14 13:33:55,037 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 1/100
2025-09-14 13:37:04,094 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 13:37:14,107 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: -330.04733 ± 35.167
2025-09-14 13:37:14,107 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(-260.59683), np.float32(-365.05646), np.float32(-316.35013), np.float32(-310.08536), np.float32(-393.80106), np.float32(-366.6132), np.float32(-314.89658), np.float32(-325.1408), np.float32(-322.73175), np.float32(-325.2012)]
2025-09-14 13:37:14,107 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:37:14,107 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (-330.05) for latency 18
2025-09-14 13:37:14,109 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 2/100 (estimated time remaining: 5 hours, 28 minutes, 28 seconds)
2025-09-14 13:40:15,275 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 13:40:25,236 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: -287.85434 ± 80.173
2025-09-14 13:40:25,236 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(-231.2205), np.float32(-279.4226), np.float32(-457.1883), np.float32(-149.98091), np.float32(-351.3863), np.float32(-335.3643), np.float32(-314.0884), np.float32(-223.8136), np.float32(-294.94266), np.float32(-241.13599)]
2025-09-14 13:40:25,236 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:40:25,237 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (-287.85) for latency 18
2025-09-14 13:40:25,239 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 3/100 (estimated time remaining: 5 hours, 18 minutes, 39 seconds)
2025-09-14 13:43:25,235 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 13:43:35,346 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: -170.54250 ± 32.631
2025-09-14 13:43:35,346 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(-190.77559), np.float32(-145.59837), np.float32(-214.5555), np.float32(-97.038895), np.float32(-190.09595), np.float32(-181.48569), np.float32(-171.68666), np.float32(-139.77448), np.float32(-198.84125), np.float32(-175.57268)]
2025-09-14 13:43:35,346 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:43:35,346 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (-170.54) for latency 18
2025-09-14 13:43:35,348 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 4/100 (estimated time remaining: 5 hours, 12 minutes, 43 seconds)
2025-09-14 13:46:35,259 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 13:46:45,158 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: -149.14621 ± 54.584
2025-09-14 13:46:45,159 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(-223.88924), np.float32(-249.23586), np.float32(-86.26394), np.float32(-143.33769), np.float32(-56.881516), np.float32(-153.18129), np.float32(-114.48537), np.float32(-166.5505), np.float32(-157.98059), np.float32(-139.65619)]
2025-09-14 13:46:45,159 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:46:45,159 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (-149.15) for latency 18
2025-09-14 13:46:45,161 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 5/100 (estimated time remaining: 5 hours, 8 minutes, 2 seconds)
2025-09-14 13:49:45,418 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 13:49:54,435 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: -33.83801 ± 67.674
2025-09-14 13:49:54,435 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(-80.71276), np.float32(-179.6337), np.float32(79.114105), np.float32(-44.8483), np.float32(-10.705865), np.float32(-40.729702), np.float32(29.337608), np.float32(-86.33936), np.float32(5.0822263), np.float32(-8.944363)]
2025-09-14 13:49:54,436 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:49:54,436 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (-33.84) for latency 18
2025-09-14 13:49:54,438 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 6/100 (estimated time remaining: 5 hours, 3 minutes, 48 seconds)
2025-09-14 13:52:53,955 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 13:53:05,374 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 312.16739 ± 129.917
2025-09-14 13:53:05,374 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(273.41858), np.float32(374.7479), np.float32(396.20212), np.float32(22.86835), np.float32(113.03374), np.float32(399.56415), np.float32(354.28735), np.float32(398.68982), np.float32(429.9583), np.float32(358.90347)]
2025-09-14 13:53:05,374 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:53:05,374 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (312.17) for latency 18
2025-09-14 13:53:05,377 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 7/100 (estimated time remaining: 4 hours, 58 minutes, 3 seconds)
2025-09-14 13:56:28,646 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 13:56:39,660 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 214.94897 ± 187.945
2025-09-14 13:56:39,660 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(688.9293), np.float32(211.1281), np.float32(162.49013), np.float32(77.43169), np.float32(431.42633), np.float32(141.12819), np.float32(95.26373), np.float32(37.119003), np.float32(132.49272), np.float32(172.08058)]
2025-09-14 13:56:39,660 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:56:39,663 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 8/100 (estimated time remaining: 5 hours, 2 minutes, 4 seconds)
2025-09-14 14:00:03,648 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:00:13,408 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 687.79669 ± 184.689
2025-09-14 14:00:13,409 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(572.35065), np.float32(702.526), np.float32(603.31946), np.float32(860.323), np.float32(807.3635), np.float32(804.50745), np.float32(876.80676), np.float32(793.69965), np.float32(226.83928), np.float32(630.23083)]
2025-09-14 14:00:13,409 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:00:13,409 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (687.80) for latency 18
2025-09-14 14:00:13,411 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 9/100 (estimated time remaining: 5 hours, 6 minutes, 4 seconds)
2025-09-14 14:03:18,790 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:03:28,495 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 750.31482 ± 229.957
2025-09-14 14:03:28,495 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(375.54822), np.float32(887.5875), np.float32(469.9949), np.float32(755.6258), np.float32(771.00903), np.float32(798.2983), np.float32(1049.9824), np.float32(982.25494), np.float32(971.6772), np.float32(441.17017)]
2025-09-14 14:03:28,495 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:03:28,495 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (750.31) for latency 18
2025-09-14 14:03:28,498 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 10/100 (estimated time remaining: 5 hours, 4 minutes, 20 seconds)
2025-09-14 14:06:28,900 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:06:38,329 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 867.17804 ± 150.940
2025-09-14 14:06:38,329 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1027.2333), np.float32(768.1979), np.float32(887.7788), np.float32(730.3377), np.float32(1001.45715), np.float32(982.16833), np.float32(984.26514), np.float32(990.19147), np.float32(747.2915), np.float32(552.85846)]
2025-09-14 14:06:38,329 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:06:38,329 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (867.18) for latency 18
2025-09-14 14:06:38,332 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 11/100 (estimated time remaining: 5 hours, 1 minute, 10 seconds)
2025-09-14 14:09:37,192 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:09:45,648 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 898.69238 ± 140.405
2025-09-14 14:09:45,648 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(994.6801), np.float32(979.5146), np.float32(973.71924), np.float32(860.9102), np.float32(1006.1154), np.float32(675.78455), np.float32(682.0973), np.float32(741.6246), np.float32(993.32745), np.float32(1079.1503)]
2025-09-14 14:09:45,648 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:09:45,649 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (898.69) for latency 18
2025-09-14 14:09:45,651 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 12/100 (estimated time remaining: 4 hours, 56 minutes, 44 seconds)
2025-09-14 14:12:45,456 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:12:54,994 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 985.87921 ± 63.758
2025-09-14 14:12:54,994 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1004.17303), np.float32(874.01526), np.float32(995.0918), np.float32(1047.2288), np.float32(1076.8804), np.float32(1024.1727), np.float32(1036.9247), np.float32(886.2531), np.float32(954.7065), np.float32(959.34595)]
2025-09-14 14:12:54,994 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:12:54,994 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (985.88) for latency 18
2025-09-14 14:12:54,997 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 13/100 (estimated time remaining: 4 hours, 46 minutes, 5 seconds)
2025-09-14 14:15:54,696 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:16:04,573 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 961.76941 ± 97.844
2025-09-14 14:16:04,573 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(994.65515), np.float32(1080.3536), np.float32(968.1148), np.float32(1025.7478), np.float32(875.47296), np.float32(972.0178), np.float32(991.61237), np.float32(707.51514), np.float32(989.3778), np.float32(1012.8274)]
2025-09-14 14:16:04,573 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:16:04,576 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 14/100 (estimated time remaining: 4 hours, 35 minutes, 50 seconds)
2025-09-14 14:19:16,817 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:19:28,061 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1088.10144 ± 65.832
2025-09-14 14:19:28,064 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1166.1549), np.float32(1074.8748), np.float32(1137.3994), np.float32(1196.8021), np.float32(1104.2002), np.float32(1075.3491), np.float32(1017.3418), np.float32(1054.0995), np.float32(1093.9934), np.float32(960.79926)]
2025-09-14 14:19:28,064 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:19:28,064 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (1088.10) for latency 18
2025-09-14 14:19:28,067 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 15/100 (estimated time remaining: 4 hours, 35 minutes, 4 seconds)
2025-09-14 14:22:51,548 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:23:02,752 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1019.61652 ± 120.982
2025-09-14 14:23:02,753 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(958.23456), np.float32(1132.7909), np.float32(1215.2776), np.float32(1113.8818), np.float32(987.6843), np.float32(871.47754), np.float32(913.3448), np.float32(1114.9203), np.float32(1062.4441), np.float32(826.1097)]
2025-09-14 14:23:02,753 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:23:02,756 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 16/100 (estimated time remaining: 4 hours, 38 minutes, 55 seconds)
2025-09-14 14:26:16,445 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:26:26,243 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1066.68909 ± 132.774
2025-09-14 14:26:26,244 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1292.9045), np.float32(994.09357), np.float32(953.6832), np.float32(831.45917), np.float32(1126.0024), np.float32(1143.9604), np.float32(1226.8615), np.float32(1006.6852), np.float32(967.8043), np.float32(1123.4359)]
2025-09-14 14:26:26,244 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:26:26,247 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 17/100 (estimated time remaining: 4 hours, 40 minutes, 10 seconds)
2025-09-14 14:29:29,053 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:29:37,698 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1180.32629 ± 145.712
2025-09-14 14:29:37,699 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1200.4545), np.float32(1038.5369), np.float32(1392.8722), np.float32(1395.8958), np.float32(981.21765), np.float32(1209.2535), np.float32(1036.882), np.float32(1331.6704), np.float32(1162.1318), np.float32(1054.3477)]
2025-09-14 14:29:37,699 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:29:37,699 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (1180.33) for latency 18
2025-09-14 14:29:37,702 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 18/100 (estimated time remaining: 4 hours, 37 minutes, 24 seconds)
2025-09-14 14:32:37,606 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:32:47,452 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1144.82593 ± 96.400
2025-09-14 14:32:47,452 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1116.7828), np.float32(1058.0562), np.float32(1198.365), np.float32(1181.2682), np.float32(1066.8451), np.float32(1090.7955), np.float32(1021.4197), np.float32(1165.7405), np.float32(1379.6742), np.float32(1169.313)]
2025-09-14 14:32:47,452 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:32:47,455 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 19/100 (estimated time remaining: 4 hours, 34 minutes, 7 seconds)
2025-09-14 14:35:47,668 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:35:57,826 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1259.89624 ± 128.043
2025-09-14 14:35:57,826 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1476.0654), np.float32(1169.4817), np.float32(1273.5352), np.float32(1257.5227), np.float32(1457.2396), np.float32(1263.3833), np.float32(1334.1393), np.float32(1077.2155), np.float32(1193.6621), np.float32(1096.7191)]
2025-09-14 14:35:57,826 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:35:57,826 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (1259.90) for latency 18
2025-09-14 14:35:57,830 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 20/100 (estimated time remaining: 4 hours, 27 minutes, 14 seconds)
2025-09-14 14:38:58,266 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:39:08,286 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1047.70374 ± 495.330
2025-09-14 14:39:08,286 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1148.9565), np.float32(1028.8987), np.float32(1308.4456), np.float32(1284.4365), np.float32(1054.256), np.float32(1110.6499), np.float32(-391.22263), np.float32(1194.8318), np.float32(1459.4354), np.float32(1278.3496)]
2025-09-14 14:39:08,286 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:39:08,290 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 21/100 (estimated time remaining: 4 hours, 17 minutes, 28 seconds)
2025-09-14 14:42:07,213 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:42:15,832 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1207.04846 ± 117.690
2025-09-14 14:42:15,832 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1471.5114), np.float32(1100.6312), np.float32(1187.7256), np.float32(1062.7991), np.float32(1324.5613), np.float32(1230.9208), np.float32(1254.1472), np.float32(1158.046), np.float32(1203.678), np.float32(1076.4651)]
2025-09-14 14:42:15,832 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:42:15,836 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 22/100 (estimated time remaining: 4 hours, 10 minutes, 3 seconds)
2025-09-14 14:45:18,610 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:45:29,834 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1465.38538 ± 321.210
2025-09-14 14:45:29,834 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1118.002), np.float32(1399.1467), np.float32(2042.6307), np.float32(1249.556), np.float32(1469.2007), np.float32(1479.454), np.float32(1237.2451), np.float32(1564.6492), np.float32(2022.0997), np.float32(1071.8687)]
2025-09-14 14:45:29,835 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:45:29,835 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (1465.39) for latency 18
2025-09-14 14:45:29,838 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 23/100 (estimated time remaining: 4 hours, 7 minutes, 33 seconds)
2025-09-14 14:48:53,179 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:49:04,405 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1346.97754 ± 248.659
2025-09-14 14:49:04,405 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1851.2081), np.float32(1138.5964), np.float32(1746.3654), np.float32(1259.3977), np.float32(1101.2416), np.float32(1188.0216), np.float32(1327.1802), np.float32(1123.0251), np.float32(1458.0051), np.float32(1276.7338)]
2025-09-14 14:49:04,405 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:49:04,409 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 24/100 (estimated time remaining: 4 hours, 10 minutes, 45 seconds)
2025-09-14 14:52:26,765 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:52:36,798 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1483.20947 ± 339.850
2025-09-14 14:52:36,798 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1253.7753), np.float32(1455.134), np.float32(1296.7876), np.float32(1190.2178), np.float32(1418.0012), np.float32(1939.3281), np.float32(1716.511), np.float32(2208.0999), np.float32(1203.3815), np.float32(1150.8574)]
2025-09-14 14:52:36,798 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:52:36,798 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (1483.21) for latency 18
2025-09-14 14:52:36,802 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 25/100 (estimated time remaining: 4 hours, 13 minutes, 4 seconds)
2025-09-14 14:55:42,887 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:55:52,927 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1422.27808 ± 257.271
2025-09-14 14:55:52,927 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1354.1439), np.float32(1432.2639), np.float32(1000.9804), np.float32(1582.985), np.float32(1535.0868), np.float32(1256.4999), np.float32(1258.6877), np.float32(1207.6006), np.float32(1624.7343), np.float32(1969.799)]
2025-09-14 14:55:52,927 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:55:52,931 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 26/100 (estimated time remaining: 4 hours, 11 minutes, 9 seconds)
2025-09-14 14:58:54,272 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 14:59:03,974 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1397.56396 ± 355.796
2025-09-14 14:59:03,975 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1180.557), np.float32(2370.0322), np.float32(1432.9023), np.float32(1126.928), np.float32(1435.6454), np.float32(1329.271), np.float32(1210.4696), np.float32(1032.5599), np.float32(1322.93), np.float32(1534.344)]
2025-09-14 14:59:03,975 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:59:03,978 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 27/100 (estimated time remaining: 4 hours, 8 minutes, 40 seconds)
2025-09-14 15:02:04,278 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 15:02:12,939 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1330.41223 ± 223.875
2025-09-14 15:02:12,939 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1120.1857), np.float32(1097.994), np.float32(1667.0266), np.float32(1366.9941), np.float32(1213.0444), np.float32(1353.7616), np.float32(1039.8097), np.float32(1289.2322), np.float32(1758.6407), np.float32(1397.4327)]
2025-09-14 15:02:12,939 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:02:12,943 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 28/100 (estimated time remaining: 4 hours, 4 minutes, 5 seconds)
2025-09-14 15:05:11,299 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 15:05:21,037 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1618.74744 ± 397.703
2025-09-14 15:05:21,038 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1187.1527), np.float32(1396.2987), np.float32(2289.909), np.float32(1806.641), np.float32(2107.2756), np.float32(1065.3384), np.float32(1317.1528), np.float32(1526.5559), np.float32(1442.5872), np.float32(2048.564)]
2025-09-14 15:05:21,038 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:05:21,038 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (1618.75) for latency 18
2025-09-14 15:05:21,041 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 29/100 (estimated time remaining: 3 hours, 54 minutes, 23 seconds)
2025-09-14 15:08:21,722 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 15:08:31,484 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1388.43372 ± 187.340
2025-09-14 15:08:31,485 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1580.7559), np.float32(1367.8079), np.float32(1142.6406), np.float32(1240.3837), np.float32(1168.9747), np.float32(1523.6475), np.float32(1435.0283), np.float32(1672.2616), np.float32(1174.2372), np.float32(1578.5994)]
2025-09-14 15:08:31,485 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:08:31,488 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 30/100 (estimated time remaining: 3 hours, 45 minutes, 56 seconds)
2025-09-14 15:11:32,307 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 15:11:42,225 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1492.59875 ± 384.390
2025-09-14 15:11:42,225 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1118.1072), np.float32(2306.714), np.float32(1788.182), np.float32(1182.4996), np.float32(1725.554), np.float32(1683.164), np.float32(1143.8607), np.float32(1188.337), np.float32(1105.7477), np.float32(1683.821)]
2025-09-14 15:11:42,225 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:11:42,229 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 31/100 (estimated time remaining: 3 hours, 41 minutes, 30 seconds)
2025-09-14 15:14:57,404 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 15:15:08,570 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1498.39319 ± 358.037
2025-09-14 15:15:08,570 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2138.4917), np.float32(1485.0713), np.float32(1424.5554), np.float32(2170.7786), np.float32(1175.3889), np.float32(1107.3983), np.float32(1594.5864), np.float32(1387.2814), np.float32(1329.0333), np.float32(1171.3468)]
2025-09-14 15:15:08,571 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:15:08,575 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 32/100 (estimated time remaining: 3 hours, 41 minutes, 51 seconds)
2025-09-14 15:18:31,779 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 15:18:42,797 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1497.87146 ± 437.458
2025-09-14 15:18:42,797 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1775.203), np.float32(2047.9572), np.float32(1317.0074), np.float32(1173.4482), np.float32(1036.946), np.float32(2253.2256), np.float32(1941.5338), np.float32(1038.3298), np.float32(1086.2039), np.float32(1308.8606)]
2025-09-14 15:18:42,797 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:18:42,801 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 33/100 (estimated time remaining: 3 hours, 44 minutes, 22 seconds)
2025-09-14 15:21:54,179 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 15:22:03,940 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1299.92212 ± 219.966
2025-09-14 15:22:03,940 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1795.2551), np.float32(1230.6954), np.float32(1236.2256), np.float32(1242.7389), np.float32(1127.597), np.float32(1222.43), np.float32(1603.0057), np.float32(1009.11066), np.float32(1188.7753), np.float32(1343.3867)]
2025-09-14 15:22:03,941 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:22:03,944 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 34/100 (estimated time remaining: 3 hours, 43 minutes, 58 seconds)
2025-09-14 15:25:09,103 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 15:25:18,713 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1305.00220 ± 146.767
2025-09-14 15:25:18,713 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1362.1306), np.float32(1162.9578), np.float32(1476.3999), np.float32(1277.9333), np.float32(1200.0365), np.float32(1541.6227), np.float32(1104.7628), np.float32(1450.7849), np.float32(1126.771), np.float32(1346.622)]
2025-09-14 15:25:18,713 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:25:18,716 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 35/100 (estimated time remaining: 3 hours, 41 minutes, 35 seconds)
2025-09-14 15:28:16,720 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 15:28:25,457 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1324.18079 ± 179.890
2025-09-14 15:28:25,457 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1137.0118), np.float32(1423.4286), np.float32(1594.8883), np.float32(1121.0143), np.float32(1560.5454), np.float32(1363.901), np.float32(1183.9429), np.float32(1494.6995), np.float32(1273.5951), np.float32(1088.7812)]
2025-09-14 15:28:25,457 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:28:25,461 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 36/100 (estimated time remaining: 3 hours, 37 minutes, 22 seconds)
2025-09-14 15:31:26,098 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 15:31:35,765 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1441.30518 ± 247.140
2025-09-14 15:31:35,765 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1231.2201), np.float32(1438.549), np.float32(1705.5125), np.float32(1056.5707), np.float32(1180.1163), np.float32(1900.5865), np.float32(1341.896), np.float32(1539.4991), np.float32(1652.4032), np.float32(1366.6993)]
2025-09-14 15:31:35,765 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:31:35,770 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 37/100 (estimated time remaining: 3 hours, 30 minutes, 36 seconds)
2025-09-14 15:34:36,053 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 15:34:45,908 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1453.62378 ± 285.730
2025-09-14 15:34:45,908 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1578.5823), np.float32(1528.0468), np.float32(1059.7341), np.float32(1427.8575), np.float32(2077.838), np.float32(1168.2716), np.float32(1315.272), np.float32(1539.3228), np.float32(1684.9685), np.float32(1156.3442)]
2025-09-14 15:34:45,908 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:34:45,913 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 38/100 (estimated time remaining: 3 hours, 22 minutes, 15 seconds)
2025-09-14 15:37:47,390 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 15:37:57,280 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1374.84546 ± 254.262
2025-09-14 15:37:57,281 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1393.9945), np.float32(1254.1249), np.float32(1103.8628), np.float32(1117.2864), np.float32(1115.0502), np.float32(1129.8667), np.float32(1713.8888), np.float32(1482.1703), np.float32(1699.3801), np.float32(1738.8308)]
2025-09-14 15:37:57,281 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:37:57,286 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 39/100 (estimated time remaining: 3 hours, 17 minutes, 1 second)
2025-09-14 15:40:55,862 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 15:41:04,457 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1409.71313 ± 226.226
2025-09-14 15:41:04,458 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1935.35), np.float32(1423.5426), np.float32(1183.6101), np.float32(1639.9216), np.float32(1180.9111), np.float32(1253.6041), np.float32(1483.1384), np.float32(1475.1434), np.float32(1244.6854), np.float32(1277.2251)]
2025-09-14 15:41:04,458 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:41:04,461 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 40/100 (estimated time remaining: 3 hours, 12 minutes, 18 seconds)
2025-09-14 15:44:28,319 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 15:44:39,295 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1704.13416 ± 317.065
2025-09-14 15:44:39,296 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1258.5441), np.float32(1788.7731), np.float32(2169.8303), np.float32(1822.7048), np.float32(1747.0382), np.float32(2218.6624), np.float32(1473.3226), np.float32(1233.7152), np.float32(1554.0469), np.float32(1774.7041)]
2025-09-14 15:44:39,296 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:44:39,296 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (1704.13) for latency 18
2025-09-14 15:44:39,301 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 41/100 (estimated time remaining: 3 hours, 14 minutes, 46 seconds)
2025-09-14 15:48:02,484 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 15:48:13,480 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1775.92651 ± 390.680
2025-09-14 15:48:13,480 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1360.4465), np.float32(2094.848), np.float32(1400.8633), np.float32(1596.8724), np.float32(2493.0176), np.float32(1249.0071), np.float32(1731.4907), np.float32(1682.5621), np.float32(2296.1008), np.float32(1854.0571)]
2025-09-14 15:48:13,480 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:48:13,480 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (1775.93) for latency 18
2025-09-14 15:48:13,485 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 42/100 (estimated time remaining: 3 hours, 16 minutes, 13 seconds)
2025-09-14 15:51:19,246 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 15:51:29,486 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1435.07483 ± 373.372
2025-09-14 15:51:29,486 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1115.6555), np.float32(1203.8085), np.float32(1381.4417), np.float32(1282.0825), np.float32(1664.3536), np.float32(1201.1351), np.float32(2443.3716), np.float32(1553.17), np.float32(1189.7642), np.float32(1315.9653)]
2025-09-14 15:51:29,486 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:51:29,491 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 43/100 (estimated time remaining: 3 hours, 14 minutes, 1 second)
2025-09-14 15:54:31,123 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 15:54:40,934 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1255.28638 ± 134.473
2025-09-14 15:54:40,935 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1096.1748), np.float32(1116.455), np.float32(1315.4576), np.float32(1426.4183), np.float32(1289.487), np.float32(1112.2811), np.float32(1318.4247), np.float32(1323.874), np.float32(1466.9689), np.float32(1087.3221)]
2025-09-14 15:54:40,935 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:54:40,938 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 44/100 (estimated time remaining: 3 hours, 10 minutes, 41 seconds)
2025-09-14 15:57:41,073 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 15:57:51,044 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1566.34497 ± 300.692
2025-09-14 15:57:51,044 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1226.9703), np.float32(2168.0015), np.float32(1250.238), np.float32(1250.735), np.float32(1648.3992), np.float32(1357.9092), np.float32(1511.7428), np.float32(1767.5417), np.float32(1931.7998), np.float32(1550.1118)]
2025-09-14 15:57:51,044 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:57:51,048 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 45/100 (estimated time remaining: 3 hours, 7 minutes, 53 seconds)
2025-09-14 16:00:51,267 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 16:01:01,047 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1284.02881 ± 210.065
2025-09-14 16:01:01,047 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1661.5048), np.float32(1170.6754), np.float32(1184.1211), np.float32(1191.4287), np.float32(1721.5104), np.float32(1166.3759), np.float32(1309.2233), np.float32(1125.8538), np.float32(1194.1093), np.float32(1115.485)]
2025-09-14 16:01:01,047 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:01:01,052 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 46/100 (estimated time remaining: 2 hours, 59 minutes, 59 seconds)
2025-09-14 16:04:02,356 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 16:04:11,543 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1387.37183 ± 336.226
2025-09-14 16:04:11,544 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1152.3253), np.float32(1235.9845), np.float32(1097.5319), np.float32(914.4176), np.float32(1400.4705), np.float32(1686.1766), np.float32(1651.0497), np.float32(1322.1052), np.float32(1272.5066), np.float32(2141.1511)]
2025-09-14 16:04:11,544 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:04:11,548 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 47/100 (estimated time remaining: 2 hours, 52 minutes, 27 seconds)
2025-09-14 16:07:10,074 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 16:07:19,503 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1669.98596 ± 420.055
2025-09-14 16:07:19,503 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1641.6072), np.float32(1880.1094), np.float32(1873.7888), np.float32(1289.1871), np.float32(1393.9037), np.float32(2730.5264), np.float32(1773.2863), np.float32(1565.1123), np.float32(1267.7769), np.float32(1284.5612)]
2025-09-14 16:07:19,503 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:07:19,508 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 48/100 (estimated time remaining: 2 hours, 47 minutes, 50 seconds)
2025-09-14 16:10:25,703 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 16:10:35,986 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1576.69067 ± 480.322
2025-09-14 16:10:35,987 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1524.6841), np.float32(945.32477), np.float32(1382.4185), np.float32(1138.46), np.float32(1238.6462), np.float32(1926.5574), np.float32(1233.213), np.float32(1672.7056), np.float32(2585.3674), np.float32(2119.53)]
2025-09-14 16:10:35,987 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:10:35,992 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 49/100 (estimated time remaining: 2 hours, 45 minutes, 32 seconds)
2025-09-14 16:13:59,186 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 16:14:10,249 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1492.83081 ± 312.297
2025-09-14 16:14:10,250 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1753.3993), np.float32(1487.5997), np.float32(2164.5703), np.float32(1202.7673), np.float32(1333.5631), np.float32(1771.789), np.float32(1194.9838), np.float32(1289.9176), np.float32(1134.4495), np.float32(1595.2689)]
2025-09-14 16:14:10,250 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:14:10,255 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 50/100 (estimated time remaining: 2 hours, 46 minutes, 27 seconds)
2025-09-14 16:17:29,908 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 16:17:39,991 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1565.70581 ± 402.059
2025-09-14 16:17:39,991 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2008.8928), np.float32(1654.971), np.float32(2524.5942), np.float32(1232.7739), np.float32(1628.2383), np.float32(1148.2317), np.float32(1471.428), np.float32(1462.9674), np.float32(1274.6041), np.float32(1250.3552)]
2025-09-14 16:17:39,991 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:17:39,996 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 51/100 (estimated time remaining: 2 hours, 46 minutes, 29 seconds)
2025-09-14 16:20:44,631 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 16:20:54,881 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1547.50232 ± 481.564
2025-09-14 16:20:54,882 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1731.4254), np.float32(1183.8528), np.float32(1214.9061), np.float32(1980.3018), np.float32(1292.998), np.float32(1205.8246), np.float32(2720.571), np.float32(1240.052), np.float32(1756.5726), np.float32(1148.5192)]
2025-09-14 16:20:54,882 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:20:54,886 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 52/100 (estimated time remaining: 2 hours, 43 minutes, 52 seconds)
2025-09-14 16:23:55,555 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 16:24:05,248 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1516.13513 ± 352.209
2025-09-14 16:24:05,249 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2102.262), np.float32(1434.2134), np.float32(1706.4182), np.float32(1159.4346), np.float32(2196.5603), np.float32(1141.7782), np.float32(1373.9521), np.float32(1417.0991), np.float32(1235.4252), np.float32(1394.2091)]
2025-09-14 16:24:05,249 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:24:05,253 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 53/100 (estimated time remaining: 2 hours, 40 minutes, 55 seconds)
2025-09-14 16:27:05,044 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 16:27:14,972 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1559.22791 ± 518.912
2025-09-14 16:27:14,972 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1556.3647), np.float32(2939.0994), np.float32(1672.2864), np.float32(1325.9095), np.float32(1185.7123), np.float32(1301.2742), np.float32(1950.9343), np.float32(1156.4985), np.float32(1177.6368), np.float32(1326.5641)]
2025-09-14 16:27:14,972 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:27:14,977 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 54/100 (estimated time remaining: 2 hours, 36 minutes, 30 seconds)
2025-09-14 16:30:13,946 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 16:30:22,649 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1420.55103 ± 295.234
2025-09-14 16:30:22,651 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1658.1007), np.float32(1262.1473), np.float32(1317.8368), np.float32(1145.3844), np.float32(1220.8132), np.float32(1983.5461), np.float32(1098.8148), np.float32(1387.6959), np.float32(1882.0682), np.float32(1249.1018)]
2025-09-14 16:30:22,651 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:30:22,656 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 55/100 (estimated time remaining: 2 hours, 29 minutes, 6 seconds)
2025-09-14 16:33:23,040 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 16:33:33,046 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2026.84180 ± 706.947
2025-09-14 16:33:33,046 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1062.3976), np.float32(1138.3185), np.float32(1438.4291), np.float32(3057.3445), np.float32(2734.9111), np.float32(1405.2455), np.float32(2158.1284), np.float32(2773.7751), np.float32(2642.0554), np.float32(1857.8132)]
2025-09-14 16:33:33,046 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:33:33,046 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (2026.84) for latency 18
2025-09-14 16:33:33,052 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 56/100 (estimated time remaining: 2 hours, 22 minutes, 57 seconds)
2025-09-14 16:36:33,306 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 16:36:43,070 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1530.56482 ± 522.658
2025-09-14 16:36:43,071 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1190.3713), np.float32(1442.4111), np.float32(1174.7578), np.float32(1127.255), np.float32(1393.9044), np.float32(1257.128), np.float32(1554.3232), np.float32(1173.346), np.float32(2837.3342), np.float32(2154.818)]
2025-09-14 16:36:43,071 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:36:43,076 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 57/100 (estimated time remaining: 2 hours, 19 minutes, 4 seconds)
2025-09-14 16:39:55,604 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 16:40:06,867 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1633.28442 ± 454.743
2025-09-14 16:40:06,867 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1255.1749), np.float32(1794.6132), np.float32(1336.753), np.float32(1378.1592), np.float32(1714.4154), np.float32(2715.5469), np.float32(1981.7579), np.float32(1183.7277), np.float32(1169.0896), np.float32(1803.605)]
2025-09-14 16:40:06,867 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:40:06,873 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 58/100 (estimated time remaining: 2 hours, 17 minutes, 49 seconds)
2025-09-14 16:43:30,320 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 16:43:41,529 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1544.98657 ± 399.222
2025-09-14 16:43:41,529 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2368.3228), np.float32(1732.4729), np.float32(1530.6622), np.float32(1278.7894), np.float32(1135.3562), np.float32(1279.5549), np.float32(1218.3525), np.float32(1294.4739), np.float32(1441.3923), np.float32(2170.4895)]
2025-09-14 16:43:41,530 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:43:41,535 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 59/100 (estimated time remaining: 2 hours, 18 minutes, 7 seconds)
2025-09-14 16:46:55,571 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 16:47:05,741 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1718.21643 ± 375.525
2025-09-14 16:47:05,741 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1836.3357), np.float32(1771.9662), np.float32(1371.3804), np.float32(1496.3167), np.float32(1224.3116), np.float32(2552.3057), np.float32(1633.8383), np.float32(1435.2814), np.float32(2161.636), np.float32(1698.7911)]
2025-09-14 16:47:05,741 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:47:05,747 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 60/100 (estimated time remaining: 2 hours, 17 minutes, 5 seconds)
2025-09-14 16:50:10,239 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 16:50:19,930 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2133.93579 ± 757.088
2025-09-14 16:50:19,930 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1589.4856), np.float32(1826.1478), np.float32(2765.3164), np.float32(1319.6133), np.float32(1681.6311), np.float32(2103.7478), np.float32(3280.7432), np.float32(1238.6818), np.float32(2006.984), np.float32(3527.006)]
2025-09-14 16:50:19,930 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:50:19,930 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (2133.94) for latency 18
2025-09-14 16:50:19,935 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 61/100 (estimated time remaining: 2 hours, 14 minutes, 15 seconds)
2025-09-14 16:53:19,316 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 16:53:27,952 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1589.06958 ± 434.528
2025-09-14 16:53:27,952 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1534.3943), np.float32(1887.7869), np.float32(1198.463), np.float32(2438.5688), np.float32(2129.3306), np.float32(1162.2953), np.float32(1197.1146), np.float32(1078.1921), np.float32(1753.3638), np.float32(1511.1859)]
2025-09-14 16:53:27,952 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:53:27,957 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 62/100 (estimated time remaining: 2 hours, 10 minutes, 38 seconds)
2025-09-14 16:56:26,192 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 16:56:35,893 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1874.98511 ± 683.157
2025-09-14 16:56:35,893 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3132.8337), np.float32(1350.0919), np.float32(2854.542), np.float32(1538.9727), np.float32(1913.485), np.float32(1969.6401), np.float32(2423.8525), np.float32(1211.1196), np.float32(1100.3307), np.float32(1254.9835)]
2025-09-14 16:56:35,893 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:56:35,898 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 63/100 (estimated time remaining: 2 hours, 5 minutes, 16 seconds)
2025-09-14 16:59:36,612 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 16:59:46,641 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1814.37134 ± 497.345
2025-09-14 16:59:46,641 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1413.0035), np.float32(1295.214), np.float32(2251.8906), np.float32(1397.6228), np.float32(2733.9707), np.float32(2104.9531), np.float32(1652.2573), np.float32(1889.0853), np.float32(1115.0801), np.float32(2290.6355)]
2025-09-14 16:59:46,641 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:59:46,647 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 64/100 (estimated time remaining: 1 hour, 59 minutes, 1 second)
2025-09-14 17:02:47,017 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:02:56,877 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2023.26428 ± 654.036
2025-09-14 17:02:56,878 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3137.758), np.float32(2646.2146), np.float32(2395.402), np.float32(1276.5197), np.float32(1938.0178), np.float32(1262.4648), np.float32(1761.0968), np.float32(1740.981), np.float32(2816.988), np.float32(1257.1986)]
2025-09-14 17:02:56,878 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:02:56,883 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 65/100 (estimated time remaining: 1 hour, 54 minutes, 8 seconds)
2025-09-14 17:05:57,796 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:06:06,449 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1816.29614 ± 588.301
2025-09-14 17:06:06,449 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2481.641), np.float32(3131.1204), np.float32(1261.4045), np.float32(1295.3624), np.float32(1403.984), np.float32(1459.168), np.float32(1909.138), np.float32(2022.3312), np.float32(1233.6473), np.float32(1965.1621)]
2025-09-14 17:06:06,449 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:06:06,454 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 66/100 (estimated time remaining: 1 hour, 50 minutes, 25 seconds)
2025-09-14 17:09:27,283 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:09:38,536 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1800.03296 ± 363.460
2025-09-14 17:09:38,536 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2060.2505), np.float32(1769.1216), np.float32(1906.5774), np.float32(2048.6514), np.float32(1256.7644), np.float32(2361.5935), np.float32(2105.0527), np.float32(1153.7017), np.float32(1778.2825), np.float32(1560.3333)]
2025-09-14 17:09:38,536 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:09:38,543 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 67/100 (estimated time remaining: 1 hour, 49 minutes, 59 seconds)
2025-09-14 17:13:03,230 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:13:14,485 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1919.14905 ± 626.175
2025-09-14 17:13:14,485 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1291.8215), np.float32(2734.7087), np.float32(3092.6394), np.float32(1656.5276), np.float32(1301.7035), np.float32(2267.9194), np.float32(2216.0723), np.float32(1128.1483), np.float32(1483.6465), np.float32(2018.3031)]
2025-09-14 17:13:14,485 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:13:14,491 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 68/100 (estimated time remaining: 1 hour, 49 minutes, 50 seconds)
2025-09-14 17:16:21,951 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:16:31,929 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2105.61084 ± 766.637
2025-09-14 17:16:31,929 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1134.7562), np.float32(2997.0935), np.float32(1900.604), np.float32(1241.4198), np.float32(1286.9941), np.float32(2744.777), np.float32(2809.416), np.float32(2785.6733), np.float32(2887.9685), np.float32(1267.4039)]
2025-09-14 17:16:31,929 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:16:31,935 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 69/100 (estimated time remaining: 1 hour, 47 minutes, 13 seconds)
2025-09-14 17:19:33,204 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:19:43,068 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1993.25354 ± 676.163
2025-09-14 17:19:43,068 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2597.3362), np.float32(1602.7031), np.float32(1503.1436), np.float32(1917.5483), np.float32(1118.0039), np.float32(3240.9631), np.float32(2981.7605), np.float32(1942.261), np.float32(1360.4481), np.float32(1668.3699)]
2025-09-14 17:19:43,069 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:19:43,075 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 70/100 (estimated time remaining: 1 hour, 43 minutes, 58 seconds)
2025-09-14 17:22:43,087 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:22:53,013 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2147.77979 ± 512.377
2025-09-14 17:22:53,013 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2641.9644), np.float32(1569.2069), np.float32(1987.1191), np.float32(1508.969), np.float32(2564.9548), np.float32(1398.4269), np.float32(2548.562), np.float32(2234.6108), np.float32(2041.1375), np.float32(2982.8481)]
2025-09-14 17:22:53,013 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:22:53,013 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (2147.78) for latency 18
2025-09-14 17:22:53,019 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 71/100 (estimated time remaining: 1 hour, 40 minutes, 39 seconds)
2025-09-14 17:25:51,035 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:25:59,677 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1427.90161 ± 342.340
2025-09-14 17:25:59,677 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1148.7699), np.float32(1566.3999), np.float32(1333.7983), np.float32(1162.2819), np.float32(1294.3638), np.float32(2383.4116), np.float32(1324.4098), np.float32(1476.5612), np.float32(1205.1631), np.float32(1383.8563)]
2025-09-14 17:25:59,678 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:25:59,683 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 72/100 (estimated time remaining: 1 hour, 34 minutes, 50 seconds)
2025-09-14 17:28:59,908 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:29:09,733 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1790.24634 ± 492.685
2025-09-14 17:29:09,733 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1273.1194), np.float32(2009.3734), np.float32(1679.1552), np.float32(1302.2675), np.float32(1200.3381), np.float32(1399.6125), np.float32(1899.1539), np.float32(2008.1034), np.float32(2788.1719), np.float32(2343.1687)]
2025-09-14 17:29:09,733 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:29:09,739 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 73/100 (estimated time remaining: 1 hour, 29 minutes, 9 seconds)
2025-09-14 17:32:09,769 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:32:19,453 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1936.02368 ± 610.456
2025-09-14 17:32:19,453 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1462.8898), np.float32(1253.3107), np.float32(2812.431), np.float32(2687.552), np.float32(1688.7876), np.float32(2307.032), np.float32(2471.4285), np.float32(1200.7745), np.float32(2285.4143), np.float32(1190.6147)]
2025-09-14 17:32:19,453 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:32:19,460 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 74/100 (estimated time remaining: 1 hour, 25 minutes, 16 seconds)
2025-09-14 17:35:32,496 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:35:43,289 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1861.78149 ± 527.762
2025-09-14 17:35:43,289 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1578.1002), np.float32(1838.0934), np.float32(2711.1565), np.float32(2126.2246), np.float32(1891.2716), np.float32(1167.2152), np.float32(1390.9547), np.float32(2827.7275), np.float32(1733.7177), np.float32(1353.3528)]
2025-09-14 17:35:43,289 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:35:43,295 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 75/100 (estimated time remaining: 1 hour, 23 minutes, 13 seconds)
2025-09-14 17:39:07,614 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:39:18,236 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1901.69531 ± 541.958
2025-09-14 17:39:18,236 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1987.2133), np.float32(1186.9618), np.float32(1648.1528), np.float32(2132.2517), np.float32(1518.4532), np.float32(2298.337), np.float32(1538.7595), np.float32(3103.6047), np.float32(1353.8104), np.float32(2249.4077)]
2025-09-14 17:39:18,236 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:39:18,243 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 76/100 (estimated time remaining: 1 hour, 22 minutes, 6 seconds)
2025-09-14 17:42:32,371 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:42:42,145 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1836.79297 ± 609.008
2025-09-14 17:42:42,146 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2745.317), np.float32(2443.6055), np.float32(1565.264), np.float32(2978.5393), np.float32(1345.411), np.float32(1595.3019), np.float32(1646.6334), np.float32(1508.9025), np.float32(1428.8763), np.float32(1110.0773)]
2025-09-14 17:42:42,146 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:42:42,152 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 77/100 (estimated time remaining: 1 hour, 20 minutes, 11 seconds)
2025-09-14 17:45:44,050 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:45:53,907 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1749.00562 ± 322.992
2025-09-14 17:45:53,907 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2034.2764), np.float32(1894.6959), np.float32(1243.198), np.float32(1201.1475), np.float32(1834.159), np.float32(1793.2982), np.float32(1507.1077), np.float32(1709.8405), np.float32(2224.464), np.float32(2047.8704)]
2025-09-14 17:45:53,907 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:45:53,913 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 78/100 (estimated time remaining: 1 hour, 16 minutes, 59 seconds)
2025-09-14 17:48:54,300 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:49:04,213 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1752.23511 ± 420.710
2025-09-14 17:49:04,213 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2151.4734), np.float32(1158.695), np.float32(1705.8064), np.float32(1271.3536), np.float32(2397.9358), np.float32(1734.5641), np.float32(2100.661), np.float32(1245.9087), np.float32(2192.8237), np.float32(1563.1283)]
2025-09-14 17:49:04,213 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:49:04,219 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 79/100 (estimated time remaining: 1 hour, 13 minutes, 40 seconds)
2025-09-14 17:52:04,936 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:52:14,630 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1741.44556 ± 660.523
2025-09-14 17:52:14,631 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1139.6486), np.float32(1756.7126), np.float32(2472.6184), np.float32(1007.80133), np.float32(2684.289), np.float32(1377.6149), np.float32(2788.8464), np.float32(1073.2142), np.float32(1173.0537), np.float32(1940.6558)]
2025-09-14 17:52:14,631 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:52:14,637 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 80/100 (estimated time remaining: 1 hour, 9 minutes, 23 seconds)
2025-09-14 17:55:15,218 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:55:23,889 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2369.73608 ± 727.073
2025-09-14 17:55:23,889 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3372.096), np.float32(1652.2423), np.float32(2730.0781), np.float32(1909.6682), np.float32(2175.0337), np.float32(1762.2186), np.float32(2905.2039), np.float32(2720.978), np.float32(1100.9264), np.float32(3368.9172)]
2025-09-14 17:55:23,889 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:55:23,889 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (2369.74) for latency 18
2025-09-14 17:55:23,894 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 81/100 (estimated time remaining: 1 hour, 4 minutes, 22 seconds)
2025-09-14 17:58:22,702 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 17:58:32,140 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2015.23608 ± 681.258
2025-09-14 17:58:32,140 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1536.3553), np.float32(1708.1342), np.float32(3405.114), np.float32(1636.642), np.float32(1589.4108), np.float32(1938.5985), np.float32(1585.7827), np.float32(2472.991), np.float32(3042.5132), np.float32(1236.8207)]
2025-09-14 17:58:32,140 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:58:32,146 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 82/100 (estimated time remaining: 1 hour, 9 seconds)
2025-09-14 18:01:39,478 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 18:01:50,366 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2223.46826 ± 539.733
2025-09-14 18:01:50,367 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2505.098), np.float32(1943.6726), np.float32(2857.8625), np.float32(3089.2534), np.float32(2115.459), np.float32(1465.7239), np.float32(1368.7657), np.float32(1944.2494), np.float32(2678.8447), np.float32(2265.7546)]
2025-09-14 18:01:50,367 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 18:01:50,374 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 83/100 (estimated time remaining: 57 minutes, 23 seconds)
2025-09-14 18:05:14,582 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 18:05:25,608 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2153.28345 ± 612.219
2025-09-14 18:05:25,609 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1775.0709), np.float32(2425.986), np.float32(1658.2542), np.float32(1813.5974), np.float32(3326.3398), np.float32(1800.6619), np.float32(1495.5333), np.float32(2251.2434), np.float32(1787.1241), np.float32(3199.0215)]
2025-09-14 18:05:25,609 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 18:05:25,615 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 84/100 (estimated time remaining: 55 minutes, 36 seconds)
2025-09-14 18:08:44,975 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 18:08:54,777 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1865.77869 ± 620.377
2025-09-14 18:08:54,777 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1311.9292), np.float32(1373.2767), np.float32(962.70764), np.float32(2364.7778), np.float32(3132.7698), np.float32(1516.4578), np.float32(1908.612), np.float32(2550.9836), np.float32(1675.5847), np.float32(1860.6881)]
2025-09-14 18:08:54,777 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 18:08:54,783 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 85/100 (estimated time remaining: 53 minutes, 20 seconds)
2025-09-14 18:12:00,945 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 18:12:10,439 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1881.02271 ± 539.399
2025-09-14 18:12:10,439 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1161.9989), np.float32(1853.8363), np.float32(1221.0538), np.float32(2148.7346), np.float32(1800.0448), np.float32(1996.9366), np.float32(1525.3052), np.float32(1655.5935), np.float32(2357.2131), np.float32(3089.5107)]
2025-09-14 18:12:10,439 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 18:12:10,446 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 86/100 (estimated time remaining: 50 minutes, 19 seconds)
2025-09-14 18:15:10,139 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 18:15:18,693 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2549.01147 ± 833.323
2025-09-14 18:15:18,694 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2622.295), np.float32(1657.3231), np.float32(2708.1975), np.float32(2734.5027), np.float32(1281.8381), np.float32(1265.1055), np.float32(3331.882), np.float32(2904.8948), np.float32(3097.2188), np.float32(3886.857)]
2025-09-14 18:15:18,694 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 18:15:18,694 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (2549.01) for latency 18
2025-09-14 18:15:18,700 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 87/100 (estimated time remaining: 46 minutes, 58 seconds)
2025-09-14 18:18:18,691 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 18:18:28,829 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2054.94385 ± 646.639
2025-09-14 18:18:28,829 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2991.2021), np.float32(2101.691), np.float32(2446.2456), np.float32(1838.1709), np.float32(1439.6183), np.float32(3087.777), np.float32(2245.2798), np.float32(1125.532), np.float32(1185.4518), np.float32(2088.4705)]
2025-09-14 18:18:28,830 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 18:18:28,836 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 88/100 (estimated time remaining: 43 minutes, 16 seconds)
2025-09-14 18:21:29,735 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 18:21:39,643 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2094.91675 ± 568.234
2025-09-14 18:21:39,643 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1716.4885), np.float32(2660.2327), np.float32(2446.1135), np.float32(2649.667), np.float32(1798.1824), np.float32(1347.0967), np.float32(2793.9053), np.float32(2650.7603), np.float32(1586.7853), np.float32(1299.9366)]
2025-09-14 18:21:39,643 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 18:21:39,651 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 89/100 (estimated time remaining: 38 minutes, 57 seconds)
2025-09-14 18:24:40,643 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 18:24:50,357 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2198.24561 ± 696.003
2025-09-14 18:24:50,357 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2437.4688), np.float32(2858.011), np.float32(2365.0894), np.float32(1834.2583), np.float32(1229.5464), np.float32(1667.9044), np.float32(1355.2849), np.float32(1870.9865), np.float32(2851.289), np.float32(3512.6187)]
2025-09-14 18:24:50,357 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 18:24:50,365 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 90/100 (estimated time remaining: 35 minutes, 2 seconds)
2025-09-14 18:27:49,282 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 18:27:58,083 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1844.30115 ± 661.265
2025-09-14 18:27:58,084 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1813.6943), np.float32(1994.734), np.float32(1249.8213), np.float32(2718.3557), np.float32(1864.8884), np.float32(1497.2672), np.float32(1400.2277), np.float32(1249.1722), np.float32(1304.5103), np.float32(3350.34)]
2025-09-14 18:27:58,084 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 18:27:58,092 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 91/100 (estimated time remaining: 31 minutes, 35 seconds)
2025-09-14 18:31:22,031 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 18:31:32,487 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1998.28223 ± 402.320
2025-09-14 18:31:32,487 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2479.0706), np.float32(2241.4568), np.float32(1528.8618), np.float32(1887.7045), np.float32(2619.6875), np.float32(1998.544), np.float32(1778.0729), np.float32(2269.4014), np.float32(1946.5391), np.float32(1233.4838)]
2025-09-14 18:31:32,487 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 18:31:32,495 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 92/100 (estimated time remaining: 29 minutes, 12 seconds)
2025-09-14 18:34:50,984 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 18:35:01,704 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2497.75610 ± 539.290
2025-09-14 18:35:01,705 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3083.1501), np.float32(2614.8835), np.float32(2093.3242), np.float32(2994.953), np.float32(2886.9768), np.float32(2700.164), np.float32(2673.958), np.float32(2559.8015), np.float32(1140.5542), np.float32(2229.795)]
2025-09-14 18:35:01,705 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 18:35:01,712 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 93/100 (estimated time remaining: 26 minutes, 28 seconds)
2025-09-14 18:38:01,703 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 18:38:10,994 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2524.14087 ± 1044.598
2025-09-14 18:38:10,994 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1572.6426), np.float32(1249.0327), np.float32(3262.9631), np.float32(3720.2383), np.float32(1473.8851), np.float32(4059.4866), np.float32(1936.4023), np.float32(2512.3936), np.float32(1605.6305), np.float32(3848.7317)]
2025-09-14 18:38:10,995 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 18:38:11,002 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 94/100 (estimated time remaining: 23 minutes, 7 seconds)
2025-09-14 18:41:06,666 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 18:41:16,057 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1773.77368 ± 675.949
2025-09-14 18:41:16,057 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2362.8652), np.float32(2158.2673), np.float32(1297.641), np.float32(3251.1614), np.float32(1152.881), np.float32(2268.6096), np.float32(1584.0411), np.float32(1300.9835), np.float32(1369.4808), np.float32(991.80524)]
2025-09-14 18:41:16,057 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 18:41:16,063 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 95/100 (estimated time remaining: 19 minutes, 42 seconds)
2025-09-14 18:44:10,379 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 18:44:19,768 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1902.51929 ± 755.744
2025-09-14 18:44:19,768 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2349.6038), np.float32(1386.7108), np.float32(2429.0488), np.float32(1790.5177), np.float32(1697.2903), np.float32(1154.591), np.float32(1108.5337), np.float32(1528.0813), np.float32(3792.1409), np.float32(1788.6754)]
2025-09-14 18:44:19,768 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 18:44:19,775 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 96/100 (estimated time remaining: 16 minutes, 21 seconds)
2025-09-14 18:47:11,224 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 18:47:19,752 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2073.33447 ± 736.006
2025-09-14 18:47:19,752 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3216.574), np.float32(2120.4946), np.float32(1738.2056), np.float32(1644.8971), np.float32(2932.1018), np.float32(1429.1409), np.float32(3197.255), np.float32(1153.1156), np.float32(1960.9298), np.float32(1340.6301)]
2025-09-14 18:47:19,752 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 18:47:19,759 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 97/100 (estimated time remaining: 12 minutes, 37 seconds)
2025-09-14 18:50:08,934 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 18:50:16,651 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1848.90002 ± 730.934
2025-09-14 18:50:16,652 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1728.2405), np.float32(1458.3348), np.float32(3454.355), np.float32(1185.1125), np.float32(1215.0911), np.float32(2369.8967), np.float32(2759.3418), np.float32(1749.7164), np.float32(1219.8251), np.float32(1349.0854)]
2025-09-14 18:50:16,652 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 18:50:16,658 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 98/100 (estimated time remaining: 9 minutes, 8 seconds)
2025-09-14 18:53:01,844 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 18:53:09,692 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2151.41040 ± 749.885
2025-09-14 18:53:09,692 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1720.6985), np.float32(2394.101), np.float32(1593.2551), np.float32(1093.9528), np.float32(2705.0425), np.float32(3717.8196), np.float32(1971.7408), np.float32(2846.0378), np.float32(2141.056), np.float32(1330.3984)]
2025-09-14 18:53:09,692 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 18:53:09,698 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 99/100 (estimated time remaining: 5 minutes, 59 seconds)
2025-09-14 18:55:51,875 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 18:55:59,317 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2610.57275 ± 777.646
2025-09-14 18:55:59,317 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3449.2722), np.float32(3538.2576), np.float32(2100.985), np.float32(2147.9429), np.float32(3334.8257), np.float32(1250.7078), np.float32(3329.2925), np.float32(2161.3435), np.float32(1753.9576), np.float32(3039.1448)]
2025-09-14 18:55:59,318 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 18:55:59,318 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (2610.57) for latency 18
2025-09-14 18:55:59,324 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 100/100 (estimated time remaining: 2 minutes, 56 seconds)
2025-09-14 18:58:41,081 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 18...
2025-09-14 18:58:48,530 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2259.28223 ± 805.652
2025-09-14 18:58:48,530 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2895.8188), np.float32(1183.6504), np.float32(2872.1816), np.float32(1227.0161), np.float32(1915.5867), np.float32(1733.3871), np.float32(2988.046), np.float32(1672.5303), np.float32(2359.649), np.float32(3744.9563)]
2025-09-14 18:58:48,530 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 18:58:48,536 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1251 [DEBUG]: Training session finished
