2025-09-14 08:43:01,511 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1108 [DEBUG]: logdir: _logs/noise-eval-v2/halfcheetah/bpql-noise_0.050-delay_12
2025-09-14 08:43:01,511 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1109 [DEBUG]: trainer_prefix: noise-eval-v2/halfcheetah/bpql-noise_0.050-delay_12
2025-09-14 08:43:01,511 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1110 [DEBUG]: args.trainer_eval_latencies: {'12': <latency_env.delayed_mdp.ConstantDelay object at 0x7f48696e2d20>}
2025-09-14 08:43:01,511 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1111 [DEBUG]: using device: cpu
2025-09-14 08:43:01,515 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1133 [INFO]: Creating new trainer
2025-09-14 08:43:01,606 baseline-bpql-noisepromille50-halfcheetah:113 [DEBUG]: pi network:
NNGaussianPolicy(
  (common_head): Sequential(
    (0): Flatten(start_dim=1, end_dim=-1)
    (1): Linear(in_features=89, out_features=256, bias=True)
    (2): ReLU()
    (3): Linear(in_features=256, out_features=256, bias=True)
    (4): ReLU()
  )
  (mu_head): Sequential(
    (0): Linear(in_features=256, out_features=6, bias=True)
    (1): Unflatten(dim=1, unflattened_size=(6,))
  )
  (log_std_head): Sequential(
    (0): Linear(in_features=256, out_features=6, bias=True)
    (1): Unflatten(dim=1, unflattened_size=(6,))
  )
  (tanh_refit): NNTanhRefit(scale: tensor([[2., 2., 2., 2., 2., 2.]]), shift: tensor([[-1., -1., -1., -1., -1., -1.]]))
)
2025-09-14 08:43:01,607 baseline-bpql-noisepromille50-halfcheetah:114 [DEBUG]: q network:
NNLayerConcat2(
  dim: -1
  (next): Sequential(
    (0): Linear(in_features=23, out_features=256, bias=True)
    (1): ReLU()
    (2): Linear(in_features=256, out_features=256, bias=True)
    (3): ReLU()
    (4): Linear(in_features=256, out_features=1, bias=True)
    (5): NNLayerSqueeze(dim: -1)
  )
  (init_left): Flatten(start_dim=1, end_dim=-1)
  (init_right): Flatten(start_dim=1, end_dim=-1)
)
2025-09-14 08:43:03,524 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1194 [DEBUG]: Starting training session...
2025-09-14 08:43:03,525 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 1/100
2025-09-14 08:45:34,360 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 08:45:40,959 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: -268.99332 ± 189.747
2025-09-14 08:45:40,959 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(-309.59995), np.float32(-262.75766), np.float32(-89.53744), np.float32(-54.20141), np.float32(-466.6617), np.float32(-280.35352), np.float32(-207.827), np.float32(-599.19403), np.float32(36.118546), np.float32(-455.91895)]
2025-09-14 08:45:40,960 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 08:45:40,960 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (-268.99) for latency 12
2025-09-14 08:45:40,961 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 2/100 (estimated time remaining: 4 hours, 19 minutes, 46 seconds)
2025-09-14 08:48:13,658 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 08:48:20,752 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: -250.88676 ± 93.471
2025-09-14 08:48:20,752 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(-183.80406), np.float32(-388.64508), np.float32(-429.40063), np.float32(-273.27963), np.float32(-126.97335), np.float32(-240.6449), np.float32(-292.42148), np.float32(-240.32767), np.float32(-171.49979), np.float32(-161.87111)]
2025-09-14 08:48:20,752 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 08:48:20,752 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (-250.89) for latency 12
2025-09-14 08:48:20,754 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 3/100 (estimated time remaining: 4 hours, 19 minutes, 4 seconds)
2025-09-14 08:51:02,840 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 08:51:10,023 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 3.50490 ± 65.545
2025-09-14 08:51:10,023 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(-95.2187), np.float32(-83.25494), np.float32(67.50227), np.float32(-21.887157), np.float32(52.783318), np.float32(89.59794), np.float32(66.11311), np.float32(-46.091705), np.float32(51.364468), np.float32(-45.85964)]
2025-09-14 08:51:10,024 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 08:51:10,024 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (3.50) for latency 12
2025-09-14 08:51:10,026 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 4/100 (estimated time remaining: 4 hours, 22 minutes, 10 seconds)
2025-09-14 08:53:50,846 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 08:53:58,463 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 363.87640 ± 367.246
2025-09-14 08:53:58,463 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(83.41115), np.float32(75.169266), np.float32(116.28574), np.float32(17.478224), np.float32(269.56644), np.float32(779.1923), np.float32(1049.8546), np.float32(893.2925), np.float32(135.38123), np.float32(219.13274)]
2025-09-14 08:53:58,463 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 08:53:58,463 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (363.88) for latency 12
2025-09-14 08:53:58,465 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 5/100 (estimated time remaining: 4 hours, 21 minutes, 58 seconds)
2025-09-14 08:56:46,608 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 08:56:55,712 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 511.98700 ± 319.266
2025-09-14 08:56:55,712 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(129.36102), np.float32(919.07355), np.float32(472.04846), np.float32(344.50043), np.float32(109.204094), np.float32(503.47736), np.float32(968.8486), np.float32(112.61651), np.float32(741.94745), np.float32(818.79297)]
2025-09-14 08:56:55,712 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 08:56:55,713 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (511.99) for latency 12
2025-09-14 08:56:55,715 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 6/100 (estimated time remaining: 4 hours, 23 minutes, 31 seconds)
2025-09-14 09:00:07,806 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:00:17,562 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 887.11292 ± 190.771
2025-09-14 09:00:17,562 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(854.62384), np.float32(687.9562), np.float32(1104.503), np.float32(931.43774), np.float32(1179.9594), np.float32(583.5946), np.float32(1019.02765), np.float32(1052.8937), np.float32(706.7459), np.float32(750.38715)]
2025-09-14 09:00:17,562 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:00:17,562 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (887.11) for latency 12
2025-09-14 09:00:17,565 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 7/100 (estimated time remaining: 4 hours, 34 minutes, 40 seconds)
2025-09-14 09:03:32,845 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:03:42,668 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 1014.88849 ± 252.950
2025-09-14 09:03:42,668 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1472.0203), np.float32(1194.5599), np.float32(1061.6477), np.float32(1092.1884), np.float32(871.27203), np.float32(1303.8458), np.float32(558.5025), np.float32(789.9885), np.float32(935.1358), np.float32(869.7241)]
2025-09-14 09:03:42,668 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:03:42,668 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (1014.89) for latency 12
2025-09-14 09:03:42,671 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 8/100 (estimated time remaining: 4 hours, 45 minutes, 47 seconds)
2025-09-14 09:06:53,610 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:07:03,114 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 1091.64185 ± 229.318
2025-09-14 09:07:03,114 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(940.0838), np.float32(1176.4982), np.float32(983.3495), np.float32(1264.6512), np.float32(954.44183), np.float32(939.2761), np.float32(1062.8928), np.float32(931.3661), np.float32(1699.3579), np.float32(964.4998)]
2025-09-14 09:07:03,114 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:07:03,114 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (1091.64) for latency 12
2025-09-14 09:07:03,116 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 9/100 (estimated time remaining: 4 hours, 52 minutes, 16 seconds)
2025-09-14 09:10:11,780 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:10:21,328 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 1280.08325 ± 252.482
2025-09-14 09:10:21,328 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1812.4728), np.float32(1335.5134), np.float32(1391.9146), np.float32(1266.9148), np.float32(1074.5698), np.float32(1033.8711), np.float32(1249.127), np.float32(1595.5092), np.float32(1029.7489), np.float32(1011.1903)]
2025-09-14 09:10:21,329 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:10:21,329 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (1280.08) for latency 12
2025-09-14 09:10:21,331 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 10/100 (estimated time remaining: 4 hours, 58 minutes, 8 seconds)
2025-09-14 09:13:30,375 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:13:40,007 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 1309.15222 ± 374.375
2025-09-14 09:13:40,007 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2132.8018), np.float32(1725.6218), np.float32(1478.8575), np.float32(953.6981), np.float32(990.4542), np.float32(928.3269), np.float32(1118.1537), np.float32(1058.7404), np.float32(1502.5055), np.float32(1202.3619)]
2025-09-14 09:13:40,007 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:13:40,008 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (1309.15) for latency 12
2025-09-14 09:13:40,010 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 11/100 (estimated time remaining: 5 hours, 1 minute, 17 seconds)
2025-09-14 09:16:49,442 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:16:59,037 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 1434.44800 ± 293.762
2025-09-14 09:16:59,037 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1548.355), np.float32(1322.0938), np.float32(1213.6274), np.float32(1455.902), np.float32(1358.4482), np.float32(1277.449), np.float32(1133.3752), np.float32(1142.1716), np.float32(1777.4259), np.float32(2115.6328)]
2025-09-14 09:16:59,037 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:16:59,037 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (1434.45) for latency 12
2025-09-14 09:16:59,040 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 12/100 (estimated time remaining: 4 hours, 57 minutes, 6 seconds)
2025-09-14 09:20:09,348 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:20:19,664 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 1393.71033 ± 354.070
2025-09-14 09:20:19,664 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1310.5403), np.float32(1274.8795), np.float32(1243.158), np.float32(1137.6141), np.float32(1257.3157), np.float32(2011.7509), np.float32(2083.6213), np.float32(1020.1536), np.float32(1533.9147), np.float32(1064.1549)]
2025-09-14 09:20:19,665 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:20:19,668 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 13/100 (estimated time remaining: 4 hours, 52 minutes, 27 seconds)
2025-09-14 09:23:40,828 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:23:51,110 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 1535.70068 ± 281.515
2025-09-14 09:23:51,110 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1820.3871), np.float32(1333.0446), np.float32(1159.49), np.float32(1267.0698), np.float32(1423.4071), np.float32(1659.9094), np.float32(1998.601), np.float32(1485.1272), np.float32(1922.9637), np.float32(1287.0078)]
2025-09-14 09:23:51,110 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:23:51,110 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (1535.70) for latency 12
2025-09-14 09:23:51,113 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 14/100 (estimated time remaining: 4 hours, 52 minutes, 19 seconds)
2025-09-14 09:27:12,032 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:27:22,234 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 1423.55688 ± 249.702
2025-09-14 09:27:22,235 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1684.4846), np.float32(1115.3944), np.float32(977.3389), np.float32(1287.496), np.float32(1620.9666), np.float32(1243.3823), np.float32(1739.5585), np.float32(1681.3331), np.float32(1489.5072), np.float32(1396.1082)]
2025-09-14 09:27:22,235 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:27:22,238 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 15/100 (estimated time remaining: 4 hours, 52 minutes, 39 seconds)
2025-09-14 09:30:43,546 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:30:53,842 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 1623.98718 ± 273.181
2025-09-14 09:30:53,842 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1888.0007), np.float32(1184.6814), np.float32(1467.5211), np.float32(1901.3019), np.float32(2071.8137), np.float32(1358.9921), np.float32(1671.3248), np.float32(1813.8247), np.float32(1497.4813), np.float32(1384.9307)]
2025-09-14 09:30:53,842 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:30:53,842 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (1623.99) for latency 12
2025-09-14 09:30:53,845 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 16/100 (estimated time remaining: 4 hours, 52 minutes, 55 seconds)
2025-09-14 09:34:01,613 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:34:10,134 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 1520.43494 ± 456.400
2025-09-14 09:34:10,134 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1292.2948), np.float32(1275.8701), np.float32(1538.4558), np.float32(1193.7001), np.float32(1284.0627), np.float32(1381.2532), np.float32(1180.5668), np.float32(1905.0139), np.float32(1403.199), np.float32(2749.9336)]
2025-09-14 09:34:10,134 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:34:10,137 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 17/100 (estimated time remaining: 4 hours, 48 minutes, 42 seconds)
2025-09-14 09:36:55,714 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:37:03,050 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 2053.19580 ± 571.094
2025-09-14 09:37:03,051 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1700.2079), np.float32(2859.82), np.float32(1312.8163), np.float32(2007.5507), np.float32(2999.0957), np.float32(2298.4023), np.float32(1618.7765), np.float32(1343.0652), np.float32(2547.7373), np.float32(1844.4839)]
2025-09-14 09:37:03,051 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:37:03,051 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (2053.20) for latency 12
2025-09-14 09:37:03,053 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 18/100 (estimated time remaining: 4 hours, 37 minutes, 36 seconds)
2025-09-14 09:39:36,217 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:39:43,562 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 2059.17773 ± 491.880
2025-09-14 09:39:43,562 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2370.856), np.float32(1756.9246), np.float32(1617.4698), np.float32(1529.8834), np.float32(1896.9404), np.float32(2664.9692), np.float32(2028.3545), np.float32(1601.0583), np.float32(3130.3672), np.float32(1994.9539)]
2025-09-14 09:39:43,562 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:39:43,562 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (2059.18) for latency 12
2025-09-14 09:39:43,565 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 19/100 (estimated time remaining: 4 hours, 20 minutes, 20 seconds)
2025-09-14 09:42:16,431 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:42:23,864 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 1794.37537 ± 548.663
2025-09-14 09:42:23,864 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2797.39), np.float32(1459.347), np.float32(1449.1761), np.float32(1273.7183), np.float32(1506.4452), np.float32(2240.5044), np.float32(1270.9995), np.float32(1295.6653), np.float32(2066.844), np.float32(2583.6636)]
2025-09-14 09:42:23,864 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:42:23,867 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 20/100 (estimated time remaining: 4 hours, 3 minutes, 26 seconds)
2025-09-14 09:45:20,898 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:45:31,427 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 2068.21924 ± 662.278
2025-09-14 09:45:31,427 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2453.5063), np.float32(3298.546), np.float32(1461.3629), np.float32(1525.0319), np.float32(1474.4601), np.float32(1498.5537), np.float32(2101.9297), np.float32(1508.8745), np.float32(3080.6562), np.float32(2279.2715)]
2025-09-14 09:45:31,427 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:45:31,427 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (2068.22) for latency 12
2025-09-14 09:45:31,431 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 21/100 (estimated time remaining: 3 hours, 54 minutes, 1 second)
2025-09-14 09:48:55,525 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:49:06,002 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 2712.24609 ± 718.578
2025-09-14 09:49:06,002 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3041.3037), np.float32(1930.0754), np.float32(1621.5974), np.float32(3233.6995), np.float32(3383.369), np.float32(3612.953), np.float32(3242.9), np.float32(2003.2196), np.float32(3185.1475), np.float32(1868.1948)]
2025-09-14 09:49:06,002 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:49:06,003 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (2712.25) for latency 12
2025-09-14 09:49:06,006 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 22/100 (estimated time remaining: 3 hours, 55 minutes, 54 seconds)
2025-09-14 09:52:31,372 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:52:41,799 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 1943.25659 ± 589.729
2025-09-14 09:52:41,799 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1578.0999), np.float32(1364.7616), np.float32(1788.3093), np.float32(2260.2231), np.float32(2863.853), np.float32(1477.0168), np.float32(1299.1198), np.float32(3106.968), np.float32(1975.2208), np.float32(1718.9934)]
2025-09-14 09:52:41,799 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:52:41,803 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 23/100 (estimated time remaining: 4 hours, 4 minutes, 4 seconds)
2025-09-14 09:56:07,201 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:56:17,625 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 2331.88135 ± 659.562
2025-09-14 09:56:17,626 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1306.2157), np.float32(1901.9558), np.float32(2032.878), np.float32(2364.1865), np.float32(2607.3022), np.float32(2533.394), np.float32(3566.4656), np.float32(3285.9846), np.float32(1743.9404), np.float32(1976.4908)]
2025-09-14 09:56:17,626 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:56:17,629 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 24/100 (estimated time remaining: 4 hours, 15 minutes, 8 seconds)
2025-09-14 09:59:42,017 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:59:52,436 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 2538.50635 ± 660.707
2025-09-14 09:59:52,436 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1866.9924), np.float32(1985.9763), np.float32(1785.4291), np.float32(2878.6685), np.float32(3547.357), np.float32(3383.6245), np.float32(1924.884), np.float32(3288.5234), np.float32(2051.3464), np.float32(2672.2644)]
2025-09-14 09:59:52,436 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:59:52,440 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 25/100 (estimated time remaining: 4 hours, 25 minutes, 38 seconds)
2025-09-14 10:03:16,945 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:03:27,446 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 2806.86865 ± 634.956
2025-09-14 10:03:27,446 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2767.1235), np.float32(3376.0354), np.float32(3333.2112), np.float32(2656.217), np.float32(3258.6257), np.float32(1753.8978), np.float32(2633.216), np.float32(3113.927), np.float32(3556.0852), np.float32(1620.3481)]
2025-09-14 10:03:27,446 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:03:27,446 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (2806.87) for latency 12
2025-09-14 10:03:27,450 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 26/100 (estimated time remaining: 4 hours, 29 minutes)
2025-09-14 10:06:51,964 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:07:02,466 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 2237.03394 ± 923.498
2025-09-14 10:07:02,466 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3418.6194), np.float32(1386.4749), np.float32(1308.5404), np.float32(2350.2256), np.float32(3435.6287), np.float32(1062.259), np.float32(3469.472), np.float32(1184.7076), np.float32(2537.9224), np.float32(2216.4907)]
2025-09-14 10:07:02,467 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:07:02,470 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 27/100 (estimated time remaining: 4 hours, 25 minutes, 31 seconds)
2025-09-14 10:10:26,928 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:10:37,304 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 2275.16113 ± 700.564
2025-09-14 10:10:37,304 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1187.6633), np.float32(3265.3938), np.float32(1496.2968), np.float32(2002.5636), np.float32(1629.8252), np.float32(2176.9683), np.float32(2227.9749), np.float32(3362.6755), np.float32(2913.0715), np.float32(2489.1785)]
2025-09-14 10:10:37,304 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:10:37,308 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 28/100 (estimated time remaining: 4 hours, 21 minutes, 42 seconds)
2025-09-14 10:14:01,328 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:14:11,792 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 2049.17334 ± 714.222
2025-09-14 10:14:11,792 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2938.6733), np.float32(3267.5127), np.float32(1306.6084), np.float32(1792.4801), np.float32(1485.3712), np.float32(1596.7249), np.float32(3086.6995), np.float32(1423.8097), np.float32(1583.7491), np.float32(2010.1044)]
2025-09-14 10:14:11,792 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:14:11,796 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 29/100 (estimated time remaining: 4 hours, 17 minutes, 48 seconds)
2025-09-14 10:17:37,073 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:17:47,585 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 3276.80005 ± 841.132
2025-09-14 10:17:47,585 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2070.9746), np.float32(3370.643), np.float32(3006.6746), np.float32(1798.2537), np.float32(4161.337), np.float32(4022.6912), np.float32(2530.6619), np.float32(3884.4468), np.float32(3685.7202), np.float32(4236.5947)]
2025-09-14 10:17:47,585 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:17:47,585 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (3276.80) for latency 12
2025-09-14 10:17:47,589 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 30/100 (estimated time remaining: 4 hours, 14 minutes, 27 seconds)
2025-09-14 10:21:11,929 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:21:22,467 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 3369.72217 ± 1149.110
2025-09-14 10:21:22,467 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3505.0295), np.float32(4296.084), np.float32(4044.8425), np.float32(4316.6577), np.float32(1162.687), np.float32(3944.168), np.float32(3219.9084), np.float32(1160.8109), np.float32(4118.366), np.float32(3928.669)]
2025-09-14 10:21:22,467 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:21:22,468 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (3369.72) for latency 12
2025-09-14 10:21:22,471 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 31/100 (estimated time remaining: 4 hours, 10 minutes, 50 seconds)
2025-09-14 10:24:47,759 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:24:58,233 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 2756.24731 ± 1153.880
2025-09-14 10:24:58,233 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1863.2925), np.float32(2758.6226), np.float32(3326.631), np.float32(2042.223), np.float32(1209.3033), np.float32(4917.405), np.float32(3018.852), np.float32(2370.277), np.float32(4486.126), np.float32(1569.742)]
2025-09-14 10:24:58,233 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:24:58,238 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 32/100 (estimated time remaining: 4 hours, 7 minutes, 25 seconds)
2025-09-14 10:28:22,062 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:28:32,488 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 3379.84375 ± 1014.809
2025-09-14 10:28:32,489 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1147.0039), np.float32(4039.7246), np.float32(3855.155), np.float32(3311.3257), np.float32(4427.361), np.float32(3559.8062), np.float32(2221.148), np.float32(2828.9937), np.float32(3697.6367), np.float32(4710.282)]
2025-09-14 10:28:32,489 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:28:32,489 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (3379.84) for latency 12
2025-09-14 10:28:32,493 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 33/100 (estimated time remaining: 4 hours, 3 minutes, 42 seconds)
2025-09-14 10:31:56,052 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:32:06,574 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 3439.40161 ± 1169.459
2025-09-14 10:32:06,575 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1183.8583), np.float32(4014.4187), np.float32(3809.9817), np.float32(4004.6882), np.float32(1158.0432), np.float32(4304.5415), np.float32(4203.386), np.float32(4183.353), np.float32(4264.856), np.float32(3266.8877)]
2025-09-14 10:32:06,575 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:32:06,575 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (3439.40) for latency 12
2025-09-14 10:32:06,579 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 34/100 (estimated time remaining: 4 hours, 2 seconds)
2025-09-14 10:35:31,410 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:35:42,009 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 3884.44531 ± 895.965
2025-09-14 10:35:42,009 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4704.147), np.float32(4351.668), np.float32(2060.581), np.float32(2472.3574), np.float32(4861.7856), np.float32(3504.8645), np.float32(3953.437), np.float32(4078.91), np.float32(4632.4717), np.float32(4224.232)]
2025-09-14 10:35:42,009 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:35:42,009 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (3884.45) for latency 12
2025-09-14 10:35:42,013 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 35/100 (estimated time remaining: 3 hours, 56 minutes, 22 seconds)
2025-09-14 10:39:06,368 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:39:16,859 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 3465.46240 ± 1087.192
2025-09-14 10:39:16,860 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4231.721), np.float32(1540.2549), np.float32(4340.6143), np.float32(3024.3552), np.float32(1414.0487), np.float32(3613.9065), np.float32(4434.11), np.float32(3453.1917), np.float32(4294.2964), np.float32(4308.127)]
2025-09-14 10:39:16,860 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:39:16,865 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 36/100 (estimated time remaining: 3 hours, 52 minutes, 47 seconds)
2025-09-14 10:42:40,426 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:42:50,956 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 3377.43164 ± 1154.108
2025-09-14 10:42:50,956 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2181.2275), np.float32(4167.0664), np.float32(4446.4253), np.float32(4196.5693), np.float32(1741.9119), np.float32(4449.445), np.float32(4503.6436), np.float32(1883.8514), np.float32(2107.9553), np.float32(4096.219)]
2025-09-14 10:42:50,956 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:42:50,961 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 37/100 (estimated time remaining: 3 hours, 48 minutes, 50 seconds)
2025-09-14 10:46:13,715 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:46:24,120 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4034.53955 ± 1055.743
2025-09-14 10:46:24,121 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4489.2715), np.float32(4266.759), np.float32(1349.5079), np.float32(4840.3384), np.float32(4042.9172), np.float32(4404.014), np.float32(4688.5513), np.float32(4730.3115), np.float32(4724.671), np.float32(2809.0498)]
2025-09-14 10:46:24,121 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:46:24,121 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (4034.54) for latency 12
2025-09-14 10:46:24,125 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 38/100 (estimated time remaining: 3 hours, 45 minutes, 2 seconds)
2025-09-14 10:49:46,580 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:49:56,882 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4363.83057 ± 666.683
2025-09-14 10:49:56,884 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4883.319), np.float32(4589.6636), np.float32(4744.2417), np.float32(4206.1885), np.float32(4700.015), np.float32(3958.1392), np.float32(4835.843), np.float32(2534.2952), np.float32(4618.9233), np.float32(4567.678)]
2025-09-14 10:49:56,884 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:49:56,884 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (4363.83) for latency 12
2025-09-14 10:49:56,889 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 39/100 (estimated time remaining: 3 hours, 41 minutes, 11 seconds)
2025-09-14 10:53:17,736 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:53:27,917 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4544.03125 ± 327.000
2025-09-14 10:53:27,917 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4565.262), np.float32(4802.1636), np.float32(4668.9243), np.float32(4775.5386), np.float32(4788.8203), np.float32(4873.973), np.float32(3897.4153), np.float32(4753.5586), np.float32(4196.9526), np.float32(4117.7036)]
2025-09-14 10:53:27,917 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:53:27,917 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (4544.03) for latency 12
2025-09-14 10:53:27,922 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 40/100 (estimated time remaining: 3 hours, 36 minutes, 44 seconds)
2025-09-14 10:56:45,087 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:56:55,197 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 3600.92383 ± 1114.451
2025-09-14 10:56:55,197 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2270.6755), np.float32(4220.811), np.float32(2721.9187), np.float32(4689.0493), np.float32(4601.228), np.float32(3010.6226), np.float32(4818.5137), np.float32(4926.979), np.float32(2971.5332), np.float32(1777.9119)]
2025-09-14 10:56:55,197 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:56:55,202 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 41/100 (estimated time remaining: 3 hours, 31 minutes, 40 seconds)
2025-09-14 11:00:10,433 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:00:19,743 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4341.94434 ± 742.174
2025-09-14 11:00:19,743 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4737.8984), np.float32(4901.4536), np.float32(4657.9697), np.float32(4102.28), np.float32(4480.246), np.float32(4247.646), np.float32(4753.7905), np.float32(2224.7654), np.float32(4618.9033), np.float32(4694.49)]
2025-09-14 11:00:19,743 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:00:19,748 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 42/100 (estimated time remaining: 3 hours, 26 minutes, 15 seconds)
2025-09-14 11:03:22,837 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:03:32,037 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 3867.37183 ± 1301.553
2025-09-14 11:03:32,037 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4957.2188), np.float32(4907.7144), np.float32(4733.244), np.float32(2613.4668), np.float32(1587.8358), np.float32(4260.9404), np.float32(4810.402), np.float32(4208.419), np.float32(1663.011), np.float32(4931.467)]
2025-09-14 11:03:32,037 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:03:32,042 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 43/100 (estimated time remaining: 3 hours, 18 minutes, 43 seconds)
2025-09-14 11:06:35,236 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:06:43,562 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 3205.25732 ± 1066.291
2025-09-14 11:06:43,563 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3428.8728), np.float32(2807.639), np.float32(2001.269), np.float32(2686.1116), np.float32(4396.2354), np.float32(2355.0833), np.float32(4761.156), np.float32(3453.4749), np.float32(4628.656), np.float32(1534.0763)]
2025-09-14 11:06:43,563 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:06:43,567 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 44/100 (estimated time remaining: 3 hours, 11 minutes, 16 seconds)
2025-09-14 11:09:32,584 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:09:39,909 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4428.99023 ± 622.355
2025-09-14 11:09:39,910 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4701.713), np.float32(4660.202), np.float32(4935.8076), np.float32(4517.2188), np.float32(4997.2026), np.float32(4882.358), np.float32(4640.9688), np.float32(4154.621), np.float32(2802.3606), np.float32(3997.4495)]
2025-09-14 11:09:39,910 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:09:39,914 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 45/100 (estimated time remaining: 3 hours, 1 minute, 26 seconds)
2025-09-14 11:12:09,937 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:12:17,249 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4119.79395 ± 1053.157
2025-09-14 11:12:17,249 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4687.4907), np.float32(2896.9353), np.float32(4346.419), np.float32(4879.929), np.float32(5136.8755), np.float32(5008.902), np.float32(4352.124), np.float32(3036.248), np.float32(4948.0664), np.float32(1904.9534)]
2025-09-14 11:12:17,249 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:12:17,253 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 46/100 (estimated time remaining: 2 hours, 49 minutes, 2 seconds)
2025-09-14 11:14:47,729 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:14:55,052 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4408.87646 ± 970.166
2025-09-14 11:14:55,052 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3415.3982), np.float32(4225.2563), np.float32(5145.78), np.float32(1917.925), np.float32(4875.877), np.float32(5096.1562), np.float32(4521.913), np.float32(4986.685), np.float32(5043.634), np.float32(4860.1396)]
2025-09-14 11:14:55,052 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:14:55,056 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 47/100 (estimated time remaining: 2 hours, 37 minutes, 33 seconds)
2025-09-14 11:17:25,627 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:17:32,968 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 3878.39136 ± 1281.137
2025-09-14 11:17:32,968 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1384.8153), np.float32(4507.6587), np.float32(4851.9346), np.float32(5139.8184), np.float32(3289.2837), np.float32(3511.3538), np.float32(4506.559), np.float32(4795.6562), np.float32(4992.4316), np.float32(1804.4045)]
2025-09-14 11:17:32,968 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:17:32,972 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 48/100 (estimated time remaining: 2 hours, 28 minutes, 33 seconds)
2025-09-14 11:20:03,606 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:20:10,950 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4215.80322 ± 717.207
2025-09-14 11:20:10,950 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5003.265), np.float32(4777.0405), np.float32(3953.4043), np.float32(3524.559), np.float32(4337.2266), np.float32(2809.5774), np.float32(5025.444), np.float32(4801.37), np.float32(3441.665), np.float32(4484.4814)]
2025-09-14 11:20:10,950 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:20:10,955 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 49/100 (estimated time remaining: 2 hours, 19 minutes, 56 seconds)
2025-09-14 11:22:41,038 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:22:48,400 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 3873.51440 ± 1256.783
2025-09-14 11:22:48,400 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3366.1106), np.float32(4984.772), np.float32(4948.587), np.float32(4779.792), np.float32(4820.66), np.float32(4152.045), np.float32(1613.8109), np.float32(3532.1704), np.float32(1625.9968), np.float32(4911.1997)]
2025-09-14 11:22:48,400 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:22:48,404 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 50/100 (estimated time remaining: 2 hours, 14 minutes, 2 seconds)
2025-09-14 11:25:18,470 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:25:25,850 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 3735.33203 ± 1329.276
2025-09-14 11:25:25,850 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3869.3154), np.float32(4345.7666), np.float32(1448.4866), np.float32(4505.6367), np.float32(5070.003), np.float32(4665.242), np.float32(4604.9585), np.float32(1986.3744), np.float32(1878.4498), np.float32(4979.0894)]
2025-09-14 11:25:25,850 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:25:25,854 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 51/100 (estimated time remaining: 2 hours, 11 minutes, 26 seconds)
2025-09-14 11:27:55,976 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:28:03,376 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 3901.05151 ± 1446.904
2025-09-14 11:28:03,376 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3176.6504), np.float32(4732.355), np.float32(1129.5881), np.float32(4972.981), np.float32(4918.343), np.float32(1333.1495), np.float32(3943.6362), np.float32(5088.492), np.float32(4953.603), np.float32(4761.717)]
2025-09-14 11:28:03,376 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:28:03,381 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 52/100 (estimated time remaining: 2 hours, 8 minutes, 45 seconds)
2025-09-14 11:30:33,714 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:30:41,147 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 3958.65161 ± 1242.960
2025-09-14 11:30:41,147 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4703.8755), np.float32(5058.7095), np.float32(4893.95), np.float32(4771.442), np.float32(4828.4067), np.float32(2576.5623), np.float32(4298.89), np.float32(1911.659), np.float32(4726.1064), np.float32(1816.9176)]
2025-09-14 11:30:41,147 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:30:41,151 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 53/100 (estimated time remaining: 2 hours, 6 minutes, 6 seconds)
2025-09-14 11:33:11,011 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:33:18,538 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 3895.10669 ± 1489.093
2025-09-14 11:33:18,538 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5031.851), np.float32(2773.5598), np.float32(4828.284), np.float32(4421.6016), np.float32(4975.9556), np.float32(4943.545), np.float32(4880.1685), np.float32(4705.788), np.float32(1222.1229), np.float32(1168.193)]
2025-09-14 11:33:18,538 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:33:18,543 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 54/100 (estimated time remaining: 2 hours, 3 minutes, 23 seconds)
2025-09-14 11:35:48,697 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:35:56,227 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 3803.00073 ± 1438.885
2025-09-14 11:35:56,229 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5208.4272), np.float32(5077.4976), np.float32(1493.5826), np.float32(1186.3398), np.float32(4995.0537), np.float32(3563.4375), np.float32(3292.9268), np.float32(5084.94), np.float32(4902.3384), np.float32(3225.462)]
2025-09-14 11:35:56,230 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:35:56,234 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 55/100 (estimated time remaining: 2 hours, 48 seconds)
2025-09-14 11:38:26,765 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:38:34,219 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4274.25879 ± 903.608
2025-09-14 11:38:34,219 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4954.3433), np.float32(4561.0894), np.float32(5058.9473), np.float32(4935.5093), np.float32(3446.2058), np.float32(2115.5999), np.float32(3586.58), np.float32(5076.35), np.float32(4507.026), np.float32(4500.9346)]
2025-09-14 11:38:34,219 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:38:34,224 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 56/100 (estimated time remaining: 1 hour, 58 minutes, 15 seconds)
2025-09-14 11:41:04,559 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:41:12,015 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4849.76807 ± 342.194
2025-09-14 11:41:12,015 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5005.7456), np.float32(5074.152), np.float32(5139.0635), np.float32(4744.0137), np.float32(4359.343), np.float32(4065.5488), np.float32(4907.882), np.float32(5048.2183), np.float32(5077.337), np.float32(5076.375)]
2025-09-14 11:41:12,015 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:41:12,016 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (4849.77) for latency 12
2025-09-14 11:41:12,020 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 57/100 (estimated time remaining: 1 hour, 55 minutes, 40 seconds)
2025-09-14 11:43:42,296 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:43:49,724 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4549.01514 ± 788.592
2025-09-14 11:43:49,724 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5050.395), np.float32(3046.011), np.float32(4706.077), np.float32(5203.846), np.float32(4575.2734), np.float32(2980.4648), np.float32(5135.3774), np.float32(4930.74), np.float32(5010.1294), np.float32(4851.8354)]
2025-09-14 11:43:49,724 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:43:49,729 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 58/100 (estimated time remaining: 1 hour, 53 minutes, 1 second)
2025-09-14 11:46:20,074 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:46:27,506 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4877.75049 ± 313.807
2025-09-14 11:46:27,506 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5182.974), np.float32(5000.679), np.float32(4202.1025), np.float32(5036.583), np.float32(4965.3887), np.float32(4365.442), np.float32(5055.7485), np.float32(4850.915), np.float32(4943.464), np.float32(5174.204)]
2025-09-14 11:46:27,506 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:46:27,506 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (4877.75) for latency 12
2025-09-14 11:46:27,511 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 59/100 (estimated time remaining: 1 hour, 50 minutes, 27 seconds)
2025-09-14 11:48:57,623 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:49:05,045 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4742.06494 ± 177.323
2025-09-14 11:49:05,045 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4544.69), np.float32(4467.0645), np.float32(4777.4565), np.float32(4475.9995), np.float32(4714.022), np.float32(4886.573), np.float32(4906.5874), np.float32(4919.442), np.float32(4958.418), np.float32(4770.395)]
2025-09-14 11:49:05,045 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:49:05,050 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 60/100 (estimated time remaining: 1 hour, 47 minutes, 48 seconds)
2025-09-14 11:51:35,269 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:51:42,696 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 3869.23291 ± 1271.099
2025-09-14 11:51:42,696 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5124.069), np.float32(3034.411), np.float32(3846.7024), np.float32(2355.3765), np.float32(4948.022), np.float32(1462.244), np.float32(5145.623), np.float32(2955.1187), np.float32(4949.0303), np.float32(4871.7285)]
2025-09-14 11:51:42,696 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:51:42,701 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 61/100 (estimated time remaining: 1 hour, 45 minutes, 7 seconds)
2025-09-14 11:54:12,583 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:54:20,031 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4243.84717 ± 849.825
2025-09-14 11:54:20,031 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4880.7485), np.float32(4746.7993), np.float32(4961.4497), np.float32(4296.887), np.float32(4232.9766), np.float32(4694.901), np.float32(4964.1104), np.float32(2733.3755), np.float32(2508.5588), np.float32(4418.665)]
2025-09-14 11:54:20,032 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:54:20,040 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 62/100 (estimated time remaining: 1 hour, 42 minutes, 26 seconds)
2025-09-14 11:56:49,711 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:56:57,161 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4918.49121 ± 154.473
2025-09-14 11:56:57,161 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5044.2676), np.float32(4852.1763), np.float32(4742.0854), np.float32(5063.1235), np.float32(4682.98), np.float32(4759.942), np.float32(5052.694), np.float32(4814.491), np.float32(5071.897), np.float32(5101.2544)]
2025-09-14 11:56:57,161 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:56:57,162 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (4918.49) for latency 12
2025-09-14 11:56:57,166 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 63/100 (estimated time remaining: 1 hour, 39 minutes, 44 seconds)
2025-09-14 11:59:26,778 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:59:34,273 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4743.03760 ± 556.719
2025-09-14 11:59:34,273 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5192.351), np.float32(3377.7249), np.float32(4173.537), np.float32(4588.3657), np.float32(5058.9644), np.float32(5210.911), np.float32(4633.964), np.float32(5160.88), np.float32(4894.3433), np.float32(5139.335)]
2025-09-14 11:59:34,274 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:59:34,278 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 64/100 (estimated time remaining: 1 hour, 37 minutes, 2 seconds)
2025-09-14 12:02:04,350 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:02:11,812 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4365.45605 ± 690.389
2025-09-14 12:02:11,812 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3419.5989), np.float32(4399.6465), np.float32(5104.3853), np.float32(4986.0635), np.float32(4921.217), np.float32(4713.9443), np.float32(3865.4307), np.float32(2927.917), np.float32(4761.013), np.float32(4555.3486)]
2025-09-14 12:02:11,812 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:02:11,817 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 65/100 (estimated time remaining: 1 hour, 34 minutes, 24 seconds)
2025-09-14 12:04:42,111 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:04:49,565 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4172.00000 ± 1032.559
2025-09-14 12:04:49,566 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4822.1997), np.float32(2859.316), np.float32(5105.4893), np.float32(3583.4263), np.float32(2228.6726), np.float32(5071.6885), np.float32(5127.0547), np.float32(5179.492), np.float32(4420.933), np.float32(3321.7268)]
2025-09-14 12:04:49,566 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:04:49,571 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 66/100 (estimated time remaining: 1 hour, 31 minutes, 48 seconds)
2025-09-14 12:07:19,122 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:07:26,605 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4533.96191 ± 1047.998
2025-09-14 12:07:26,606 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4990.9995), np.float32(5083.0376), np.float32(5020.5107), np.float32(4421.9126), np.float32(5181.762), np.float32(3748.3513), np.float32(5163.9624), np.float32(5073.688), np.float32(1656.1454), np.float32(4999.2446)]
2025-09-14 12:07:26,606 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:07:26,611 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 67/100 (estimated time remaining: 1 hour, 29 minutes, 8 seconds)
2025-09-14 12:09:56,823 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:10:04,299 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4447.00977 ± 1042.719
2025-09-14 12:10:04,299 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5204.216), np.float32(1417.4), np.float32(4475.071), np.float32(4824.238), np.float32(4860.693), np.float32(4366.7295), np.float32(5049.58), np.float32(4494.2153), np.float32(4740.521), np.float32(5037.438)]
2025-09-14 12:10:04,299 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:10:04,304 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 68/100 (estimated time remaining: 1 hour, 26 minutes, 35 seconds)
2025-09-14 12:12:34,543 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:12:42,018 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4814.01611 ± 515.480
2025-09-14 12:12:42,021 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3348.8308), np.float32(5005.9785), np.float32(5138.611), np.float32(5003.635), np.float32(5002.205), np.float32(5116.603), np.float32(4710.319), np.float32(4634.8013), np.float32(5162.182), np.float32(5016.9917)]
2025-09-14 12:12:42,021 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:12:42,025 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 69/100 (estimated time remaining: 1 hour, 24 minutes, 1 second)
2025-09-14 12:15:12,051 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:15:19,549 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4202.08691 ± 1154.938
2025-09-14 12:15:19,549 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5025.798), np.float32(4351.76), np.float32(5073.556), np.float32(3231.5034), np.float32(4419.4487), np.float32(4386.2334), np.float32(5066.5776), np.float32(5177.6113), np.float32(1177.529), np.float32(4110.852)]
2025-09-14 12:15:19,550 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:15:19,555 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 70/100 (estimated time remaining: 1 hour, 21 minutes, 23 seconds)
2025-09-14 12:17:49,350 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:17:56,767 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4630.19336 ± 907.362
2025-09-14 12:17:56,767 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3822.7185), np.float32(4950.014), np.float32(5202.5977), np.float32(5169.025), np.float32(5159.1245), np.float32(4961.8716), np.float32(4968.451), np.float32(5019.348), np.float32(4895.438), np.float32(2153.3462)]
2025-09-14 12:17:56,767 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:17:56,773 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 71/100 (estimated time remaining: 1 hour, 18 minutes, 43 seconds)
2025-09-14 12:20:26,957 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:20:34,343 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4085.16333 ± 1154.748
2025-09-14 12:20:34,343 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2196.648), np.float32(3508.5486), np.float32(5059.4517), np.float32(1975.2415), np.float32(5075.5474), np.float32(5140.9146), np.float32(4509.8647), np.float32(5018.3013), np.float32(4842.533), np.float32(3524.5786)]
2025-09-14 12:20:34,343 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:20:34,348 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 72/100 (estimated time remaining: 1 hour, 16 minutes, 8 seconds)
2025-09-14 12:23:04,569 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:23:11,920 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4928.32812 ± 171.484
2025-09-14 12:23:11,920 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5096.8613), np.float32(5106.686), np.float32(4747.601), np.float32(4871.6865), np.float32(4797.26), np.float32(4971.898), np.float32(4969.776), np.float32(5089.646), np.float32(5066.8306), np.float32(4565.036)]
2025-09-14 12:23:11,920 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:23:11,920 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (4928.33) for latency 12
2025-09-14 12:23:11,925 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 73/100 (estimated time remaining: 1 hour, 13 minutes, 30 seconds)
2025-09-14 12:25:42,119 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:25:49,504 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4979.71387 ± 212.460
2025-09-14 12:25:49,504 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5063.9854), np.float32(5031.5083), np.float32(4752.6035), np.float32(5113.255), np.float32(5123.9434), np.float32(5106.7), np.float32(4546.878), np.float32(4726.1245), np.float32(5081.9507), np.float32(5250.194)]
2025-09-14 12:25:49,504 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:25:49,504 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (4979.71) for latency 12
2025-09-14 12:25:49,509 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 74/100 (estimated time remaining: 1 hour, 10 minutes, 52 seconds)
2025-09-14 12:28:19,494 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:28:27,002 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4659.61914 ± 894.145
2025-09-14 12:28:27,002 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4658.642), np.float32(4827.8896), np.float32(2015.5281), np.float32(4869.5835), np.float32(4951.001), np.float32(4992.665), np.float32(5028.643), np.float32(4910.0796), np.float32(5253.0024), np.float32(5089.1577)]
2025-09-14 12:28:27,002 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:28:27,008 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 75/100 (estimated time remaining: 1 hour, 8 minutes, 14 seconds)
2025-09-14 12:30:56,727 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:31:04,324 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4644.49072 ± 386.426
2025-09-14 12:31:04,325 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5059.9937), np.float32(5108.4614), np.float32(4836.6724), np.float32(4706.6494), np.float32(4375.509), np.float32(4257.9937), np.float32(4208.622), np.float32(4860.8877), np.float32(5050.083), np.float32(3980.0366)]
2025-09-14 12:31:04,325 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:31:04,330 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 76/100 (estimated time remaining: 1 hour, 5 minutes, 37 seconds)
2025-09-14 12:33:34,278 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:33:41,776 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4773.18408 ± 1056.617
2025-09-14 12:33:41,777 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1634.3293), np.float32(5117.738), np.float32(5073.5527), np.float32(5139.7275), np.float32(5332.659), np.float32(5120.459), np.float32(4737.18), np.float32(5210.4595), np.float32(5241.5723), np.float32(5124.1616)]
2025-09-14 12:33:41,777 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:33:41,782 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 77/100 (estimated time remaining: 1 hour, 2 minutes, 59 seconds)
2025-09-14 12:36:11,720 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:36:19,216 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4963.74902 ± 367.339
2025-09-14 12:36:19,217 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5138.344), np.float32(5217.1606), np.float32(5268.229), np.float32(5155.3867), np.float32(4000.196), np.float32(5182.601), np.float32(4717.4023), np.float32(5142.275), np.float32(4749.625), np.float32(5066.2695)]
2025-09-14 12:36:19,217 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:36:19,222 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 78/100 (estimated time remaining: 1 hour, 21 seconds)
2025-09-14 12:38:49,241 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:38:56,850 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4894.23340 ± 755.480
2025-09-14 12:38:56,850 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5118.139), np.float32(2641.215), np.float32(5147.644), np.float32(5098.7285), np.float32(5252.135), np.float32(5200.516), np.float32(5204.486), np.float32(4945.1167), np.float32(5219.5967), np.float32(5114.7544)]
2025-09-14 12:38:56,850 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:38:56,856 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 79/100 (estimated time remaining: 57 minutes, 44 seconds)
2025-09-14 12:41:26,648 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:41:34,140 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4362.69629 ± 1123.279
2025-09-14 12:41:34,140 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5113.741), np.float32(5114.024), np.float32(2012.1289), np.float32(5070.906), np.float32(4299.2764), np.float32(2415.8435), np.float32(4410.062), np.float32(4709.9644), np.float32(5285.636), np.float32(5195.377)]
2025-09-14 12:41:34,141 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:41:34,146 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 80/100 (estimated time remaining: 55 minutes, 5 seconds)
2025-09-14 12:44:04,136 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:44:11,615 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4414.77930 ± 1106.562
2025-09-14 12:44:11,616 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3001.9512), np.float32(4352.5522), np.float32(5098.678), np.float32(5029.6367), np.float32(5123.392), np.float32(1711.2876), np.float32(5133.4595), np.float32(4406.726), np.float32(5194.5576), np.float32(5095.551)]
2025-09-14 12:44:11,616 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:44:11,621 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 81/100 (estimated time remaining: 52 minutes, 29 seconds)
2025-09-14 12:46:41,511 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:46:48,987 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4804.68506 ± 996.657
2025-09-14 12:46:48,987 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5189.733), np.float32(5187.611), np.float32(1835.6184), np.float32(4909.1675), np.float32(4940.6245), np.float32(5263.7563), np.float32(5127.342), np.float32(5299.093), np.float32(5143.9663), np.float32(5149.9424)]
2025-09-14 12:46:48,987 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:46:48,993 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 82/100 (estimated time remaining: 49 minutes, 51 seconds)
2025-09-14 12:49:18,955 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:49:26,450 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4473.36572 ± 1262.888
2025-09-14 12:49:26,450 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5193.186), np.float32(1708.4685), np.float32(5260.165), np.float32(5106.4624), np.float32(5215.848), np.float32(5074.941), np.float32(4789.4736), np.float32(5083.589), np.float32(2231.7185), np.float32(5069.8047)]
2025-09-14 12:49:26,450 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:49:26,456 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 83/100 (estimated time remaining: 47 minutes, 14 seconds)
2025-09-14 12:51:56,414 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:52:03,914 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 5025.70557 ± 134.006
2025-09-14 12:52:03,914 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5149.2), np.float32(5109.9795), np.float32(4908.25), np.float32(5048.548), np.float32(4750.4478), np.float32(5137.667), np.float32(4891.04), np.float32(4981.6553), np.float32(5204.716), np.float32(5075.55)]
2025-09-14 12:52:03,914 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:52:03,914 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (5025.71) for latency 12
2025-09-14 12:52:03,920 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 84/100 (estimated time remaining: 44 minutes, 36 seconds)
2025-09-14 12:54:33,844 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:54:41,303 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4428.34863 ± 993.619
2025-09-14 12:54:41,303 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3287.7266), np.float32(5224.285), np.float32(5229.4365), np.float32(5363.698), np.float32(2530.7913), np.float32(4799.6533), np.float32(5259.5244), np.float32(3612.4797), np.float32(3675.7158), np.float32(5300.1772)]
2025-09-14 12:54:41,303 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:54:41,312 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 85/100 (estimated time remaining: 41 minutes, 58 seconds)
2025-09-14 12:57:11,452 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:57:18,890 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4678.82617 ± 1012.734
2025-09-14 12:57:18,890 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5116.9453), np.float32(5254.7954), np.float32(5391.828), np.float32(4843.4014), np.float32(4679.2417), np.float32(5265.3994), np.float32(1732.7252), np.float32(4678.1157), np.float32(4734.026), np.float32(5091.7847)]
2025-09-14 12:57:18,890 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:57:18,896 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 86/100 (estimated time remaining: 39 minutes, 21 seconds)
2025-09-14 12:59:48,846 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:59:56,218 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4064.70972 ± 1506.508
2025-09-14 12:59:56,219 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1732.5803), np.float32(5335.403), np.float32(5136.2056), np.float32(2010.3085), np.float32(4857.64), np.float32(4693.6694), np.float32(1603.3079), np.float32(5145.5034), np.float32(4975.5767), np.float32(5156.904)]
2025-09-14 12:59:56,219 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:59:56,224 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 87/100 (estimated time remaining: 36 minutes, 44 seconds)
2025-09-14 13:02:26,107 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:02:33,468 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4567.40576 ± 1176.819
2025-09-14 13:02:33,469 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5286.654), np.float32(1665.7996), np.float32(5198.206), np.float32(5216.0356), np.float32(4677.325), np.float32(5341.327), np.float32(2960.9714), np.float32(4990.055), np.float32(5166.5674), np.float32(5171.1167)]
2025-09-14 13:02:33,469 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:02:33,475 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 88/100 (estimated time remaining: 34 minutes, 6 seconds)
2025-09-14 13:05:03,284 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:05:10,630 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4908.01465 ± 253.773
2025-09-14 13:05:10,630 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5244.196), np.float32(4942.1646), np.float32(5105.8774), np.float32(5081.8687), np.float32(4571.516), np.float32(5060.4863), np.float32(4849.204), np.float32(4791.2354), np.float32(5058.668), np.float32(4374.933)]
2025-09-14 13:05:10,630 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:05:10,636 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 89/100 (estimated time remaining: 31 minutes, 28 seconds)
2025-09-14 13:07:40,493 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:07:47,883 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 3574.19800 ± 1601.823
2025-09-14 13:07:47,883 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1799.4349), np.float32(2140.1914), np.float32(2372.2075), np.float32(5349.108), np.float32(5140.9595), np.float32(2297.927), np.float32(5319.1074), np.float32(5265.547), np.float32(4652.4946), np.float32(1405.0038)]
2025-09-14 13:07:47,883 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:07:47,889 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 90/100 (estimated time remaining: 28 minutes, 50 seconds)
2025-09-14 13:10:17,735 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:10:25,148 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4535.48096 ± 1305.923
2025-09-14 13:10:25,148 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1626.5151), np.float32(5082.3374), np.float32(5245.573), np.float32(5292.8633), np.float32(5178.8003), np.float32(5278.029), np.float32(4861.8765), np.float32(5299.605), np.float32(2276.827), np.float32(5212.378)]
2025-09-14 13:10:25,148 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:10:25,154 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 91/100 (estimated time remaining: 26 minutes, 12 seconds)
2025-09-14 13:12:54,838 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:13:02,280 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4758.58691 ± 884.904
2025-09-14 13:13:02,280 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4867.9106), np.float32(5232.3726), np.float32(5300.457), np.float32(5321.7266), np.float32(4831.9253), np.float32(2500.799), np.float32(5368.487), np.float32(5153.676), np.float32(3722.289), np.float32(5286.229)]
2025-09-14 13:13:02,280 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:13:02,287 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 92/100 (estimated time remaining: 23 minutes, 34 seconds)
2025-09-14 13:15:32,244 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:15:39,736 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4999.51562 ± 614.886
2025-09-14 13:15:39,736 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5334.001), np.float32(5331.937), np.float32(3218.3245), np.float32(5147.0615), np.float32(5403.696), np.float32(5203.1143), np.float32(5244.243), np.float32(5068.819), np.float32(5238.244), np.float32(4805.718)]
2025-09-14 13:15:39,736 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:15:39,743 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 93/100 (estimated time remaining: 20 minutes, 58 seconds)
2025-09-14 13:18:09,767 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:18:17,278 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 5201.89746 ± 197.058
2025-09-14 13:18:17,278 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5368.063), np.float32(5058.3335), np.float32(5460.534), np.float32(5119.6646), np.float32(5229.608), np.float32(5102.946), np.float32(4735.975), np.float32(5334.605), np.float32(5298.253), np.float32(5310.9907)]
2025-09-14 13:18:17,278 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:18:17,278 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1226 [INFO]: New best (5201.90) for latency 12
2025-09-14 13:18:17,285 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 94/100 (estimated time remaining: 18 minutes, 21 seconds)
2025-09-14 13:20:47,270 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:20:54,763 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4813.25098 ± 674.237
2025-09-14 13:20:54,763 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5149.0176), np.float32(5057.328), np.float32(5344.052), np.float32(5003.737), np.float32(5205.076), np.float32(5074.326), np.float32(4805.9375), np.float32(5009.256), np.float32(4607.865), np.float32(2875.9175)]
2025-09-14 13:20:54,763 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:20:54,769 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 95/100 (estimated time remaining: 15 minutes, 44 seconds)
2025-09-14 13:23:24,817 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:23:32,314 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4627.31152 ± 740.305
2025-09-14 13:23:32,314 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2881.902), np.float32(4863.016), np.float32(3920.138), np.float32(5390.1675), np.float32(4977.839), np.float32(4410.248), np.float32(5280.431), np.float32(4926.45), np.float32(5332.326), np.float32(4290.5977)]
2025-09-14 13:23:32,315 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:23:32,321 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 96/100 (estimated time remaining: 13 minutes, 7 seconds)
2025-09-14 13:26:02,210 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:26:09,708 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4656.41113 ± 1119.158
2025-09-14 13:26:09,708 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5188.49), np.float32(1414.0219), np.float32(4311.951), np.float32(5323.006), np.float32(5207.8096), np.float32(5099.756), np.float32(5131.269), np.float32(5112.466), np.float32(4637.764), np.float32(5137.579)]
2025-09-14 13:26:09,708 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:26:09,715 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 97/100 (estimated time remaining: 10 minutes, 29 seconds)
2025-09-14 13:28:39,220 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:28:46,710 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4713.34814 ± 752.976
2025-09-14 13:28:46,710 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4782.0986), np.float32(4982.4976), np.float32(5228.961), np.float32(5292.497), np.float32(5232.966), np.float32(4703.1113), np.float32(3398.9998), np.float32(5154.7974), np.float32(5234.0635), np.float32(3123.488)]
2025-09-14 13:28:46,710 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:28:46,717 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 98/100 (estimated time remaining: 7 minutes, 52 seconds)
2025-09-14 13:31:16,242 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:31:23,727 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4885.62891 ± 949.951
2025-09-14 13:31:23,727 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5266.7163), np.float32(5158.383), np.float32(5127.408), np.float32(2043.4066), np.float32(5234.935), np.float32(5057.3364), np.float32(5183.807), np.float32(5287.6807), np.float32(5212.0435), np.float32(5284.576)]
2025-09-14 13:31:23,727 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:31:23,734 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 99/100 (estimated time remaining: 5 minutes, 14 seconds)
2025-09-14 13:33:53,111 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:34:00,402 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4925.18311 ± 225.455
2025-09-14 13:34:00,402 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4938.976), np.float32(4648.7954), np.float32(5027.4736), np.float32(4586.4956), np.float32(4843.378), np.float32(5125.8955), np.float32(5150.1997), np.float32(5037.126), np.float32(4634.1133), np.float32(5259.3774)]
2025-09-14 13:34:00,402 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:34:00,409 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1199 [INFO]: Iteration 100/100 (estimated time remaining: 2 minutes, 37 seconds)
2025-09-14 13:36:16,585 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:36:22,992 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1221 [DEBUG]: Total Reward: 4866.18848 ± 736.135
2025-09-14 13:36:22,992 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5113.2515), np.float32(5305.5176), np.float32(5209.02), np.float32(2698.7703), np.float32(4866.007), np.float32(5175.7036), np.float32(4850.3325), np.float32(5138.353), np.float32(5240.8643), np.float32(5064.0693)]
2025-09-14 13:36:22,992 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:36:22,998 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille50-halfcheetah):1251 [DEBUG]: Training session finished
