2025-09-14 08:43:01,518 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1108 [DEBUG]: logdir: _logs/noise-eval-v2/halfcheetah/bpql-noise_0.100-delay_6
2025-09-14 08:43:01,519 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1109 [DEBUG]: trainer_prefix: noise-eval-v2/halfcheetah/bpql-noise_0.100-delay_6
2025-09-14 08:43:01,519 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1110 [DEBUG]: args.trainer_eval_latencies: {'6': <latency_env.delayed_mdp.ConstantDelay object at 0x7f8b3bf3ba10>}
2025-09-14 08:43:01,519 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1111 [DEBUG]: using device: cpu
2025-09-14 08:43:01,523 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1133 [INFO]: Creating new trainer
2025-09-14 08:43:01,645 baseline-bpql-noisepromille100-halfcheetah:113 [DEBUG]: pi network:
NNGaussianPolicy(
  (common_head): Sequential(
    (0): Flatten(start_dim=1, end_dim=-1)
    (1): Linear(in_features=53, out_features=256, bias=True)
    (2): ReLU()
    (3): Linear(in_features=256, out_features=256, bias=True)
    (4): ReLU()
  )
  (mu_head): Sequential(
    (0): Linear(in_features=256, out_features=6, bias=True)
    (1): Unflatten(dim=1, unflattened_size=(6,))
  )
  (log_std_head): Sequential(
    (0): Linear(in_features=256, out_features=6, bias=True)
    (1): Unflatten(dim=1, unflattened_size=(6,))
  )
  (tanh_refit): NNTanhRefit(scale: tensor([[2., 2., 2., 2., 2., 2.]]), shift: tensor([[-1., -1., -1., -1., -1., -1.]]))
)
2025-09-14 08:43:01,645 baseline-bpql-noisepromille100-halfcheetah:114 [DEBUG]: q network:
NNLayerConcat2(
  dim: -1
  (next): Sequential(
    (0): Linear(in_features=23, out_features=256, bias=True)
    (1): ReLU()
    (2): Linear(in_features=256, out_features=256, bias=True)
    (3): ReLU()
    (4): Linear(in_features=256, out_features=1, bias=True)
    (5): NNLayerSqueeze(dim: -1)
  )
  (init_left): Flatten(start_dim=1, end_dim=-1)
  (init_right): Flatten(start_dim=1, end_dim=-1)
)
2025-09-14 08:43:03,426 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1194 [DEBUG]: Starting training session...
2025-09-14 08:43:03,426 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 1/100
2025-09-14 08:46:18,649 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 08:46:25,532 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: -447.58365 ± 19.161
2025-09-14 08:46:25,532 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(-460.67764), np.float32(-431.45474), np.float32(-455.84827), np.float32(-424.13437), np.float32(-443.33823), np.float32(-436.56046), np.float32(-463.78864), np.float32(-487.7503), np.float32(-423.02344), np.float32(-449.26016)]
2025-09-14 08:46:25,533 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 08:46:25,533 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (-447.58) for latency 6
2025-09-14 08:46:25,536 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 2/100 (estimated time remaining: 5 hours, 33 minutes, 28 seconds)
2025-09-14 08:49:43,068 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 08:49:49,778 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: -224.53922 ± 33.178
2025-09-14 08:49:49,779 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(-263.26273), np.float32(-260.29724), np.float32(-193.18005), np.float32(-170.06688), np.float32(-270.37613), np.float32(-246.15147), np.float32(-208.52193), np.float32(-206.4305), np.float32(-234.59221), np.float32(-192.51297)]
2025-09-14 08:49:49,779 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 08:49:49,779 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (-224.54) for latency 6
2025-09-14 08:49:49,781 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 3/100 (estimated time remaining: 5 hours, 31 minutes, 51 seconds)
2025-09-14 08:53:07,625 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 08:53:14,270 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 203.13568 ± 103.789
2025-09-14 08:53:14,270 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(68.86841), np.float32(0.31041262), np.float32(159.57312), np.float32(209.17381), np.float32(216.88272), np.float32(355.7976), np.float32(303.27515), np.float32(175.16527), np.float32(232.02208), np.float32(310.2882)]
2025-09-14 08:53:14,270 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 08:53:14,270 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (203.14) for latency 6
2025-09-14 08:53:14,272 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 4/100 (estimated time remaining: 5 hours, 29 minutes, 10 seconds)
2025-09-14 08:56:33,990 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 08:56:40,936 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 896.39227 ± 595.607
2025-09-14 08:56:40,936 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(211.63895), np.float32(1354.4672), np.float32(-85.637184), np.float32(1491.6348), np.float32(577.39435), np.float32(1520.8016), np.float32(130.13722), np.float32(1449.4427), np.float32(1278.564), np.float32(1035.4799)]
2025-09-14 08:56:40,936 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 08:56:40,936 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (896.39) for latency 6
2025-09-14 08:56:40,939 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 5/100 (estimated time remaining: 5 hours, 27 minutes)
2025-09-14 08:59:58,882 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 09:00:05,494 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 475.86533 ± 378.378
2025-09-14 09:00:05,494 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(610.29663), np.float32(-62.2173), np.float32(413.85898), np.float32(745.5235), np.float32(1001.95514), np.float32(908.4557), np.float32(635.6825), np.float32(5.8404565), np.float32(-93.00692), np.float32(592.26483)]
2025-09-14 09:00:05,494 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:00:05,497 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 6/100 (estimated time remaining: 5 hours, 23 minutes, 39 seconds)
2025-09-14 09:03:25,228 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 09:03:32,149 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 622.23505 ± 631.118
2025-09-14 09:03:32,149 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1515.7654), np.float32(1073.2273), np.float32(39.2423), np.float32(-121.19224), np.float32(19.795073), np.float32(143.64096), np.float32(702.6263), np.float32(766.22205), np.float32(293.85492), np.float32(1789.1682)]
2025-09-14 09:03:32,149 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:03:32,152 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 7/100 (estimated time remaining: 5 hours, 21 minutes, 40 seconds)
2025-09-14 09:06:47,467 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 09:06:54,668 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1351.73901 ± 572.515
2025-09-14 09:06:54,668 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1176.5996), np.float32(702.7865), np.float32(1456.3292), np.float32(1336.4541), np.float32(1336.8942), np.float32(403.80295), np.float32(2525.2595), np.float32(1047.3134), np.float32(1992.8387), np.float32(1539.1125)]
2025-09-14 09:06:54,668 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:06:54,668 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (1351.74) for latency 6
2025-09-14 09:06:54,671 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 8/100 (estimated time remaining: 5 hours, 17 minutes, 42 seconds)
2025-09-14 09:10:11,429 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 09:10:18,955 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2743.78467 ± 187.377
2025-09-14 09:10:18,955 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2664.14), np.float32(3059.6465), np.float32(2429.8213), np.float32(2731.2156), np.float32(2951.5989), np.float32(2692.828), np.float32(2849.0332), np.float32(2727.9075), np.float32(2469.6262), np.float32(2862.0303)]
2025-09-14 09:10:18,955 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:10:18,955 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (2743.78) for latency 6
2025-09-14 09:10:18,958 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 9/100 (estimated time remaining: 5 hours, 14 minutes, 14 seconds)
2025-09-14 09:13:30,046 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 09:13:36,321 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2366.92627 ± 1135.748
2025-09-14 09:13:36,322 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(981.8221), np.float32(3307.9297), np.float32(3006.0942), np.float32(3306.9512), np.float32(1212.8997), np.float32(192.0248), np.float32(3253.6836), np.float32(1848.5529), np.float32(3179.9373), np.float32(3379.3674)]
2025-09-14 09:13:36,322 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:13:36,324 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 10/100 (estimated time remaining: 5 hours, 8 minutes)
2025-09-14 09:16:49,882 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 09:16:57,164 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2324.40894 ± 1134.187
2025-09-14 09:16:57,164 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(861.53864), np.float32(3208.7336), np.float32(1925.3727), np.float32(3210.8245), np.float32(221.03696), np.float32(3017.3743), np.float32(3317.8296), np.float32(3155.632), np.float32(3249.7131), np.float32(1076.0325)]
2025-09-14 09:16:57,164 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:16:57,166 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 11/100 (estimated time remaining: 5 hours, 3 minutes, 30 seconds)
2025-09-14 09:20:10,388 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 09:20:17,787 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2192.34375 ± 1281.698
2025-09-14 09:20:17,788 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(274.18988), np.float32(3599.4907), np.float32(3058.9578), np.float32(3376.4539), np.float32(268.95224), np.float32(950.81506), np.float32(1298.9758), np.float32(3486.7095), np.float32(3079.519), np.float32(2529.3723)]
2025-09-14 09:20:17,788 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:20:17,791 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 12/100 (estimated time remaining: 4 hours, 58 minutes, 20 seconds)
2025-09-14 09:23:32,105 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 09:23:40,210 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2641.31494 ± 1040.666
2025-09-14 09:23:40,210 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2793.5488), np.float32(3162.4067), np.float32(3465.5178), np.float32(3458.486), np.float32(211.74269), np.float32(3000.106), np.float32(3280.5796), np.float32(1060.3544), np.float32(2940.3943), np.float32(3040.0115)]
2025-09-14 09:23:40,210 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:23:40,213 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 13/100 (estimated time remaining: 4 hours, 54 minutes, 57 seconds)
2025-09-14 09:26:51,161 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 09:26:58,533 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2214.01172 ± 984.675
2025-09-14 09:26:58,534 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1671.8875), np.float32(2221.98), np.float32(2678.5361), np.float32(344.0464), np.float32(887.93555), np.float32(3283.144), np.float32(1694.047), np.float32(2998.253), np.float32(3046.7415), np.float32(3313.5461)]
2025-09-14 09:26:58,534 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:26:58,536 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 14/100 (estimated time remaining: 4 hours, 49 minutes, 52 seconds)
2025-09-14 09:30:10,775 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 09:30:18,145 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2295.09009 ± 1293.803
2025-09-14 09:30:18,145 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3353.6348), np.float32(2154.4636), np.float32(522.69275), np.float32(3444.4424), np.float32(2678.024), np.float32(713.1681), np.float32(2915.9587), np.float32(70.47804), np.float32(3484.2441), np.float32(3613.7932)]
2025-09-14 09:30:18,145 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:30:18,148 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 15/100 (estimated time remaining: 4 hours, 47 minutes, 11 seconds)
2025-09-14 09:33:32,666 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 09:33:40,072 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3328.34619 ± 330.412
2025-09-14 09:33:40,072 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3611.1204), np.float32(3109.759), np.float32(3650.1714), np.float32(2490.3625), np.float32(3324.5667), np.float32(3512.984), np.float32(3222.9028), np.float32(3251.4502), np.float32(3509.004), np.float32(3601.1396)]
2025-09-14 09:33:40,072 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:33:40,072 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (3328.35) for latency 6
2025-09-14 09:33:40,075 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 16/100 (estimated time remaining: 4 hours, 44 minutes, 9 seconds)
2025-09-14 09:36:54,010 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 09:37:02,376 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2520.47803 ± 951.358
2025-09-14 09:37:02,376 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3449.4583), np.float32(3263.8975), np.float32(3295.7961), np.float32(1430.671), np.float32(2050.359), np.float32(2241.0457), np.float32(391.10388), np.float32(3140.6792), np.float32(2637.2476), np.float32(3304.5244)]
2025-09-14 09:37:02,376 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:37:02,379 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 17/100 (estimated time remaining: 4 hours, 41 minutes, 17 seconds)
2025-09-14 09:40:13,726 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 09:40:21,924 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3239.84180 ± 567.276
2025-09-14 09:40:21,924 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2840.1826), np.float32(1977.7816), np.float32(2530.286), np.float32(3453.9219), np.float32(3537.6436), np.float32(3614.8843), np.float32(3905.3394), np.float32(3456.4895), np.float32(3452.3196), np.float32(3629.5684)]
2025-09-14 09:40:21,924 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:40:21,927 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 18/100 (estimated time remaining: 4 hours, 37 minutes, 8 seconds)
2025-09-14 09:43:34,897 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 09:43:42,210 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3299.03320 ± 905.174
2025-09-14 09:43:42,210 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3824.5872), np.float32(1707.9777), np.float32(3873.835), np.float32(3744.727), np.float32(3693.1511), np.float32(3356.3628), np.float32(4133.647), np.float32(3691.827), np.float32(1361.287), np.float32(3602.9307)]
2025-09-14 09:43:42,210 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:43:42,213 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 19/100 (estimated time remaining: 4 hours, 34 minutes, 20 seconds)
2025-09-14 09:46:55,590 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 09:47:02,967 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2834.80054 ± 1096.194
2025-09-14 09:47:02,967 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3234.7463), np.float32(1446.644), np.float32(1009.29706), np.float32(3350.6084), np.float32(2784.5447), np.float32(3776.3188), np.float32(3950.0728), np.float32(3799.1738), np.float32(1265.0776), np.float32(3731.521)]
2025-09-14 09:47:02,967 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:47:02,970 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 20/100 (estimated time remaining: 4 hours, 31 minutes, 18 seconds)
2025-09-14 09:50:15,838 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 09:50:23,170 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3454.91138 ± 672.322
2025-09-14 09:50:23,171 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1586.6621), np.float32(3748.4304), np.float32(3756.9878), np.float32(3719.4014), np.float32(3908.469), np.float32(3485.1199), np.float32(3099.7998), np.float32(3391.3816), np.float32(3949.147), np.float32(3903.7144)]
2025-09-14 09:50:23,171 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:50:23,171 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (3454.91) for latency 6
2025-09-14 09:50:23,174 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 21/100 (estimated time remaining: 4 hours, 27 minutes, 29 seconds)
2025-09-14 09:53:40,773 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 09:53:49,337 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3566.87842 ± 386.895
2025-09-14 09:53:49,337 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2631.4656), np.float32(4003.353), np.float32(3148.3313), np.float32(3472.1982), np.float32(3641.0754), np.float32(3923.969), np.float32(3633.6045), np.float32(3640.8127), np.float32(3726.206), np.float32(3847.7688)]
2025-09-14 09:53:49,337 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:53:49,337 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (3566.88) for latency 6
2025-09-14 09:53:49,342 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 22/100 (estimated time remaining: 4 hours, 25 minutes, 10 seconds)
2025-09-14 09:57:08,556 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 09:57:16,161 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3410.24268 ± 800.074
2025-09-14 09:57:16,162 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1024.2487), np.float32(3726.724), np.float32(3650.2344), np.float32(3651.7073), np.float32(3843.938), np.float32(3639.5083), np.float32(3655.0283), np.float32(3518.325), np.float32(3786.2798), np.float32(3606.431)]
2025-09-14 09:57:16,162 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:57:16,165 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 23/100 (estimated time remaining: 4 hours, 23 minutes, 42 seconds)
2025-09-14 10:00:33,952 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 10:00:41,576 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3637.90283 ± 180.692
2025-09-14 10:00:41,576 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3921.6267), np.float32(3611.8833), np.float32(3835.5295), np.float32(3560.458), np.float32(3844.0415), np.float32(3633.5735), np.float32(3438.571), np.float32(3448.9407), np.float32(3721.0803), np.float32(3363.321)]
2025-09-14 10:00:41,576 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:00:41,576 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (3637.90) for latency 6
2025-09-14 10:00:41,580 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 24/100 (estimated time remaining: 4 hours, 21 minutes, 38 seconds)
2025-09-14 10:04:00,638 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 10:04:09,201 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3429.18896 ± 641.866
2025-09-14 10:04:09,201 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3748.4158), np.float32(3833.5881), np.float32(3855.279), np.float32(3694.1936), np.float32(3530.0806), np.float32(3739.638), np.float32(2590.2163), np.float32(3827.4316), np.float32(3651.4248), np.float32(1821.6213)]
2025-09-14 10:04:09,201 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:04:09,205 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 25/100 (estimated time remaining: 4 hours, 19 minutes, 58 seconds)
2025-09-14 10:07:28,605 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 10:07:36,383 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2938.52515 ± 981.364
2025-09-14 10:07:36,383 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2295.6174), np.float32(2532.2192), np.float32(3825.5679), np.float32(1571.5618), np.float32(3296.3467), np.float32(983.7369), np.float32(3756.03), np.float32(3710.5144), np.float32(3669.409), np.float32(3744.2495)]
2025-09-14 10:07:36,383 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:07:36,387 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 26/100 (estimated time remaining: 4 hours, 18 minutes, 18 seconds)
2025-09-14 10:10:56,548 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 10:11:04,659 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3308.26831 ± 1029.820
2025-09-14 10:11:04,660 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3758.0244), np.float32(1242.467), np.float32(3618.4106), np.float32(3897.643), np.float32(3861.5808), np.float32(3546.8247), np.float32(3986.8284), np.float32(1291.6484), np.float32(3986.67), np.float32(3892.5862)]
2025-09-14 10:11:04,660 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:11:04,664 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 27/100 (estimated time remaining: 4 hours, 15 minutes, 22 seconds)
2025-09-14 10:14:24,779 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 10:14:32,641 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3341.22656 ± 995.666
2025-09-14 10:14:32,641 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4007.11), np.float32(1392.1337), np.float32(4039.8032), np.float32(3794.7617), np.float32(1352.8575), np.float32(3449.376), np.float32(3841.6145), np.float32(3847.3818), np.float32(3863.087), np.float32(3824.1392)]
2025-09-14 10:14:32,641 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:14:32,645 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 28/100 (estimated time remaining: 4 hours, 12 minutes, 12 seconds)
2025-09-14 10:17:51,125 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 10:17:58,714 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3370.93213 ± 826.184
2025-09-14 10:17:58,715 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3759.172), np.float32(1858.5344), np.float32(3695.7893), np.float32(3930.4504), np.float32(3667.6133), np.float32(1626.8029), np.float32(3616.0132), np.float32(3628.8005), np.float32(4038.26), np.float32(3887.883)]
2025-09-14 10:17:58,715 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:17:58,719 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 29/100 (estimated time remaining: 4 hours, 8 minutes, 54 seconds)
2025-09-14 10:21:17,221 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 10:21:25,705 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3174.92578 ± 1152.173
2025-09-14 10:21:25,706 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1294.2255), np.float32(4027.024), np.float32(1146.904), np.float32(4109.391), np.float32(4000.7798), np.float32(3831.4001), np.float32(3986.3037), np.float32(3530.6033), np.float32(3908.563), np.float32(1914.0631)]
2025-09-14 10:21:25,706 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:21:25,710 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 30/100 (estimated time remaining: 4 hours, 5 minutes, 18 seconds)
2025-09-14 10:24:46,885 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 10:24:54,660 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3487.04736 ± 843.793
2025-09-14 10:24:54,660 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3729.9958), np.float32(3902.8213), np.float32(3632.9526), np.float32(3857.3494), np.float32(1217.773), np.float32(4133.1914), np.float32(3943.795), np.float32(3735.592), np.float32(2703.1316), np.float32(4013.8694)]
2025-09-14 10:24:54,660 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:24:54,665 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 31/100 (estimated time remaining: 4 hours, 2 minutes, 15 seconds)
2025-09-14 10:28:14,965 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 10:28:22,602 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3588.61475 ± 369.608
2025-09-14 10:28:22,602 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2780.0708), np.float32(3882.6702), np.float32(3856.4507), np.float32(3765.5356), np.float32(3067.2405), np.float32(3540.7578), np.float32(4007.2454), np.float32(3766.1724), np.float32(3466.764), np.float32(3753.2402)]
2025-09-14 10:28:22,602 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:28:22,606 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 32/100 (estimated time remaining: 3 hours, 58 minutes, 43 seconds)
2025-09-14 10:31:42,602 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 10:31:51,208 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3582.46924 ± 655.987
2025-09-14 10:31:51,209 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4017.8462), np.float32(4080.776), np.float32(3951.9736), np.float32(4086.1035), np.float32(3563.03), np.float32(2718.0393), np.float32(3763.1511), np.float32(1991.1919), np.float32(3974.3152), np.float32(3678.264)]
2025-09-14 10:31:51,209 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:31:51,214 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 33/100 (estimated time remaining: 3 hours, 55 minutes, 24 seconds)
2025-09-14 10:35:09,191 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 10:35:16,890 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3691.74341 ± 809.414
2025-09-14 10:35:16,891 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4107.101), np.float32(4125.111), np.float32(3509.9934), np.float32(1349.0466), np.float32(4114.764), np.float32(4251.483), np.float32(3974.1138), np.float32(3858.9653), np.float32(3671.3723), np.float32(3955.4856)]
2025-09-14 10:35:16,891 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:35:16,891 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (3691.74) for latency 6
2025-09-14 10:35:16,895 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 34/100 (estimated time remaining: 3 hours, 51 minutes, 51 seconds)
2025-09-14 10:38:37,284 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 10:38:45,946 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3646.26953 ± 316.668
2025-09-14 10:38:45,947 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3747.4958), np.float32(2746.3884), np.float32(3834.9717), np.float32(3804.887), np.float32(3640.445), np.float32(3827.812), np.float32(3515.9841), np.float32(3731.7405), np.float32(3888.2732), np.float32(3724.7002)]
2025-09-14 10:38:45,947 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:38:45,951 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 35/100 (estimated time remaining: 3 hours, 48 minutes, 51 seconds)
2025-09-14 10:42:07,423 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 10:42:15,144 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3920.78271 ± 113.508
2025-09-14 10:42:15,144 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3819.079), np.float32(3861.8154), np.float32(4039.6182), np.float32(3868.8442), np.float32(4092.153), np.float32(3784.8396), np.float32(4002.1406), np.float32(3746.1226), np.float32(3967.2102), np.float32(4026.004)]
2025-09-14 10:42:15,144 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:42:15,144 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (3920.78) for latency 6
2025-09-14 10:42:15,149 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 36/100 (estimated time remaining: 3 hours, 45 minutes, 26 seconds)
2025-09-14 10:45:35,081 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 10:45:42,681 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3817.74341 ± 87.155
2025-09-14 10:45:42,681 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3937.0337), np.float32(3773.0105), np.float32(3858.2507), np.float32(3802.2317), np.float32(3788.0498), np.float32(3658.0454), np.float32(3733.1946), np.float32(3779.0422), np.float32(3916.8022), np.float32(3931.7727)]
2025-09-14 10:45:42,682 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:45:42,686 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 37/100 (estimated time remaining: 3 hours, 41 minutes, 53 seconds)
2025-09-14 10:48:59,790 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 10:49:08,315 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3795.60400 ± 581.576
2025-09-14 10:49:08,316 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4035.505), np.float32(3621.3733), np.float32(2169.5422), np.float32(4278.328), np.float32(3926.431), np.float32(3728.4602), np.float32(4216.6187), np.float32(4153.026), np.float32(3735.9895), np.float32(4090.764)]
2025-09-14 10:49:08,316 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:49:08,321 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 38/100 (estimated time remaining: 3 hours, 37 minutes, 47 seconds)
2025-09-14 10:52:27,826 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 10:52:35,490 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3688.44531 ± 577.325
2025-09-14 10:52:35,490 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4163.7036), np.float32(4062.734), np.float32(3892.3948), np.float32(3099.964), np.float32(3758.109), np.float32(3473.6035), np.float32(4120.6465), np.float32(2242.651), np.float32(4107.402), np.float32(3963.2478)]
2025-09-14 10:52:35,490 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:52:35,495 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 39/100 (estimated time remaining: 3 hours, 34 minutes, 38 seconds)
2025-09-14 10:55:55,541 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 10:56:03,723 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3955.97656 ± 160.711
2025-09-14 10:56:03,723 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3838.3254), np.float32(3687.0256), np.float32(3804.7866), np.float32(3976.0647), np.float32(4150.172), np.float32(4215.796), np.float32(3961.8376), np.float32(3916.9526), np.float32(4141.11), np.float32(3867.6938)]
2025-09-14 10:56:03,723 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:56:03,724 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (3955.98) for latency 6
2025-09-14 10:56:03,729 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 40/100 (estimated time remaining: 3 hours, 31 minutes)
2025-09-14 10:59:26,092 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 10:59:34,525 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3711.81323 ± 604.077
2025-09-14 10:59:34,525 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4175.358), np.float32(3912.333), np.float32(3286.0305), np.float32(4271.8438), np.float32(3908.9856), np.float32(4078.9854), np.float32(3613.9443), np.float32(2088.2712), np.float32(3802.9592), np.float32(3979.4219)]
2025-09-14 10:59:34,525 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:59:34,530 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 41/100 (estimated time remaining: 3 hours, 27 minutes, 52 seconds)
2025-09-14 11:02:53,240 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 11:03:00,848 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3928.42041 ± 158.936
2025-09-14 11:03:00,849 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3897.7532), np.float32(3661.241), np.float32(3801.7385), np.float32(4091.9878), np.float32(3868.4446), np.float32(4043.2642), np.float32(4050.9841), np.float32(3788.0122), np.float32(3865.9187), np.float32(4214.8613)]
2025-09-14 11:03:00,849 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:03:00,853 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 42/100 (estimated time remaining: 3 hours, 24 minutes, 10 seconds)
2025-09-14 11:06:10,517 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 11:06:18,170 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3687.92236 ± 707.575
2025-09-14 11:06:18,170 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4161.5303), np.float32(3956.206), np.float32(3931.903), np.float32(3962.386), np.float32(2180.959), np.float32(4057.784), np.float32(4164.271), np.float32(3836.3123), np.float32(2408.6494), np.float32(4219.2236)]
2025-09-14 11:06:18,170 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:06:18,175 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 43/100 (estimated time remaining: 3 hours, 19 minutes, 6 seconds)
2025-09-14 11:09:25,484 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 11:09:33,000 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3749.02490 ± 438.525
2025-09-14 11:09:33,000 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3649.8442), np.float32(3905.8972), np.float32(2523.1672), np.float32(3902.952), np.float32(3613.6875), np.float32(4029.9124), np.float32(3978.612), np.float32(4078.3064), np.float32(3736.8564), np.float32(4071.016)]
2025-09-14 11:09:33,001 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:09:33,005 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 44/100 (estimated time remaining: 3 hours, 13 minutes, 19 seconds)
2025-09-14 11:12:41,944 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 11:12:49,509 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3698.10400 ± 801.327
2025-09-14 11:12:49,510 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4223.77), np.float32(4459.6587), np.float32(4042.8472), np.float32(2010.8911), np.float32(3905.7468), np.float32(3918.7197), np.float32(4062.4482), np.float32(2243.004), np.float32(4056.8564), np.float32(4057.0981)]
2025-09-14 11:12:49,510 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:12:49,514 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 45/100 (estimated time remaining: 3 hours, 7 minutes, 44 seconds)
2025-09-14 11:15:58,358 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 11:16:05,926 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3744.94678 ± 700.920
2025-09-14 11:16:05,926 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4232.4688), np.float32(4161.752), np.float32(4100.362), np.float32(1950.596), np.float32(4071.585), np.float32(3170.3452), np.float32(3866.9133), np.float32(3355.596), np.float32(4291.5796), np.float32(4248.2725)]
2025-09-14 11:16:05,926 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:16:05,931 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 46/100 (estimated time remaining: 3 hours, 1 minute, 45 seconds)
2025-09-14 11:19:13,848 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 11:19:21,536 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 4026.56177 ± 297.903
2025-09-14 11:19:21,536 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4264.539), np.float32(4093.2063), np.float32(4451.973), np.float32(4264.4653), np.float32(4128.756), np.float32(3914.1904), np.float32(3727.1978), np.float32(3886.6355), np.float32(4168.661), np.float32(3365.9922)]
2025-09-14 11:19:21,536 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:19:21,536 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (4026.56) for latency 6
2025-09-14 11:19:21,541 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 47/100 (estimated time remaining: 2 hours, 56 minutes, 31 seconds)
2025-09-14 11:22:29,561 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 11:22:37,107 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3968.55933 ± 179.064
2025-09-14 11:22:37,107 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4201.305), np.float32(4034.65), np.float32(4116.858), np.float32(3936.6638), np.float32(4301.807), np.float32(3715.4023), np.float32(3876.329), np.float32(3807.677), np.float32(3857.999), np.float32(3836.9011)]
2025-09-14 11:22:37,107 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:22:37,112 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 48/100 (estimated time remaining: 2 hours, 52 minutes, 56 seconds)
2025-09-14 11:25:44,139 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 11:25:51,711 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 4061.18408 ± 257.640
2025-09-14 11:25:51,711 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4060.6426), np.float32(4322.282), np.float32(4403.73), np.float32(3412.0366), np.float32(3917.248), np.float32(4241.5063), np.float32(4063.4766), np.float32(4014.8418), np.float32(4093.1729), np.float32(4082.903)]
2025-09-14 11:25:51,711 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:25:51,711 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (4061.18) for latency 6
2025-09-14 11:25:51,717 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 49/100 (estimated time remaining: 2 hours, 49 minutes, 38 seconds)
2025-09-14 11:29:00,609 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 11:29:08,146 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3849.55273 ± 769.712
2025-09-14 11:29:08,146 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4482.117), np.float32(3645.0166), np.float32(1621.3628), np.float32(4173.9136), np.float32(4016.1216), np.float32(3977.3232), np.float32(4223.7783), np.float32(4057.2983), np.float32(4202.11), np.float32(4096.485)]
2025-09-14 11:29:08,146 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:29:08,151 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 50/100 (estimated time remaining: 2 hours, 46 minutes, 22 seconds)
2025-09-14 11:32:14,905 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 11:32:22,421 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3485.31958 ± 1097.619
2025-09-14 11:32:22,421 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4084.573), np.float32(3810.6936), np.float32(1305.6265), np.float32(1323.14), np.float32(3937.0105), np.float32(3695.8135), np.float32(4187.646), np.float32(4151.705), np.float32(4107.896), np.float32(4249.092)]
2025-09-14 11:32:22,422 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:32:22,426 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 51/100 (estimated time remaining: 2 hours, 42 minutes, 44 seconds)
2025-09-14 11:35:31,817 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 11:35:39,380 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 4005.75586 ± 157.954
2025-09-14 11:35:39,380 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4153.866), np.float32(4091.5344), np.float32(3839.967), np.float32(3748.545), np.float32(3954.9968), np.float32(3906.8284), np.float32(3941.4387), np.float32(4230.659), np.float32(3947.22), np.float32(4242.503)]
2025-09-14 11:35:39,380 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:35:39,385 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 52/100 (estimated time remaining: 2 hours, 39 minutes, 42 seconds)
2025-09-14 11:38:45,975 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 11:38:53,343 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3878.89185 ± 688.098
2025-09-14 11:38:53,363 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4218.0146), np.float32(4218.376), np.float32(4347.4883), np.float32(4030.2214), np.float32(1847.4524), np.float32(3984.574), np.float32(3968.1797), np.float32(3996.2449), np.float32(4172.71), np.float32(4005.6562)]
2025-09-14 11:38:53,363 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:38:53,369 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 53/100 (estimated time remaining: 2 hours, 36 minutes, 12 seconds)
2025-09-14 11:42:00,631 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 11:42:08,222 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3869.71802 ± 501.304
2025-09-14 11:42:08,222 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4170.8027), np.float32(4050.1367), np.float32(4145.322), np.float32(4366.265), np.float32(3974.5042), np.float32(3577.5928), np.float32(4171.9136), np.float32(3909.1914), np.float32(3831.6523), np.float32(2499.7952)]
2025-09-14 11:42:08,222 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:42:08,227 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 54/100 (estimated time remaining: 2 hours, 32 minutes, 59 seconds)
2025-09-14 11:45:17,012 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 11:45:24,632 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 4098.27441 ± 240.853
2025-09-14 11:45:24,632 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4238.621), np.float32(4271.4917), np.float32(4203.4863), np.float32(3589.7993), np.float32(4268.7603), np.float32(4368.688), np.float32(3898.2717), np.float32(4298.939), np.float32(4012.9697), np.float32(3831.7202)]
2025-09-14 11:45:24,632 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:45:24,632 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (4098.27) for latency 6
2025-09-14 11:45:24,637 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 55/100 (estimated time remaining: 2 hours, 29 minutes, 43 seconds)
2025-09-14 11:48:32,104 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 11:48:39,759 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 4209.08691 ± 194.347
2025-09-14 11:48:39,759 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4434.91), np.float32(4330.9443), np.float32(4112.355), np.float32(4529.4326), np.float32(4175.3926), np.float32(3897.4014), np.float32(3919.7131), np.float32(4335.1587), np.float32(4179.767), np.float32(4175.795)]
2025-09-14 11:48:39,759 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:48:39,759 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (4209.09) for latency 6
2025-09-14 11:48:39,764 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 56/100 (estimated time remaining: 2 hours, 26 minutes, 36 seconds)
2025-09-14 11:51:48,229 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 11:51:55,804 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 4201.98682 ± 133.489
2025-09-14 11:51:55,804 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4166.8936), np.float32(4272.533), np.float32(4124.632), np.float32(4478.8813), np.float32(4035.8416), np.float32(4192.8516), np.float32(4109.321), np.float32(4209.335), np.float32(4052.5012), np.float32(4377.0767)]
2025-09-14 11:51:55,804 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:51:55,809 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 57/100 (estimated time remaining: 2 hours, 23 minutes, 12 seconds)
2025-09-14 11:55:02,684 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 11:55:10,025 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3828.10596 ± 441.381
2025-09-14 11:55:10,026 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4104.5986), np.float32(3993.2583), np.float32(4253.6177), np.float32(3416.2783), np.float32(3847.4626), np.float32(3924.2302), np.float32(2677.1558), np.float32(3981.7715), np.float32(4191.225), np.float32(3891.462)]
2025-09-14 11:55:10,026 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:55:10,031 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 58/100 (estimated time remaining: 2 hours, 19 minutes, 59 seconds)
2025-09-14 11:58:19,418 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 11:58:26,909 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3850.20630 ± 668.732
2025-09-14 11:58:26,909 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1976.5999), np.float32(3875.5984), np.float32(4010.255), np.float32(3937.1357), np.float32(4147.836), np.float32(4234.9043), np.float32(4294.607), np.float32(4320.719), np.float32(3483.1873), np.float32(4221.217)]
2025-09-14 11:58:26,909 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:58:26,914 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 59/100 (estimated time remaining: 2 hours, 17 minutes)
2025-09-14 12:01:33,144 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 12:01:40,717 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 4160.20557 ± 170.123
2025-09-14 12:01:40,717 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4313.3984), np.float32(4235.217), np.float32(4254.51), np.float32(4227.62), np.float32(4008.3025), np.float32(3932.037), np.float32(4445.142), np.float32(4148.8203), np.float32(4177.0513), np.float32(3859.9573)]
2025-09-14 12:01:40,718 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:01:40,723 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 60/100 (estimated time remaining: 2 hours, 13 minutes, 23 seconds)
2025-09-14 12:04:50,408 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 12:04:58,004 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3749.52930 ± 816.883
2025-09-14 12:04:58,004 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4402.722), np.float32(3515.73), np.float32(4168.1987), np.float32(3547.8804), np.float32(4050.5552), np.float32(4239.3677), np.float32(4356.451), np.float32(3587.914), np.float32(1490.8534), np.float32(4135.622)]
2025-09-14 12:04:58,004 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:04:58,009 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 61/100 (estimated time remaining: 2 hours, 10 minutes, 25 seconds)
2025-09-14 12:08:06,536 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 12:08:13,976 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 4082.77661 ± 158.605
2025-09-14 12:08:13,976 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4307.4873), np.float32(4102.016), np.float32(4292.3604), np.float32(4042.055), np.float32(4062.0977), np.float32(4028.9912), np.float32(3823.9688), np.float32(4193.763), np.float32(4152.345), np.float32(3822.6855)]
2025-09-14 12:08:13,977 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:08:13,982 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 62/100 (estimated time remaining: 2 hours, 7 minutes, 9 seconds)
2025-09-14 12:11:21,514 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 12:11:29,073 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3337.18555 ± 1179.751
2025-09-14 12:11:29,073 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4255.9155), np.float32(4339.9736), np.float32(1481.1864), np.float32(3639.0159), np.float32(1996.7643), np.float32(4125.2134), np.float32(1250.6238), np.float32(3974.0671), np.float32(4253.063), np.float32(4056.0361)]
2025-09-14 12:11:29,073 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:11:29,079 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 63/100 (estimated time remaining: 2 hours, 4 minutes)
2025-09-14 12:14:35,553 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 12:14:43,016 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 4241.71973 ± 120.379
2025-09-14 12:14:43,017 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4147.8413), np.float32(4480.9517), np.float32(4374.2524), np.float32(4182.3), np.float32(4322.5493), np.float32(4022.9758), np.float32(4246.9683), np.float32(4227.2354), np.float32(4188.641), np.float32(4223.486)]
2025-09-14 12:14:43,017 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:14:43,017 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (4241.72) for latency 6
2025-09-14 12:14:43,022 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 64/100 (estimated time remaining: 2 hours, 23 seconds)
2025-09-14 12:17:50,207 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 12:17:57,564 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 4103.93848 ± 634.090
2025-09-14 12:17:57,564 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4482.449), np.float32(2280.2434), np.float32(4227.5854), np.float32(4627.7476), np.float32(4418.598), np.float32(3910.5195), np.float32(4344.342), np.float32(4195.8643), np.float32(4278.3564), np.float32(4273.6797)]
2025-09-14 12:17:57,564 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:17:57,569 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 65/100 (estimated time remaining: 1 hour, 57 minutes, 13 seconds)
2025-09-14 12:21:06,582 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 12:21:14,223 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 4088.98633 ± 210.244
2025-09-14 12:21:14,223 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4224.9556), np.float32(3954.2852), np.float32(3879.814), np.float32(4180.2427), np.float32(4446.9004), np.float32(4233.7427), np.float32(3702.4097), np.float32(4270.0674), np.float32(3945.1995), np.float32(4052.2483)]
2025-09-14 12:21:14,223 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:21:14,229 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 66/100 (estimated time remaining: 1 hour, 53 minutes, 53 seconds)
2025-09-14 12:24:17,224 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 12:24:24,391 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3450.25244 ± 789.506
2025-09-14 12:24:24,391 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3686.8076), np.float32(3757.118), np.float32(2459.519), np.float32(2991.6868), np.float32(4201.8066), np.float32(1635.8108), np.float32(4145.7446), np.float32(3831.6077), np.float32(3956.6714), np.float32(3835.7502)]
2025-09-14 12:24:24,391 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:24:24,397 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 67/100 (estimated time remaining: 1 hour, 49 minutes, 58 seconds)
2025-09-14 12:27:24,442 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 12:27:31,628 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3575.22217 ± 1044.834
2025-09-14 12:27:31,628 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4123.7754), np.float32(4346.987), np.float32(4200.276), np.float32(4282.44), np.float32(3930.368), np.float32(4097.932), np.float32(1187.713), np.float32(3519.4863), np.float32(1928.0527), np.float32(4135.195)]
2025-09-14 12:27:31,628 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:27:31,633 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 68/100 (estimated time remaining: 1 hour, 45 minutes, 52 seconds)
2025-09-14 12:30:30,825 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 12:30:38,048 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3622.93433 ± 879.905
2025-09-14 12:30:38,048 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4136.8696), np.float32(4400.0493), np.float32(4287.706), np.float32(3959.36), np.float32(1413.6305), np.float32(3570.006), np.float32(4026.5806), np.float32(2586.7107), np.float32(4016.767), np.float32(3831.6653)]
2025-09-14 12:30:38,048 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:30:38,053 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 69/100 (estimated time remaining: 1 hour, 41 minutes, 52 seconds)
2025-09-14 12:33:38,969 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 12:33:46,189 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 4084.91870 ± 252.912
2025-09-14 12:33:46,190 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3844.3784), np.float32(4088.155), np.float32(4298.389), np.float32(4071.756), np.float32(3935.1384), np.float32(4417.6406), np.float32(3535.7515), np.float32(4368.4463), np.float32(4061.8271), np.float32(4227.7026)]
2025-09-14 12:33:46,190 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:33:46,195 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 70/100 (estimated time remaining: 1 hour, 38 minutes, 1 second)
2025-09-14 12:36:48,168 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 12:36:55,449 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3991.42334 ± 762.409
2025-09-14 12:36:55,449 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4374.782), np.float32(4166.8013), np.float32(4058.6135), np.float32(4202.9756), np.float32(3983.9937), np.float32(4534.429), np.float32(4144.81), np.float32(1756.4719), np.float32(4242.6157), np.float32(4448.742)]
2025-09-14 12:36:55,450 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:36:55,455 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 71/100 (estimated time remaining: 1 hour, 34 minutes, 7 seconds)
2025-09-14 12:39:57,343 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 12:40:04,566 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 4186.72803 ± 96.631
2025-09-14 12:40:04,567 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4134.421), np.float32(4262.4834), np.float32(3932.4065), np.float32(4209.574), np.float32(4182.8804), np.float32(4259.9785), np.float32(4233.954), np.float32(4138.7437), np.float32(4262.4673), np.float32(4250.37)]
2025-09-14 12:40:04,567 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:40:04,573 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 72/100 (estimated time remaining: 1 hour, 30 minutes, 53 seconds)
2025-09-14 12:43:03,828 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 12:43:11,103 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3962.26050 ± 714.265
2025-09-14 12:43:11,103 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3955.6248), np.float32(4266.7974), np.float32(4289.9897), np.float32(4271.7646), np.float32(4328.913), np.float32(4239.4604), np.float32(4061.999), np.float32(1849.3865), np.float32(4065.182), np.float32(4293.4897)]
2025-09-14 12:43:11,103 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:43:11,109 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 73/100 (estimated time remaining: 1 hour, 27 minutes, 41 seconds)
2025-09-14 12:46:11,477 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 12:46:18,578 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 4322.89014 ± 166.602
2025-09-14 12:46:18,578 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4175.636), np.float32(4065.4841), np.float32(4399.243), np.float32(4128.8906), np.float32(4456.587), np.float32(4511.7534), np.float32(4450.1357), np.float32(4517.181), np.float32(4387.0024), np.float32(4136.987)]
2025-09-14 12:46:18,578 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:46:18,579 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (4322.89) for latency 6
2025-09-14 12:46:18,584 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 74/100 (estimated time remaining: 1 hour, 24 minutes, 38 seconds)
2025-09-14 12:49:20,026 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 12:49:27,278 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 4164.55176 ± 62.118
2025-09-14 12:49:27,278 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4046.3535), np.float32(4208.641), np.float32(4158.953), np.float32(4131.3203), np.float32(4240.081), np.float32(4110.3525), np.float32(4184.046), np.float32(4158.3716), np.float32(4135.107), np.float32(4272.293)]
2025-09-14 12:49:27,278 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:49:27,283 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 75/100 (estimated time remaining: 1 hour, 21 minutes, 33 seconds)
2025-09-14 12:52:29,388 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 12:52:36,455 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3751.59961 ± 984.237
2025-09-14 12:52:36,455 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4230.3574), np.float32(4324.5786), np.float32(4176.1157), np.float32(3993.0522), np.float32(2039.3208), np.float32(4284.3716), np.float32(1569.4957), np.float32(4207.9517), np.float32(4329.8154), np.float32(4360.936)]
2025-09-14 12:52:36,455 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:52:36,461 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 76/100 (estimated time remaining: 1 hour, 18 minutes, 25 seconds)
2025-09-14 12:55:37,710 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 12:55:44,954 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 4256.71289 ± 151.309
2025-09-14 12:55:44,954 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4376.981), np.float32(4128.8877), np.float32(4283.3564), np.float32(4293.451), np.float32(4073.7495), np.float32(4263.946), np.float32(4229.1616), np.float32(4621.9707), np.float32(4196.087), np.float32(4099.5376)]
2025-09-14 12:55:44,954 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:55:44,960 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 77/100 (estimated time remaining: 1 hour, 15 minutes, 13 seconds)
2025-09-14 12:58:45,310 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 12:58:52,422 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 4081.51636 ± 505.354
2025-09-14 12:58:52,422 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2878.6487), np.float32(3929.8223), np.float32(3699.643), np.float32(3807.109), np.float32(4417.107), np.float32(4432.0933), np.float32(4416.171), np.float32(4610.348), np.float32(4566.341), np.float32(4057.8845)]
2025-09-14 12:58:52,422 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:58:52,428 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 78/100 (estimated time remaining: 1 hour, 12 minutes, 10 seconds)
2025-09-14 13:01:52,848 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 13:02:00,176 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3951.16333 ± 777.363
2025-09-14 13:02:00,177 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4251.322), np.float32(3643.317), np.float32(4515.333), np.float32(4074.432), np.float32(4495.672), np.float32(4169.1597), np.float32(4513.697), np.float32(4337.8823), np.float32(1792.3632), np.float32(3718.4575)]
2025-09-14 13:02:00,177 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:02:00,183 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 79/100 (estimated time remaining: 1 hour, 9 minutes, 3 seconds)
2025-09-14 13:05:02,054 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 13:05:09,464 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 4104.91162 ± 344.005
2025-09-14 13:05:09,464 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4091.707), np.float32(4304.111), np.float32(4283.187), np.float32(4295.9067), np.float32(4268.062), np.float32(4263.4424), np.float32(4260.1787), np.float32(4339.5625), np.float32(3205.1282), np.float32(3737.8308)]
2025-09-14 13:05:09,464 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:05:09,470 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 80/100 (estimated time remaining: 1 hour, 5 minutes, 57 seconds)
2025-09-14 13:08:09,690 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 13:08:17,009 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 4160.34619 ± 232.529
2025-09-14 13:08:17,009 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4604.0386), np.float32(3975.2598), np.float32(3863.9512), np.float32(4371.189), np.float32(4363.218), np.float32(4106.5073), np.float32(4073.0168), np.float32(4092.8008), np.float32(3845.3875), np.float32(4308.0903)]
2025-09-14 13:08:17,009 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:08:17,015 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 81/100 (estimated time remaining: 1 hour, 2 minutes, 42 seconds)
2025-09-14 13:11:18,416 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 13:11:25,828 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 4217.55957 ± 161.530
2025-09-14 13:11:25,828 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4025.8467), np.float32(4201.2993), np.float32(4458.513), np.float32(4186.183), np.float32(3889.8708), np.float32(4252.7505), np.float32(4400.4526), np.float32(4176.5977), np.float32(4362.81), np.float32(4221.272)]
2025-09-14 13:11:25,828 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:11:25,835 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 82/100 (estimated time remaining: 59 minutes, 35 seconds)
2025-09-14 13:14:26,938 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 13:14:34,259 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 4296.67236 ± 115.928
2025-09-14 13:14:34,259 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4235.199), np.float32(4241.5073), np.float32(4143.8936), np.float32(4337.433), np.float32(4285.34), np.float32(4583.693), np.float32(4328.4995), np.float32(4374.518), np.float32(4199.502), np.float32(4237.138)]
2025-09-14 13:14:34,259 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:14:34,265 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 83/100 (estimated time remaining: 56 minutes, 30 seconds)
2025-09-14 13:17:33,995 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 13:17:41,132 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 4344.28174 ± 157.505
2025-09-14 13:17:41,132 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4480.3955), np.float32(4324.538), np.float32(4343.1567), np.float32(4599.752), np.float32(4051.207), np.float32(4257.0947), np.float32(4362.189), np.float32(4540.2715), np.float32(4315.5493), np.float32(4168.6646)]
2025-09-14 13:17:41,132 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:17:41,132 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (4344.28) for latency 6
2025-09-14 13:17:41,138 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 84/100 (estimated time remaining: 53 minutes, 19 seconds)
2025-09-14 13:20:41,000 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 13:20:48,119 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 4068.77881 ± 737.838
2025-09-14 13:20:48,119 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4303.255), np.float32(1901.1638), np.float32(4109.619), np.float32(4545.321), np.float32(4344.3013), np.float32(4264.725), np.float32(4100.0913), np.float32(4185.783), np.float32(4377.3496), np.float32(4556.1807)]
2025-09-14 13:20:48,119 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:20:48,125 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 85/100 (estimated time remaining: 50 minutes, 3 seconds)
2025-09-14 13:23:49,469 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 13:23:56,800 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 4137.03271 ± 428.264
2025-09-14 13:23:56,800 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4258.379), np.float32(4561.1777), np.float32(4359.6772), np.float32(3073.4907), np.float32(4355.488), np.float32(4172.186), np.float32(4647.8906), np.float32(4010.1355), np.float32(4158.583), np.float32(3773.3257)]
2025-09-14 13:23:56,800 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:23:56,807 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 86/100 (estimated time remaining: 46 minutes, 59 seconds)
2025-09-14 13:26:58,647 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 13:27:05,982 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 4227.53516 ± 236.541
2025-09-14 13:27:05,982 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4517.6084), np.float32(4087.7542), np.float32(4115.1094), np.float32(4114.832), np.float32(4526.6924), np.float32(4266.748), np.float32(4406.873), np.float32(4311.2065), np.float32(3679.9458), np.float32(4248.5825)]
2025-09-14 13:27:05,982 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:27:05,989 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 87/100 (estimated time remaining: 43 minutes, 52 seconds)
2025-09-14 13:30:08,506 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 13:30:15,947 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 4040.89893 ± 639.279
2025-09-14 13:30:15,948 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4289.2515), np.float32(2181.376), np.float32(3847.8223), np.float32(4236.0317), np.float32(4144.9204), np.float32(4222.153), np.float32(4404.487), np.float32(4312.8667), np.float32(4436.0835), np.float32(4334.0005)]
2025-09-14 13:30:15,948 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:30:15,955 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 88/100 (estimated time remaining: 40 minutes, 48 seconds)
2025-09-14 13:33:15,291 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 13:33:22,675 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 4376.52002 ± 114.975
2025-09-14 13:33:22,676 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4398.7183), np.float32(4258.2993), np.float32(4524.4893), np.float32(4389.1772), np.float32(4249.0557), np.float32(4272.439), np.float32(4275.934), np.float32(4610.5405), np.float32(4353.625), np.float32(4432.921)]
2025-09-14 13:33:22,676 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:33:22,676 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (4376.52) for latency 6
2025-09-14 13:33:22,682 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 89/100 (estimated time remaining: 37 minutes, 39 seconds)
2025-09-14 13:36:24,190 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 13:36:31,521 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3969.44849 ± 597.746
2025-09-14 13:36:31,521 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2876.3843), np.float32(4428.218), np.float32(4398.8735), np.float32(3816.9548), np.float32(4203.88), np.float32(4151.162), np.float32(4351.0254), np.float32(4256.6), np.float32(2775.2065), np.float32(4436.18)]
2025-09-14 13:36:31,521 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:36:31,528 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 90/100 (estimated time remaining: 34 minutes, 35 seconds)
2025-09-14 13:39:35,309 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 13:39:42,688 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 4214.23535 ± 147.365
2025-09-14 13:39:42,688 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3889.7173), np.float32(4112.785), np.float32(4176.224), np.float32(4374.315), np.float32(4376.66), np.float32(4362.543), np.float32(4319.194), np.float32(4250.4443), np.float32(4177.867), np.float32(4102.6)]
2025-09-14 13:39:42,688 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:39:42,695 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 91/100 (estimated time remaining: 31 minutes, 31 seconds)
2025-09-14 13:42:46,784 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 13:42:54,272 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 4242.91895 ± 210.953
2025-09-14 13:42:54,272 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4394.916), np.float32(4427.2695), np.float32(4341.858), np.float32(4057.0242), np.float32(4141.098), np.float32(3708.5217), np.float32(4413.065), np.float32(4312.081), np.float32(4355.517), np.float32(4277.8423)]
2025-09-14 13:42:54,272 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:42:54,280 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 92/100 (estimated time remaining: 28 minutes, 26 seconds)
2025-09-14 13:45:56,831 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 13:46:04,190 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 4302.58398 ± 168.280
2025-09-14 13:46:04,191 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4308.7964), np.float32(4074.9446), np.float32(4556.6587), np.float32(4222.0977), np.float32(4260.4404), np.float32(4149.7793), np.float32(4631.6064), np.float32(4150.3), np.float32(4363.337), np.float32(4307.884)]
2025-09-14 13:46:04,191 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:46:04,197 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 93/100 (estimated time remaining: 25 minutes, 17 seconds)
2025-09-14 13:49:07,221 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 13:49:14,668 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 4254.57568 ± 175.845
2025-09-14 13:49:14,669 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4059.4836), np.float32(4124.693), np.float32(4006.7388), np.float32(4291.2393), np.float32(4155.3975), np.float32(4482.1816), np.float32(4142.3374), np.float32(4295.3633), np.float32(4514.337), np.float32(4473.986)]
2025-09-14 13:49:14,669 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:49:14,676 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 94/100 (estimated time remaining: 22 minutes, 12 seconds)
2025-09-14 13:52:16,580 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 13:52:23,865 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 4191.65479 ± 509.876
2025-09-14 13:52:23,865 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4214.085), np.float32(4436.5576), np.float32(4782.9893), np.float32(4432.4004), np.float32(4361.8906), np.float32(2757.7593), np.float32(4192.7407), np.float32(4367.9106), np.float32(4272.955), np.float32(4097.26)]
2025-09-14 13:52:23,866 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:52:23,872 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 95/100 (estimated time remaining: 19 minutes, 2 seconds)
2025-09-14 13:55:26,811 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 13:55:34,077 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 4310.13672 ± 124.323
2025-09-14 13:55:34,077 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4238.055), np.float32(4295.2197), np.float32(4561.7354), np.float32(4082.1729), np.float32(4323.0913), np.float32(4335.504), np.float32(4419.626), np.float32(4170.4443), np.float32(4342.004), np.float32(4333.51)]
2025-09-14 13:55:34,077 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:55:34,084 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 96/100 (estimated time remaining: 15 minutes, 51 seconds)
2025-09-14 13:58:35,553 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 13:58:42,676 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 4016.34619 ± 807.495
2025-09-14 13:58:42,676 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4501.1323), np.float32(4212.5093), np.float32(4508.5444), np.float32(4531.118), np.float32(4195.4595), np.float32(1980.8428), np.float32(4335.4614), np.float32(2989.4578), np.float32(4564.3486), np.float32(4344.5854)]
2025-09-14 13:58:42,676 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:58:42,683 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 97/100 (estimated time remaining: 12 minutes, 38 seconds)
2025-09-14 14:01:44,409 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 14:01:51,931 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 4358.58105 ± 201.045
2025-09-14 14:01:51,932 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4256.6685), np.float32(4609.86), np.float32(4607.044), np.float32(4218.515), np.float32(4542.7275), np.float32(3950.371), np.float32(4434.3896), np.float32(4173.6475), np.float32(4450.9453), np.float32(4341.6426)]
2025-09-14 14:01:51,932 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:01:51,939 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 98/100 (estimated time remaining: 9 minutes, 28 seconds)
2025-09-14 14:04:55,600 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 14:05:03,028 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 4116.61621 ± 415.779
2025-09-14 14:05:03,029 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3874.186), np.float32(4100.4434), np.float32(4136.254), np.float32(3805.5251), np.float32(3141.436), np.float32(4527.329), np.float32(4539.675), np.float32(4554.8223), np.float32(4401.032), np.float32(4085.4568)]
2025-09-14 14:05:03,029 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:05:03,037 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 99/100 (estimated time remaining: 6 minutes, 19 seconds)
2025-09-14 14:08:05,669 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 14:08:12,987 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 4324.89453 ± 165.787
2025-09-14 14:08:12,988 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4031.9739), np.float32(4198.7925), np.float32(4331.4277), np.float32(4240.4214), np.float32(4120.2793), np.float32(4420.1763), np.float32(4484.174), np.float32(4465.916), np.float32(4585.0186), np.float32(4370.762)]
2025-09-14 14:08:12,988 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:08:12,995 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 100/100 (estimated time remaining: 3 minutes, 9 seconds)
2025-09-14 14:11:13,651 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-14 14:11:19,295 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 4363.94287 ± 102.070
2025-09-14 14:11:19,296 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4461.6934), np.float32(4242.6143), np.float32(4371.5586), np.float32(4556.257), np.float32(4234.9355), np.float32(4386.6094), np.float32(4419.206), np.float32(4214.5205), np.float32(4371.888), np.float32(4380.15)]
2025-09-14 14:11:19,296 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:11:19,302 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1251 [DEBUG]: Training session finished
