2025-09-14 08:43:01,511 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1108 [DEBUG]: logdir: _logs/noise-eval-v2/halfcheetah/bpql-noise_0.025-delay_12
2025-09-14 08:43:01,511 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1109 [DEBUG]: trainer_prefix: noise-eval-v2/halfcheetah/bpql-noise_0.025-delay_12
2025-09-14 08:43:01,511 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1110 [DEBUG]: args.trainer_eval_latencies: {'12': <latency_env.delayed_mdp.ConstantDelay object at 0x7fd9a6fb66f0>}
2025-09-14 08:43:01,512 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1111 [DEBUG]: using device: cpu
2025-09-14 08:43:01,515 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1133 [INFO]: Creating new trainer
2025-09-14 08:43:01,628 baseline-bpql-noisepromille25-halfcheetah:113 [DEBUG]: pi network:
NNGaussianPolicy(
  (common_head): Sequential(
    (0): Flatten(start_dim=1, end_dim=-1)
    (1): Linear(in_features=89, out_features=256, bias=True)
    (2): ReLU()
    (3): Linear(in_features=256, out_features=256, bias=True)
    (4): ReLU()
  )
  (mu_head): Sequential(
    (0): Linear(in_features=256, out_features=6, bias=True)
    (1): Unflatten(dim=1, unflattened_size=(6,))
  )
  (log_std_head): Sequential(
    (0): Linear(in_features=256, out_features=6, bias=True)
    (1): Unflatten(dim=1, unflattened_size=(6,))
  )
  (tanh_refit): NNTanhRefit(scale: tensor([[2., 2., 2., 2., 2., 2.]]), shift: tensor([[-1., -1., -1., -1., -1., -1.]]))
)
2025-09-14 08:43:01,628 baseline-bpql-noisepromille25-halfcheetah:114 [DEBUG]: q network:
NNLayerConcat2(
  dim: -1
  (next): Sequential(
    (0): Linear(in_features=23, out_features=256, bias=True)
    (1): ReLU()
    (2): Linear(in_features=256, out_features=256, bias=True)
    (3): ReLU()
    (4): Linear(in_features=256, out_features=1, bias=True)
    (5): NNLayerSqueeze(dim: -1)
  )
  (init_left): Flatten(start_dim=1, end_dim=-1)
  (init_right): Flatten(start_dim=1, end_dim=-1)
)
2025-09-14 08:43:03,527 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1194 [DEBUG]: Starting training session...
2025-09-14 08:43:03,527 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 1/100
2025-09-14 08:45:34,353 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 08:45:40,784 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: -530.59967 ± 109.658
2025-09-14 08:45:40,784 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(-465.32654), np.float32(-474.73785), np.float32(-608.67346), np.float32(-530.6343), np.float32(-450.64395), np.float32(-712.4242), np.float32(-731.436), np.float32(-434.06995), np.float32(-495.6954), np.float32(-402.35504)]
2025-09-14 08:45:40,785 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 08:45:40,785 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (-530.60) for latency 12
2025-09-14 08:45:40,786 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 2/100 (estimated time remaining: 4 hours, 19 minutes, 28 seconds)
2025-09-14 08:48:13,810 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 08:48:20,803 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: -251.19746 ± 46.235
2025-09-14 08:48:20,804 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(-193.50836), np.float32(-255.11049), np.float32(-312.56134), np.float32(-254.67834), np.float32(-356.50192), np.float32(-220.54869), np.float32(-223.47322), np.float32(-226.88066), np.float32(-246.60364), np.float32(-222.10794)]
2025-09-14 08:48:20,804 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 08:48:20,804 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (-251.20) for latency 12
2025-09-14 08:48:20,806 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 3/100 (estimated time remaining: 4 hours, 19 minutes, 6 seconds)
2025-09-14 08:51:02,762 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 08:51:09,841 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: -35.04467 ± 134.299
2025-09-14 08:51:09,842 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(10.691624), np.float32(-130.39122), np.float32(-79.05652), np.float32(-174.62468), np.float32(-152.09074), np.float32(50.98324), np.float32(281.34155), np.float32(76.16467), np.float32(-137.28036), np.float32(-96.184235)]
2025-09-14 08:51:09,842 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 08:51:09,842 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (-35.04) for latency 12
2025-09-14 08:51:09,844 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 4/100 (estimated time remaining: 4 hours, 22 minutes, 4 seconds)
2025-09-14 08:53:50,831 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 08:53:58,325 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 71.74571 ± 274.610
2025-09-14 08:53:58,325 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(338.90585), np.float32(-311.69855), np.float32(-169.02164), np.float32(401.8357), np.float32(59.083508), np.float32(107.31626), np.float32(581.1173), np.float32(-190.57866), np.float32(19.357883), np.float32(-118.860504)]
2025-09-14 08:53:58,325 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 08:53:58,325 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (71.75) for latency 12
2025-09-14 08:53:58,327 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 5/100 (estimated time remaining: 4 hours, 21 minutes, 55 seconds)
2025-09-14 08:56:46,253 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 08:56:55,259 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 98.26827 ± 404.181
2025-09-14 08:56:55,260 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(131.74959), np.float32(-580.9441), np.float32(-21.993525), np.float32(68.11259), np.float32(554.20483), np.float32(930.76324), np.float32(73.230934), np.float32(171.86436), np.float32(54.317062), np.float32(-398.62222)]
2025-09-14 08:56:55,260 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 08:56:55,260 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (98.27) for latency 12
2025-09-14 08:56:55,262 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 6/100 (estimated time remaining: 4 hours, 23 minutes, 22 seconds)
2025-09-14 09:00:07,464 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:00:17,106 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 152.88664 ± 283.236
2025-09-14 09:00:17,106 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(507.92566), np.float32(174.74414), np.float32(255.56586), np.float32(-591.0436), np.float32(279.75397), np.float32(311.8619), np.float32(165.1095), np.float32(41.10708), np.float32(26.602102), np.float32(357.24005)]
2025-09-14 09:00:17,106 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:00:17,106 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (152.89) for latency 12
2025-09-14 09:00:17,109 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 7/100 (estimated time remaining: 4 hours, 34 minutes, 34 seconds)
2025-09-14 09:03:32,433 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:03:42,130 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 402.96954 ± 406.032
2025-09-14 09:03:42,130 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(686.84283), np.float32(584.2523), np.float32(1249.1129), np.float32(251.31462), np.float32(277.87906), np.float32(360.90558), np.float32(419.0634), np.float32(-466.95285), np.float32(411.60757), np.float32(255.66982)]
2025-09-14 09:03:42,130 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:03:42,130 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (402.97) for latency 12
2025-09-14 09:03:42,133 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 8/100 (estimated time remaining: 4 hours, 45 minutes, 36 seconds)
2025-09-14 09:06:52,425 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:07:01,768 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 604.07019 ± 320.853
2025-09-14 09:07:01,768 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(587.5986), np.float32(530.4313), np.float32(876.4152), np.float32(676.5002), np.float32(-266.60223), np.float32(653.2963), np.float32(633.3707), np.float32(1016.50104), np.float32(616.1893), np.float32(717.00165)]
2025-09-14 09:07:01,768 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:07:01,768 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (604.07) for latency 12
2025-09-14 09:07:01,771 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 9/100 (estimated time remaining: 4 hours, 51 minutes, 55 seconds)
2025-09-14 09:10:10,464 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:10:19,802 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 906.24902 ± 129.092
2025-09-14 09:10:19,802 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(810.6701), np.float32(783.8948), np.float32(853.0794), np.float32(1030.8584), np.float32(980.92), np.float32(1200.6932), np.float32(820.4138), np.float32(794.5012), np.float32(970.3608), np.float32(817.0987)]
2025-09-14 09:10:19,803 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:10:19,803 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (906.25) for latency 12
2025-09-14 09:10:19,805 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 10/100 (estimated time remaining: 4 hours, 57 minutes, 42 seconds)
2025-09-14 09:13:28,396 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:13:37,918 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1313.36084 ± 365.168
2025-09-14 09:13:37,919 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1502.6569), np.float32(1484.3741), np.float32(1018.7306), np.float32(486.24817), np.float32(1444.074), np.float32(1551.0752), np.float32(1150.5123), np.float32(1892.9856), np.float32(1476.2786), np.float32(1126.6725)]
2025-09-14 09:13:37,919 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:13:37,919 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (1313.36) for latency 12
2025-09-14 09:13:37,922 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 11/100 (estimated time remaining: 5 hours, 47 seconds)
2025-09-14 09:16:46,181 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:16:55,484 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1247.55139 ± 257.956
2025-09-14 09:16:55,484 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1002.97614), np.float32(1414.8961), np.float32(1937.7411), np.float32(1080.9431), np.float32(1142.325), np.float32(1084.5243), np.float32(1067.7498), np.float32(1295.8044), np.float32(1218.0133), np.float32(1230.5399)]
2025-09-14 09:16:55,485 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:16:55,487 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 12/100 (estimated time remaining: 4 hours, 56 minutes, 11 seconds)
2025-09-14 09:20:05,183 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:20:15,342 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1173.52197 ± 277.800
2025-09-14 09:20:15,342 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(692.0115), np.float32(1617.4886), np.float32(1325.7635), np.float32(1536.4703), np.float32(1188.4252), np.float32(757.52795), np.float32(1076.7363), np.float32(1156.2107), np.float32(1146.3866), np.float32(1238.1979)]
2025-09-14 09:20:15,346 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:20:15,349 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 13/100 (estimated time remaining: 4 hours, 51 minutes, 20 seconds)
2025-09-14 09:23:37,668 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:23:47,820 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1355.84900 ± 403.630
2025-09-14 09:23:47,820 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1530.9264), np.float32(1235.8533), np.float32(2148.2297), np.float32(1031.4885), np.float32(1173.6301), np.float32(1506.9193), np.float32(1436.215), np.float32(1723.0955), np.float32(1211.8691), np.float32(560.264)]
2025-09-14 09:23:47,820 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:23:47,820 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (1355.85) for latency 12
2025-09-14 09:23:47,823 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 14/100 (estimated time remaining: 4 hours, 51 minutes, 45 seconds)
2025-09-14 09:27:09,900 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:27:19,940 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1575.60913 ± 293.574
2025-09-14 09:27:19,941 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1566.7109), np.float32(1467.4257), np.float32(1688.4823), np.float32(1910.7858), np.float32(1701.657), np.float32(1504.5792), np.float32(2050.6267), np.float32(1673.9801), np.float32(1161.2495), np.float32(1030.5934)]
2025-09-14 09:27:19,941 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:27:19,941 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (1575.61) for latency 12
2025-09-14 09:27:19,944 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 15/100 (estimated time remaining: 4 hours, 52 minutes, 26 seconds)
2025-09-14 09:30:41,827 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:30:51,987 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1518.28699 ± 331.413
2025-09-14 09:30:51,987 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2019.8923), np.float32(1808.3242), np.float32(1161.2437), np.float32(1950.7877), np.float32(1109.7642), np.float32(1687.6022), np.float32(1557.5232), np.float32(1042.452), np.float32(1478.6066), np.float32(1366.6741)]
2025-09-14 09:30:51,987 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:30:51,990 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 16/100 (estimated time remaining: 4 hours, 52 minutes, 59 seconds)
2025-09-14 09:34:00,670 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:34:09,053 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1493.10242 ± 540.612
2025-09-14 09:34:09,054 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1333.5479), np.float32(474.41302), np.float32(2326.3313), np.float32(1222.7609), np.float32(1841.5723), np.float32(2391.4229), np.float32(1386.4781), np.float32(1220.8079), np.float32(1214.7474), np.float32(1518.943)]
2025-09-14 09:34:09,054 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:34:09,057 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 17/100 (estimated time remaining: 4 hours, 49 minutes, 23 seconds)
2025-09-14 09:36:55,040 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:37:02,289 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1320.98206 ± 414.561
2025-09-14 09:37:02,289 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(422.132), np.float32(1129.4645), np.float32(1169.1661), np.float32(1732.7595), np.float32(2012.8728), np.float32(1385.0728), np.float32(1643.6881), np.float32(1090.4402), np.float32(1455.4998), np.float32(1168.7234)]
2025-09-14 09:37:02,289 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:37:02,292 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 18/100 (estimated time remaining: 4 hours, 38 minutes, 35 seconds)
2025-09-14 09:39:35,085 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:39:42,322 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1558.13843 ± 336.218
2025-09-14 09:39:42,322 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1447.2538), np.float32(1795.8151), np.float32(1275.6622), np.float32(1474.8813), np.float32(2313.9448), np.float32(1104.7788), np.float32(1658.2642), np.float32(1162.0023), np.float32(1720.0026), np.float32(1628.7798)]
2025-09-14 09:39:42,322 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:39:42,325 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 19/100 (estimated time remaining: 4 hours, 20 minutes, 53 seconds)
2025-09-14 09:42:14,719 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:42:22,011 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1362.03296 ± 404.899
2025-09-14 09:42:22,012 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1369.2286), np.float32(1113.5752), np.float32(734.9677), np.float32(1714.24), np.float32(1114.7997), np.float32(1482.8265), np.float32(1117.6278), np.float32(1322.999), np.float32(1327.6995), np.float32(2322.3655)]
2025-09-14 09:42:22,012 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:42:22,015 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 20/100 (estimated time remaining: 4 hours, 3 minutes, 33 seconds)
2025-09-14 09:45:18,282 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:45:28,606 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1530.58020 ± 385.541
2025-09-14 09:45:28,606 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1834.5383), np.float32(1431.5219), np.float32(1278.3207), np.float32(2329.5398), np.float32(1134.3984), np.float32(1960.0087), np.float32(1450.5505), np.float32(1233.4401), np.float32(1612.2913), np.float32(1041.1923)]
2025-09-14 09:45:28,606 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:45:28,609 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 21/100 (estimated time remaining: 3 hours, 53 minutes, 45 seconds)
2025-09-14 09:48:53,275 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:49:03,624 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1310.46582 ± 493.643
2025-09-14 09:49:03,625 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1504.7596), np.float32(1517.0895), np.float32(1786.8156), np.float32(1269.8782), np.float32(1178.6489), np.float32(-33.458973), np.float32(1478.8086), np.float32(1421.8907), np.float32(1174.6559), np.float32(1805.5701)]
2025-09-14 09:49:03,625 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:49:03,629 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 22/100 (estimated time remaining: 3 hours, 55 minutes, 34 seconds)
2025-09-14 09:52:28,656 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:52:38,960 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1511.23804 ± 386.861
2025-09-14 09:52:38,961 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2051.4307), np.float32(722.1899), np.float32(1446.098), np.float32(1855.2803), np.float32(1813.7908), np.float32(1185.6759), np.float32(1378.2192), np.float32(1805.4443), np.float32(1695.1492), np.float32(1159.103)]
2025-09-14 09:52:38,961 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:52:38,964 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 23/100 (estimated time remaining: 4 hours, 3 minutes, 32 seconds)
2025-09-14 09:56:04,407 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:56:14,731 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1493.77710 ± 550.328
2025-09-14 09:56:14,731 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1859.8672), np.float32(1389.056), np.float32(1406.9592), np.float32(1293.3055), np.float32(1565.3088), np.float32(1930.6735), np.float32(221.3779), np.float32(1484.2216), np.float32(1301.3652), np.float32(2485.6365)]
2025-09-14 09:56:14,731 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:56:14,735 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 24/100 (estimated time remaining: 4 hours, 14 minutes, 43 seconds)
2025-09-14 09:59:40,507 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:59:50,783 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1827.89099 ± 504.481
2025-09-14 09:59:50,784 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1313.7886), np.float32(1604.7094), np.float32(1673.7844), np.float32(1372.6661), np.float32(1577.1721), np.float32(2816.0264), np.float32(1917.2322), np.float32(2146.1897), np.float32(2573.2808), np.float32(1284.061)]
2025-09-14 09:59:50,784 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:59:50,784 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (1827.89) for latency 12
2025-09-14 09:59:50,787 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 25/100 (estimated time remaining: 4 hours, 25 minutes, 41 seconds)
2025-09-14 10:03:16,223 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:03:26,598 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 2038.00720 ± 531.986
2025-09-14 10:03:26,598 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1127.7972), np.float32(1461.842), np.float32(2462.8152), np.float32(1754.4916), np.float32(2375.9602), np.float32(1634.2721), np.float32(2639.9292), np.float32(2316.29), np.float32(2828.2368), np.float32(1778.4403)]
2025-09-14 10:03:26,598 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:03:26,599 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (2038.01) for latency 12
2025-09-14 10:03:26,602 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 26/100 (estimated time remaining: 4 hours, 29 minutes, 29 seconds)
2025-09-14 10:06:51,878 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:07:02,272 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1907.01440 ± 342.625
2025-09-14 10:07:02,272 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1537.2666), np.float32(1624.8972), np.float32(2197.5186), np.float32(1763.4928), np.float32(2654.0024), np.float32(1899.9102), np.float32(1856.9985), np.float32(2273.681), np.float32(1683.5151), np.float32(1578.8639)]
2025-09-14 10:07:02,272 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:07:02,276 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 27/100 (estimated time remaining: 4 hours, 26 minutes, 3 seconds)
2025-09-14 10:10:27,311 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:10:37,606 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1603.85876 ± 356.347
2025-09-14 10:10:37,606 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1612.6415), np.float32(2410.8267), np.float32(1795.277), np.float32(1386.0006), np.float32(1227.2328), np.float32(1386.7522), np.float32(1633.7034), np.float32(1208.3619), np.float32(1983.602), np.float32(1394.1892)]
2025-09-14 10:10:37,606 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:10:37,610 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 28/100 (estimated time remaining: 4 hours, 22 minutes, 28 seconds)
2025-09-14 10:14:02,777 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:14:13,119 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1616.15015 ± 440.329
2025-09-14 10:14:13,119 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2500.6494), np.float32(1437.9222), np.float32(1383.9841), np.float32(1344.773), np.float32(1368.0115), np.float32(2432.97), np.float32(1656.0752), np.float32(1520.4043), np.float32(1229.3213), np.float32(1287.3905)]
2025-09-14 10:14:13,119 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:14:13,124 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 29/100 (estimated time remaining: 4 hours, 18 minutes, 48 seconds)
2025-09-14 10:17:37,926 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:17:48,311 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1773.47424 ± 465.966
2025-09-14 10:17:48,311 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1539.6501), np.float32(1538.6138), np.float32(1295.3107), np.float32(1872.2389), np.float32(2442.5874), np.float32(1822.8169), np.float32(2781.1719), np.float32(1234.2401), np.float32(1536.0939), np.float32(1672.018)]
2025-09-14 10:17:48,311 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:17:48,315 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 30/100 (estimated time remaining: 4 hours, 15 minutes)
2025-09-14 10:21:12,879 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:21:23,274 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1980.23560 ± 480.562
2025-09-14 10:21:23,274 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1888.1505), np.float32(2205.3333), np.float32(1907.2504), np.float32(1221.1666), np.float32(1196.0992), np.float32(2557.9353), np.float32(2297.2893), np.float32(2405.1445), np.float32(1600.6406), np.float32(2523.3457)]
2025-09-14 10:21:23,274 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:21:23,278 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 31/100 (estimated time remaining: 4 hours, 11 minutes, 13 seconds)
2025-09-14 10:24:48,115 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:24:58,396 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 2106.85815 ± 374.906
2025-09-14 10:24:58,396 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2176.3647), np.float32(2290.0625), np.float32(1663.95), np.float32(2423.1125), np.float32(1528.9016), np.float32(2440.8706), np.float32(1697.9373), np.float32(2574.5186), np.float32(2481.1753), np.float32(1791.6874)]
2025-09-14 10:24:58,396 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:24:58,396 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (2106.86) for latency 12
2025-09-14 10:24:58,400 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 32/100 (estimated time remaining: 4 hours, 7 minutes, 30 seconds)
2025-09-14 10:28:22,777 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:28:33,070 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1674.72461 ± 445.085
2025-09-14 10:28:33,071 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2607.8086), np.float32(1565.8458), np.float32(1409.2545), np.float32(1489.6406), np.float32(1225.594), np.float32(2105.9438), np.float32(2237.6829), np.float32(1306.4521), np.float32(1431.7118), np.float32(1367.3109)]
2025-09-14 10:28:33,071 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:28:33,075 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 33/100 (estimated time remaining: 4 hours, 3 minutes, 46 seconds)
2025-09-14 10:31:56,496 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:32:06,843 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1804.94995 ± 329.281
2025-09-14 10:32:06,843 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1452.6669), np.float32(1482.1001), np.float32(1788.3461), np.float32(2431.6753), np.float32(1601.6874), np.float32(1793.269), np.float32(2335.53), np.float32(1963.4852), np.float32(1475.7517), np.float32(1724.9889)]
2025-09-14 10:32:06,844 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:32:06,848 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 34/100 (estimated time remaining: 3 hours, 59 minutes, 47 seconds)
2025-09-14 10:35:31,564 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:35:41,982 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1771.31152 ± 453.360
2025-09-14 10:35:41,982 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1711.896), np.float32(1940.2222), np.float32(2079.096), np.float32(1472.9177), np.float32(1374.9939), np.float32(1276.9473), np.float32(2080.3535), np.float32(2797.6165), np.float32(1744.9324), np.float32(1234.1401)]
2025-09-14 10:35:41,983 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:35:41,987 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 35/100 (estimated time remaining: 3 hours, 56 minutes, 12 seconds)
2025-09-14 10:39:07,226 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:39:17,575 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 2043.90955 ± 557.847
2025-09-14 10:39:17,576 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1644.9518), np.float32(2965.6636), np.float32(1598.1283), np.float32(1241.8574), np.float32(1901.1538), np.float32(2316.685), np.float32(1743.0558), np.float32(2645.3977), np.float32(2770.1965), np.float32(1612.0034)]
2025-09-14 10:39:17,576 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:39:17,580 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 36/100 (estimated time remaining: 3 hours, 52 minutes, 45 seconds)
2025-09-14 10:42:41,746 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:42:52,164 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1804.88416 ± 336.513
2025-09-14 10:42:52,165 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1728.9429), np.float32(1786.2738), np.float32(1802.0364), np.float32(1307.2836), np.float32(2038.5735), np.float32(2100.0151), np.float32(1594.3701), np.float32(1549.7091), np.float32(1577.8181), np.float32(2563.8186)]
2025-09-14 10:42:52,165 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:42:52,169 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 37/100 (estimated time remaining: 3 hours, 49 minutes, 4 seconds)
2025-09-14 10:46:15,811 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:46:26,108 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1996.01147 ± 599.094
2025-09-14 10:46:26,109 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2964.8506), np.float32(1679.269), np.float32(1446.1425), np.float32(3164.7224), np.float32(1216.1472), np.float32(1825.2543), np.float32(2211.788), np.float32(1681.2622), np.float32(1692.4056), np.float32(2078.2727)]
2025-09-14 10:46:26,109 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:46:26,113 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 38/100 (estimated time remaining: 3 hours, 45 minutes, 20 seconds)
2025-09-14 10:49:49,257 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:49:59,534 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1810.80310 ± 349.021
2025-09-14 10:49:59,537 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1250.896), np.float32(1505.3265), np.float32(2348.976), np.float32(2212.5789), np.float32(1963.2975), np.float32(2112.5918), np.float32(1479.2646), np.float32(1896.8232), np.float32(1879.5037), np.float32(1458.7736)]
2025-09-14 10:49:59,537 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:49:59,541 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 39/100 (estimated time remaining: 3 hours, 41 minutes, 41 seconds)
2025-09-14 10:53:20,484 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:53:30,556 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1580.69897 ± 760.722
2025-09-14 10:53:30,556 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2542.8755), np.float32(1857.9126), np.float32(1152.3055), np.float32(1355.6948), np.float32(1425.388), np.float32(212.75371), np.float32(1272.1647), np.float32(1621.3068), np.float32(3140.2979), np.float32(1226.2887)]
2025-09-14 10:53:30,556 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:53:30,561 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 40/100 (estimated time remaining: 3 hours, 37 minutes, 16 seconds)
2025-09-14 10:56:48,455 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:56:58,502 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1837.13159 ± 417.867
2025-09-14 10:56:58,502 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1629.8625), np.float32(1649.0706), np.float32(2731.5981), np.float32(1910.9617), np.float32(1473.856), np.float32(1229.7913), np.float32(1587.0034), np.float32(1757.4266), np.float32(2319.0186), np.float32(2082.7275)]
2025-09-14 10:56:58,502 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:56:58,507 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 41/100 (estimated time remaining: 3 hours, 32 minutes, 11 seconds)
2025-09-14 11:00:13,864 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:00:23,139 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1969.29041 ± 551.518
2025-09-14 11:00:23,140 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1309.6143), np.float32(1706.6844), np.float32(2231.3562), np.float32(3097.0027), np.float32(2440.101), np.float32(1445.7897), np.float32(1808.4874), np.float32(2438.0793), np.float32(1295.9479), np.float32(1919.8412)]
2025-09-14 11:00:23,140 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:00:23,144 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 42/100 (estimated time remaining: 3 hours, 26 minutes, 41 seconds)
2025-09-14 11:03:26,938 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:03:36,150 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1725.39099 ± 656.818
2025-09-14 11:03:36,150 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(546.6929), np.float32(3170.112), np.float32(1367.2059), np.float32(1266.9929), np.float32(2058.9888), np.float32(1545.5985), np.float32(1582.3291), np.float32(2303.54), np.float32(1711.1863), np.float32(1701.2617)]
2025-09-14 11:03:36,150 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:03:36,154 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 43/100 (estimated time remaining: 3 hours, 19 minutes, 8 seconds)
2025-09-14 11:06:39,489 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:06:47,789 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 2049.53809 ± 558.675
2025-09-14 11:06:47,789 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1544.7673), np.float32(3142.9429), np.float32(2026.274), np.float32(2624.306), np.float32(1968.7141), np.float32(1664.8286), np.float32(1534.9998), np.float32(1342.2336), np.float32(1955.2938), np.float32(2691.0215)]
2025-09-14 11:06:47,789 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:06:47,793 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 44/100 (estimated time remaining: 3 hours, 11 minutes, 34 seconds)
2025-09-14 11:09:36,225 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:09:43,518 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1908.62427 ± 597.382
2025-09-14 11:09:43,518 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1607.9329), np.float32(1313.2845), np.float32(1516.5293), np.float32(2707.6328), np.float32(2492.7024), np.float32(1306.3774), np.float32(1524.8264), np.float32(1695.0997), np.float32(1818.734), np.float32(3103.1233)]
2025-09-14 11:09:43,518 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:09:43,522 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 45/100 (estimated time remaining: 3 hours, 1 minute, 37 seconds)
2025-09-14 11:12:13,703 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:12:20,966 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 2295.85229 ± 769.118
2025-09-14 11:12:20,966 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2248.6582), np.float32(1672.9115), np.float32(2032.9528), np.float32(1387.6068), np.float32(1750.5303), np.float32(2331.1787), np.float32(3273.312), np.float32(3430.5725), np.float32(1400.2909), np.float32(3430.5098)]
2025-09-14 11:12:20,966 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:12:20,966 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (2295.85) for latency 12
2025-09-14 11:12:20,970 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 46/100 (estimated time remaining: 2 hours, 49 minutes, 7 seconds)
2025-09-14 11:14:51,298 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:14:58,574 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 2114.75903 ± 823.585
2025-09-14 11:14:58,574 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3636.3743), np.float32(1442.0856), np.float32(1470.4889), np.float32(1359.0864), np.float32(1315.0156), np.float32(3203.3718), np.float32(1822.6606), np.float32(2561.0522), np.float32(2844.747), np.float32(1492.7072)]
2025-09-14 11:14:58,574 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:14:58,578 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 47/100 (estimated time remaining: 2 hours, 37 minutes, 34 seconds)
2025-09-14 11:17:29,504 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:17:36,792 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 2052.24438 ± 635.566
2025-09-14 11:17:36,792 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2304.771), np.float32(1270.428), np.float32(1347.0687), np.float32(1494.9473), np.float32(1581.7234), np.float32(2600.2605), np.float32(3021.95), np.float32(2529.5537), np.float32(1541.2758), np.float32(2830.4656)]
2025-09-14 11:17:36,792 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:17:36,796 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 48/100 (estimated time remaining: 2 hours, 28 minutes, 30 seconds)
2025-09-14 11:20:07,560 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:20:14,874 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 1935.23083 ± 521.029
2025-09-14 11:20:14,874 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1545.9891), np.float32(1645.3027), np.float32(1454.3671), np.float32(2478.57), np.float32(2290.2214), np.float32(1671.0793), np.float32(2678.1), np.float32(2750.3376), np.float32(1355.8733), np.float32(1482.4692)]
2025-09-14 11:20:14,875 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:20:14,879 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 49/100 (estimated time remaining: 2 hours, 19 minutes, 53 seconds)
2025-09-14 11:22:45,203 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:22:52,637 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 2501.98120 ± 770.657
2025-09-14 11:22:52,637 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3356.974), np.float32(1398.4187), np.float32(3002.5718), np.float32(1173.2938), np.float32(1682.9724), np.float32(2735.5474), np.float32(3335.5942), np.float32(2316.6018), np.float32(3062.2312), np.float32(2955.6091)]
2025-09-14 11:22:52,637 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:22:52,637 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (2501.98) for latency 12
2025-09-14 11:22:52,641 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 50/100 (estimated time remaining: 2 hours, 14 minutes, 9 seconds)
2025-09-14 11:25:23,257 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:25:30,705 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 2579.96509 ± 684.238
2025-09-14 11:25:30,705 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2466.8357), np.float32(3276.6125), np.float32(1615.817), np.float32(2337.6865), np.float32(3571.316), np.float32(2228.762), np.float32(2686.2961), np.float32(1403.4521), np.float32(3370.5034), np.float32(2842.369)]
2025-09-14 11:25:30,705 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:25:30,705 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (2579.97) for latency 12
2025-09-14 11:25:30,709 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 51/100 (estimated time remaining: 2 hours, 11 minutes, 37 seconds)
2025-09-14 11:28:01,320 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:28:08,762 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 3080.45630 ± 666.704
2025-09-14 11:28:08,762 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3521.217), np.float32(3381.2625), np.float32(3194.822), np.float32(1506.0162), np.float32(3069.8486), np.float32(3644.1248), np.float32(3474.961), np.float32(3730.3132), np.float32(3080.436), np.float32(2201.5603)]
2025-09-14 11:28:08,762 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:28:08,762 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (3080.46) for latency 12
2025-09-14 11:28:08,767 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 52/100 (estimated time remaining: 2 hours, 9 minutes, 3 seconds)
2025-09-14 11:30:39,366 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:30:46,707 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 2457.34814 ± 628.078
2025-09-14 11:30:46,707 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1546.5237), np.float32(1823.426), np.float32(1608.3099), np.float32(2332.038), np.float32(2856.4482), np.float32(3379.43), np.float32(2306.949), np.float32(3373.881), np.float32(2597.691), np.float32(2748.7827)]
2025-09-14 11:30:46,707 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:30:46,712 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 53/100 (estimated time remaining: 2 hours, 6 minutes, 23 seconds)
2025-09-14 11:33:16,923 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:33:24,371 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 2463.81030 ± 639.962
2025-09-14 11:33:24,371 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1819.8052), np.float32(2381.1077), np.float32(3190.8655), np.float32(2951.4539), np.float32(2018.7616), np.float32(1490.3074), np.float32(2921.6228), np.float32(3579.4185), np.float32(2367.0376), np.float32(1917.724)]
2025-09-14 11:33:24,371 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:33:24,376 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 54/100 (estimated time remaining: 2 hours, 3 minutes, 41 seconds)
2025-09-14 11:35:54,327 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:36:01,651 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 3510.51318 ± 542.965
2025-09-14 11:36:01,654 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3415.5735), np.float32(3993.2522), np.float32(3990.0776), np.float32(3888.2407), np.float32(3241.6006), np.float32(3596.4187), np.float32(3152.1323), np.float32(3937.2075), np.float32(3752.873), np.float32(2137.7578)]
2025-09-14 11:36:01,654 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:36:01,654 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (3510.51) for latency 12
2025-09-14 11:36:01,658 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 55/100 (estimated time remaining: 2 hours, 58 seconds)
2025-09-14 11:38:31,808 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:38:39,142 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 3179.68115 ± 732.986
2025-09-14 11:38:39,142 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3831.6582), np.float32(3556.0293), np.float32(3119.9504), np.float32(4078.2297), np.float32(3456.391), np.float32(3501.8564), np.float32(2951.953), np.float32(2142.336), np.float32(3570.5537), np.float32(1587.8529)]
2025-09-14 11:38:39,142 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:38:39,146 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 56/100 (estimated time remaining: 1 hour, 58 minutes, 15 seconds)
2025-09-14 11:41:09,570 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:41:16,914 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 3479.35938 ± 658.462
2025-09-14 11:41:16,915 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3652.0342), np.float32(3716.5708), np.float32(3912.2356), np.float32(3462.2986), np.float32(3735.3232), np.float32(3747.2195), np.float32(3268.483), np.float32(4013.2065), np.float32(3691.112), np.float32(1595.1091)]
2025-09-14 11:41:16,915 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:41:16,919 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 57/100 (estimated time remaining: 1 hour, 55 minutes, 35 seconds)
2025-09-14 11:43:46,970 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:43:54,286 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 2981.82275 ± 785.518
2025-09-14 11:43:54,286 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3753.3848), np.float32(3618.1428), np.float32(3486.5693), np.float32(2878.3547), np.float32(3699.0176), np.float32(3024.4724), np.float32(1677.6494), np.float32(3007.014), np.float32(3299.861), np.float32(1373.7604)]
2025-09-14 11:43:54,286 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:43:54,291 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 58/100 (estimated time remaining: 1 hour, 52 minutes, 53 seconds)
2025-09-14 11:46:24,638 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:46:31,968 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 3746.34961 ± 260.093
2025-09-14 11:46:31,968 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4004.596), np.float32(3593.8523), np.float32(3980.1914), np.float32(4073.47), np.float32(3289.2407), np.float32(3465.3384), np.float32(3528.856), np.float32(3770.6147), np.float32(3714.3123), np.float32(4043.022)]
2025-09-14 11:46:31,968 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:46:31,968 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (3746.35) for latency 12
2025-09-14 11:46:31,973 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 59/100 (estimated time remaining: 1 hour, 50 minutes, 15 seconds)
2025-09-14 11:49:01,981 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:49:09,301 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 3323.49536 ± 770.798
2025-09-14 11:49:09,301 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2546.119), np.float32(1866.1934), np.float32(2638.1982), np.float32(3993.3552), np.float32(2906.9888), np.float32(4421.162), np.float32(3525.1233), np.float32(3375.811), np.float32(4046.3552), np.float32(3915.643)]
2025-09-14 11:49:09,301 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:49:09,306 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 60/100 (estimated time remaining: 1 hour, 47 minutes, 38 seconds)
2025-09-14 11:51:39,950 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:51:47,262 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 3674.15479 ± 1158.840
2025-09-14 11:51:47,262 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4657.49), np.float32(1877.0134), np.float32(4143.3), np.float32(4751.324), np.float32(4024.3098), np.float32(4218.007), np.float32(4655.754), np.float32(2260.0222), np.float32(4447.8477), np.float32(1706.4757)]
2025-09-14 11:51:47,262 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:51:47,267 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 61/100 (estimated time remaining: 1 hour, 45 minutes, 4 seconds)
2025-09-14 11:54:17,790 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:54:25,136 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 3692.16602 ± 818.383
2025-09-14 11:54:25,136 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2333.2), np.float32(3805.8955), np.float32(4238.7427), np.float32(4025.166), np.float32(4636.6157), np.float32(4191.0312), np.float32(4434.8433), np.float32(2048.4307), np.float32(3699.44), np.float32(3508.2957)]
2025-09-14 11:54:25,136 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:54:25,141 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 62/100 (estimated time remaining: 1 hour, 42 minutes, 28 seconds)
2025-09-14 11:56:55,604 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:57:02,949 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 3306.12354 ± 986.780
2025-09-14 11:57:02,969 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2110.7727), np.float32(4255.8354), np.float32(2975.076), np.float32(4586.3623), np.float32(3949.3167), np.float32(2584.7534), np.float32(1902.5148), np.float32(2264.2773), np.float32(4138.847), np.float32(4293.477)]
2025-09-14 11:57:02,969 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:57:02,974 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 63/100 (estimated time remaining: 1 hour, 39 minutes, 53 seconds)
2025-09-14 11:59:33,577 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:59:40,915 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 3556.70850 ± 844.245
2025-09-14 11:59:40,915 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1738.9491), np.float32(4523.811), np.float32(3015.6843), np.float32(3740.114), np.float32(4486.755), np.float32(4191.0933), np.float32(3320.3838), np.float32(4106.9507), np.float32(2625.4072), np.float32(3817.9375)]
2025-09-14 11:59:40,915 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:59:40,920 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 64/100 (estimated time remaining: 1 hour, 37 minutes, 18 seconds)
2025-09-14 12:02:11,531 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:02:18,891 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 3083.64185 ± 1263.357
2025-09-14 12:02:18,891 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3540.3845), np.float32(4540.4805), np.float32(4578.267), np.float32(1460.4412), np.float32(1371.1055), np.float32(2224.7996), np.float32(1808.4818), np.float32(4405.0635), np.float32(2603.8213), np.float32(4303.5737)]
2025-09-14 12:02:18,891 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:02:18,896 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 65/100 (estimated time remaining: 1 hour, 34 minutes, 45 seconds)
2025-09-14 12:04:49,530 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:04:56,986 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 4006.14209 ± 482.160
2025-09-14 12:04:56,986 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4688.5757), np.float32(3433.2925), np.float32(4053.9583), np.float32(4552.177), np.float32(3614.699), np.float32(3289.222), np.float32(3776.753), np.float32(4578.343), np.float32(4338.2905), np.float32(3736.1125)]
2025-09-14 12:04:56,986 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:04:56,986 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (4006.14) for latency 12
2025-09-14 12:04:56,991 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 66/100 (estimated time remaining: 1 hour, 32 minutes, 8 seconds)
2025-09-14 12:07:27,487 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:07:34,954 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 4406.62500 ± 1123.058
2025-09-14 12:07:34,955 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4916.5513), np.float32(3626.404), np.float32(4519.943), np.float32(4834.497), np.float32(1267.0272), np.float32(5138.934), np.float32(4849.1035), np.float32(4958.2188), np.float32(5009.981), np.float32(4945.592)]
2025-09-14 12:07:34,955 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:07:34,955 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (4406.62) for latency 12
2025-09-14 12:07:34,960 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 67/100 (estimated time remaining: 1 hour, 29 minutes, 30 seconds)
2025-09-14 12:10:05,445 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:10:12,808 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 4997.64697 ± 149.172
2025-09-14 12:10:12,808 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4640.242), np.float32(4997.046), np.float32(5112.7354), np.float32(5064.887), np.float32(5085.0337), np.float32(5021.109), np.float32(5157.7383), np.float32(5086.4346), np.float32(4809.7534), np.float32(5001.489)]
2025-09-14 12:10:12,808 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:10:12,808 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (4997.65) for latency 12
2025-09-14 12:10:12,813 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 68/100 (estimated time remaining: 1 hour, 26 minutes, 52 seconds)
2025-09-14 12:12:42,877 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:12:50,234 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 4829.74561 ± 158.497
2025-09-14 12:12:50,234 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4977.306), np.float32(4846.4946), np.float32(4999.7905), np.float32(4566.718), np.float32(4705.8887), np.float32(4999.0894), np.float32(4803.4785), np.float32(4957.5693), np.float32(4872.5015), np.float32(4568.6226)]
2025-09-14 12:12:50,234 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:12:50,240 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 69/100 (estimated time remaining: 1 hour, 24 minutes, 11 seconds)
2025-09-14 12:15:20,784 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:15:28,252 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 4962.80615 ± 194.833
2025-09-14 12:15:28,252 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5028.9727), np.float32(4522.501), np.float32(5202.38), np.float32(4893.5405), np.float32(5105.391), np.float32(5093.294), np.float32(4954.743), np.float32(4715.5093), np.float32(5096.248), np.float32(5015.4844)]
2025-09-14 12:15:28,252 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:15:28,257 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 70/100 (estimated time remaining: 1 hour, 21 minutes, 34 seconds)
2025-09-14 12:17:58,263 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:18:05,618 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 4795.71191 ± 417.802
2025-09-14 12:18:05,619 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5064.9746), np.float32(3623.219), np.float32(5015.44), np.float32(4882.743), np.float32(4746.3125), np.float32(5157.035), np.float32(4991.077), np.float32(4749.2437), np.float32(4690.385), np.float32(5036.688)]
2025-09-14 12:18:05,619 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:18:05,624 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 71/100 (estimated time remaining: 1 hour, 18 minutes, 51 seconds)
2025-09-14 12:20:35,592 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:20:42,958 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 4404.06543 ± 284.920
2025-09-14 12:20:42,959 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4542.7964), np.float32(4592.5737), np.float32(3875.333), np.float32(4621.1074), np.float32(4559.9907), np.float32(4322.405), np.float32(4543.8335), np.float32(3836.234), np.float32(4563.4233), np.float32(4582.957)]
2025-09-14 12:20:42,959 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:20:42,964 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 72/100 (estimated time remaining: 1 hour, 16 minutes, 10 seconds)
2025-09-14 12:23:13,271 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:23:20,765 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 4936.88281 ± 224.984
2025-09-14 12:23:20,765 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4364.4863), np.float32(5148.801), np.float32(4932.129), np.float32(5104.1533), np.float32(4893.0747), np.float32(4748.8584), np.float32(5134.9946), np.float32(5050.884), np.float32(4927.9395), np.float32(5063.51)]
2025-09-14 12:23:20,765 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:23:20,771 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 73/100 (estimated time remaining: 1 hour, 13 minutes, 32 seconds)
2025-09-14 12:25:50,969 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:25:58,438 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 5002.32227 ± 126.541
2025-09-14 12:25:58,439 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4980.4897), np.float32(5220.518), np.float32(4878.1367), np.float32(4949.422), np.float32(4840.1587), np.float32(4985.515), np.float32(4844.498), np.float32(5171.0806), np.float32(5114.042), np.float32(5039.358)]
2025-09-14 12:25:58,439 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:25:58,439 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (5002.32) for latency 12
2025-09-14 12:25:58,444 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 74/100 (estimated time remaining: 1 hour, 10 minutes, 56 seconds)
2025-09-14 12:28:29,183 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:28:36,522 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 5223.26953 ± 102.360
2025-09-14 12:28:36,522 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5349.6685), np.float32(5326.942), np.float32(5222.006), np.float32(5270.4136), np.float32(4983.6147), np.float32(5302.931), np.float32(5247.128), np.float32(5119.162), np.float32(5206.3555), np.float32(5204.472)]
2025-09-14 12:28:36,523 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:28:36,523 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (5223.27) for latency 12
2025-09-14 12:28:36,528 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 75/100 (estimated time remaining: 1 hour, 8 minutes, 19 seconds)
2025-09-14 12:31:07,355 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:31:14,639 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 4613.87695 ± 960.873
2025-09-14 12:31:14,639 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3972.0159), np.float32(4194.1377), np.float32(5257.436), np.float32(5170.73), np.float32(5233.372), np.float32(2041.9227), np.float32(5299.643), np.float32(4885.897), np.float32(5116.358), np.float32(4967.2524)]
2025-09-14 12:31:14,639 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:31:14,645 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 76/100 (estimated time remaining: 1 hour, 5 minutes, 45 seconds)
2025-09-14 12:33:45,489 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:33:52,740 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 4828.81885 ± 541.903
2025-09-14 12:33:52,741 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3273.9746), np.float32(5179.6343), np.float32(5089.174), np.float32(4837.236), np.float32(4971.0283), np.float32(4897.598), np.float32(5082.661), np.float32(5241.0776), np.float32(4674.76), np.float32(5041.0474)]
2025-09-14 12:33:52,741 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:33:52,746 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 77/100 (estimated time remaining: 1 hour, 3 minutes, 10 seconds)
2025-09-14 12:36:23,227 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:36:30,459 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 5235.79395 ± 145.116
2025-09-14 12:36:30,459 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5392.5234), np.float32(4893.6934), np.float32(5221.8027), np.float32(5372.238), np.float32(5192.961), np.float32(5362.3384), np.float32(5206.7095), np.float32(5113.7295), np.float32(5364.554), np.float32(5237.3877)]
2025-09-14 12:36:30,460 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:36:30,460 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (5235.79) for latency 12
2025-09-14 12:36:30,465 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 78/100 (estimated time remaining: 1 hour, 32 seconds)
2025-09-14 12:39:00,800 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:39:08,081 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 5080.51074 ± 301.796
2025-09-14 12:39:08,081 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5267.1855), np.float32(5393.7827), np.float32(5194.845), np.float32(5206.934), np.float32(5120.8843), np.float32(4730.231), np.float32(5313.6245), np.float32(4414.129), np.float32(5332.7715), np.float32(4830.7153)]
2025-09-14 12:39:08,081 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:39:08,087 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 79/100 (estimated time remaining: 57 minutes, 54 seconds)
2025-09-14 12:41:38,710 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:41:45,966 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 5198.71777 ± 162.369
2025-09-14 12:41:45,966 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4896.0024), np.float32(5125.909), np.float32(5207.6104), np.float32(5330.26), np.float32(5251.255), np.float32(4905.0356), np.float32(5287.6265), np.float32(5360.525), np.float32(5291.5566), np.float32(5331.395)]
2025-09-14 12:41:45,966 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:41:45,972 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 80/100 (estimated time remaining: 55 minutes, 15 seconds)
2025-09-14 12:44:16,886 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:44:24,107 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 5105.05957 ± 229.733
2025-09-14 12:44:24,107 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5030.1816), np.float32(5296.615), np.float32(5149.566), np.float32(5234.8696), np.float32(5142.9097), np.float32(5268.565), np.float32(5248.9688), np.float32(4570.2393), np.float32(4802.7583), np.float32(5305.9214)]
2025-09-14 12:44:24,107 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:44:24,113 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 81/100 (estimated time remaining: 52 minutes, 37 seconds)
2025-09-14 12:46:55,003 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:47:02,246 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 4814.65088 ± 935.245
2025-09-14 12:47:02,247 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4889.3735), np.float32(5245.9985), np.float32(5173.836), np.float32(5294.2065), np.float32(2043.162), np.float32(5025.3877), np.float32(5271.6914), np.float32(4853.364), np.float32(5196.153), np.float32(5153.338)]
2025-09-14 12:47:02,247 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:47:02,252 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 82/100 (estimated time remaining: 50 minutes)
2025-09-14 12:49:33,004 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:49:40,319 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 4714.05762 ± 1161.119
2025-09-14 12:49:40,319 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5238.311), np.float32(1307.9856), np.float32(5237.174), np.float32(5230.7803), np.float32(4925.0156), np.float32(5303.09), np.float32(5216.604), np.float32(4893.243), np.float32(4491.6074), np.float32(5296.7637)]
2025-09-14 12:49:40,320 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:49:40,325 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 83/100 (estimated time remaining: 47 minutes, 23 seconds)
2025-09-14 12:52:10,490 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:52:17,848 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 4638.59717 ± 862.877
2025-09-14 12:52:17,848 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4519.2573), np.float32(3182.8982), np.float32(5081.276), np.float32(5298.9946), np.float32(5271.2314), np.float32(5000.1846), np.float32(5191.219), np.float32(5235.511), np.float32(4824.9707), np.float32(2780.4343)]
2025-09-14 12:52:17,849 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:52:17,854 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 84/100 (estimated time remaining: 44 minutes, 45 seconds)
2025-09-14 12:54:48,378 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:54:55,729 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 5073.00879 ± 60.107
2025-09-14 12:54:55,729 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5040.432), np.float32(5122.407), np.float32(5096.9146), np.float32(5116.437), np.float32(5019.3657), np.float32(4953.4263), np.float32(5074.0195), np.float32(5120.149), np.float32(5163.9336), np.float32(5022.999)]
2025-09-14 12:54:55,729 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:54:55,735 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 85/100 (estimated time remaining: 42 minutes, 7 seconds)
2025-09-14 12:57:26,018 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:57:33,359 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 4694.36182 ± 916.285
2025-09-14 12:57:33,360 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5136.7773), np.float32(5184.537), np.float32(5131.4614), np.float32(4937.246), np.float32(3549.3867), np.float32(5176.589), np.float32(5127.197), np.float32(5133.3774), np.float32(5219.9927), np.float32(2347.0535)]
2025-09-14 12:57:33,360 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:57:33,365 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 86/100 (estimated time remaining: 39 minutes, 27 seconds)
2025-09-14 13:00:03,714 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:00:11,158 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 4802.51758 ± 1117.906
2025-09-14 13:00:11,158 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2590.319), np.float32(5426.603), np.float32(5459.3843), np.float32(2550.7458), np.float32(5339.678), np.float32(5328.71), np.float32(5207.154), np.float32(5387.4634), np.float32(5399.3525), np.float32(5335.766)]
2025-09-14 13:00:11,158 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:00:11,164 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 87/100 (estimated time remaining: 36 minutes, 48 seconds)
2025-09-14 13:02:41,578 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:02:49,059 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 4715.62012 ± 1029.762
2025-09-14 13:02:49,059 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5111.382), np.float32(5244.885), np.float32(5127.4995), np.float32(3406.2786), np.float32(5275.0625), np.float32(5304.714), np.float32(2097.5989), np.float32(4978.2), np.float32(5255.759), np.float32(5354.8286)]
2025-09-14 13:02:49,059 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:02:49,065 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 88/100 (estimated time remaining: 34 minutes, 10 seconds)
2025-09-14 13:05:19,450 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:05:26,843 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 4908.06543 ± 891.499
2025-09-14 13:05:26,843 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5308.2783), np.float32(4622.099), np.float32(2301.8481), np.float32(5291.239), np.float32(5377.205), np.float32(5279.3506), np.float32(5191.87), np.float32(5182.903), np.float32(5244.389), np.float32(5281.4697)]
2025-09-14 13:05:26,844 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:05:26,849 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 89/100 (estimated time remaining: 31 minutes, 33 seconds)
2025-09-14 13:07:57,494 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:08:04,981 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 5210.02393 ± 139.690
2025-09-14 13:08:04,982 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5293.0303), np.float32(5241.57), np.float32(5315.234), np.float32(5364.959), np.float32(5112.8765), np.float32(5266.0967), np.float32(4903.8037), np.float32(5032.6426), np.float32(5245.587), np.float32(5324.4424)]
2025-09-14 13:08:04,982 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:08:04,988 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 90/100 (estimated time remaining: 28 minutes, 56 seconds)
2025-09-14 13:10:35,882 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:10:43,263 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 5198.35498 ± 93.224
2025-09-14 13:10:43,263 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4989.9194), np.float32(5124.8013), np.float32(5159.8174), np.float32(5271.476), np.float32(5270.457), np.float32(5160.3994), np.float32(5159.64), np.float32(5246.599), np.float32(5299.399), np.float32(5301.044)]
2025-09-14 13:10:43,264 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:10:43,270 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 91/100 (estimated time remaining: 26 minutes, 19 seconds)
2025-09-14 13:13:13,850 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:13:21,250 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 5196.84082 ± 154.548
2025-09-14 13:13:21,250 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5384.6147), np.float32(5371.059), np.float32(5347.3096), np.float32(5150.42), np.float32(4946.408), np.float32(5102.1973), np.float32(5362.4346), np.float32(4994.74), np.float32(5104.9214), np.float32(5204.3)]
2025-09-14 13:13:21,250 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:13:21,256 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 92/100 (estimated time remaining: 23 minutes, 42 seconds)
2025-09-14 13:15:51,512 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:15:58,993 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 4788.67676 ± 1084.942
2025-09-14 13:15:58,994 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5148.235), np.float32(5359.762), np.float32(5415.225), np.float32(5343.2817), np.float32(1669.3251), np.float32(5401.419), np.float32(4378.659), np.float32(5342.7544), np.float32(4953.179), np.float32(4874.9253)]
2025-09-14 13:15:58,994 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:15:59,000 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 93/100 (estimated time remaining: 21 minutes, 3 seconds)
2025-09-14 13:18:29,256 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:18:36,753 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 5017.53760 ± 799.106
2025-09-14 13:18:36,753 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5281.395), np.float32(5198.5024), np.float32(5218.8745), np.float32(5439.559), np.float32(2649.195), np.float32(5311.984), np.float32(5360.2812), np.float32(5329.5293), np.float32(4977.9185), np.float32(5408.1387)]
2025-09-14 13:18:36,753 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:18:36,760 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 94/100 (estimated time remaining: 18 minutes, 25 seconds)
2025-09-14 13:21:07,446 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:21:14,822 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 5035.23730 ± 711.419
2025-09-14 13:21:14,822 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5129.1245), np.float32(2928.3418), np.float32(5357.4326), np.float32(5137.0586), np.float32(5064.693), np.float32(5332.522), np.float32(5360.7), np.float32(5323.656), np.float32(5286.5366), np.float32(5432.31)]
2025-09-14 13:21:14,823 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:21:14,829 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 95/100 (estimated time remaining: 15 minutes, 47 seconds)
2025-09-14 13:23:45,711 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:23:53,179 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 5325.64453 ± 62.940
2025-09-14 13:23:53,179 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5330.8813), np.float32(5223.18), np.float32(5452.509), np.float32(5276.384), np.float32(5267.546), np.float32(5340.5034), np.float32(5294.429), np.float32(5311.962), np.float32(5378.5405), np.float32(5380.517)]
2025-09-14 13:23:53,179 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:23:53,179 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1226 [INFO]: New best (5325.64) for latency 12
2025-09-14 13:23:53,185 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 96/100 (estimated time remaining: 13 minutes, 9 seconds)
2025-09-14 13:26:23,867 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:26:31,232 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 4956.01904 ± 824.278
2025-09-14 13:26:31,232 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2546.1443), np.float32(5275.1313), np.float32(5253.409), np.float32(4711.0386), np.float32(5307.305), np.float32(5306.091), np.float32(5297.3813), np.float32(5157.272), np.float32(5448.2), np.float32(5258.22)]
2025-09-14 13:26:31,232 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:26:31,239 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 97/100 (estimated time remaining: 10 minutes, 31 seconds)
2025-09-14 13:29:01,351 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:29:08,673 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 5315.67285 ± 97.758
2025-09-14 13:29:08,673 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5234.4756), np.float32(5286.5063), np.float32(5416.618), np.float32(5397.0234), np.float32(5104.387), np.float32(5441.3584), np.float32(5371.046), np.float32(5352.377), np.float32(5317.634), np.float32(5235.3022)]
2025-09-14 13:29:08,673 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:29:08,679 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 98/100 (estimated time remaining: 7 minutes, 53 seconds)
2025-09-14 13:31:38,382 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:31:45,686 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 4648.78027 ± 1117.635
2025-09-14 13:31:45,686 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5037.865), np.float32(5130.6616), np.float32(5365.0757), np.float32(5396.3413), np.float32(4907.308), np.float32(5264.005), np.float32(3575.5627), np.float32(5387.528), np.float32(1668.6067), np.float32(4754.852)]
2025-09-14 13:31:45,687 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:31:45,693 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 99/100 (estimated time remaining: 5 minutes, 15 seconds)
2025-09-14 13:34:15,032 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:34:22,196 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 5151.92676 ± 238.433
2025-09-14 13:34:22,197 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5037.2007), np.float32(5407.428), np.float32(4547.175), np.float32(5103.406), np.float32(5251.834), np.float32(5160.4), np.float32(5348.3853), np.float32(5349.2466), np.float32(5026.0586), np.float32(5288.1304)]
2025-09-14 13:34:22,197 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:34:22,203 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1199 [INFO]: Iteration 100/100 (estimated time remaining: 2 minutes, 37 seconds)
2025-09-14 13:36:35,949 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:36:42,233 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1221 [DEBUG]: Total Reward: 4972.96875 ± 999.220
2025-09-14 13:36:42,233 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5420.9844), np.float32(5353.268), np.float32(4986.132), np.float32(1997.4104), np.float32(5252.38), np.float32(5256.6025), np.float32(5437.755), np.float32(5314.2764), np.float32(5329.5596), np.float32(5381.3237)]
2025-09-14 13:36:42,233 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:36:42,240 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille25-halfcheetah):1251 [DEBUG]: Training session finished
