2025-09-14 08:43:01,511 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1108 [DEBUG]: logdir: _logs/noise-eval-v2/halfcheetah/bpql-noise_0.100-delay_9
2025-09-14 08:43:01,511 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1109 [DEBUG]: trainer_prefix: noise-eval-v2/halfcheetah/bpql-noise_0.100-delay_9
2025-09-14 08:43:01,511 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1110 [DEBUG]: args.trainer_eval_latencies: {'9': <latency_env.delayed_mdp.ConstantDelay object at 0x7f2d67c97e60>}
2025-09-14 08:43:01,512 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1111 [DEBUG]: using device: cpu
2025-09-14 08:43:01,515 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1133 [INFO]: Creating new trainer
2025-09-14 08:43:01,628 baseline-bpql-noisepromille100-halfcheetah:113 [DEBUG]: pi network:
NNGaussianPolicy(
  (common_head): Sequential(
    (0): Flatten(start_dim=1, end_dim=-1)
    (1): Linear(in_features=71, out_features=256, bias=True)
    (2): ReLU()
    (3): Linear(in_features=256, out_features=256, bias=True)
    (4): ReLU()
  )
  (mu_head): Sequential(
    (0): Linear(in_features=256, out_features=6, bias=True)
    (1): Unflatten(dim=1, unflattened_size=(6,))
  )
  (log_std_head): Sequential(
    (0): Linear(in_features=256, out_features=6, bias=True)
    (1): Unflatten(dim=1, unflattened_size=(6,))
  )
  (tanh_refit): NNTanhRefit(scale: tensor([[2., 2., 2., 2., 2., 2.]]), shift: tensor([[-1., -1., -1., -1., -1., -1.]]))
)
2025-09-14 08:43:01,628 baseline-bpql-noisepromille100-halfcheetah:114 [DEBUG]: q network:
NNLayerConcat2(
  dim: -1
  (next): Sequential(
    (0): Linear(in_features=23, out_features=256, bias=True)
    (1): ReLU()
    (2): Linear(in_features=256, out_features=256, bias=True)
    (3): ReLU()
    (4): Linear(in_features=256, out_features=1, bias=True)
    (5): NNLayerSqueeze(dim: -1)
  )
  (init_left): Flatten(start_dim=1, end_dim=-1)
  (init_right): Flatten(start_dim=1, end_dim=-1)
)
2025-09-14 08:43:03,503 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1194 [DEBUG]: Starting training session...
2025-09-14 08:43:03,504 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 1/100
2025-09-14 08:45:34,345 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 08:45:40,575 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: -312.55939 ± 16.701
2025-09-14 08:45:40,575 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(-271.12775), np.float32(-305.88208), np.float32(-330.83438), np.float32(-306.51038), np.float32(-330.16736), np.float32(-317.35507), np.float32(-314.23666), np.float32(-307.19675), np.float32(-329.66858), np.float32(-312.6149)]
2025-09-14 08:45:40,575 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 08:45:40,575 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (-312.56) for latency 9
2025-09-14 08:45:40,577 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 2/100 (estimated time remaining: 4 hours, 19 minutes, 10 seconds)
2025-09-14 08:48:12,664 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 08:48:19,083 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: -247.92773 ± 25.679
2025-09-14 08:48:19,083 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(-266.95023), np.float32(-262.88068), np.float32(-235.86125), np.float32(-229.7327), np.float32(-277.23148), np.float32(-192.09767), np.float32(-239.28226), np.float32(-231.73663), np.float32(-278.24878), np.float32(-265.25558)]
2025-09-14 08:48:19,083 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 08:48:19,083 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (-247.93) for latency 9
2025-09-14 08:48:19,085 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 3/100 (estimated time remaining: 4 hours, 17 minutes, 43 seconds)
2025-09-14 08:51:00,590 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 08:51:07,184 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: -165.74084 ± 47.821
2025-09-14 08:51:07,184 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(-131.49672), np.float32(-189.40178), np.float32(-209.77826), np.float32(-142.68135), np.float32(-117.74731), np.float32(-179.77708), np.float32(-74.72329), np.float32(-193.87575), np.float32(-249.9075), np.float32(-168.01949)]
2025-09-14 08:51:07,184 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 08:51:07,184 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (-165.74) for latency 9
2025-09-14 08:51:07,186 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 4/100 (estimated time remaining: 4 hours, 20 minutes, 39 seconds)
2025-09-14 08:53:47,419 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 08:53:54,117 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 231.08020 ± 184.044
2025-09-14 08:53:54,117 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(451.27646), np.float32(233.6034), np.float32(-136.51419), np.float32(211.22641), np.float32(465.9377), np.float32(252.71759), np.float32(-60.411358), np.float32(274.95016), np.float32(313.71988), np.float32(304.29602)]
2025-09-14 08:53:54,118 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 08:53:54,118 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (231.08) for latency 9
2025-09-14 08:53:54,120 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 5/100 (estimated time remaining: 4 hours, 20 minutes, 14 seconds)
2025-09-14 08:56:41,181 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 08:56:49,209 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 515.02661 ± 305.906
2025-09-14 08:56:49,209 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(874.99023), np.float32(716.4014), np.float32(446.8908), np.float32(743.93085), np.float32(177.16501), np.float32(84.20293), np.float32(699.7434), np.float32(100.88017), np.float32(365.6874), np.float32(940.3739)]
2025-09-14 08:56:49,209 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 08:56:49,209 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (515.03) for latency 9
2025-09-14 08:56:49,211 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 6/100 (estimated time remaining: 4 hours, 21 minutes, 28 seconds)
2025-09-14 09:00:01,154 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 09:00:09,750 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1327.18188 ± 586.163
2025-09-14 09:00:09,751 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1269.4158), np.float32(453.08334), np.float32(1761.5968), np.float32(1866.0797), np.float32(2014.6887), np.float32(234.63602), np.float32(822.06104), np.float32(1626.2561), np.float32(1522.0155), np.float32(1701.9867)]
2025-09-14 09:00:09,751 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:00:09,751 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (1327.18) for latency 9
2025-09-14 09:00:09,753 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 7/100 (estimated time remaining: 4 hours, 32 minutes, 20 seconds)
2025-09-14 09:03:25,130 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 09:03:33,778 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 953.19354 ± 273.169
2025-09-14 09:03:33,778 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(596.9043), np.float32(1144.8591), np.float32(544.46466), np.float32(1109.0814), np.float32(846.5452), np.float32(938.76324), np.float32(1001.6475), np.float32(1451.2268), np.float32(1203.3649), np.float32(695.0783)]
2025-09-14 09:03:33,778 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:03:33,780 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 8/100 (estimated time remaining: 4 hours, 43 minutes, 33 seconds)
2025-09-14 09:06:45,044 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 09:06:53,305 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1336.15662 ± 398.023
2025-09-14 09:06:53,305 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1156.3315), np.float32(996.70264), np.float32(1837.8835), np.float32(1184.5151), np.float32(1241.4464), np.float32(1008.0564), np.float32(1025.3339), np.float32(1849.9904), np.float32(978.82855), np.float32(2082.4788)]
2025-09-14 09:06:53,305 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:06:53,305 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (1336.16) for latency 9
2025-09-14 09:06:53,307 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 9/100 (estimated time remaining: 4 hours, 50 minutes, 8 seconds)
2025-09-14 09:10:02,166 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 09:10:10,629 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1993.85095 ± 592.124
2025-09-14 09:10:10,630 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2270.1733), np.float32(832.6778), np.float32(1704.2075), np.float32(2200.6028), np.float32(2582.8794), np.float32(2039.127), np.float32(2501.1387), np.float32(2527.5703), np.float32(2277.744), np.float32(1002.39136)]
2025-09-14 09:10:10,630 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:10:10,630 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (1993.85) for latency 9
2025-09-14 09:10:10,632 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 10/100 (estimated time remaining: 4 hours, 56 minutes, 12 seconds)
2025-09-14 09:13:18,410 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 09:13:26,986 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1976.32593 ± 614.857
2025-09-14 09:13:26,986 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2647.859), np.float32(2056.3557), np.float32(2480.9146), np.float32(2019.1343), np.float32(2918.5544), np.float32(1367.6278), np.float32(2254.9146), np.float32(1089.1508), np.float32(1006.3884), np.float32(1922.36)]
2025-09-14 09:13:26,986 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:13:26,988 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 11/100 (estimated time remaining: 4 hours, 59 minutes, 19 seconds)
2025-09-14 09:16:34,697 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 09:16:43,255 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1408.78589 ± 365.555
2025-09-14 09:16:43,255 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1104.3992), np.float32(1091.5775), np.float32(1172.2938), np.float32(1516.4333), np.float32(1505.9404), np.float32(2368.258), np.float32(1104.303), np.float32(1221.1425), np.float32(1489.8901), np.float32(1513.6207)]
2025-09-14 09:16:43,255 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:16:43,258 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 12/100 (estimated time remaining: 4 hours, 54 minutes, 44 seconds)
2025-09-14 09:19:51,621 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 09:20:00,872 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1846.66895 ± 681.086
2025-09-14 09:20:00,872 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2521.2256), np.float32(1164.8688), np.float32(2664.6694), np.float32(2327.0715), np.float32(2728.454), np.float32(1228.8427), np.float32(925.0905), np.float32(2249.3608), np.float32(1092.5603), np.float32(1564.5448)]
2025-09-14 09:20:00,872 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:20:00,875 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 13/100 (estimated time remaining: 4 hours, 49 minutes, 32 seconds)
2025-09-14 09:23:21,255 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 09:23:30,473 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1903.53052 ± 834.228
2025-09-14 09:23:30,474 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1247.3336), np.float32(737.1423), np.float32(1085.6836), np.float32(1452.1305), np.float32(3017.9075), np.float32(2631.2007), np.float32(2504.682), np.float32(971.64923), np.float32(2542.914), np.float32(2844.6616)]
2025-09-14 09:23:30,474 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:23:30,477 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 14/100 (estimated time remaining: 4 hours, 49 minutes, 10 seconds)
2025-09-14 09:26:50,759 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 09:27:00,096 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2005.11975 ± 745.860
2025-09-14 09:27:00,097 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1086.3263), np.float32(730.71826), np.float32(1155.4028), np.float32(2168.3704), np.float32(3067.7107), np.float32(2014.4238), np.float32(2996.6177), np.float32(2271.7993), np.float32(2203.359), np.float32(2356.4692)]
2025-09-14 09:27:00,097 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:27:00,097 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (2005.12) for latency 9
2025-09-14 09:27:00,100 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 15/100 (estimated time remaining: 4 hours, 49 minutes, 22 seconds)
2025-09-14 09:30:20,666 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 09:30:29,941 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2241.07422 ± 757.117
2025-09-14 09:30:29,942 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2756.8447), np.float32(2970.543), np.float32(2542.4927), np.float32(2028.9172), np.float32(1210.4146), np.float32(3243.9326), np.float32(1122.5643), np.float32(3090.2407), np.float32(2054.113), np.float32(1390.6787)]
2025-09-14 09:30:29,942 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:30:29,942 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (2241.07) for latency 9
2025-09-14 09:30:29,945 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 16/100 (estimated time remaining: 4 hours, 49 minutes, 50 seconds)
2025-09-14 09:33:40,932 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 09:33:48,525 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2300.36865 ± 725.805
2025-09-14 09:33:48,525 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2769.3228), np.float32(1032.3928), np.float32(1801.133), np.float32(1341.4171), np.float32(3154.6812), np.float32(2668.4182), np.float32(1680.4331), np.float32(2960.3645), np.float32(2583.9888), np.float32(3011.5344)]
2025-09-14 09:33:48,525 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:33:48,525 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (2300.37) for latency 9
2025-09-14 09:33:48,528 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 17/100 (estimated time remaining: 4 hours, 47 minutes, 4 seconds)
2025-09-14 09:36:35,950 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 09:36:42,635 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2570.04150 ± 403.616
2025-09-14 09:36:42,635 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2933.217), np.float32(1861.7446), np.float32(2595.38), np.float32(2742.1226), np.float32(2999.8198), np.float32(2661.712), np.float32(2626.7507), np.float32(2693.2969), np.float32(2841.0935), np.float32(1745.2772)]
2025-09-14 09:36:42,635 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:36:42,635 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (2570.04) for latency 9
2025-09-14 09:36:42,638 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 18/100 (estimated time remaining: 4 hours, 37 minutes, 9 seconds)
2025-09-14 09:39:14,233 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 09:39:20,915 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2020.25354 ± 646.228
2025-09-14 09:39:20,916 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1486.2206), np.float32(2111.989), np.float32(1063.7848), np.float32(1093.345), np.float32(1949.3795), np.float32(2951.3918), np.float32(2780.74), np.float32(1704.3673), np.float32(2394.0674), np.float32(2667.2493)]
2025-09-14 09:39:20,916 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:39:20,918 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 19/100 (estimated time remaining: 4 hours, 19 minutes, 47 seconds)
2025-09-14 09:41:52,426 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 09:41:59,106 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2058.60620 ± 709.693
2025-09-14 09:41:59,106 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1167.6907), np.float32(1425.7809), np.float32(1260.2638), np.float32(2950.7444), np.float32(1836.6821), np.float32(2716.666), np.float32(2103.6165), np.float32(3037.5273), np.float32(2738.5684), np.float32(1348.5208)]
2025-09-14 09:41:59,106 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:41:59,109 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 20/100 (estimated time remaining: 4 hours, 2 minutes, 43 seconds)
2025-09-14 09:44:47,369 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 09:44:56,471 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1953.00513 ± 747.356
2025-09-14 09:44:56,471 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2859.7815), np.float32(1124.3901), np.float32(994.663), np.float32(1174.758), np.float32(1174.7705), np.float32(2134.28), np.float32(2573.8936), np.float32(3141.3557), np.float32(2084.5244), np.float32(2267.6335)]
2025-09-14 09:44:56,471 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:44:56,475 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 21/100 (estimated time remaining: 3 hours, 51 minutes, 4 seconds)
2025-09-14 09:48:19,380 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 09:48:28,889 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1918.49902 ± 626.371
2025-09-14 09:48:28,889 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2072.9995), np.float32(1094.6633), np.float32(1617.53), np.float32(1254.0979), np.float32(1539.652), np.float32(2796.586), np.float32(2533.862), np.float32(2751.103), np.float32(1190.6155), np.float32(2333.8804)]
2025-09-14 09:48:28,889 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:48:28,892 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 22/100 (estimated time remaining: 3 hours, 51 minutes, 49 seconds)
2025-09-14 09:51:53,111 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 09:52:02,626 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2168.94434 ± 680.335
2025-09-14 09:52:02,627 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2166.4553), np.float32(2677.9846), np.float32(1178.2217), np.float32(3107.5361), np.float32(1839.7588), np.float32(2536.526), np.float32(1632.6783), np.float32(3150.0283), np.float32(2235.7488), np.float32(1164.5076)]
2025-09-14 09:52:02,627 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:52:02,631 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 23/100 (estimated time remaining: 3 hours, 59 minutes, 11 seconds)
2025-09-14 09:55:27,506 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 09:55:36,975 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2125.36768 ± 469.319
2025-09-14 09:55:36,975 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1668.5375), np.float32(2472.8071), np.float32(2029.2858), np.float32(2387.573), np.float32(2898.9866), np.float32(1232.1808), np.float32(2297.444), np.float32(2501.38), np.float32(1647.9972), np.float32(2117.4834)]
2025-09-14 09:55:36,975 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:55:36,979 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 24/100 (estimated time remaining: 4 hours, 10 minutes, 31 seconds)
2025-09-14 09:59:01,639 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 09:59:11,146 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2829.40747 ± 400.203
2025-09-14 09:59:11,146 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3110.6882), np.float32(3248.4854), np.float32(2771.1575), np.float32(2707.3262), np.float32(2652.6523), np.float32(2504.7825), np.float32(3117.2222), np.float32(3063.2576), np.float32(1885.2839), np.float32(3233.2217)]
2025-09-14 09:59:11,147 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:59:11,147 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (2829.41) for latency 9
2025-09-14 09:59:11,151 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 25/100 (estimated time remaining: 4 hours, 21 minutes, 27 seconds)
2025-09-14 10:02:36,036 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 10:02:45,505 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1804.87659 ± 493.266
2025-09-14 10:02:45,505 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2545.0334), np.float32(2086.738), np.float32(1794.1117), np.float32(1349.8038), np.float32(1594.4908), np.float32(1432.3042), np.float32(2059.926), np.float32(2664.616), np.float32(1162.218), np.float32(1359.5228)]
2025-09-14 10:02:45,505 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:02:45,509 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 26/100 (estimated time remaining: 4 hours, 27 minutes, 15 seconds)
2025-09-14 10:06:09,227 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 10:06:18,775 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1943.94568 ± 702.224
2025-09-14 10:06:18,775 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3125.7122), np.float32(1578.1862), np.float32(1696.8615), np.float32(2806.6436), np.float32(2918.797), np.float32(1483.9839), np.float32(1111.3513), np.float32(2037.065), np.float32(1380.958), np.float32(1299.8978)]
2025-09-14 10:06:18,776 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:06:18,780 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 27/100 (estimated time remaining: 4 hours, 23 minutes, 54 seconds)
2025-09-14 10:09:42,762 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 10:09:52,347 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2181.30469 ± 719.411
2025-09-14 10:09:52,347 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3014.2603), np.float32(1512.8716), np.float32(2149.9753), np.float32(2002.3434), np.float32(1327.2745), np.float32(1820.6044), np.float32(1193.4226), np.float32(2367.053), np.float32(3261.5337), np.float32(3163.707)]
2025-09-14 10:09:52,347 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:09:52,351 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 28/100 (estimated time remaining: 4 hours, 20 minutes, 17 seconds)
2025-09-14 10:13:16,270 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 10:13:25,795 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1887.17029 ± 584.100
2025-09-14 10:13:25,795 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3126.87), np.float32(1198.9048), np.float32(1195.148), np.float32(2251.9514), np.float32(1565.7393), np.float32(2390.0378), np.float32(2209.5378), np.float32(1387.1494), np.float32(1610.2742), np.float32(1936.0922)]
2025-09-14 10:13:25,795 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:13:25,799 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 29/100 (estimated time remaining: 4 hours, 16 minutes, 31 seconds)
2025-09-14 10:16:50,894 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 10:17:00,375 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2420.61011 ± 941.680
2025-09-14 10:17:00,375 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1207.6628), np.float32(3403.011), np.float32(2996.071), np.float32(3382.1567), np.float32(1091.8754), np.float32(1820.8446), np.float32(2570.717), np.float32(1174.835), np.float32(3295.607), np.float32(3263.321)]
2025-09-14 10:17:00,375 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:17:00,379 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 30/100 (estimated time remaining: 4 hours, 13 minutes, 3 seconds)
2025-09-14 10:20:24,972 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 10:20:34,532 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2168.65161 ± 858.580
2025-09-14 10:20:34,532 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2214.126), np.float32(2612.998), np.float32(1768.0372), np.float32(3136.7366), np.float32(1250.3077), np.float32(1579.4552), np.float32(3600.2253), np.float32(1223.9401), np.float32(3135.5696), np.float32(1165.1168)]
2025-09-14 10:20:34,532 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:20:34,537 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 31/100 (estimated time remaining: 4 hours, 9 minutes, 26 seconds)
2025-09-14 10:23:58,205 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 10:24:07,766 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 1727.16638 ± 418.169
2025-09-14 10:24:07,767 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1441.72), np.float32(2206.773), np.float32(1903.8876), np.float32(2596.0432), np.float32(1281.0654), np.float32(1813.0425), np.float32(1766.3235), np.float32(1315.3389), np.float32(1204.4712), np.float32(1742.9973)]
2025-09-14 10:24:07,767 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:24:07,771 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 32/100 (estimated time remaining: 4 hours, 5 minutes, 52 seconds)
2025-09-14 10:27:31,185 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 10:27:40,631 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2345.85010 ± 787.216
2025-09-14 10:27:40,634 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2080.8813), np.float32(2715.418), np.float32(2867.1777), np.float32(2368.3027), np.float32(1344.9485), np.float32(3249.114), np.float32(1175.5017), np.float32(1592.3927), np.float32(3746.6135), np.float32(2318.1497)]
2025-09-14 10:27:40,634 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:27:40,638 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 33/100 (estimated time remaining: 4 hours, 2 minutes, 8 seconds)
2025-09-14 10:31:03,512 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 10:31:13,023 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2261.84644 ± 780.013
2025-09-14 10:31:13,023 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1161.4781), np.float32(2979.557), np.float32(2732.3428), np.float32(3106.1685), np.float32(1413.3499), np.float32(1337.099), np.float32(3032.1902), np.float32(2728.8162), np.float32(1371.2925), np.float32(2756.171)]
2025-09-14 10:31:13,024 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:31:13,028 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 34/100 (estimated time remaining: 3 hours, 58 minutes, 20 seconds)
2025-09-14 10:34:36,516 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 10:34:46,059 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2619.71875 ± 549.055
2025-09-14 10:34:46,059 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2586.2092), np.float32(3208.8018), np.float32(2151.759), np.float32(2302.606), np.float32(1637.109), np.float32(2770.4998), np.float32(2320.7075), np.float32(3131.7617), np.float32(3612.7544), np.float32(2474.9812)]
2025-09-14 10:34:46,059 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:34:46,063 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 35/100 (estimated time remaining: 3 hours, 54 minutes, 27 seconds)
2025-09-14 10:38:09,507 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 10:38:19,032 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2421.05005 ± 565.446
2025-09-14 10:38:19,032 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3178.3037), np.float32(2408.497), np.float32(1583.675), np.float32(3368.577), np.float32(2078.8447), np.float32(3008.1091), np.float32(1841.3531), np.float32(1960.795), np.float32(2309.658), np.float32(2472.6855)]
2025-09-14 10:38:19,032 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:38:19,037 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 36/100 (estimated time remaining: 3 hours, 50 minutes, 38 seconds)
2025-09-14 10:41:41,950 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 10:41:51,467 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2859.17920 ± 679.784
2025-09-14 10:41:51,467 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2951.9106), np.float32(3235.8696), np.float32(1437.6858), np.float32(3011.8403), np.float32(3343.736), np.float32(2894.6782), np.float32(3590.7878), np.float32(3069.1152), np.float32(3362.3455), np.float32(1693.8212)]
2025-09-14 10:41:51,467 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:41:51,467 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (2859.18) for latency 9
2025-09-14 10:41:51,471 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 37/100 (estimated time remaining: 3 hours, 46 minutes, 55 seconds)
2025-09-14 10:45:14,746 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 10:45:24,178 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2570.37451 ± 662.729
2025-09-14 10:45:24,178 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2991.5142), np.float32(2979.9243), np.float32(2709.3115), np.float32(3201.5564), np.float32(1449.488), np.float32(3269.3674), np.float32(2134.8635), np.float32(1916.7837), np.float32(3320.2793), np.float32(1730.6555)]
2025-09-14 10:45:24,179 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:45:24,183 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 38/100 (estimated time remaining: 3 hours, 43 minutes, 20 seconds)
2025-09-14 10:48:46,992 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 10:48:56,418 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2325.05273 ± 655.634
2025-09-14 10:48:56,418 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3129.88), np.float32(3419.001), np.float32(2351.4082), np.float32(1614.9391), np.float32(2643.7458), np.float32(2494.7205), np.float32(1783.4968), np.float32(1706.4607), np.float32(1347.997), np.float32(2758.8784)]
2025-09-14 10:48:56,418 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:48:56,423 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 39/100 (estimated time remaining: 3 hours, 39 minutes, 46 seconds)
2025-09-14 10:52:18,187 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 10:52:27,530 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2602.66064 ± 647.727
2025-09-14 10:52:27,530 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3450.7944), np.float32(2741.668), np.float32(1534.258), np.float32(2953.035), np.float32(2708.0742), np.float32(3119.121), np.float32(3278.5256), np.float32(2667.3447), np.float32(1539.0712), np.float32(2034.7142)]
2025-09-14 10:52:27,530 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:52:27,535 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 40/100 (estimated time remaining: 3 hours, 35 minutes, 49 seconds)
2025-09-14 10:55:44,546 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 10:55:53,871 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2412.29956 ± 1181.137
2025-09-14 10:55:53,872 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3069.9004), np.float32(1322.6855), np.float32(306.76715), np.float32(3737.5745), np.float32(1250.6873), np.float32(3024.301), np.float32(3833.5227), np.float32(1381.8542), np.float32(3542.811), np.float32(2652.893)]
2025-09-14 10:55:53,872 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:55:53,876 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 41/100 (estimated time remaining: 3 hours, 30 minutes, 58 seconds)
2025-09-14 10:59:11,305 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 10:59:20,559 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3165.92676 ± 536.155
2025-09-14 10:59:20,559 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3243.0383), np.float32(3374.434), np.float32(3402.8667), np.float32(3633.0625), np.float32(3162.3132), np.float32(3327.5835), np.float32(2612.526), np.float32(1784.4019), np.float32(3559.4912), np.float32(3559.5496)]
2025-09-14 10:59:20,559 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:59:20,559 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (3165.93) for latency 9
2025-09-14 10:59:20,564 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 42/100 (estimated time remaining: 3 hours, 26 minutes, 19 seconds)
2025-09-14 11:02:25,420 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 11:02:33,907 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2824.05444 ± 610.639
2025-09-14 11:02:33,908 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1435.0282), np.float32(2920.7441), np.float32(2660.424), np.float32(3091.2026), np.float32(3476.5212), np.float32(3138.4382), np.float32(3387.984), np.float32(3389.5586), np.float32(2127.6282), np.float32(2613.0134)]
2025-09-14 11:02:33,908 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:02:33,912 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 43/100 (estimated time remaining: 3 hours, 19 minutes, 4 seconds)
2025-09-14 11:05:37,618 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 11:05:46,099 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2998.09229 ± 679.111
2025-09-14 11:05:46,100 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3124.466), np.float32(3465.639), np.float32(3116.773), np.float32(1569.4326), np.float32(3028.9614), np.float32(1925.7035), np.float32(3565.2314), np.float32(3128.7205), np.float32(3159.9558), np.float32(3896.0417)]
2025-09-14 11:05:46,100 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:05:46,104 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 44/100 (estimated time remaining: 3 hours, 11 minutes, 50 seconds)
2025-09-14 11:08:38,718 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 11:08:46,458 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2796.63013 ± 931.494
2025-09-14 11:08:46,459 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2422.8445), np.float32(3318.4868), np.float32(3248.3499), np.float32(3371.72), np.float32(3120.226), np.float32(3749.668), np.float32(3738.9556), np.float32(1099.7275), np.float32(2816.9907), np.float32(1079.3337)]
2025-09-14 11:08:46,459 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:08:46,463 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 45/100 (estimated time remaining: 3 hours, 2 minutes, 43 seconds)
2025-09-14 11:11:20,835 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 11:11:27,648 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2805.28467 ± 647.889
2025-09-14 11:11:27,648 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2863.9165), np.float32(3326.4917), np.float32(3120.605), np.float32(3458.7056), np.float32(2050.7866), np.float32(2133.942), np.float32(3451.0286), np.float32(1526.6554), np.float32(2748.7922), np.float32(3371.9214)]
2025-09-14 11:11:27,649 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:11:27,653 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 46/100 (estimated time remaining: 2 hours, 51 minutes, 11 seconds)
2025-09-14 11:13:57,138 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 11:14:03,906 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2164.84033 ± 675.225
2025-09-14 11:14:03,906 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3156.5454), np.float32(1896.5822), np.float32(1543.9302), np.float32(2240.091), np.float32(1660.6799), np.float32(2973.5654), np.float32(1920.526), np.float32(1360.753), np.float32(3265.7869), np.float32(1629.9429)]
2025-09-14 11:14:03,906 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:14:03,913 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 47/100 (estimated time remaining: 2 hours, 39 minutes)
2025-09-14 11:16:33,300 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 11:16:40,059 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2960.27808 ± 728.383
2025-09-14 11:16:40,059 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3102.9795), np.float32(3567.8838), np.float32(3245.6763), np.float32(1479.3436), np.float32(3427.5134), np.float32(3461.098), np.float32(2369.6907), np.float32(1898.3158), np.float32(3694.575), np.float32(3355.7075)]
2025-09-14 11:16:40,059 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:16:40,063 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 48/100 (estimated time remaining: 2 hours, 29 minutes, 29 seconds)
2025-09-14 11:19:09,711 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 11:19:16,417 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2854.83618 ± 869.229
2025-09-14 11:19:16,417 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1677.113), np.float32(3643.232), np.float32(3629.391), np.float32(2908.6453), np.float32(3272.128), np.float32(1439.0718), np.float32(3231.324), np.float32(3663.8962), np.float32(3483.404), np.float32(1600.157)]
2025-09-14 11:19:16,417 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:19:16,422 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 49/100 (estimated time remaining: 2 hours, 20 minutes, 27 seconds)
2025-09-14 11:21:46,267 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 11:21:53,000 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2737.43604 ± 832.036
2025-09-14 11:21:53,000 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3196.8047), np.float32(2117.1792), np.float32(1703.6296), np.float32(2355.2993), np.float32(3189.3677), np.float32(3101.5916), np.float32(3784.8079), np.float32(3867.8809), np.float32(2864.5188), np.float32(1193.2806)]
2025-09-14 11:21:53,000 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:21:53,004 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 50/100 (estimated time remaining: 2 hours, 13 minutes, 42 seconds)
2025-09-14 11:24:22,403 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 11:24:29,203 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3211.50391 ± 559.133
2025-09-14 11:24:29,203 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3297.9036), np.float32(3019.3118), np.float32(3518.7307), np.float32(3542.7212), np.float32(3532.9563), np.float32(3475.0498), np.float32(3775.5479), np.float32(3632.0479), np.float32(2172.7517), np.float32(2148.0173)]
2025-09-14 11:24:29,203 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:24:29,203 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (3211.50) for latency 9
2025-09-14 11:24:29,207 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 51/100 (estimated time remaining: 2 hours, 10 minutes, 15 seconds)
2025-09-14 11:26:58,843 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 11:27:05,654 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2905.87354 ± 654.327
2025-09-14 11:27:05,654 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3575.905), np.float32(3472.858), np.float32(2423.808), np.float32(3207.0715), np.float32(1352.7651), np.float32(3252.408), np.float32(2250.4358), np.float32(3223.3801), np.float32(3095.1135), np.float32(3204.9897)]
2025-09-14 11:27:05,654 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:27:05,658 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 52/100 (estimated time remaining: 2 hours, 7 minutes, 41 seconds)
2025-09-14 11:29:35,037 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 11:29:41,846 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3093.25269 ± 496.405
2025-09-14 11:29:41,846 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3327.3076), np.float32(3394.9226), np.float32(3238.8916), np.float32(3701.4146), np.float32(3298.0276), np.float32(2590.515), np.float32(3536.719), np.float32(3321.2737), np.float32(2238.06), np.float32(2285.3955)]
2025-09-14 11:29:41,846 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:29:41,851 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 53/100 (estimated time remaining: 2 hours, 5 minutes, 5 seconds)
2025-09-14 11:32:11,238 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 11:32:18,038 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3121.84888 ± 394.943
2025-09-14 11:32:18,038 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2973.1018), np.float32(2776.6228), np.float32(3051.0432), np.float32(3830.246), np.float32(3494.0696), np.float32(2519.857), np.float32(3295.6536), np.float32(3348.163), np.float32(3339.0273), np.float32(2590.7026)]
2025-09-14 11:32:18,038 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:32:18,043 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 54/100 (estimated time remaining: 2 hours, 2 minutes, 27 seconds)
2025-09-14 11:34:47,087 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 11:34:53,934 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2862.83057 ± 701.887
2025-09-14 11:34:53,934 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1934.1632), np.float32(2009.4808), np.float32(3499.0574), np.float32(2157.173), np.float32(3550.8582), np.float32(3640.4854), np.float32(2867.794), np.float32(3586.096), np.float32(2069.833), np.float32(3313.3643)]
2025-09-14 11:34:53,934 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:34:53,939 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 55/100 (estimated time remaining: 1 hour, 59 minutes, 44 seconds)
2025-09-14 11:37:23,727 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 11:37:30,553 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2938.27490 ± 715.749
2025-09-14 11:37:30,553 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3843.5723), np.float32(3255.7346), np.float32(1326.7247), np.float32(3151.1157), np.float32(2976.029), np.float32(2028.3416), np.float32(3565.8052), np.float32(3188.465), np.float32(2662.984), np.float32(3383.9768)]
2025-09-14 11:37:30,553 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:37:30,558 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 56/100 (estimated time remaining: 1 hour, 57 minutes, 12 seconds)
2025-09-14 11:39:59,522 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 11:40:06,384 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2453.73462 ± 716.470
2025-09-14 11:40:06,385 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1620.4272), np.float32(3077.3298), np.float32(2788.0635), np.float32(1607.1328), np.float32(3264.8298), np.float32(2694.5), np.float32(1470.5542), np.float32(3105.8186), np.float32(1708.0829), np.float32(3200.6077)]
2025-09-14 11:40:06,385 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:40:06,389 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 57/100 (estimated time remaining: 1 hour, 54 minutes, 30 seconds)
2025-09-14 11:42:35,521 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 11:42:42,392 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2855.32373 ± 720.481
2025-09-14 11:42:42,392 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3443.3357), np.float32(1842.9205), np.float32(1811.425), np.float32(3236.676), np.float32(3490.5896), np.float32(2046.8322), np.float32(3621.6128), np.float32(3206.2349), np.float32(2291.481), np.float32(3562.13)]
2025-09-14 11:42:42,392 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:42:42,397 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 58/100 (estimated time remaining: 1 hour, 51 minutes, 52 seconds)
2025-09-14 11:45:11,806 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 11:45:18,678 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3073.03955 ± 710.000
2025-09-14 11:45:18,678 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3810.3672), np.float32(2428.561), np.float32(3584.7686), np.float32(1666.0427), np.float32(3428.422), np.float32(3675.7483), np.float32(3492.3704), np.float32(3562.6262), np.float32(2111.3699), np.float32(2970.1213)]
2025-09-14 11:45:18,678 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:45:18,683 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 59/100 (estimated time remaining: 1 hour, 49 minutes, 17 seconds)
2025-09-14 11:47:47,877 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 11:47:54,774 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2746.51978 ± 776.789
2025-09-14 11:47:54,774 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2944.4102), np.float32(2695.7444), np.float32(1529.2172), np.float32(3347.6548), np.float32(3072.458), np.float32(3401.6687), np.float32(3513.8552), np.float32(1733.4596), np.float32(1622.6384), np.float32(3604.089)]
2025-09-14 11:47:54,774 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:47:54,779 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 60/100 (estimated time remaining: 1 hour, 46 minutes, 42 seconds)
2025-09-14 11:50:23,706 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 11:50:30,594 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2208.19971 ± 809.516
2025-09-14 11:50:30,594 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2256.0537), np.float32(3224.7888), np.float32(3320.3276), np.float32(1602.2885), np.float32(1684.1051), np.float32(1193.7753), np.float32(3561.852), np.float32(1455.9048), np.float32(1964.7407), np.float32(1818.1599)]
2025-09-14 11:50:30,594 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:50:30,599 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 61/100 (estimated time remaining: 1 hour, 44 minutes)
2025-09-14 11:52:59,750 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 11:53:06,730 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2991.01562 ± 600.226
2025-09-14 11:53:06,730 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3212.9358), np.float32(3371.361), np.float32(2169.287), np.float32(3254.9458), np.float32(3483.126), np.float32(2876.6416), np.float32(1607.0728), np.float32(3270.1853), np.float32(3634.3494), np.float32(3030.2522)]
2025-09-14 11:53:06,730 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:53:06,735 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 62/100 (estimated time remaining: 1 hour, 41 minutes, 26 seconds)
2025-09-14 11:55:35,940 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 11:55:42,849 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2889.27051 ± 807.357
2025-09-14 11:55:42,849 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3596.456), np.float32(3327.0842), np.float32(3856.4238), np.float32(2997.8381), np.float32(1493.1586), np.float32(2528.6624), np.float32(3213.8494), np.float32(3717.3235), np.float32(1488.3578), np.float32(2673.5505)]
2025-09-14 11:55:42,849 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:55:42,854 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 63/100 (estimated time remaining: 1 hour, 38 minutes, 51 seconds)
2025-09-14 11:58:12,161 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 11:58:19,059 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2321.49780 ± 859.743
2025-09-14 11:58:19,059 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3288.2195), np.float32(1575.9601), np.float32(1752.1661), np.float32(1397.414), np.float32(3620.2021), np.float32(2210.8596), np.float32(2061.5916), np.float32(2322.8086), np.float32(3696.6165), np.float32(1289.1393)]
2025-09-14 11:58:19,059 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:58:19,064 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 64/100 (estimated time remaining: 1 hour, 36 minutes, 14 seconds)
2025-09-14 12:00:48,569 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 12:00:55,463 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2744.53223 ± 876.983
2025-09-14 12:00:55,464 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1592.5199), np.float32(3101.004), np.float32(3621.7046), np.float32(1561.9753), np.float32(3307.8533), np.float32(3478.6072), np.float32(3098.0898), np.float32(1199.6317), np.float32(3562.9866), np.float32(2920.9504)]
2025-09-14 12:00:55,464 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:00:55,469 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 65/100 (estimated time remaining: 1 hour, 33 minutes, 40 seconds)
2025-09-14 12:03:24,497 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 12:03:31,379 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2774.32812 ± 761.722
2025-09-14 12:03:31,380 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3324.8647), np.float32(3361.9163), np.float32(1363.2441), np.float32(3079.8958), np.float32(3168.6497), np.float32(3444.5955), np.float32(2999.746), np.float32(1534.9441), np.float32(2071.5664), np.float32(3393.8591)]
2025-09-14 12:03:31,380 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:03:31,385 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 66/100 (estimated time remaining: 1 hour, 31 minutes, 5 seconds)
2025-09-14 12:06:00,657 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 12:06:07,554 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3017.83081 ± 634.075
2025-09-14 12:06:07,555 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3379.8894), np.float32(3332.7053), np.float32(3255.0674), np.float32(3804.5378), np.float32(1585.4816), np.float32(3250.7068), np.float32(3503.9714), np.float32(2398.6604), np.float32(3234.692), np.float32(2432.5981)]
2025-09-14 12:06:07,555 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:06:07,560 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 67/100 (estimated time remaining: 1 hour, 28 minutes, 29 seconds)
2025-09-14 12:08:37,090 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 12:08:43,972 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3289.33521 ± 604.076
2025-09-14 12:08:43,972 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3852.4475), np.float32(2899.3157), np.float32(3613.6196), np.float32(3265.0903), np.float32(1755.6512), np.float32(3235.1355), np.float32(3810.7588), np.float32(3925.0427), np.float32(3413.2065), np.float32(3123.0857)]
2025-09-14 12:08:43,972 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:08:43,972 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (3289.34) for latency 9
2025-09-14 12:08:43,977 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 68/100 (estimated time remaining: 1 hour, 25 minutes, 55 seconds)
2025-09-14 12:11:13,631 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 12:11:20,540 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2768.40283 ± 722.010
2025-09-14 12:11:20,540 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3202.2324), np.float32(3090.1528), np.float32(2422.4558), np.float32(3653.8718), np.float32(2003.8413), np.float32(1879.618), np.float32(1463.1876), np.float32(3416.3574), np.float32(3231.5818), np.float32(3320.7288)]
2025-09-14 12:11:20,540 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:11:20,545 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 69/100 (estimated time remaining: 1 hour, 23 minutes, 21 seconds)
2025-09-14 12:13:49,715 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 12:13:56,601 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2500.76562 ± 974.784
2025-09-14 12:13:56,602 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1874.2848), np.float32(1262.8209), np.float32(3639.166), np.float32(3419.4712), np.float32(1247.5226), np.float32(2465.8274), np.float32(3458.72), np.float32(1146.775), np.float32(3277.6484), np.float32(3215.4175)]
2025-09-14 12:13:56,602 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:13:56,607 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 70/100 (estimated time remaining: 1 hour, 20 minutes, 43 seconds)
2025-09-14 12:16:25,561 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 12:16:32,569 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3272.32178 ± 522.900
2025-09-14 12:16:32,569 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3439.1726), np.float32(3244.1558), np.float32(3273.6096), np.float32(3546.1965), np.float32(3729.036), np.float32(2627.5154), np.float32(2002.8818), np.float32(3646.0303), np.float32(3483.4639), np.float32(3731.1538)]
2025-09-14 12:16:32,569 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:16:32,575 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 71/100 (estimated time remaining: 1 hour, 18 minutes, 7 seconds)
2025-09-14 12:19:01,390 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 12:19:08,279 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2959.25098 ± 699.818
2025-09-14 12:19:08,279 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3215.0173), np.float32(3183.3423), np.float32(3200.7502), np.float32(3335.1333), np.float32(3554.4692), np.float32(2701.211), np.float32(3141.4065), np.float32(1091.7222), np.float32(2532.913), np.float32(3636.5452)]
2025-09-14 12:19:08,279 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:19:08,284 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 72/100 (estimated time remaining: 1 hour, 15 minutes, 28 seconds)
2025-09-14 12:21:37,503 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 12:21:44,396 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3218.33203 ± 591.992
2025-09-14 12:21:44,396 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3218.8877), np.float32(3890.608), np.float32(1623.3501), np.float32(3592.263), np.float32(3270.0142), np.float32(3585.5435), np.float32(3478.4783), np.float32(3297.969), np.float32(2853.1853), np.float32(3373.021)]
2025-09-14 12:21:44,396 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:21:44,401 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 73/100 (estimated time remaining: 1 hour, 12 minutes, 50 seconds)
2025-09-14 12:24:13,985 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 12:24:20,948 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2930.78955 ± 685.854
2025-09-14 12:24:20,949 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3672.5618), np.float32(3652.3938), np.float32(2440.2744), np.float32(2442.2158), np.float32(2046.0531), np.float32(3191.702), np.float32(1965.4213), np.float32(2532.5857), np.float32(3885.7017), np.float32(3478.9885)]
2025-09-14 12:24:20,949 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:24:20,954 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 74/100 (estimated time remaining: 1 hour, 10 minutes, 14 seconds)
2025-09-14 12:26:50,195 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 12:26:57,104 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3400.73779 ± 294.145
2025-09-14 12:26:57,105 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3502.0237), np.float32(3384.2327), np.float32(3628.7153), np.float32(2916.9482), np.float32(2826.695), np.float32(3824.7314), np.float32(3443.968), np.float32(3625.8413), np.float32(3375.01), np.float32(3479.2117)]
2025-09-14 12:26:57,105 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:26:57,105 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (3400.74) for latency 9
2025-09-14 12:26:57,110 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 75/100 (estimated time remaining: 1 hour, 7 minutes, 38 seconds)
2025-09-14 12:29:26,297 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 12:29:33,188 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3187.28516 ± 532.996
2025-09-14 12:29:33,188 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3155.4602), np.float32(3612.1592), np.float32(3480.1501), np.float32(3328.4155), np.float32(1924.4048), np.float32(3478.0415), np.float32(3705.077), np.float32(3372.094), np.float32(3364.892), np.float32(2452.158)]
2025-09-14 12:29:33,188 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:29:33,193 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 76/100 (estimated time remaining: 1 hour, 5 minutes, 3 seconds)
2025-09-14 12:32:02,304 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 12:32:09,202 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3193.48291 ± 623.417
2025-09-14 12:32:09,202 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3829.6506), np.float32(3659.072), np.float32(3413.1794), np.float32(2294.939), np.float32(3140.8972), np.float32(2091.141), np.float32(3352.7703), np.float32(2663.8958), np.float32(4117.5005), np.float32(3371.7856)]
2025-09-14 12:32:09,202 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:32:09,207 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 77/100 (estimated time remaining: 1 hour, 2 minutes, 28 seconds)
2025-09-14 12:34:38,320 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 12:34:45,296 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2829.73120 ± 812.636
2025-09-14 12:34:45,297 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3791.0454), np.float32(3602.35), np.float32(3497.2354), np.float32(1348.0074), np.float32(2437.261), np.float32(3061.3416), np.float32(2035.5321), np.float32(1852.4489), np.float32(3596.2632), np.float32(3075.826)]
2025-09-14 12:34:45,297 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:34:45,302 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 78/100 (estimated time remaining: 59 minutes, 52 seconds)
2025-09-14 12:37:14,596 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 12:37:21,498 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3096.27197 ± 762.721
2025-09-14 12:37:21,498 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3604.8972), np.float32(3536.2812), np.float32(3802.5369), np.float32(1688.7174), np.float32(3232.517), np.float32(2052.1775), np.float32(2152.5132), np.float32(3634.3333), np.float32(3511.405), np.float32(3747.3403)]
2025-09-14 12:37:21,499 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:37:21,504 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 79/100 (estimated time remaining: 57 minutes, 14 seconds)
2025-09-14 12:39:51,041 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 12:39:57,936 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3136.73022 ± 821.095
2025-09-14 12:39:57,936 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3576.8567), np.float32(3625.0925), np.float32(1560.0859), np.float32(3892.176), np.float32(3742.988), np.float32(3293.5957), np.float32(1555.5945), np.float32(3282.8557), np.float32(3719.3357), np.float32(3118.7212)]
2025-09-14 12:39:57,936 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:39:57,942 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 80/100 (estimated time remaining: 54 minutes, 39 seconds)
2025-09-14 12:42:27,240 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 12:42:34,128 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3337.65967 ± 392.488
2025-09-14 12:42:34,128 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3595.9983), np.float32(3636.8945), np.float32(3398.9058), np.float32(3638.7158), np.float32(3867.0898), np.float32(2547.9167), np.float32(3041.5422), np.float32(2846.531), np.float32(3245.347), np.float32(3557.6562)]
2025-09-14 12:42:34,129 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:42:34,134 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 81/100 (estimated time remaining: 52 minutes, 3 seconds)
2025-09-14 12:45:03,455 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 12:45:10,352 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3317.47461 ± 567.669
2025-09-14 12:45:10,353 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3968.7263), np.float32(3205.5032), np.float32(4125.049), np.float32(2580.9863), np.float32(3071.931), np.float32(3722.786), np.float32(3326.1235), np.float32(3478.431), np.float32(3518.0996), np.float32(2177.1113)]
2025-09-14 12:45:10,353 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:45:10,358 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 82/100 (estimated time remaining: 49 minutes, 28 seconds)
2025-09-14 12:47:39,803 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 12:47:46,705 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3113.82324 ± 793.144
2025-09-14 12:47:46,705 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3628.3667), np.float32(3803.1008), np.float32(3035.667), np.float32(2083.142), np.float32(3520.6199), np.float32(3777.296), np.float32(1540.5343), np.float32(3736.6985), np.float32(3710.1575), np.float32(2302.648)]
2025-09-14 12:47:46,705 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:47:46,711 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 83/100 (estimated time remaining: 46 minutes, 53 seconds)
2025-09-14 12:50:16,380 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 12:50:23,293 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3251.93408 ± 942.761
2025-09-14 12:50:23,293 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1557.506), np.float32(1317.0028), np.float32(4036.5688), np.float32(3510.5525), np.float32(4013.1216), np.float32(3310.0015), np.float32(3955.0027), np.float32(3721.219), np.float32(3790.4094), np.float32(3307.956)]
2025-09-14 12:50:23,293 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:50:23,299 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 84/100 (estimated time remaining: 44 minutes, 18 seconds)
2025-09-14 12:52:52,418 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 12:52:59,296 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2682.05151 ± 777.916
2025-09-14 12:52:59,297 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3398.4504), np.float32(2520.0872), np.float32(3236.6587), np.float32(3507.388), np.float32(2836.1572), np.float32(1620.6882), np.float32(3699.0833), np.float32(1208.127), np.float32(2354.0715), np.float32(2439.803)]
2025-09-14 12:52:59,297 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:52:59,302 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 85/100 (estimated time remaining: 41 minutes, 40 seconds)
2025-09-14 12:55:28,756 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 12:55:35,637 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3323.81128 ± 679.391
2025-09-14 12:55:35,638 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3920.6487), np.float32(3361.5342), np.float32(3837.191), np.float32(3492.8445), np.float32(3424.7212), np.float32(3483.1619), np.float32(3393.5088), np.float32(3588.821), np.float32(3374.6997), np.float32(1360.9827)]
2025-09-14 12:55:35,638 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:55:35,644 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 86/100 (estimated time remaining: 39 minutes, 4 seconds)
2025-09-14 12:58:04,986 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 12:58:11,873 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3543.69531 ± 257.156
2025-09-14 12:58:11,873 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3184.322), np.float32(3631.515), np.float32(3449.298), np.float32(3145.583), np.float32(3969.6526), np.float32(3474.9138), np.float32(3557.5366), np.float32(3879.32), np.float32(3404.517), np.float32(3740.2935)]
2025-09-14 12:58:11,873 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:58:11,873 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1226 [INFO]: New best (3543.70) for latency 9
2025-09-14 12:58:11,879 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 87/100 (estimated time remaining: 36 minutes, 28 seconds)
2025-09-14 13:00:41,271 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 13:00:48,145 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2740.94312 ± 704.021
2025-09-14 13:00:48,145 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2911.5664), np.float32(3571.0303), np.float32(1775.9005), np.float32(3164.094), np.float32(3486.1948), np.float32(1564.0316), np.float32(2133.0088), np.float32(3617.4653), np.float32(2539.2112), np.float32(2646.9282)]
2025-09-14 13:00:48,145 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:00:48,151 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 88/100 (estimated time remaining: 33 minutes, 51 seconds)
2025-09-14 13:03:17,309 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 13:03:24,203 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2748.19067 ± 1036.708
2025-09-14 13:03:24,203 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1793.9176), np.float32(3862.209), np.float32(1643.6), np.float32(3849.2915), np.float32(3794.306), np.float32(1268.4828), np.float32(1499.8439), np.float32(2655.699), np.float32(3598.297), np.float32(3516.2605)]
2025-09-14 13:03:24,203 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:03:24,209 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 89/100 (estimated time remaining: 31 minutes, 14 seconds)
2025-09-14 13:05:53,483 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 13:06:00,380 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3430.93750 ± 377.525
2025-09-14 13:06:00,380 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2385.8699), np.float32(3485.6484), np.float32(3635.7844), np.float32(3281.6086), np.float32(3509.4055), np.float32(3675.8125), np.float32(3765.128), np.float32(3708.6265), np.float32(3504.616), np.float32(3356.8752)]
2025-09-14 13:06:00,380 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:06:00,386 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 90/100 (estimated time remaining: 28 minutes, 38 seconds)
2025-09-14 13:08:29,568 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 13:08:36,453 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2714.40869 ± 866.396
2025-09-14 13:08:36,453 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1727.0621), np.float32(3712.0806), np.float32(2301.2222), np.float32(3060.983), np.float32(1685.9266), np.float32(3488.5554), np.float32(1419.822), np.float32(3682.9111), np.float32(3653.9111), np.float32(2411.6147)]
2025-09-14 13:08:36,453 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:08:36,460 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 91/100 (estimated time remaining: 26 minutes, 1 second)
2025-09-14 13:11:05,586 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 13:11:12,378 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3309.29761 ± 701.094
2025-09-14 13:11:12,379 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1919.8805), np.float32(3732.3796), np.float32(3188.2642), np.float32(4247.2275), np.float32(3749.2473), np.float32(3794.9058), np.float32(2475.8162), np.float32(3896.8884), np.float32(3432.5054), np.float32(2655.8604)]
2025-09-14 13:11:12,379 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:11:12,385 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 92/100 (estimated time remaining: 23 minutes, 24 seconds)
2025-09-14 13:13:41,645 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 13:13:48,461 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3113.70166 ± 773.612
2025-09-14 13:13:48,461 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3691.6848), np.float32(1437.4701), np.float32(2106.193), np.float32(3114.1924), np.float32(3576.4429), np.float32(3694.7537), np.float32(2625.517), np.float32(3256.3262), np.float32(3704.1353), np.float32(3930.3025)]
2025-09-14 13:13:48,461 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:13:48,467 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 93/100 (estimated time remaining: 20 minutes, 48 seconds)
2025-09-14 13:16:18,035 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 13:16:24,898 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2837.72095 ± 883.448
2025-09-14 13:16:24,898 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3868.3152), np.float32(1816.7174), np.float32(3521.0867), np.float32(1377.1503), np.float32(2348.8655), np.float32(3603.6257), np.float32(3395.7134), np.float32(3061.4785), np.float32(3659.229), np.float32(1725.0276)]
2025-09-14 13:16:24,898 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:16:24,904 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 94/100 (estimated time remaining: 18 minutes, 12 seconds)
2025-09-14 13:18:54,100 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 13:19:01,009 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3004.63281 ± 796.547
2025-09-14 13:19:01,009 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1769.4248), np.float32(3890.0422), np.float32(3617.0703), np.float32(3722.254), np.float32(1944.6617), np.float32(3448.0417), np.float32(3792.308), np.float32(2097.7207), np.float32(3302.0254), np.float32(2462.7783)]
2025-09-14 13:19:01,009 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:19:01,015 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 95/100 (estimated time remaining: 15 minutes, 36 seconds)
2025-09-14 13:21:30,464 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 13:21:37,378 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3423.94019 ± 374.245
2025-09-14 13:21:37,379 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3621.603), np.float32(3555.1165), np.float32(2449.6729), np.float32(3284.209), np.float32(3781.614), np.float32(3714.5747), np.float32(3654.4722), np.float32(3204.1853), np.float32(3322.956), np.float32(3651.0007)]
2025-09-14 13:21:37,379 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:21:37,385 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 96/100 (estimated time remaining: 13 minutes)
2025-09-14 13:24:06,430 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 13:24:13,336 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 2408.01294 ± 1046.337
2025-09-14 13:24:13,336 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3581.8933), np.float32(1286.2064), np.float32(1316.2285), np.float32(2578.9094), np.float32(2133.0015), np.float32(3460.823), np.float32(1460.0663), np.float32(3663.0154), np.float32(994.31915), np.float32(3605.6658)]
2025-09-14 13:24:13,336 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:24:13,343 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 97/100 (estimated time remaining: 10 minutes, 24 seconds)
2025-09-14 13:26:42,874 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 13:26:49,782 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3464.98364 ± 499.972
2025-09-14 13:26:49,783 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3927.3015), np.float32(3570.1106), np.float32(3246.9705), np.float32(3937.0842), np.float32(3451.7559), np.float32(3588.9343), np.float32(2174.862), np.float32(3191.6855), np.float32(3939.7734), np.float32(3621.358)]
2025-09-14 13:26:49,783 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:26:49,789 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 98/100 (estimated time remaining: 7 minutes, 48 seconds)
2025-09-14 13:29:18,869 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 13:29:25,772 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3480.52734 ± 603.854
2025-09-14 13:29:25,772 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1814.796), np.float32(3744.326), np.float32(3647.9268), np.float32(3767.8386), np.float32(3072.5337), np.float32(3701.0698), np.float32(3528.9534), np.float32(3886.5835), np.float32(3623.2344), np.float32(4018.0127)]
2025-09-14 13:29:25,772 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:29:25,779 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 99/100 (estimated time remaining: 5 minutes, 12 seconds)
2025-09-14 13:31:54,853 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 13:32:01,767 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3217.60620 ± 889.802
2025-09-14 13:32:01,767 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3930.3188), np.float32(3364.4373), np.float32(1393.0413), np.float32(3905.9233), np.float32(3586.7703), np.float32(3492.225), np.float32(1543.55), np.float32(3651.5308), np.float32(3682.1255), np.float32(3626.141)]
2025-09-14 13:32:01,767 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:32:01,774 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1199 [INFO]: Iteration 100/100 (estimated time remaining: 2 minutes, 36 seconds)
2025-09-14 13:34:29,736 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1214 [DEBUG]: Evaluating for latency 9...
2025-09-14 13:34:36,335 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1221 [DEBUG]: Total Reward: 3139.55151 ± 735.618
2025-09-14 13:34:36,335 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2171.7722), np.float32(3760.3867), np.float32(3592.5525), np.float32(3546.5737), np.float32(3599.5283), np.float32(3541.558), np.float32(3454.8113), np.float32(2626.4375), np.float32(1480.9818), np.float32(3620.9128)]
2025-09-14 13:34:36,335 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:34:36,342 latency_env.delayed_mdp:training_loop(baseline-bpql-noisepromille100-halfcheetah):1251 [DEBUG]: Training session finished
