2025-09-14 08:43:01,511 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1108 [DEBUG]: logdir: _logs/noise-eval-v2/halfcheetah/bpql-noise_0.000-delay_12
2025-09-14 08:43:01,511 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1109 [DEBUG]: trainer_prefix: noise-eval-v2/halfcheetah/bpql-noise_0.000-delay_12
2025-09-14 08:43:01,511 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1110 [DEBUG]: args.trainer_eval_latencies: {'12': <latency_env.delayed_mdp.ConstantDelay object at 0x7faefcdd7140>}
2025-09-14 08:43:01,512 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1111 [DEBUG]: using device: cpu
2025-09-14 08:43:01,515 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1133 [INFO]: Creating new trainer
2025-09-14 08:43:01,606 baseline-bpql-halfcheetah:113 [DEBUG]: pi network:
NNGaussianPolicy(
  (common_head): Sequential(
    (0): Flatten(start_dim=1, end_dim=-1)
    (1): Linear(in_features=89, out_features=256, bias=True)
    (2): ReLU()
    (3): Linear(in_features=256, out_features=256, bias=True)
    (4): ReLU()
  )
  (mu_head): Sequential(
    (0): Linear(in_features=256, out_features=6, bias=True)
    (1): Unflatten(dim=1, unflattened_size=(6,))
  )
  (log_std_head): Sequential(
    (0): Linear(in_features=256, out_features=6, bias=True)
    (1): Unflatten(dim=1, unflattened_size=(6,))
  )
  (tanh_refit): NNTanhRefit(scale: tensor([[2., 2., 2., 2., 2., 2.]]), shift: tensor([[-1., -1., -1., -1., -1., -1.]]))
)
2025-09-14 08:43:01,606 baseline-bpql-halfcheetah:114 [DEBUG]: q network:
NNLayerConcat2(
  dim: -1
  (next): Sequential(
    (0): Linear(in_features=23, out_features=256, bias=True)
    (1): ReLU()
    (2): Linear(in_features=256, out_features=256, bias=True)
    (3): ReLU()
    (4): Linear(in_features=256, out_features=1, bias=True)
    (5): NNLayerSqueeze(dim: -1)
  )
  (init_left): Flatten(start_dim=1, end_dim=-1)
  (init_right): Flatten(start_dim=1, end_dim=-1)
)
2025-09-14 08:43:03,521 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1194 [DEBUG]: Starting training session...
2025-09-14 08:43:03,521 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 1/100
2025-09-14 08:45:33,295 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 08:45:39,392 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: -425.10327 ± 45.768
2025-09-14 08:45:39,392 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(-503.2043), np.float32(-392.28195), np.float32(-372.17044), np.float32(-469.33374), np.float32(-390.13034), np.float32(-453.8559), np.float32(-417.49365), np.float32(-475.7913), np.float32(-417.00787), np.float32(-359.7634)]
2025-09-14 08:45:39,392 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 08:45:39,392 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (-425.10) for latency 12
2025-09-14 08:45:39,394 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 2/100 (estimated time remaining: 4 hours, 17 minutes, 11 seconds)
2025-09-14 08:48:11,020 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 08:48:17,498 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: -290.93579 ± 56.896
2025-09-14 08:48:17,499 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(-286.8204), np.float32(-283.27402), np.float32(-442.4223), np.float32(-302.3763), np.float32(-257.97034), np.float32(-261.5009), np.float32(-272.34116), np.float32(-254.63837), np.float32(-324.14114), np.float32(-223.87317)]
2025-09-14 08:48:17,499 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 08:48:17,499 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (-290.94) for latency 12
2025-09-14 08:48:17,501 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 3/100 (estimated time remaining: 4 hours, 16 minutes, 24 seconds)
2025-09-14 08:50:57,716 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 08:51:04,554 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 42.25238 ± 150.741
2025-09-14 08:51:04,554 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(206.38632), np.float32(-273.38733), np.float32(91.80768), np.float32(196.39706), np.float32(88.58333), np.float32(120.99032), np.float32(83.26297), np.float32(100.76426), np.float32(-202.64305), np.float32(10.362319)]
2025-09-14 08:51:04,555 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 08:51:04,555 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (42.25) for latency 12
2025-09-14 08:51:04,557 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 4/100 (estimated time remaining: 4 hours, 19 minutes, 13 seconds)
2025-09-14 08:53:43,777 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 08:53:50,825 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 143.51694 ± 237.751
2025-09-14 08:53:50,825 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(27.386492), np.float32(152.85445), np.float32(-118.19052), np.float32(386.2622), np.float32(-41.54323), np.float32(-135.17499), np.float32(303.30347), np.float32(147.52525), np.float32(43.176228), np.float32(669.56995)]
2025-09-14 08:53:50,825 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 08:53:50,825 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (143.52) for latency 12
2025-09-14 08:53:50,827 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 5/100 (estimated time remaining: 4 hours, 18 minutes, 55 seconds)
2025-09-14 08:56:36,049 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 08:56:44,510 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 357.55255 ± 341.277
2025-09-14 08:56:44,510 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(513.2333), np.float32(-35.662216), np.float32(1009.0839), np.float32(838.28046), np.float32(104.01676), np.float32(556.31287), np.float32(67.978874), np.float32(39.199417), np.float32(156.83536), np.float32(326.2468)]
2025-09-14 08:56:44,510 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 08:56:44,510 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (357.55) for latency 12
2025-09-14 08:56:44,512 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 6/100 (estimated time remaining: 4 hours, 19 minutes, 58 seconds)
2025-09-14 08:59:54,514 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:00:03,691 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 609.52661 ± 508.139
2025-09-14 09:00:03,691 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1142.4467), np.float32(99.125946), np.float32(660.52545), np.float32(199.72179), np.float32(1270.9872), np.float32(1544.182), np.float32(137.61604), np.float32(60.84744), np.float32(531.86053), np.float32(447.95364)]
2025-09-14 09:00:03,691 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:00:03,692 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (609.53) for latency 12
2025-09-14 09:00:03,694 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 7/100 (estimated time remaining: 4 hours, 30 minutes, 48 seconds)
2025-09-14 09:03:17,696 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:03:26,976 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 736.89471 ± 217.399
2025-09-14 09:03:26,976 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(733.2405), np.float32(452.36984), np.float32(732.736), np.float32(716.3573), np.float32(460.73734), np.float32(1239.2611), np.float32(596.62195), np.float32(795.1739), np.float32(933.6652), np.float32(708.7844)]
2025-09-14 09:03:26,976 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:03:26,976 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (736.89) for latency 12
2025-09-14 09:03:26,978 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 8/100 (estimated time remaining: 4 hours, 41 minutes, 56 seconds)
2025-09-14 09:06:36,806 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:06:45,568 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 933.04474 ± 348.402
2025-09-14 09:06:45,568 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1171.713), np.float32(801.3313), np.float32(747.84607), np.float32(1021.43256), np.float32(713.07263), np.float32(569.1153), np.float32(1296.0374), np.float32(673.0275), np.float32(617.80365), np.float32(1719.068)]
2025-09-14 09:06:45,568 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:06:45,568 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (933.04) for latency 12
2025-09-14 09:06:45,571 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 9/100 (estimated time remaining: 4 hours, 48 minutes, 34 seconds)
2025-09-14 09:09:52,990 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:10:01,723 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 1271.55884 ± 297.637
2025-09-14 09:10:01,723 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1734.1664), np.float32(1150.5044), np.float32(866.883), np.float32(1409.9208), np.float32(1396.3351), np.float32(1199.998), np.float32(866.5066), np.float32(1702.132), np.float32(979.7363), np.float32(1409.4059)]
2025-09-14 09:10:01,723 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:10:01,723 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (1271.56) for latency 12
2025-09-14 09:10:01,726 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 10/100 (estimated time remaining: 4 hours, 54 minutes, 30 seconds)
2025-09-14 09:13:09,477 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:13:18,356 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 1252.50854 ± 454.255
2025-09-14 09:13:18,357 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(916.4866), np.float32(943.15027), np.float32(1988.3954), np.float32(1613.1555), np.float32(920.21985), np.float32(1203.574), np.float32(916.9422), np.float32(963.9102), np.float32(929.2206), np.float32(2130.031)]
2025-09-14 09:13:18,357 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:13:18,363 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 11/100 (estimated time remaining: 4 hours, 58 minutes, 9 seconds)
2025-09-14 09:16:25,253 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:16:34,317 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 1385.93518 ± 384.069
2025-09-14 09:16:34,317 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1259.9993), np.float32(1617.2354), np.float32(1066.6582), np.float32(1051.9886), np.float32(1008.9547), np.float32(1075.7748), np.float32(1635.1316), np.float32(1587.0125), np.float32(1256.987), np.float32(2299.6082)]
2025-09-14 09:16:34,317 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:16:34,317 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (1385.94) for latency 12
2025-09-14 09:16:34,320 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 12/100 (estimated time remaining: 4 hours, 53 minutes, 53 seconds)
2025-09-14 09:19:41,360 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:19:50,287 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 1687.67810 ± 465.079
2025-09-14 09:19:50,287 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1753.9829), np.float32(2371.5886), np.float32(1360.656), np.float32(1637.723), np.float32(1573.2087), np.float32(1446.1143), np.float32(983.7489), np.float32(1162.0155), np.float32(2254.0256), np.float32(2333.7188)]
2025-09-14 09:19:50,287 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:19:50,287 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (1687.68) for latency 12
2025-09-14 09:19:50,290 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 13/100 (estimated time remaining: 4 hours, 48 minutes, 26 seconds)
2025-09-14 09:23:09,623 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:23:19,259 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 1654.59570 ± 448.178
2025-09-14 09:23:19,259 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1972.6125), np.float32(1220.6257), np.float32(1071.8904), np.float32(1152.636), np.float32(2509.7366), np.float32(1800.9625), np.float32(1827.0916), np.float32(2142.1204), np.float32(1449.7719), np.float32(1398.509)]
2025-09-14 09:23:19,260 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:23:19,263 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 14/100 (estimated time remaining: 4 hours, 48 minutes, 10 seconds)
2025-09-14 09:26:38,383 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:26:47,905 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 2443.13281 ± 510.358
2025-09-14 09:26:47,906 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2718.2861), np.float32(2634.6602), np.float32(1267.0121), np.float32(2796.416), np.float32(2835.4866), np.float32(3041.964), np.float32(2182.2705), np.float32(2647.021), np.float32(1841.8887), np.float32(2466.3215)]
2025-09-14 09:26:47,906 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:26:47,906 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (2443.13) for latency 12
2025-09-14 09:26:47,909 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 15/100 (estimated time remaining: 4 hours, 48 minutes, 26 seconds)
2025-09-14 09:30:06,637 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:30:16,286 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 2045.75171 ± 678.005
2025-09-14 09:30:16,286 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1501.7584), np.float32(1071.5187), np.float32(1427.1125), np.float32(2153.9841), np.float32(2988.8984), np.float32(3092.938), np.float32(2409.0366), np.float32(1283.0046), np.float32(2548.9841), np.float32(1980.282)]
2025-09-14 09:30:16,286 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:30:16,289 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 16/100 (estimated time remaining: 4 hours, 48 minutes, 24 seconds)
2025-09-14 09:33:28,299 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:33:36,356 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 2530.27710 ± 912.416
2025-09-14 09:33:36,356 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3423.1333), np.float32(1429.8519), np.float32(3139.4856), np.float32(1494.3325), np.float32(3298.7078), np.float32(3087.345), np.float32(3264.2395), np.float32(1282.5718), np.float32(1477.0417), np.float32(3406.0627)]
2025-09-14 09:33:36,356 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:33:36,356 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (2530.28) for latency 12
2025-09-14 09:33:36,359 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 17/100 (estimated time remaining: 4 hours, 46 minutes, 10 seconds)
2025-09-14 09:36:24,171 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:36:31,345 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 1615.73999 ± 450.727
2025-09-14 09:36:31,346 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1457.569), np.float32(1775.4669), np.float32(1042.8492), np.float32(1587.6681), np.float32(2601.089), np.float32(1384.3727), np.float32(1340.9478), np.float32(1494.002), np.float32(2240.693), np.float32(1232.7417)]
2025-09-14 09:36:31,346 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:36:31,348 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 18/100 (estimated time remaining: 4 hours, 36 minutes, 57 seconds)
2025-09-14 09:39:02,351 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:39:09,293 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 2203.26123 ± 799.825
2025-09-14 09:39:09,293 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3030.7253), np.float32(1359.0367), np.float32(2200.7488), np.float32(1147.9344), np.float32(2466.638), np.float32(2954.9426), np.float32(3062.4187), np.float32(1402.4833), np.float32(3171.3848), np.float32(1236.2996)]
2025-09-14 09:39:09,293 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:39:09,296 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 19/100 (estimated time remaining: 4 hours, 19 minutes, 40 seconds)
2025-09-14 09:41:40,089 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:41:47,069 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 2710.75610 ± 363.504
2025-09-14 09:41:47,070 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2422.0518), np.float32(2918.5247), np.float32(2808.7095), np.float32(2546.9377), np.float32(2162.5967), np.float32(2151.7793), np.float32(3098.883), np.float32(3010.3933), np.float32(3262.8843), np.float32(2724.8)]
2025-09-14 09:41:47,070 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:41:47,070 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (2710.76) for latency 12
2025-09-14 09:41:47,072 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 20/100 (estimated time remaining: 4 hours, 2 minutes, 46 seconds)
2025-09-14 09:44:31,408 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:44:40,933 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 2627.92041 ± 818.846
2025-09-14 09:44:40,933 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3194.6104), np.float32(3310.7883), np.float32(1144.4027), np.float32(1217.3849), np.float32(3009.9492), np.float32(3293.2825), np.float32(3043.2625), np.float32(1896.6953), np.float32(3169.8132), np.float32(2999.018)]
2025-09-14 09:44:40,933 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:44:40,936 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 21/100 (estimated time remaining: 3 hours, 50 minutes, 34 seconds)
2025-09-14 09:48:01,847 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:48:11,718 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 2894.32080 ± 943.760
2025-09-14 09:48:11,719 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3613.8071), np.float32(3497.3552), np.float32(1420.7789), np.float32(1572.1288), np.float32(3846.3804), np.float32(3851.9612), np.float32(3674.6438), np.float32(1637.6615), np.float32(3130.5913), np.float32(2697.9004)]
2025-09-14 09:48:11,719 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:48:11,719 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (2894.32) for latency 12
2025-09-14 09:48:11,723 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 22/100 (estimated time remaining: 3 hours, 50 minutes, 30 seconds)
2025-09-14 09:51:35,280 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:51:44,997 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 2935.83350 ± 1061.889
2025-09-14 09:51:44,997 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3831.1138), np.float32(3964.541), np.float32(2577.6108), np.float32(940.9142), np.float32(1499.4639), np.float32(3639.211), np.float32(3799.0376), np.float32(1892.7384), np.float32(3515.4807), np.float32(3698.2249)]
2025-09-14 09:51:44,997 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:51:44,997 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (2935.83) for latency 12
2025-09-14 09:51:45,001 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 23/100 (estimated time remaining: 3 hours, 57 minutes, 32 seconds)
2025-09-14 09:55:07,721 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:55:17,498 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 3247.32812 ± 875.638
2025-09-14 09:55:17,499 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3823.6626), np.float32(1221.2446), np.float32(3377.6548), np.float32(1876.9086), np.float32(3787.6638), np.float32(3798.1592), np.float32(3475.6858), np.float32(3761.1094), np.float32(3506.9548), np.float32(3844.2366)]
2025-09-14 09:55:17,499 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:55:17,499 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (3247.33) for latency 12
2025-09-14 09:55:17,503 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 24/100 (estimated time remaining: 4 hours, 8 minutes, 30 seconds)
2025-09-14 09:58:40,753 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 09:58:50,503 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 2600.35107 ± 752.689
2025-09-14 09:58:50,503 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2682.6963), np.float32(3059.7515), np.float32(1968.1085), np.float32(1396.752), np.float32(2084.9023), np.float32(3256.6704), np.float32(3451.762), np.float32(3586.5698), np.float32(2954.9355), np.float32(1561.3641)]
2025-09-14 09:58:50,504 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:58:50,507 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 25/100 (estimated time remaining: 4 hours, 19 minutes, 16 seconds)
2025-09-14 10:02:13,660 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:02:23,369 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 3192.40894 ± 547.447
2025-09-14 10:02:23,369 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3525.3022), np.float32(3617.761), np.float32(3449.3123), np.float32(2513.5645), np.float32(3458.1226), np.float32(3494.0847), np.float32(3419.0447), np.float32(3771.2295), np.float32(2629.2178), np.float32(2046.45)]
2025-09-14 10:02:23,369 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:02:23,373 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 26/100 (estimated time remaining: 4 hours, 25 minutes, 36 seconds)
2025-09-14 10:05:46,022 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:05:55,759 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4119.17578 ± 307.557
2025-09-14 10:05:55,759 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3923.2542), np.float32(4207.0503), np.float32(4549.5913), np.float32(4025.1194), np.float32(4043.5803), np.float32(4457.058), np.float32(4425.851), np.float32(4054.603), np.float32(3421.7842), np.float32(4083.865)]
2025-09-14 10:05:55,759 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:05:55,759 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (4119.18) for latency 12
2025-09-14 10:05:55,763 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 27/100 (estimated time remaining: 4 hours, 22 minutes, 27 seconds)
2025-09-14 10:09:18,335 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:09:28,041 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 3351.21948 ± 1144.411
2025-09-14 10:09:28,041 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4218.657), np.float32(4096.434), np.float32(4263.303), np.float32(3920.0708), np.float32(4327.7236), np.float32(1316.814), np.float32(2616.7476), np.float32(1200.9507), np.float32(3869.497), np.float32(3681.9949)]
2025-09-14 10:09:28,041 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:09:28,045 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 28/100 (estimated time remaining: 4 hours, 18 minutes, 40 seconds)
2025-09-14 10:12:50,935 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:13:00,718 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 3532.39307 ± 949.194
2025-09-14 10:13:00,718 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4101.416), np.float32(3123.7996), np.float32(4004.0771), np.float32(4326.5366), np.float32(3999.3918), np.float32(1790.5171), np.float32(4150.358), np.float32(1752.2083), np.float32(4457.186), np.float32(3618.4377)]
2025-09-14 10:13:00,718 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:13:00,722 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 29/100 (estimated time remaining: 4 hours, 15 minutes, 10 seconds)
2025-09-14 10:16:23,943 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:16:33,804 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 2373.68945 ± 777.819
2025-09-14 10:16:33,804 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1971.7924), np.float32(2191.0881), np.float32(3334.8828), np.float32(3562.1382), np.float32(1183.9314), np.float32(2216.915), np.float32(2119.583), np.float32(3083.0193), np.float32(2838.0227), np.float32(1235.5188)]
2025-09-14 10:16:33,804 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:16:33,809 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 30/100 (estimated time remaining: 4 hours, 11 minutes, 38 seconds)
2025-09-14 10:19:56,986 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:20:06,745 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 1929.28125 ± 613.603
2025-09-14 10:20:06,745 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1459.7227), np.float32(1626.2976), np.float32(2547.741), np.float32(1536.7745), np.float32(2033.9197), np.float32(1311.4557), np.float32(1544.3503), np.float32(2505.5098), np.float32(1450.7642), np.float32(3276.2776)]
2025-09-14 10:20:06,745 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:20:06,749 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 31/100 (estimated time remaining: 4 hours, 8 minutes, 7 seconds)
2025-09-14 10:23:29,666 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:23:39,457 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 3615.68823 ± 811.494
2025-09-14 10:23:39,457 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4222.31), np.float32(1869.1774), np.float32(3971.3684), np.float32(3983.8533), np.float32(3644.214), np.float32(4094.6526), np.float32(3839.216), np.float32(4184.7036), np.float32(4150.9834), np.float32(2196.4019)]
2025-09-14 10:23:39,457 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:23:39,462 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 32/100 (estimated time remaining: 4 hours, 4 minutes, 39 seconds)
2025-09-14 10:27:01,568 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:27:11,326 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 3409.46289 ± 806.050
2025-09-14 10:27:11,326 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3739.6558), np.float32(1730.0204), np.float32(4170.48), np.float32(2375.6362), np.float32(3478.0493), np.float32(2877.527), np.float32(4180.358), np.float32(3405.1438), np.float32(3779.012), np.float32(4358.7466)]
2025-09-14 10:27:11,326 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:27:11,331 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 33/100 (estimated time remaining: 4 hours, 1 minute)
2025-09-14 10:30:32,906 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:30:42,747 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 3749.93115 ± 743.816
2025-09-14 10:30:42,747 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4837.411), np.float32(4475.615), np.float32(2760.7656), np.float32(2375.1545), np.float32(4240.5454), np.float32(4030.0562), np.float32(3132.0645), np.float32(4066.7942), np.float32(4064.6067), np.float32(3516.3025)]
2025-09-14 10:30:42,747 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:30:42,752 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 34/100 (estimated time remaining: 3 hours, 57 minutes, 11 seconds)
2025-09-14 10:34:05,939 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:34:15,700 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 2364.22241 ± 901.665
2025-09-14 10:34:15,701 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1662.7316), np.float32(2031.3446), np.float32(1054.4077), np.float32(3399.9058), np.float32(3687.0564), np.float32(1212.681), np.float32(2493.5898), np.float32(1769.7587), np.float32(3097.9407), np.float32(3232.806)]
2025-09-14 10:34:15,701 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:34:15,705 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 35/100 (estimated time remaining: 3 hours, 53 minutes, 37 seconds)
2025-09-14 10:37:38,319 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:37:48,010 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 3099.83276 ± 1282.010
2025-09-14 10:37:48,012 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4258.3823), np.float32(1640.2552), np.float32(1241.1311), np.float32(1427.8411), np.float32(4352.4644), np.float32(2272.3499), np.float32(4069.387), np.float32(2899.713), np.float32(4391.786), np.float32(4445.02)]
2025-09-14 10:37:48,016 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:37:48,020 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 36/100 (estimated time remaining: 3 hours, 49 minutes, 56 seconds)
2025-09-14 10:41:10,089 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:41:19,844 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4199.92383 ± 343.552
2025-09-14 10:41:19,845 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4601.474), np.float32(4086.372), np.float32(4230.067), np.float32(4013.0266), np.float32(3919.2817), np.float32(4683.8784), np.float32(3466.7712), np.float32(4502.694), np.float32(4356.366), np.float32(4139.309)]
2025-09-14 10:41:19,845 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:41:19,845 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (4199.92) for latency 12
2025-09-14 10:41:19,849 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 37/100 (estimated time remaining: 3 hours, 46 minutes, 12 seconds)
2025-09-14 10:44:42,031 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:44:51,897 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4356.80371 ± 402.123
2025-09-14 10:44:51,897 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4377.9214), np.float32(4752.9814), np.float32(3890.9988), np.float32(3432.9526), np.float32(4648.818), np.float32(4529.7246), np.float32(4088.401), np.float32(4588.4224), np.float32(4618.7124), np.float32(4639.1104)]
2025-09-14 10:44:51,897 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:44:51,897 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (4356.80) for latency 12
2025-09-14 10:44:51,902 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 38/100 (estimated time remaining: 3 hours, 42 minutes, 43 seconds)
2025-09-14 10:48:12,801 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:48:22,448 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 3662.85986 ± 1302.721
2025-09-14 10:48:22,448 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3207.6938), np.float32(4706.0503), np.float32(2795.8398), np.float32(4541.48), np.float32(4611.7573), np.float32(4354.9497), np.float32(4810.0815), np.float32(1440.9904), np.float32(4788.6416), np.float32(1371.1118)]
2025-09-14 10:48:22,448 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:48:22,453 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 39/100 (estimated time remaining: 3 hours, 39 minutes)
2025-09-14 10:51:42,910 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:51:52,610 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 3833.37842 ± 1301.829
2025-09-14 10:51:52,610 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4263.63), np.float32(4943.111), np.float32(4566.722), np.float32(4349.629), np.float32(1263.3899), np.float32(4810.9863), np.float32(1349.2162), np.float32(3880.946), np.float32(4788.2793), np.float32(4117.8774)]
2025-09-14 10:51:52,610 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:51:52,615 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 40/100 (estimated time remaining: 3 hours, 34 minutes, 54 seconds)
2025-09-14 10:55:09,002 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:55:18,408 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4336.46973 ± 803.689
2025-09-14 10:55:18,408 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4777.8066), np.float32(4720.052), np.float32(4716.694), np.float32(4655.765), np.float32(4489.746), np.float32(4426.6855), np.float32(4338.4), np.float32(4474.9165), np.float32(4796.377), np.float32(1968.2607)]
2025-09-14 10:55:18,409 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:55:18,413 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 41/100 (estimated time remaining: 3 hours, 30 minutes, 4 seconds)
2025-09-14 10:58:34,368 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 10:58:43,827 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4187.49707 ± 1066.202
2025-09-14 10:58:43,828 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1373.6437), np.float32(3284.4841), np.float32(4822.3623), np.float32(4651.3267), np.float32(4857.7075), np.float32(4843.2373), np.float32(4847.749), np.float32(5005.1816), np.float32(4006.6047), np.float32(4182.6772)]
2025-09-14 10:58:43,828 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:58:43,832 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 42/100 (estimated time remaining: 3 hours, 25 minutes, 18 seconds)
2025-09-14 11:01:50,445 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:01:59,230 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4370.98535 ± 1084.135
2025-09-14 11:01:59,230 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5035.4316), np.float32(4834.4155), np.float32(1362.3118), np.float32(4686.6323), np.float32(4817.1304), np.float32(4678.1924), np.float32(3552.7004), np.float32(5003.342), np.float32(5075.8506), np.float32(4663.8423)]
2025-09-14 11:01:59,230 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:01:59,230 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (4370.99) for latency 12
2025-09-14 11:01:59,235 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 43/100 (estimated time remaining: 3 hours, 18 minutes, 37 seconds)
2025-09-14 11:05:02,477 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:05:11,262 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4924.51221 ± 195.223
2025-09-14 11:05:11,263 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5061.3906), np.float32(5084.2437), np.float32(4937.206), np.float32(5265.018), np.float32(4813.6), np.float32(4874.3604), np.float32(5116.2134), np.float32(4598.216), np.float32(4744.4785), np.float32(4750.3955)]
2025-09-14 11:05:11,263 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:05:11,263 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (4924.51) for latency 12
2025-09-14 11:05:11,267 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 44/100 (estimated time remaining: 3 hours, 11 minutes, 40 seconds)
2025-09-14 11:08:05,772 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:08:13,741 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 3782.54297 ± 1046.946
2025-09-14 11:08:13,741 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5072.3467), np.float32(4962.9824), np.float32(2045.785), np.float32(2837.3313), np.float32(4690.8555), np.float32(2647.4255), np.float32(3904.449), np.float32(4870.3525), np.float32(3835.7554), np.float32(2958.15)]
2025-09-14 11:08:13,741 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:08:13,745 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 45/100 (estimated time remaining: 3 hours, 3 minutes, 8 seconds)
2025-09-14 11:10:51,433 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:10:58,367 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4357.66699 ± 1168.447
2025-09-14 11:10:58,367 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5127.0312), np.float32(1611.2942), np.float32(5002.9927), np.float32(2657.7214), np.float32(5139.1123), np.float32(5190.471), np.float32(4849.829), np.float32(4165.79), np.float32(4964.8447), np.float32(4867.585)]
2025-09-14 11:10:58,367 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:10:58,371 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 46/100 (estimated time remaining: 2 hours, 52 minutes, 19 seconds)
2025-09-14 11:13:27,462 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:13:34,301 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4529.44531 ± 1128.652
2025-09-14 11:13:34,301 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4694.852), np.float32(4786.97), np.float32(4833.875), np.float32(5180.2485), np.float32(5239.9214), np.float32(4933.1206), np.float32(1205.7975), np.float32(5156.1113), np.float32(4613.879), np.float32(4649.675)]
2025-09-14 11:13:34,302 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:13:34,306 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 47/100 (estimated time remaining: 2 hours, 40 minutes, 17 seconds)
2025-09-14 11:16:02,959 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:16:09,781 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5107.59326 ± 264.397
2025-09-14 11:16:09,782 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5333.0225), np.float32(5133.798), np.float32(5142.7456), np.float32(5222.6494), np.float32(5320.154), np.float32(5362.607), np.float32(4754.2295), np.float32(5233.1978), np.float32(5083.7197), np.float32(4489.813)]
2025-09-14 11:16:09,782 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:16:09,782 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (5107.59) for latency 12
2025-09-14 11:16:09,786 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 48/100 (estimated time remaining: 2 hours, 30 minutes, 15 seconds)
2025-09-14 11:18:38,833 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:18:45,653 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 3046.24756 ± 1362.299
2025-09-14 11:18:45,653 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1669.1886), np.float32(2066.168), np.float32(2250.9885), np.float32(1745.0011), np.float32(2315.7615), np.float32(1746.5066), np.float32(5104.719), np.float32(4015.4207), np.float32(4817.81), np.float32(4730.912)]
2025-09-14 11:18:45,653 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:18:45,657 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 49/100 (estimated time remaining: 2 hours, 21 minutes, 9 seconds)
2025-09-14 11:21:14,803 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:21:21,628 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 3453.15088 ± 1428.101
2025-09-14 11:21:21,628 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3991.645), np.float32(1517.897), np.float32(3428.2053), np.float32(4602.7827), np.float32(4373.0015), np.float32(5289.8105), np.float32(4907.0537), np.float32(978.0252), np.float32(3589.654), np.float32(1853.4336)]
2025-09-14 11:21:21,628 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:21:21,632 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 50/100 (estimated time remaining: 2 hours, 13 minutes, 56 seconds)
2025-09-14 11:23:50,778 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:23:57,600 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4395.82910 ± 1612.442
2025-09-14 11:23:57,601 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1193.3618), np.float32(4760.576), np.float32(1224.1716), np.float32(5075.582), np.float32(4743.113), np.float32(5422.012), np.float32(5409.6313), np.float32(5344.174), np.float32(5376.319), np.float32(5409.349)]
2025-09-14 11:23:57,601 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:23:57,605 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 51/100 (estimated time remaining: 2 hours, 9 minutes, 52 seconds)
2025-09-14 11:26:26,835 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:26:33,674 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5409.33350 ± 112.397
2025-09-14 11:26:33,674 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5508.4897), np.float32(5443.44), np.float32(5083.7935), np.float32(5433.731), np.float32(5384.338), np.float32(5450.467), np.float32(5442.8555), np.float32(5429.3926), np.float32(5465.5835), np.float32(5451.2476)]
2025-09-14 11:26:33,674 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:26:33,674 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (5409.33) for latency 12
2025-09-14 11:26:33,678 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 52/100 (estimated time remaining: 2 hours, 7 minutes, 17 seconds)
2025-09-14 11:29:02,592 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:29:09,422 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4352.36621 ± 1084.933
2025-09-14 11:29:09,423 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3220.3542), np.float32(4535.3574), np.float32(4819.665), np.float32(3767.9014), np.float32(5493.4688), np.float32(5501.169), np.float32(1946.2026), np.float32(3934.946), np.float32(5029.037), np.float32(5275.5605)]
2025-09-14 11:29:09,423 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:29:09,427 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 53/100 (estimated time remaining: 2 hours, 4 minutes, 44 seconds)
2025-09-14 11:31:37,887 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:31:44,716 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5545.11865 ± 147.790
2025-09-14 11:31:44,717 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5606.074), np.float32(5410.174), np.float32(5629.618), np.float32(5572.203), np.float32(5615.3887), np.float32(5577.8394), np.float32(5604.106), np.float32(5691.94), np.float32(5592.804), np.float32(5151.0376)]
2025-09-14 11:31:44,717 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:31:44,717 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (5545.12) for latency 12
2025-09-14 11:31:44,721 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 54/100 (estimated time remaining: 2 hours, 2 minutes, 3 seconds)
2025-09-14 11:34:13,441 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:34:20,279 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5257.22363 ± 168.070
2025-09-14 11:34:20,279 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5305.07), np.float32(5362.38), np.float32(5319.142), np.float32(5235.244), np.float32(4900.7627), np.float32(5432.3247), np.float32(5000.9614), np.float32(5308.1353), np.float32(5454.2373), np.float32(5253.976)]
2025-09-14 11:34:20,279 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:34:20,287 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 55/100 (estimated time remaining: 1 hour, 59 minutes, 23 seconds)
2025-09-14 11:36:49,290 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:36:56,126 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4354.32910 ± 1887.984
2025-09-14 11:36:56,126 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1266.3081), np.float32(5562.119), np.float32(5580.401), np.float32(5506.716), np.float32(5651.7485), np.float32(5569.901), np.float32(5604.2666), np.float32(5625.72), np.float32(1883.6299), np.float32(1292.4811)]
2025-09-14 11:36:56,127 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:36:56,131 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 56/100 (estimated time remaining: 1 hour, 56 minutes, 46 seconds)
2025-09-14 11:39:25,048 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:39:31,987 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5267.87793 ± 261.457
2025-09-14 11:39:31,987 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4941.651), np.float32(4687.778), np.float32(5403.6), np.float32(5429.3022), np.float32(5421.539), np.float32(5536.001), np.float32(5304.1733), np.float32(5072.4375), np.float32(5464.7856), np.float32(5417.5073)]
2025-09-14 11:39:31,987 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:39:31,992 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 57/100 (estimated time remaining: 1 hour, 54 minutes, 9 seconds)
2025-09-14 11:42:00,665 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:42:07,513 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5317.79395 ± 77.505
2025-09-14 11:42:07,513 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5429.8315), np.float32(5332.102), np.float32(5367.168), np.float32(5314.0376), np.float32(5127.674), np.float32(5333.2793), np.float32(5348.354), np.float32(5275.4297), np.float32(5378.1465), np.float32(5271.9126)]
2025-09-14 11:42:07,513 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:42:07,518 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 58/100 (estimated time remaining: 1 hour, 51 minutes, 31 seconds)
2025-09-14 11:44:36,262 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:44:43,195 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5621.46387 ± 54.891
2025-09-14 11:44:43,196 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5631.007), np.float32(5636.6484), np.float32(5616.179), np.float32(5705.9897), np.float32(5523.616), np.float32(5663.37), np.float32(5654.5884), np.float32(5524.769), np.float32(5648.445), np.float32(5610.0273)]
2025-09-14 11:44:43,196 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:44:43,196 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (5621.46) for latency 12
2025-09-14 11:44:43,200 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 59/100 (estimated time remaining: 1 hour, 48 minutes, 59 seconds)
2025-09-14 11:47:12,145 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:47:18,975 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 3934.38086 ± 1650.735
2025-09-14 11:47:18,975 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5479.677), np.float32(2700.8157), np.float32(5685.295), np.float32(2345.141), np.float32(1250.4517), np.float32(5300.916), np.float32(5483.2446), np.float32(3447.3958), np.float32(5560.0117), np.float32(2090.8618)]
2025-09-14 11:47:18,975 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:47:18,980 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 60/100 (estimated time remaining: 1 hour, 46 minutes, 25 seconds)
2025-09-14 11:49:47,913 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:49:54,745 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5577.25684 ± 57.932
2025-09-14 11:49:54,745 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5646.231), np.float32(5552.6514), np.float32(5587.2866), np.float32(5483.544), np.float32(5629.1147), np.float32(5629.186), np.float32(5524.8745), np.float32(5503.181), np.float32(5651.713), np.float32(5564.789)]
2025-09-14 11:49:54,745 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:49:54,750 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 61/100 (estimated time remaining: 1 hour, 43 minutes, 48 seconds)
2025-09-14 11:52:23,559 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:52:30,394 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5003.75195 ± 1690.647
2025-09-14 11:52:30,394 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5815.9775), np.float32(1209.6642), np.float32(5780.4746), np.float32(5860.641), np.float32(5831.952), np.float32(5876.905), np.float32(5851.331), np.float32(5867.1284), np.float32(5862.252), np.float32(2081.192)]
2025-09-14 11:52:30,394 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:52:30,399 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 62/100 (estimated time remaining: 1 hour, 41 minutes, 11 seconds)
2025-09-14 11:54:58,891 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:55:05,730 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4559.66309 ± 1552.803
2025-09-14 11:55:05,731 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5317.029), np.float32(5190.8096), np.float32(5501.7905), np.float32(1344.7108), np.float32(1699.3533), np.float32(5576.491), np.float32(5580.6655), np.float32(5294.978), np.float32(4476.99), np.float32(5613.8125)]
2025-09-14 11:55:05,731 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:55:05,736 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 63/100 (estimated time remaining: 1 hour, 38 minutes, 34 seconds)
2025-09-14 11:57:34,640 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 11:57:41,466 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5484.62354 ± 325.772
2025-09-14 11:57:41,466 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5634.852), np.float32(5603.1255), np.float32(5550.7627), np.float32(5566.529), np.float32(5650.0103), np.float32(5743.2026), np.float32(5490.167), np.float32(4543.9707), np.float32(5409.219), np.float32(5654.3936)]
2025-09-14 11:57:41,466 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:57:41,471 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 64/100 (estimated time remaining: 1 hour, 35 minutes, 59 seconds)
2025-09-14 12:00:10,328 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:00:17,164 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5623.88574 ± 105.389
2025-09-14 12:00:17,164 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5714.675), np.float32(5701.3057), np.float32(5750.23), np.float32(5528.517), np.float32(5571.2764), np.float32(5467.9907), np.float32(5640.0303), np.float32(5473.059), np.float32(5620.632), np.float32(5771.139)]
2025-09-14 12:00:17,165 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:00:17,165 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (5623.89) for latency 12
2025-09-14 12:00:17,169 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 65/100 (estimated time remaining: 1 hour, 33 minutes, 22 seconds)
2025-09-14 12:02:45,782 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:02:52,618 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 3627.01489 ± 1151.800
2025-09-14 12:02:52,618 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2601.26), np.float32(4611.128), np.float32(1967.8408), np.float32(2702.301), np.float32(4052.449), np.float32(4658.1465), np.float32(2652.765), np.float32(5268.487), np.float32(5023.3423), np.float32(2732.429)]
2025-09-14 12:02:52,618 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:02:52,623 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 66/100 (estimated time remaining: 1 hour, 30 minutes, 45 seconds)
2025-09-14 12:05:21,348 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:05:28,269 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5758.47803 ± 58.098
2025-09-14 12:05:28,270 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5777.059), np.float32(5736.7827), np.float32(5798.8887), np.float32(5792.3105), np.float32(5801.459), np.float32(5782.2837), np.float32(5738.7134), np.float32(5668.157), np.float32(5647.5815), np.float32(5841.543)]
2025-09-14 12:05:28,270 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:05:28,270 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (5758.48) for latency 12
2025-09-14 12:05:28,275 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 67/100 (estimated time remaining: 1 hour, 28 minutes, 9 seconds)
2025-09-14 12:07:56,985 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:08:03,825 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5730.97705 ± 84.694
2025-09-14 12:08:03,825 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5707.1626), np.float32(5777.548), np.float32(5792.5767), np.float32(5729.3203), np.float32(5781.73), np.float32(5700.6646), np.float32(5496.716), np.float32(5764.888), np.float32(5801.375), np.float32(5757.7905)]
2025-09-14 12:08:03,825 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:08:03,830 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 68/100 (estimated time remaining: 1 hour, 25 minutes, 35 seconds)
2025-09-14 12:10:32,487 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:10:39,314 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5787.37451 ± 89.757
2025-09-14 12:10:39,314 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5839.2886), np.float32(5812.692), np.float32(5795.721), np.float32(5866.63), np.float32(5805.351), np.float32(5578.494), np.float32(5852.075), np.float32(5868.54), np.float32(5794.4536), np.float32(5660.499)]
2025-09-14 12:10:39,314 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:10:39,314 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (5787.37) for latency 12
2025-09-14 12:10:39,319 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 69/100 (estimated time remaining: 1 hour, 22 minutes, 58 seconds)
2025-09-14 12:13:08,043 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:13:14,998 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4902.96484 ± 1796.012
2025-09-14 12:13:14,998 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1313.5605), np.float32(5758.772), np.float32(5769.5615), np.float32(5838.292), np.float32(5736.4688), np.float32(5829.947), np.float32(5811.429), np.float32(5860.052), np.float32(1309.7417), np.float32(5801.8247)]
2025-09-14 12:13:14,998 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:13:15,003 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 70/100 (estimated time remaining: 1 hour, 20 minutes, 22 seconds)
2025-09-14 12:15:44,456 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:15:51,404 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5813.14307 ± 85.006
2025-09-14 12:15:51,404 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5879.331), np.float32(5817.6914), np.float32(5832.87), np.float32(5875.4814), np.float32(5835.753), np.float32(5679.591), np.float32(5865.3633), np.float32(5869.6875), np.float32(5619.5776), np.float32(5856.082)]
2025-09-14 12:15:51,404 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:15:51,404 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (5813.14) for latency 12
2025-09-14 12:15:51,409 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 71/100 (estimated time remaining: 1 hour, 17 minutes, 52 seconds)
2025-09-14 12:18:20,445 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:18:27,388 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5302.68896 ± 1293.163
2025-09-14 12:18:27,388 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5489.7993), np.float32(5841.044), np.float32(5763.1006), np.float32(5750.973), np.float32(5849.654), np.float32(5762.55), np.float32(5587.89), np.float32(5787.7485), np.float32(1436.122), np.float32(5758.013)]
2025-09-14 12:18:27,388 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:18:27,393 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 72/100 (estimated time remaining: 1 hour, 15 minutes, 18 seconds)
2025-09-14 12:20:55,993 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:21:02,935 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5287.06299 ± 1311.670
2025-09-14 12:21:02,935 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5723.838), np.float32(5574.7495), np.float32(5762.4062), np.float32(5764.2505), np.float32(5737.9053), np.float32(5667.764), np.float32(5688.715), np.float32(1356.8973), np.float32(5826.043), np.float32(5768.0635)]
2025-09-14 12:21:02,936 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:21:02,941 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 73/100 (estimated time remaining: 1 hour, 12 minutes, 43 seconds)
2025-09-14 12:23:31,575 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:23:38,408 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5430.20264 ± 891.099
2025-09-14 12:23:38,408 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5632.8906), np.float32(5824.041), np.float32(5691.6733), np.float32(2765.7617), np.float32(5825.8228), np.float32(5645.7983), np.float32(5789.362), np.float32(5788.058), np.float32(5634.927), np.float32(5703.696)]
2025-09-14 12:23:38,408 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:23:38,413 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 74/100 (estimated time remaining: 1 hour, 10 minutes, 7 seconds)
2025-09-14 12:26:07,543 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:26:14,370 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5684.78027 ± 49.432
2025-09-14 12:26:14,371 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5596.4883), np.float32(5650.3325), np.float32(5662.727), np.float32(5701.052), np.float32(5653.7334), np.float32(5697.8823), np.float32(5766.083), np.float32(5763.917), np.float32(5660.531), np.float32(5695.0576)]
2025-09-14 12:26:14,371 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:26:14,376 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 75/100 (estimated time remaining: 1 hour, 7 minutes, 32 seconds)
2025-09-14 12:28:43,095 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:28:49,934 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5758.17041 ± 199.450
2025-09-14 12:28:49,934 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5818.1807), np.float32(5688.316), np.float32(5914.5615), np.float32(5930.5967), np.float32(5270.1523), np.float32(5759.86), np.float32(5551.778), np.float32(5944.1704), np.float32(5819.2476), np.float32(5884.8384)]
2025-09-14 12:28:49,934 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:28:49,940 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 76/100 (estimated time remaining: 1 hour, 4 minutes, 52 seconds)
2025-09-14 12:31:18,705 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:31:25,534 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5003.97949 ± 1505.485
2025-09-14 12:31:25,535 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5808.458), np.float32(1315.2578), np.float32(5366.912), np.float32(5686.954), np.float32(5818.926), np.float32(5780.74), np.float32(5812.3374), np.float32(5717.6533), np.float32(2854.784), np.float32(5877.772)]
2025-09-14 12:31:25,535 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:31:25,540 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 77/100 (estimated time remaining: 1 hour, 2 minutes, 15 seconds)
2025-09-14 12:33:54,176 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:34:01,011 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5884.91504 ± 91.869
2025-09-14 12:34:01,011 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5940.6367), np.float32(5839.7437), np.float32(5642.6045), np.float32(5917.926), np.float32(5925.174), np.float32(5937.0625), np.float32(5945.5234), np.float32(5815.7124), np.float32(5946.981), np.float32(5937.7866)]
2025-09-14 12:34:01,011 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:34:01,011 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (5884.92) for latency 12
2025-09-14 12:34:01,017 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 78/100 (estimated time remaining: 59 minutes, 39 seconds)
2025-09-14 12:36:29,723 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:36:36,517 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5780.48535 ± 126.893
2025-09-14 12:36:36,517 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5853.1587), np.float32(5833.7607), np.float32(5710.709), np.float32(5878.3726), np.float32(5820.773), np.float32(5806.7144), np.float32(5449.7), np.float32(5708.442), np.float32(5921.2153), np.float32(5822.006)]
2025-09-14 12:36:36,517 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:36:36,522 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 79/100 (estimated time remaining: 57 minutes, 3 seconds)
2025-09-14 12:39:05,421 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:39:12,118 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5879.64160 ± 51.284
2025-09-14 12:39:12,118 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5904.3276), np.float32(5881.167), np.float32(5850.633), np.float32(5897.727), np.float32(5743.6523), np.float32(5923.465), np.float32(5854.793), np.float32(5910.381), np.float32(5911.6025), np.float32(5918.666)]
2025-09-14 12:39:12,119 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:39:12,124 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 80/100 (estimated time remaining: 54 minutes, 26 seconds)
2025-09-14 12:41:41,149 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:41:47,873 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5961.74316 ± 34.983
2025-09-14 12:41:47,874 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5960.942), np.float32(5910.014), np.float32(5997.6343), np.float32(5935.0166), np.float32(5939.8223), np.float32(5926.5835), np.float32(5981.1875), np.float32(6020.2915), np.float32(5943.1035), np.float32(6002.8384)]
2025-09-14 12:41:47,874 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:41:47,874 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (5961.74) for latency 12
2025-09-14 12:41:47,879 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 81/100 (estimated time remaining: 51 minutes, 51 seconds)
2025-09-14 12:44:17,060 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:44:23,785 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5767.67773 ± 79.254
2025-09-14 12:44:23,785 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5784.1055), np.float32(5813.3486), np.float32(5784.602), np.float32(5788.552), np.float32(5814.9375), np.float32(5698.087), np.float32(5838.8413), np.float32(5811.177), np.float32(5554.889), np.float32(5788.233)]
2025-09-14 12:44:23,785 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:44:23,791 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 82/100 (estimated time remaining: 49 minutes, 17 seconds)
2025-09-14 12:46:52,898 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:46:59,779 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5430.69482 ± 1423.533
2025-09-14 12:46:59,779 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5889.949), np.float32(5917.317), np.float32(5920.856), np.float32(5929.5806), np.float32(5892.9766), np.float32(5888.5645), np.float32(5895.949), np.float32(5896.8115), np.float32(1160.2946), np.float32(5914.654)]
2025-09-14 12:46:59,779 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:46:59,785 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 83/100 (estimated time remaining: 46 minutes, 43 seconds)
2025-09-14 12:49:28,970 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:49:35,790 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5806.56543 ± 80.041
2025-09-14 12:49:35,790 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5848.7896), np.float32(5833.3516), np.float32(5871.9224), np.float32(5830.7974), np.float32(5862.128), np.float32(5638.407), np.float32(5825.8423), np.float32(5836.4663), np.float32(5660.1143), np.float32(5857.837)]
2025-09-14 12:49:35,791 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:49:35,796 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 84/100 (estimated time remaining: 44 minutes, 9 seconds)
2025-09-14 12:52:04,524 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:52:11,363 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5306.46484 ± 1337.672
2025-09-14 12:52:11,363 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5804.969), np.float32(5547.4375), np.float32(5605.916), np.float32(1303.864), np.float32(5784.159), np.float32(5846.2583), np.float32(5872.6055), np.float32(5754.0576), np.float32(5745.8433), np.float32(5799.5337)]
2025-09-14 12:52:11,363 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:52:11,369 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 85/100 (estimated time remaining: 41 minutes, 33 seconds)
2025-09-14 12:54:40,002 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:54:46,803 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5668.04541 ± 105.661
2025-09-14 12:54:46,803 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5716.972), np.float32(5735.188), np.float32(5707.474), np.float32(5771.8706), np.float32(5607.016), np.float32(5742.731), np.float32(5644.3184), np.float32(5692.4775), np.float32(5680.352), np.float32(5382.0527)]
2025-09-14 12:54:46,803 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:54:46,809 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 86/100 (estimated time remaining: 38 minutes, 56 seconds)
2025-09-14 12:57:15,693 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:57:22,475 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5770.85059 ± 134.718
2025-09-14 12:57:22,475 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5830.6406), np.float32(5850.878), np.float32(5809.986), np.float32(5753.995), np.float32(5873.627), np.float32(5816.1484), np.float32(5400.105), np.float32(5872.5586), np.float32(5815.239), np.float32(5685.3335)]
2025-09-14 12:57:22,475 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:57:22,481 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 87/100 (estimated time remaining: 36 minutes, 20 seconds)
2025-09-14 12:59:51,522 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 12:59:58,276 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5671.11279 ± 63.193
2025-09-14 12:59:58,277 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5668.316), np.float32(5630.2236), np.float32(5729.713), np.float32(5667.439), np.float32(5573.046), np.float32(5675.7993), np.float32(5820.263), np.float32(5669.5435), np.float32(5658.1084), np.float32(5618.6763)]
2025-09-14 12:59:58,277 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:59:58,282 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 88/100 (estimated time remaining: 33 minutes, 44 seconds)
2025-09-14 13:02:27,247 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:02:33,979 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5724.94287 ± 51.054
2025-09-14 13:02:33,979 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5716.5776), np.float32(5767.0303), np.float32(5690.6025), np.float32(5712.1167), np.float32(5609.3687), np.float32(5704.0244), np.float32(5737.231), np.float32(5749.116), np.float32(5812.5923), np.float32(5750.7695)]
2025-09-14 13:02:33,980 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:02:33,985 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 89/100 (estimated time remaining: 31 minutes, 7 seconds)
2025-09-14 13:05:02,913 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:05:09,649 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5875.42383 ± 64.457
2025-09-14 13:05:09,650 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5896.5703), np.float32(5924.221), np.float32(5913.391), np.float32(5691.3555), np.float32(5868.492), np.float32(5895.134), np.float32(5910.026), np.float32(5854.219), np.float32(5906.0273), np.float32(5894.8057)]
2025-09-14 13:05:09,650 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:05:09,655 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 90/100 (estimated time remaining: 28 minutes, 32 seconds)
2025-09-14 13:07:38,765 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:07:45,533 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5515.54443 ± 412.844
2025-09-14 13:07:45,533 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5761.7476), np.float32(5678.5044), np.float32(5731.182), np.float32(5748.3447), np.float32(5770.282), np.float32(5649.8813), np.float32(4604.1353), np.float32(4792.08), np.float32(5755.3706), np.float32(5663.9165)]
2025-09-14 13:07:45,533 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:07:45,539 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 91/100 (estimated time remaining: 25 minutes, 57 seconds)
2025-09-14 13:10:14,597 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:10:21,380 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5799.34229 ± 51.018
2025-09-14 13:10:21,380 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5798.5884), np.float32(5856.522), np.float32(5845.8604), np.float32(5793.164), np.float32(5841.2544), np.float32(5794.6387), np.float32(5663.2437), np.float32(5810.896), np.float32(5788.442), np.float32(5800.8145)]
2025-09-14 13:10:21,380 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:10:21,387 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 92/100 (estimated time remaining: 23 minutes, 22 seconds)
2025-09-14 13:12:50,582 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:12:57,386 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5848.31641 ± 227.234
2025-09-14 13:12:57,386 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5932.9917), np.float32(5168.373), np.float32(5908.222), np.float32(5939.246), np.float32(5929.4595), np.float32(5919.6235), np.float32(5907.0312), np.float32(5959.646), np.float32(5914.577), np.float32(5903.993)]
2025-09-14 13:12:57,386 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:12:57,392 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 93/100 (estimated time remaining: 20 minutes, 46 seconds)
2025-09-14 13:15:25,986 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:15:32,797 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5845.46387 ± 37.265
2025-09-14 13:15:32,797 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5820.5015), np.float32(5833.3916), np.float32(5887.088), np.float32(5892.016), np.float32(5868.007), np.float32(5859.473), np.float32(5754.1484), np.float32(5847.7705), np.float32(5836.9346), np.float32(5855.3125)]
2025-09-14 13:15:32,797 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:15:32,803 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 94/100 (estimated time remaining: 18 minutes, 10 seconds)
2025-09-14 13:18:01,619 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:18:08,442 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5797.32275 ± 192.081
2025-09-14 13:18:08,443 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5276.111), np.float32(5649.5854), np.float32(5928.797), np.float32(5878.832), np.float32(5907.9165), np.float32(5772.7754), np.float32(5865.685), np.float32(5947.565), np.float32(5874.9204), np.float32(5871.0405)]
2025-09-14 13:18:08,443 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:18:08,449 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 95/100 (estimated time remaining: 15 minutes, 34 seconds)
2025-09-14 13:20:37,138 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:20:43,943 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5767.04541 ± 147.735
2025-09-14 13:20:43,943 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5851.3276), np.float32(5891.5474), np.float32(5849.066), np.float32(5832.567), np.float32(5826.5605), np.float32(5640.9546), np.float32(5373.083), np.float32(5738.45), np.float32(5846.624), np.float32(5820.275)]
2025-09-14 13:20:43,944 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:20:43,950 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 96/100 (estimated time remaining: 12 minutes, 58 seconds)
2025-09-14 13:23:12,709 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:23:19,553 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5339.76416 ± 1345.649
2025-09-14 13:23:19,553 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5737.1245), np.float32(5848.4097), np.float32(5744.5493), np.float32(5815.7734), np.float32(5825.8594), np.float32(5837.549), np.float32(5717.337), np.float32(1304.8206), np.float32(5774.5557), np.float32(5791.662)]
2025-09-14 13:23:19,553 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:23:19,559 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 97/100 (estimated time remaining: 10 minutes, 22 seconds)
2025-09-14 13:25:48,173 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:25:55,035 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5303.48145 ± 1045.427
2025-09-14 13:25:55,035 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5628.698), np.float32(5625.4233), np.float32(5677.2534), np.float32(5649.7485), np.float32(5656.838), np.float32(5676.647), np.float32(5653.915), np.float32(5634.364), np.float32(5664.3027), np.float32(2167.6284)]
2025-09-14 13:25:55,035 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:25:55,042 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 98/100 (estimated time remaining: 7 minutes, 46 seconds)
2025-09-14 13:28:24,034 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:28:30,862 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5848.46094 ± 49.806
2025-09-14 13:28:30,863 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5833.156), np.float32(5844.801), np.float32(5894.238), np.float32(5773.1436), np.float32(5889.074), np.float32(5883.973), np.float32(5887.731), np.float32(5745.5776), np.float32(5842.9507), np.float32(5889.9644)]
2025-09-14 13:28:30,863 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:28:30,869 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 99/100 (estimated time remaining: 5 minutes, 11 seconds)
2025-09-14 13:30:59,586 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:31:06,432 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5852.29150 ± 30.610
2025-09-14 13:31:06,432 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5902.527), np.float32(5799.701), np.float32(5854.8086), np.float32(5890.204), np.float32(5857.126), np.float32(5847.9766), np.float32(5839.9233), np.float32(5814.2812), np.float32(5836.643), np.float32(5879.7197)]
2025-09-14 13:31:06,432 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:31:06,439 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 100/100 (estimated time remaining: 2 minutes, 35 seconds)
2025-09-14 13:33:35,278 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 12...
2025-09-14 13:33:42,125 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5794.90771 ± 98.530
2025-09-14 13:33:42,125 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5793.1216), np.float32(5828.824), np.float32(5879.243), np.float32(5847.15), np.float32(5831.7407), np.float32(5864.5923), np.float32(5767.153), np.float32(5514.614), np.float32(5811.9546), np.float32(5810.6895)]
2025-09-14 13:33:42,125 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:33:42,132 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1251 [DEBUG]: Training session finished
