2025-09-14 13:36:11,382 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1108 [DEBUG]: logdir: _logs/noise-eval-v2/halfcheetah/bpql-noise_0.000-delay_21
2025-09-14 13:36:11,383 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1109 [DEBUG]: trainer_prefix: noise-eval-v2/halfcheetah/bpql-noise_0.000-delay_21
2025-09-14 13:36:11,383 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1110 [DEBUG]: args.trainer_eval_latencies: {'21': <latency_env.delayed_mdp.ConstantDelay object at 0x7fdeb79d7b30>}
2025-09-14 13:36:11,383 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1111 [DEBUG]: using device: cpu
2025-09-14 13:36:11,386 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1133 [INFO]: Creating new trainer
2025-09-14 13:36:11,498 baseline-bpql-halfcheetah:113 [DEBUG]: pi network:
NNGaussianPolicy(
  (common_head): Sequential(
    (0): Flatten(start_dim=1, end_dim=-1)
    (1): Linear(in_features=143, out_features=256, bias=True)
    (2): ReLU()
    (3): Linear(in_features=256, out_features=256, bias=True)
    (4): ReLU()
  )
  (mu_head): Sequential(
    (0): Linear(in_features=256, out_features=6, bias=True)
    (1): Unflatten(dim=1, unflattened_size=(6,))
  )
  (log_std_head): Sequential(
    (0): Linear(in_features=256, out_features=6, bias=True)
    (1): Unflatten(dim=1, unflattened_size=(6,))
  )
  (tanh_refit): NNTanhRefit(scale: tensor([[2., 2., 2., 2., 2., 2.]]), shift: tensor([[-1., -1., -1., -1., -1., -1.]]))
)
2025-09-14 13:36:11,498 baseline-bpql-halfcheetah:114 [DEBUG]: q network:
NNLayerConcat2(
  dim: -1
  (next): Sequential(
    (0): Linear(in_features=23, out_features=256, bias=True)
    (1): ReLU()
    (2): Linear(in_features=256, out_features=256, bias=True)
    (3): ReLU()
    (4): Linear(in_features=256, out_features=1, bias=True)
    (5): NNLayerSqueeze(dim: -1)
  )
  (init_left): Flatten(start_dim=1, end_dim=-1)
  (init_right): Flatten(start_dim=1, end_dim=-1)
)
2025-09-14 13:36:13,071 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1194 [DEBUG]: Starting training session...
2025-09-14 13:36:13,072 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 1/100
2025-09-14 13:38:21,328 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 13:38:29,072 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: -307.06598 ± 94.699
2025-09-14 13:38:29,072 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(-341.81943), np.float32(-296.56033), np.float32(-285.4847), np.float32(-117.152565), np.float32(-511.23822), np.float32(-219.4259), np.float32(-297.8893), np.float32(-344.97397), np.float32(-322.6956), np.float32(-333.41977)]
2025-09-14 13:38:29,072 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:38:29,072 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (-307.07) for latency 21
2025-09-14 13:38:29,074 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 2/100 (estimated time remaining: 3 hours, 44 minutes, 24 seconds)
2025-09-14 13:40:40,342 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 13:40:48,087 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: -208.37476 ± 39.644
2025-09-14 13:40:48,088 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(-270.65546), np.float32(-123.95524), np.float32(-230.94937), np.float32(-182.19736), np.float32(-244.72313), np.float32(-227.4741), np.float32(-176.27435), np.float32(-231.03705), np.float32(-191.90144), np.float32(-204.58006)]
2025-09-14 13:40:48,088 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:40:48,088 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (-208.37) for latency 21
2025-09-14 13:40:48,090 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 3/100 (estimated time remaining: 3 hours, 44 minutes, 35 seconds)
2025-09-14 13:42:58,541 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 13:43:06,347 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: -200.66530 ± 68.608
2025-09-14 13:43:06,347 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(-147.10275), np.float32(-157.75035), np.float32(-299.1423), np.float32(-269.33417), np.float32(-225.05348), np.float32(-231.19994), np.float32(-291.63956), np.float32(-164.68114), np.float32(-126.854546), np.float32(-93.89485)]
2025-09-14 13:43:06,347 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:43:06,347 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (-200.67) for latency 21
2025-09-14 13:43:06,349 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 4/100 (estimated time remaining: 3 hours, 42 minutes, 42 seconds)
2025-09-14 13:45:16,872 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 13:45:24,602 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: -10.60423 ± 75.602
2025-09-14 13:45:24,602 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(13.030165), np.float32(8.249934), np.float32(-16.334713), np.float32(28.489218), np.float32(-117.20391), np.float32(-164.4975), np.float32(73.68751), np.float32(44.452362), np.float32(-53.043163), np.float32(77.12781)]
2025-09-14 13:45:24,602 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:45:24,602 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (-10.60) for latency 21
2025-09-14 13:45:24,604 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 5/100 (estimated time remaining: 3 hours, 40 minutes, 36 seconds)
2025-09-14 13:47:33,983 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 13:47:41,690 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 125.65392 ± 83.168
2025-09-14 13:47:41,690 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(53.034843), np.float32(52.612442), np.float32(88.749626), np.float32(88.3583), np.float32(237.74478), np.float32(51.685318), np.float32(298.77386), np.float32(131.69843), np.float32(64.64505), np.float32(189.23657)]
2025-09-14 13:47:41,690 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:47:41,691 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (125.65) for latency 21
2025-09-14 13:47:41,693 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 6/100 (estimated time remaining: 3 hours, 38 minutes, 3 seconds)
2025-09-14 13:49:51,249 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 13:49:58,954 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 384.70642 ± 121.171
2025-09-14 13:49:58,955 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(381.82944), np.float32(336.58145), np.float32(464.11746), np.float32(480.64413), np.float32(653.853), np.float32(298.32065), np.float32(353.98764), np.float32(373.01288), np.float32(332.90326), np.float32(171.81422)]
2025-09-14 13:49:58,955 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:49:58,955 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (384.71) for latency 21
2025-09-14 13:49:58,957 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 7/100 (estimated time remaining: 3 hours, 36 minutes, 9 seconds)
2025-09-14 13:52:09,364 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 13:52:17,067 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 695.05420 ± 143.331
2025-09-14 13:52:17,067 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(836.01184), np.float32(370.73187), np.float32(702.3483), np.float32(799.6626), np.float32(820.39667), np.float32(854.34314), np.float32(666.56384), np.float32(733.18994), np.float32(608.289), np.float32(559.0051)]
2025-09-14 13:52:17,067 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:52:17,067 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (695.05) for latency 21
2025-09-14 13:52:17,069 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 8/100 (estimated time remaining: 3 hours, 33 minutes, 35 seconds)
2025-09-14 13:54:29,666 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 13:54:37,679 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 630.41431 ± 133.558
2025-09-14 13:54:37,679 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(515.1191), np.float32(713.13074), np.float32(827.89685), np.float32(530.6481), np.float32(739.8058), np.float32(394.44458), np.float32(580.1761), np.float32(566.6698), np.float32(817.2914), np.float32(618.961)]
2025-09-14 13:54:37,679 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:54:37,682 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 9/100 (estimated time remaining: 3 hours, 32 minutes)
2025-09-14 13:56:48,947 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 13:56:56,656 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 840.83331 ± 63.635
2025-09-14 13:56:56,657 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(812.06024), np.float32(854.0404), np.float32(923.4473), np.float32(800.46967), np.float32(755.646), np.float32(769.765), np.float32(887.91254), np.float32(956.4577), np.float32(860.2884), np.float32(788.24585)]
2025-09-14 13:56:56,657 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:56:56,657 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (840.83) for latency 21
2025-09-14 13:56:56,659 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 10/100 (estimated time remaining: 3 hours, 29 minutes, 55 seconds)
2025-09-14 13:59:06,488 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 13:59:14,204 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 1072.24329 ± 91.758
2025-09-14 13:59:14,204 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1003.43933), np.float32(1146.7014), np.float32(1247.041), np.float32(1109.5114), np.float32(1065.4086), np.float32(990.53125), np.float32(891.5301), np.float32(1057.3575), np.float32(1114.5428), np.float32(1096.3694)]
2025-09-14 13:59:14,204 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:59:14,204 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (1072.24) for latency 21
2025-09-14 13:59:14,206 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 11/100 (estimated time remaining: 3 hours, 27 minutes, 45 seconds)
2025-09-14 14:01:23,748 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 14:01:31,446 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 1082.93005 ± 113.832
2025-09-14 14:01:31,446 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1231.5573), np.float32(1228.2935), np.float32(1069.0098), np.float32(1042.8362), np.float32(1044.797), np.float32(1051.8754), np.float32(1119.3267), np.float32(1192.014), np.float32(827.58704), np.float32(1022.005)]
2025-09-14 14:01:31,446 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:01:31,446 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (1082.93) for latency 21
2025-09-14 14:01:31,448 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 12/100 (estimated time remaining: 3 hours, 25 minutes, 26 seconds)
2025-09-14 14:03:41,334 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 14:03:49,067 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 1063.93140 ± 54.548
2025-09-14 14:03:49,067 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1098.0969), np.float32(1045.7515), np.float32(1065.6163), np.float32(1162.6578), np.float32(1048.4812), np.float32(1005.9119), np.float32(1137.6606), np.float32(968.6784), np.float32(1062.2802), np.float32(1044.1776)]
2025-09-14 14:03:49,067 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:03:49,069 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 13/100 (estimated time remaining: 3 hours, 22 minutes, 59 seconds)
2025-09-14 14:05:58,583 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 14:06:06,299 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 1061.10278 ± 59.904
2025-09-14 14:06:06,299 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(938.733), np.float32(1102.5034), np.float32(1035.6814), np.float32(1124.7245), np.float32(1047.7784), np.float32(1084.828), np.float32(994.2766), np.float32(1096.4552), np.float32(1147.8362), np.float32(1038.2107)]
2025-09-14 14:06:06,299 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:06:06,302 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 14/100 (estimated time remaining: 3 hours, 19 minutes, 41 seconds)
2025-09-14 14:08:15,997 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 14:08:23,722 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 1156.17163 ± 140.493
2025-09-14 14:08:23,722 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1031.1371), np.float32(1059.3842), np.float32(945.3602), np.float32(1167.3478), np.float32(1009.5458), np.float32(1176.5029), np.float32(1330.7188), np.float32(1229.0234), np.float32(1194.4932), np.float32(1418.2029)]
2025-09-14 14:08:23,722 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:08:23,722 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (1156.17) for latency 21
2025-09-14 14:08:23,725 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 15/100 (estimated time remaining: 3 hours, 16 minutes, 57 seconds)
2025-09-14 14:10:33,554 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 14:10:41,276 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 1256.43542 ± 64.634
2025-09-14 14:10:41,276 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1234.796), np.float32(1192.8938), np.float32(1179.9318), np.float32(1341.6652), np.float32(1241.841), np.float32(1299.6635), np.float32(1342.5546), np.float32(1314.106), np.float32(1266.4333), np.float32(1150.4684)]
2025-09-14 14:10:41,276 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:10:41,276 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (1256.44) for latency 21
2025-09-14 14:10:41,279 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 16/100 (estimated time remaining: 3 hours, 14 minutes, 40 seconds)
2025-09-14 14:12:50,851 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 14:12:58,554 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 1266.63831 ± 129.547
2025-09-14 14:12:58,554 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1136.5854), np.float32(1280.27), np.float32(1446.4734), np.float32(1203.6296), np.float32(1241.8766), np.float32(1409.3026), np.float32(1006.6523), np.float32(1373.3854), np.float32(1200.8893), np.float32(1367.3169)]
2025-09-14 14:12:58,554 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:12:58,554 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (1266.64) for latency 21
2025-09-14 14:12:58,557 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 17/100 (estimated time remaining: 3 hours, 12 minutes, 23 seconds)
2025-09-14 14:15:08,371 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 14:15:16,087 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 1260.09619 ± 121.466
2025-09-14 14:15:16,087 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1260.2039), np.float32(1145.461), np.float32(1441.6527), np.float32(1123.2462), np.float32(1152.9556), np.float32(1155.7949), np.float32(1440.2366), np.float32(1271.3541), np.float32(1196.0897), np.float32(1413.9672)]
2025-09-14 14:15:16,087 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:15:16,093 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 18/100 (estimated time remaining: 3 hours, 10 minutes, 4 seconds)
2025-09-14 14:17:25,904 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 14:17:33,631 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 1320.12830 ± 127.519
2025-09-14 14:17:33,631 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1339.0603), np.float32(1149.3983), np.float32(1588.1365), np.float32(1216.0443), np.float32(1327.5513), np.float32(1207.0778), np.float32(1486.7179), np.float32(1332.745), np.float32(1332.1506), np.float32(1222.4028)]
2025-09-14 14:17:33,631 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:17:33,631 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (1320.13) for latency 21
2025-09-14 14:17:33,634 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 19/100 (estimated time remaining: 3 hours, 7 minutes, 52 seconds)
2025-09-14 14:19:43,321 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 14:19:51,043 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 1425.68042 ± 152.310
2025-09-14 14:19:51,043 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1475.4591), np.float32(1281.31), np.float32(1428.6039), np.float32(1682.8315), np.float32(1360.0353), np.float32(1677.9005), np.float32(1514.9254), np.float32(1244.7922), np.float32(1293.111), np.float32(1297.835)]
2025-09-14 14:19:51,043 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:19:51,043 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (1425.68) for latency 21
2025-09-14 14:19:51,046 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 20/100 (estimated time remaining: 3 hours, 5 minutes, 34 seconds)
2025-09-14 14:22:00,674 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 14:22:08,399 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 1283.10303 ± 105.535
2025-09-14 14:22:08,399 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1208.1055), np.float32(1330.4001), np.float32(1149.7383), np.float32(1397.111), np.float32(1274.0459), np.float32(1486.4677), np.float32(1301.2521), np.float32(1338.9485), np.float32(1213.5253), np.float32(1131.4369)]
2025-09-14 14:22:08,399 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:22:08,402 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 21/100 (estimated time remaining: 3 hours, 3 minutes, 13 seconds)
2025-09-14 14:24:19,514 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 14:24:27,233 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 1427.60242 ± 103.303
2025-09-14 14:24:27,233 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1272.0687), np.float32(1338.2911), np.float32(1399.9135), np.float32(1416.4537), np.float32(1483.0946), np.float32(1462.2869), np.float32(1479.9391), np.float32(1373.1239), np.float32(1674.64), np.float32(1376.213)]
2025-09-14 14:24:27,234 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:24:27,234 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (1427.60) for latency 21
2025-09-14 14:24:27,237 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 22/100 (estimated time remaining: 3 hours, 1 minute, 21 seconds)
2025-09-14 14:26:36,842 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 14:26:44,551 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 1686.91638 ± 388.162
2025-09-14 14:26:44,551 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1197.249), np.float32(1614.7151), np.float32(1402.5858), np.float32(1878.5972), np.float32(1331.2484), np.float32(1446.6354), np.float32(1942.668), np.float32(1780.6273), np.float32(2627.211), np.float32(1647.6274)]
2025-09-14 14:26:44,552 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:26:44,552 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (1686.92) for latency 21
2025-09-14 14:26:44,554 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 23/100 (estimated time remaining: 2 hours, 58 minutes, 59 seconds)
2025-09-14 14:28:54,406 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 14:29:02,155 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 1385.65393 ± 170.675
2025-09-14 14:29:02,155 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1261.0751), np.float32(1296.4283), np.float32(1312.8214), np.float32(1257.1981), np.float32(1480.0494), np.float32(1262.4738), np.float32(1653.2667), np.float32(1177.3229), np.float32(1706.3153), np.float32(1449.5879)]
2025-09-14 14:29:02,156 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:29:02,158 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 24/100 (estimated time remaining: 2 hours, 56 minutes, 43 seconds)
2025-09-14 14:31:14,127 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 14:31:21,893 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 1300.53601 ± 234.388
2025-09-14 14:31:21,893 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1305.534), np.float32(1414.0168), np.float32(1039.6519), np.float32(1484.7225), np.float32(827.7454), np.float32(1144.6631), np.float32(1463.5934), np.float32(1369.8853), np.float32(1262.7871), np.float32(1692.7605)]
2025-09-14 14:31:21,893 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:31:21,896 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 25/100 (estimated time remaining: 2 hours, 55 minutes)
2025-09-14 14:33:31,594 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 14:33:39,361 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 1396.77893 ± 116.815
2025-09-14 14:33:39,361 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1460.4645), np.float32(1391.2512), np.float32(1366.4092), np.float32(1611.7329), np.float32(1246.9946), np.float32(1452.1469), np.float32(1484.2441), np.float32(1234.3564), np.float32(1252.8091), np.float32(1467.381)]
2025-09-14 14:33:39,361 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:33:39,364 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 26/100 (estimated time remaining: 2 hours, 52 minutes, 44 seconds)
2025-09-14 14:35:48,488 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 14:35:56,236 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 1390.29688 ± 194.073
2025-09-14 14:35:56,236 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1230.2778), np.float32(1298.3671), np.float32(1416.8168), np.float32(1352.3259), np.float32(1195.4446), np.float32(1807.7517), np.float32(1234.6582), np.float32(1211.0848), np.float32(1552.5343), np.float32(1603.7075)]
2025-09-14 14:35:56,236 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:35:56,239 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 27/100 (estimated time remaining: 2 hours, 49 minutes, 57 seconds)
2025-09-14 14:38:05,606 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 14:38:13,348 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 1441.69348 ± 420.032
2025-09-14 14:38:13,348 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1996.1392), np.float32(1639.6761), np.float32(1229.0348), np.float32(1266.4906), np.float32(1230.8917), np.float32(442.316), np.float32(1724.907), np.float32(1866.7407), np.float32(1629.201), np.float32(1391.5369)]
2025-09-14 14:38:13,348 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:38:13,351 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 28/100 (estimated time remaining: 2 hours, 47 minutes, 36 seconds)
2025-09-14 14:40:22,702 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 14:40:30,435 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 2259.31909 ± 719.575
2025-09-14 14:40:30,435 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2116.2588), np.float32(3483.3247), np.float32(1801.6213), np.float32(3100.0374), np.float32(1439.2266), np.float32(1649.5128), np.float32(2983.0786), np.float32(1579.8932), np.float32(1624.304), np.float32(2815.9338)]
2025-09-14 14:40:30,435 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:40:30,435 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (2259.32) for latency 21
2025-09-14 14:40:30,439 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 29/100 (estimated time remaining: 2 hours, 45 minutes, 11 seconds)
2025-09-14 14:42:39,741 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 14:42:47,478 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 1735.89319 ± 445.318
2025-09-14 14:42:47,478 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2325.7625), np.float32(1337.59), np.float32(1249.0864), np.float32(1543.9432), np.float32(2567.8884), np.float32(1475.6548), np.float32(1291.6241), np.float32(1603.8276), np.float32(1750.4915), np.float32(2213.0635)]
2025-09-14 14:42:47,478 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:42:47,482 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 30/100 (estimated time remaining: 2 hours, 42 minutes, 15 seconds)
2025-09-14 14:44:56,597 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 14:45:04,334 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 1606.65405 ± 539.533
2025-09-14 14:45:04,335 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1422.2305), np.float32(1261.3329), np.float32(1410.1663), np.float32(1573.9955), np.float32(1331.5486), np.float32(1575.6836), np.float32(1411.1865), np.float32(3193.7417), np.float32(1318.8737), np.float32(1567.7812)]
2025-09-14 14:45:04,335 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:45:04,338 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 31/100 (estimated time remaining: 2 hours, 39 minutes, 49 seconds)
2025-09-14 14:47:13,463 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 14:47:21,188 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 1697.29590 ± 583.017
2025-09-14 14:47:21,189 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1624.6628), np.float32(3296.7656), np.float32(1219.3574), np.float32(1469.7506), np.float32(1631.9095), np.float32(1733.9493), np.float32(2036.5784), np.float32(1378.4448), np.float32(1256.2167), np.float32(1325.3247)]
2025-09-14 14:47:21,189 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:47:21,192 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 32/100 (estimated time remaining: 2 hours, 37 minutes, 32 seconds)
2025-09-14 14:49:30,984 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 14:49:38,735 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 1900.18750 ± 707.118
2025-09-14 14:49:38,735 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1542.5259), np.float32(1458.5194), np.float32(3344.653), np.float32(2203.753), np.float32(1223.1906), np.float32(3083.1238), np.float32(1288.6901), np.float32(1554.0884), np.float32(1542.9683), np.float32(1760.3628)]
2025-09-14 14:49:38,735 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:49:38,739 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 33/100 (estimated time remaining: 2 hours, 35 minutes, 21 seconds)
2025-09-14 14:51:48,438 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 14:51:56,172 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 1798.74121 ± 370.662
2025-09-14 14:51:56,172 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2061.243), np.float32(1220.4181), np.float32(2022.7993), np.float32(1404.7777), np.float32(1669.5532), np.float32(1632.3021), np.float32(2599.7786), np.float32(1887.5769), np.float32(1563.7974), np.float32(1925.1667)]
2025-09-14 14:51:56,172 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:51:56,175 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 34/100 (estimated time remaining: 2 hours, 33 minutes, 8 seconds)
2025-09-14 14:54:05,781 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 14:54:13,544 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 1492.32800 ± 408.792
2025-09-14 14:54:13,544 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1429.0331), np.float32(1275.6705), np.float32(1621.4889), np.float32(1316.6506), np.float32(2232.9832), np.float32(668.9974), np.float32(1202.1787), np.float32(1903.3154), np.float32(1475.9889), np.float32(1796.9727)]
2025-09-14 14:54:13,544 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:54:13,548 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 35/100 (estimated time remaining: 2 hours, 30 minutes, 56 seconds)
2025-09-14 14:56:23,071 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 14:56:30,816 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 1512.38940 ± 238.949
2025-09-14 14:56:30,816 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1882.0254), np.float32(1496.208), np.float32(1230.7758), np.float32(1676.0104), np.float32(1304.1637), np.float32(1224.6564), np.float32(1485.3243), np.float32(1312.8304), np.float32(1901.9788), np.float32(1609.9214)]
2025-09-14 14:56:30,816 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:56:30,820 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 36/100 (estimated time remaining: 2 hours, 28 minutes, 44 seconds)
2025-09-14 14:58:40,175 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 14:58:47,906 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 1753.32788 ± 465.691
2025-09-14 14:58:47,906 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1453.0969), np.float32(1422.7997), np.float32(1814.5466), np.float32(1814.785), np.float32(2200.6184), np.float32(1469.2483), np.float32(1573.1904), np.float32(1292.9629), np.float32(1556.4036), np.float32(2935.6282)]
2025-09-14 14:58:47,906 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:58:47,909 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 37/100 (estimated time remaining: 2 hours, 26 minutes, 29 seconds)
2025-09-14 15:00:57,175 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 15:01:04,929 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 1895.16638 ± 487.281
2025-09-14 15:01:04,930 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1716.3064), np.float32(1835.5444), np.float32(2034.0077), np.float32(3278.6577), np.float32(1790.7479), np.float32(1607.4221), np.float32(1580.462), np.float32(1705.1655), np.float32(1927.3293), np.float32(1476.0201)]
2025-09-14 15:01:04,930 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:01:04,933 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 38/100 (estimated time remaining: 2 hours, 24 minutes, 6 seconds)
2025-09-14 15:03:14,245 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 15:03:21,977 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 2166.89160 ± 602.099
2025-09-14 15:03:21,977 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2077.7441), np.float32(3282.0552), np.float32(1405.442), np.float32(1721.5773), np.float32(1812.1417), np.float32(3096.7527), np.float32(2267.7122), np.float32(1501.7977), np.float32(1989.0299), np.float32(2514.6638)]
2025-09-14 15:03:21,977 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:03:21,981 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 39/100 (estimated time remaining: 2 hours, 21 minutes, 43 seconds)
2025-09-14 15:05:31,348 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 15:05:39,091 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 2084.99463 ± 858.118
2025-09-14 15:05:39,091 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1377.2085), np.float32(1498.1249), np.float32(1289.7361), np.float32(1504.2474), np.float32(1439.6913), np.float32(3481.443), np.float32(2184.6045), np.float32(3145.3318), np.float32(1526.8608), np.float32(3402.698)]
2025-09-14 15:05:39,091 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:05:39,095 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 40/100 (estimated time remaining: 2 hours, 19 minutes, 23 seconds)
2025-09-14 15:07:48,696 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 15:07:56,438 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 2227.64917 ± 755.328
2025-09-14 15:07:56,438 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3408.4292), np.float32(2634.9312), np.float32(1376.1766), np.float32(1606.4209), np.float32(1867.3606), np.float32(1365.073), np.float32(1596.894), np.float32(3512.6177), np.float32(2519.7668), np.float32(2388.8225)]
2025-09-14 15:07:56,438 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:07:56,442 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 41/100 (estimated time remaining: 2 hours, 17 minutes, 7 seconds)
2025-09-14 15:10:06,325 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 15:10:14,051 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 2527.85034 ± 645.997
2025-09-14 15:10:14,051 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2573.4194), np.float32(1774.9122), np.float32(2326.642), np.float32(2813.6055), np.float32(1347.6406), np.float32(3003.015), np.float32(1856.472), np.float32(3309.2715), np.float32(3310.5398), np.float32(2962.9873)]
2025-09-14 15:10:14,051 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:10:14,051 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (2527.85) for latency 21
2025-09-14 15:10:14,055 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 42/100 (estimated time remaining: 2 hours, 14 minutes, 56 seconds)
2025-09-14 15:12:26,519 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 15:12:34,275 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 1766.96155 ± 377.915
2025-09-14 15:12:34,275 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2312.6548), np.float32(1460.5242), np.float32(1827.0681), np.float32(2144.1917), np.float32(1287.0977), np.float32(1365.5985), np.float32(2272.06), np.float32(1576.2811), np.float32(2029.6589), np.float32(1394.4801)]
2025-09-14 15:12:34,275 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:12:34,279 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 43/100 (estimated time remaining: 2 hours, 13 minutes, 16 seconds)
2025-09-14 15:14:48,198 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 15:14:55,882 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 3964.04736 ± 148.764
2025-09-14 15:14:55,882 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3915.1726), np.float32(3728.865), np.float32(4029.7017), np.float32(3990.2612), np.float32(3665.6648), np.float32(4066.224), np.float32(3978.4797), np.float32(4153.532), np.float32(4001.4775), np.float32(4111.094)]
2025-09-14 15:14:55,883 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:14:55,883 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (3964.05) for latency 21
2025-09-14 15:14:55,887 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 44/100 (estimated time remaining: 2 hours, 11 minutes, 50 seconds)
2025-09-14 15:17:09,292 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 15:17:16,991 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 1478.78186 ± 183.650
2025-09-14 15:17:16,991 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1631.5454), np.float32(1532.7095), np.float32(1684.4673), np.float32(1277.2657), np.float32(1726.1615), np.float32(1411.0013), np.float32(1165.1423), np.float32(1655.4968), np.float32(1334.9249), np.float32(1369.1046)]
2025-09-14 15:17:16,991 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:17:16,995 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 45/100 (estimated time remaining: 2 hours, 10 minutes, 16 seconds)
2025-09-14 15:19:29,807 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 15:19:37,485 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4373.22363 ± 229.644
2025-09-14 15:19:37,486 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3749.997), np.float32(4597.383), np.float32(4553.6733), np.float32(4402.665), np.float32(4437.182), np.float32(4502.7363), np.float32(4285.4736), np.float32(4524.2026), np.float32(4368.0576), np.float32(4310.864)]
2025-09-14 15:19:37,486 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:19:37,486 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (4373.22) for latency 21
2025-09-14 15:19:37,489 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 46/100 (estimated time remaining: 2 hours, 8 minutes, 31 seconds)
2025-09-14 15:21:51,791 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 15:21:59,503 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4406.41113 ± 717.862
2025-09-14 15:21:59,503 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4712.06), np.float32(4763.59), np.float32(4616.473), np.float32(4730.045), np.float32(4827.2314), np.float32(4781.8604), np.float32(2746.258), np.float32(3240.4622), np.float32(4859.115), np.float32(4787.0156)]
2025-09-14 15:21:59,503 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:21:59,503 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (4406.41) for latency 21
2025-09-14 15:21:59,507 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 47/100 (estimated time remaining: 2 hours, 6 minutes, 58 seconds)
2025-09-14 15:24:27,989 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 15:24:36,714 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4510.53564 ± 1078.646
2025-09-14 15:24:36,715 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4900.996), np.float32(4817.4756), np.float32(4884.602), np.float32(4851.8228), np.float32(4906.0854), np.float32(1276.4001), np.float32(4793.4414), np.float32(4897.6714), np.float32(4882.165), np.float32(4894.6963)]
2025-09-14 15:24:36,715 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:24:36,715 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (4510.54) for latency 21
2025-09-14 15:24:36,719 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 48/100 (estimated time remaining: 2 hours, 7 minutes, 37 seconds)
2025-09-14 15:26:59,762 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 15:27:07,515 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4869.35059 ± 71.660
2025-09-14 15:27:07,515 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4863.605), np.float32(4894.691), np.float32(4899.401), np.float32(4871.1133), np.float32(4890.443), np.float32(4887.269), np.float32(4926.579), np.float32(4667.615), np.float32(4938.6445), np.float32(4854.1484)]
2025-09-14 15:27:07,515 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:27:07,515 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (4869.35) for latency 21
2025-09-14 15:27:07,519 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 49/100 (estimated time remaining: 2 hours, 6 minutes, 48 seconds)
2025-09-14 15:29:23,399 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 15:29:31,155 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4943.93164 ± 38.286
2025-09-14 15:29:31,155 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4953.9624), np.float32(4900.3496), np.float32(4973.4985), np.float32(4983.0884), np.float32(4967.8643), np.float32(4934.626), np.float32(4949.5234), np.float32(4910.1216), np.float32(4997.111), np.float32(4869.175)]
2025-09-14 15:29:31,156 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:29:31,156 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (4943.93) for latency 21
2025-09-14 15:29:31,160 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 50/100 (estimated time remaining: 2 hours, 4 minutes, 48 seconds)
2025-09-14 15:31:47,194 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 15:31:54,944 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4915.68213 ± 66.554
2025-09-14 15:31:54,944 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4981.585), np.float32(5021.887), np.float32(4937.0107), np.float32(4921.0396), np.float32(4920.3774), np.float32(4783.191), np.float32(4977.8823), np.float32(4857.6006), np.float32(4861.5547), np.float32(4894.6924)]
2025-09-14 15:31:54,944 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:31:54,948 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 51/100 (estimated time remaining: 2 hours, 2 minutes, 54 seconds)
2025-09-14 15:34:11,192 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 15:34:18,914 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5167.31738 ± 166.650
2025-09-14 15:34:18,915 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5283.3657), np.float32(5271.8345), np.float32(5205.6777), np.float32(5273.8247), np.float32(4865.8613), np.float32(5220.2656), np.float32(5280.8613), np.float32(5281.6694), np.float32(5170.148), np.float32(4819.6685)]
2025-09-14 15:34:18,915 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:34:18,915 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (5167.32) for latency 21
2025-09-14 15:34:18,919 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 52/100 (estimated time remaining: 2 hours, 46 seconds)
2025-09-14 15:36:37,144 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 15:36:44,856 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5239.84814 ± 172.381
2025-09-14 15:36:44,856 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5427.3516), np.float32(4853.786), np.float32(5133.2305), np.float32(5342.3296), np.float32(5331.568), np.float32(5367.673), np.float32(5251.4473), np.float32(5229.847), np.float32(5049.067), np.float32(5412.1807)]
2025-09-14 15:36:44,856 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:36:44,856 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (5239.85) for latency 21
2025-09-14 15:36:44,860 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 53/100 (estimated time remaining: 1 hour, 56 minutes, 30 seconds)
2025-09-14 15:39:00,840 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 15:39:08,528 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4504.94629 ± 1440.529
2025-09-14 15:39:08,529 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2005.2539), np.float32(1309.1719), np.float32(5304.2705), np.float32(5140.2646), np.float32(5374.528), np.float32(5376.1094), np.float32(5273.6943), np.float32(4867.1704), np.float32(5349.7583), np.float32(5049.2466)]
2025-09-14 15:39:08,529 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:39:08,533 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 54/100 (estimated time remaining: 1 hour, 52 minutes, 57 seconds)
2025-09-14 15:41:24,355 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 15:41:32,074 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4773.85156 ± 1091.791
2025-09-14 15:41:32,074 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1797.4108), np.float32(5526.347), np.float32(5210.1), np.float32(5408.6294), np.float32(5392.699), np.float32(5189.3726), np.float32(5322.789), np.float32(5290.5854), np.float32(4705.8003), np.float32(3894.779)]
2025-09-14 15:41:32,074 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:41:32,078 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 55/100 (estimated time remaining: 1 hour, 50 minutes, 32 seconds)
2025-09-14 15:43:47,795 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 15:43:55,484 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4461.86621 ± 1434.045
2025-09-14 15:43:55,484 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5144.756), np.float32(5490.2676), np.float32(5293.171), np.float32(3963.9827), np.float32(5672.605), np.float32(4487.8164), np.float32(2109.5476), np.float32(5519.6294), np.float32(1467.2776), np.float32(5469.612)]
2025-09-14 15:43:55,484 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:43:55,489 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 56/100 (estimated time remaining: 1 hour, 48 minutes, 4 seconds)
2025-09-14 15:46:11,645 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 15:46:19,337 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4086.44922 ± 1836.948
2025-09-14 15:46:19,337 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5391.8257), np.float32(5443.7207), np.float32(5368.2524), np.float32(1235.5576), np.float32(5442.582), np.float32(5251.699), np.float32(4827.8867), np.float32(1302.9905), np.float32(1338.4012), np.float32(5261.5747)]
2025-09-14 15:46:19,337 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:46:19,342 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 57/100 (estimated time remaining: 1 hour, 45 minutes, 39 seconds)
2025-09-14 15:48:35,707 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 15:48:43,404 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5033.97168 ± 687.702
2025-09-14 15:48:43,404 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5042.046), np.float32(5294.839), np.float32(5252.1836), np.float32(5472.937), np.float32(5396.5884), np.float32(4273.552), np.float32(5425.494), np.float32(5416.1455), np.float32(3250.661), np.float32(5515.2666)]
2025-09-14 15:48:43,404 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:48:43,408 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 58/100 (estimated time remaining: 1 hour, 42 minutes, 59 seconds)
2025-09-14 15:50:59,807 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 15:51:07,522 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4680.75732 ± 1201.725
2025-09-14 15:51:07,522 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4989.951), np.float32(1260.6135), np.float32(4923.12), np.float32(5205.9307), np.float32(5229.1255), np.float32(5359.6016), np.float32(4209.8096), np.float32(5472.4673), np.float32(5518.596), np.float32(4638.358)]
2025-09-14 15:51:07,522 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:51:07,527 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 59/100 (estimated time remaining: 1 hour, 40 minutes, 39 seconds)
2025-09-14 15:53:24,485 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 15:53:32,187 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4738.95605 ± 1268.506
2025-09-14 15:53:32,187 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5498.48), np.float32(5206.799), np.float32(5439.59), np.float32(5597.4316), np.float32(1532.6514), np.float32(4997.8784), np.float32(5517.3984), np.float32(5495.2856), np.float32(3158.0818), np.float32(4945.96)]
2025-09-14 15:53:32,187 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:53:32,191 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 60/100 (estimated time remaining: 1 hour, 38 minutes, 24 seconds)
2025-09-14 15:55:48,340 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 15:55:56,045 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5253.55957 ± 317.272
2025-09-14 15:55:56,046 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4413.9565), np.float32(5301.178), np.float32(5488.071), np.float32(5334.74), np.float32(5488.3057), np.float32(5434.328), np.float32(5099.3247), np.float32(5545.9927), np.float32(5352.226), np.float32(5077.471)]
2025-09-14 15:55:56,046 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:55:56,046 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (5253.56) for latency 21
2025-09-14 15:55:56,050 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 61/100 (estimated time remaining: 1 hour, 36 minutes, 4 seconds)
2025-09-14 15:58:12,274 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 15:58:20,026 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5357.98730 ± 289.421
2025-09-14 15:58:20,026 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5288.726), np.float32(5474.722), np.float32(5428.664), np.float32(5535.8438), np.float32(5363.404), np.float32(4526.6465), np.float32(5512.035), np.float32(5404.7925), np.float32(5447.1113), np.float32(5597.9272)]
2025-09-14 15:58:20,026 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:58:20,027 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (5357.99) for latency 21
2025-09-14 15:58:20,031 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 62/100 (estimated time remaining: 1 hour, 33 minutes, 41 seconds)
2025-09-14 16:00:36,344 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 16:00:44,068 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5080.65918 ± 670.853
2025-09-14 16:00:44,069 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5536.406), np.float32(4899.941), np.float32(5331.2305), np.float32(5352.1104), np.float32(5273.8076), np.float32(5272.1294), np.float32(5458.1147), np.float32(3124.3386), np.float32(5255.959), np.float32(5302.5513)]
2025-09-14 16:00:44,069 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:00:44,073 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 63/100 (estimated time remaining: 1 hour, 31 minutes, 17 seconds)
2025-09-14 16:03:00,277 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 16:03:07,990 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5359.32080 ± 111.837
2025-09-14 16:03:07,990 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5371.1), np.float32(5476.6675), np.float32(5479.7603), np.float32(5336.597), np.float32(5359.3237), np.float32(5306.811), np.float32(5461.2603), np.float32(5076.2476), np.float32(5400.418), np.float32(5325.025)]
2025-09-14 16:03:07,990 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:03:07,990 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (5359.32) for latency 21
2025-09-14 16:03:07,994 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 64/100 (estimated time remaining: 1 hour, 28 minutes, 51 seconds)
2025-09-14 16:05:24,629 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 16:05:32,357 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5156.38428 ± 949.738
2025-09-14 16:05:32,357 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5459.6133), np.float32(5282.8037), np.float32(5486.545), np.float32(5485.834), np.float32(5435.2866), np.float32(5573.4614), np.float32(5514.939), np.float32(2316.1616), np.float32(5455.224), np.float32(5553.9746)]
2025-09-14 16:05:32,357 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:05:32,362 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 65/100 (estimated time remaining: 1 hour, 26 minutes, 25 seconds)
2025-09-14 16:07:48,939 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 16:07:56,657 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5215.69580 ± 450.051
2025-09-14 16:07:56,657 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4391.495), np.float32(5456.267), np.float32(5407.0566), np.float32(5433.5093), np.float32(5564.447), np.float32(5450.2944), np.float32(4348.102), np.float32(5000.0093), np.float32(5531.5337), np.float32(5574.2427)]
2025-09-14 16:07:56,657 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:07:56,662 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 66/100 (estimated time remaining: 1 hour, 24 minutes, 4 seconds)
2025-09-14 16:10:13,000 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 16:10:20,715 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4487.25684 ± 1259.395
2025-09-14 16:10:20,715 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2071.2346), np.float32(2977.6462), np.float32(5050.9644), np.float32(5494.8604), np.float32(5547.043), np.float32(5431.622), np.float32(4383.186), np.float32(2942.194), np.float32(5489.17), np.float32(5484.643)]
2025-09-14 16:10:20,715 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:10:20,720 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 67/100 (estimated time remaining: 1 hour, 21 minutes, 40 seconds)
2025-09-14 16:12:37,545 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 16:12:45,253 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5116.86182 ± 1050.503
2025-09-14 16:12:45,254 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5395.8877), np.float32(5520.588), np.float32(5284.389), np.float32(5507.272), np.float32(5535.495), np.float32(5498.243), np.float32(1972.6309), np.float32(5511.171), np.float32(5484.858), np.float32(5458.081)]
2025-09-14 16:12:45,254 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:12:45,258 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 68/100 (estimated time remaining: 1 hour, 19 minutes, 19 seconds)
2025-09-14 16:15:01,770 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 16:15:09,476 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5494.73096 ± 49.187
2025-09-14 16:15:09,476 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5466.036), np.float32(5480.833), np.float32(5557.359), np.float32(5473.703), np.float32(5407.872), np.float32(5482.464), np.float32(5537.954), np.float32(5442.629), np.float32(5566.4717), np.float32(5531.984)]
2025-09-14 16:15:09,476 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:15:09,476 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (5494.73) for latency 21
2025-09-14 16:15:09,481 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 69/100 (estimated time remaining: 1 hour, 16 minutes, 57 seconds)
2025-09-14 16:17:25,829 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 16:17:33,535 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5239.38623 ± 607.502
2025-09-14 16:17:33,535 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5537.4287), np.float32(5474.472), np.float32(5451.4272), np.float32(5519.217), np.float32(5432.3955), np.float32(3486.2422), np.float32(5508.5273), np.float32(4980.3438), np.float32(5634.35), np.float32(5369.458)]
2025-09-14 16:17:33,535 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:17:33,540 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 70/100 (estimated time remaining: 1 hour, 14 minutes, 31 seconds)
2025-09-14 16:19:49,881 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 16:19:57,606 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5372.58740 ± 175.405
2025-09-14 16:19:57,606 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5470.4883), np.float32(5291.1475), np.float32(5484.841), np.float32(5501.7324), np.float32(5130.9604), np.float32(4966.709), np.float32(5455.382), np.float32(5474.7256), np.float32(5454.7974), np.float32(5495.092)]
2025-09-14 16:19:57,606 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:19:57,611 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 71/100 (estimated time remaining: 1 hour, 12 minutes, 5 seconds)
2025-09-14 16:22:13,787 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 16:22:21,510 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5528.83154 ± 85.265
2025-09-14 16:22:21,511 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5374.815), np.float32(5638.5015), np.float32(5502.689), np.float32(5492.065), np.float32(5622.4165), np.float32(5388.235), np.float32(5566.948), np.float32(5556.9214), np.float32(5584.533), np.float32(5561.191)]
2025-09-14 16:22:21,511 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:22:21,511 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (5528.83) for latency 21
2025-09-14 16:22:21,515 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 72/100 (estimated time remaining: 1 hour, 9 minutes, 40 seconds)
2025-09-14 16:24:37,729 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 16:24:45,451 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5285.11035 ± 706.421
2025-09-14 16:24:45,451 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5626.2227), np.float32(5428.4985), np.float32(5630.928), np.float32(5396.53), np.float32(5580.407), np.float32(5345.702), np.float32(5593.25), np.float32(5668.75), np.float32(3192.792), np.float32(5388.0195)]
2025-09-14 16:24:45,451 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:24:45,456 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 73/100 (estimated time remaining: 1 hour, 7 minutes, 13 seconds)
2025-09-14 16:27:01,604 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 16:27:09,340 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5076.65869 ± 1033.946
2025-09-14 16:27:09,340 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2052.2505), np.float32(5548.913), np.float32(5384.2563), np.float32(5374.027), np.float32(5602.592), np.float32(5603.97), np.float32(4775.8564), np.float32(5554.143), np.float32(5494.2646), np.float32(5376.3135)]
2025-09-14 16:27:09,340 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:27:09,345 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 74/100 (estimated time remaining: 1 hour, 4 minutes, 47 seconds)
2025-09-14 16:29:25,643 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 16:29:33,393 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5129.77441 ± 1057.361
2025-09-14 16:29:33,393 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5599.724), np.float32(5667.127), np.float32(5560.433), np.float32(5709.1504), np.float32(4549.808), np.float32(5613.0464), np.float32(5520.188), np.float32(5655.1577), np.float32(5312.859), np.float32(2110.2554)]
2025-09-14 16:29:33,393 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:29:33,398 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 75/100 (estimated time remaining: 1 hour, 2 minutes, 23 seconds)
2025-09-14 16:31:49,521 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 16:31:57,268 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5630.36670 ± 91.548
2025-09-14 16:31:57,269 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5453.6094), np.float32(5712.837), np.float32(5724.4688), np.float32(5672.2104), np.float32(5589.3486), np.float32(5646.5776), np.float32(5615.4736), np.float32(5771.3755), np.float32(5520.873), np.float32(5596.8936)]
2025-09-14 16:31:57,269 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:31:57,269 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (5630.37) for latency 21
2025-09-14 16:31:57,274 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 76/100 (estimated time remaining: 59 minutes, 58 seconds)
2025-09-14 16:34:13,328 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 16:34:21,066 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5665.74561 ± 162.062
2025-09-14 16:34:21,066 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5688.437), np.float32(5835.036), np.float32(5723.5815), np.float32(5714.877), np.float32(5618.596), np.float32(5628.6367), np.float32(5765.337), np.float32(5223.415), np.float32(5801.654), np.float32(5657.8916)]
2025-09-14 16:34:21,066 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:34:21,066 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (5665.75) for latency 21
2025-09-14 16:34:21,072 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 77/100 (estimated time remaining: 57 minutes, 33 seconds)
2025-09-14 16:36:37,644 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 16:36:45,384 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5530.79980 ± 80.964
2025-09-14 16:36:45,384 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5560.2476), np.float32(5530.183), np.float32(5572.9365), np.float32(5537.51), np.float32(5593.1836), np.float32(5488.5083), np.float32(5622.3027), np.float32(5312.8027), np.float32(5524.2397), np.float32(5566.09)]
2025-09-14 16:36:45,384 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:36:45,389 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 78/100 (estimated time remaining: 55 minutes, 11 seconds)
2025-09-14 16:39:01,436 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 16:39:09,152 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5124.65479 ± 1306.069
2025-09-14 16:39:09,152 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5574.712), np.float32(1211.8054), np.float32(5396.777), np.float32(5507.9487), np.float32(5624.571), np.float32(5655.579), np.float32(5542.5938), np.float32(5537.788), np.float32(5588.9146), np.float32(5605.8613)]
2025-09-14 16:39:09,152 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:39:09,157 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 79/100 (estimated time remaining: 52 minutes, 47 seconds)
2025-09-14 16:41:25,213 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 16:41:32,972 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5565.63770 ± 29.332
2025-09-14 16:41:32,972 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5593.6514), np.float32(5565.7095), np.float32(5635.977), np.float32(5569.154), np.float32(5523.624), np.float32(5543.939), np.float32(5548.5654), np.float32(5559.43), np.float32(5567.681), np.float32(5548.65)]
2025-09-14 16:41:32,972 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:41:32,977 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 80/100 (estimated time remaining: 50 minutes, 22 seconds)
2025-09-14 16:43:49,238 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 16:43:56,968 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4649.15186 ± 1775.558
2025-09-14 16:43:56,969 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5786.314), np.float32(2197.0452), np.float32(5777.3457), np.float32(5825.791), np.float32(5779.3604), np.float32(5784.7715), np.float32(2529.3533), np.float32(5803.1724), np.float32(1207.6003), np.float32(5800.7666)]
2025-09-14 16:43:56,969 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:43:56,974 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 81/100 (estimated time remaining: 47 minutes, 58 seconds)
2025-09-14 16:46:13,728 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 16:46:21,458 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4791.70361 ± 1740.142
2025-09-14 16:46:21,459 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5625.957), np.float32(5655.092), np.float32(5658.601), np.float32(5692.8853), np.float32(5696.301), np.float32(5591.4746), np.float32(1330.3312), np.float32(1294.8483), np.float32(5757.0234), np.float32(5614.518)]
2025-09-14 16:46:21,459 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:46:21,464 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 82/100 (estimated time remaining: 45 minutes, 37 seconds)
2025-09-14 16:48:37,957 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 16:48:45,692 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5626.74121 ± 42.412
2025-09-14 16:48:45,692 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5578.843), np.float32(5600.2725), np.float32(5630.389), np.float32(5680.5825), np.float32(5616.946), np.float32(5722.788), np.float32(5587.3965), np.float32(5640.454), np.float32(5610.6543), np.float32(5599.086)]
2025-09-14 16:48:45,692 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:48:45,698 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 83/100 (estimated time remaining: 43 minutes, 13 seconds)
2025-09-14 16:51:02,027 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 16:51:09,755 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5393.54004 ± 988.147
2025-09-14 16:51:09,755 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5650.669), np.float32(5739.111), np.float32(5733.1323), np.float32(5729.1562), np.float32(5753.6846), np.float32(5562.3804), np.float32(5675.807), np.float32(2437.8687), np.float32(5799.067), np.float32(5854.5283)]
2025-09-14 16:51:09,756 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:51:09,761 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 84/100 (estimated time remaining: 40 minutes, 50 seconds)
2025-09-14 16:53:25,714 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 16:53:33,427 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5773.66406 ± 106.132
2025-09-14 16:53:33,427 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5754.012), np.float32(5788.0522), np.float32(5600.774), np.float32(5611.056), np.float32(5821.232), np.float32(5768.278), np.float32(5973.523), np.float32(5821.299), np.float32(5869.582), np.float32(5728.833)]
2025-09-14 16:53:33,427 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:53:33,427 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (5773.66) for latency 21
2025-09-14 16:53:33,433 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 85/100 (estimated time remaining: 38 minutes, 25 seconds)
2025-09-14 16:55:49,406 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 16:55:57,114 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5727.10840 ± 125.081
2025-09-14 16:55:57,114 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5822.8286), np.float32(5697.8896), np.float32(5848.012), np.float32(5405.3633), np.float32(5707.7476), np.float32(5663.2695), np.float32(5794.793), np.float32(5834.2524), np.float32(5689.8345), np.float32(5807.0894)]
2025-09-14 16:55:57,114 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:55:57,120 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 86/100 (estimated time remaining: 36 minutes)
2025-09-14 16:58:13,054 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 16:58:20,783 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4525.48535 ± 1851.300
2025-09-14 16:58:20,783 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5758.3945), np.float32(5738.3726), np.float32(5807.2607), np.float32(1224.4943), np.float32(5779.977), np.float32(5562.063), np.float32(1251.4789), np.float32(2845.0151), np.float32(5486.6987), np.float32(5801.0933)]
2025-09-14 16:58:20,783 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:58:20,788 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 87/100 (estimated time remaining: 33 minutes, 34 seconds)
2025-09-14 17:00:36,762 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 17:00:44,484 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5666.67285 ± 56.749
2025-09-14 17:00:44,484 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5662.248), np.float32(5631.0767), np.float32(5659.407), np.float32(5636.6226), np.float32(5649.539), np.float32(5586.7207), np.float32(5743.8013), np.float32(5602.2266), np.float32(5754.3755), np.float32(5740.7153)]
2025-09-14 17:00:44,484 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:00:44,489 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 88/100 (estimated time remaining: 31 minutes, 8 seconds)
2025-09-14 17:03:01,039 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 17:03:08,736 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4990.96143 ± 1366.269
2025-09-14 17:03:08,736 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4910.1436), np.float32(5728.6274), np.float32(5746.7812), np.float32(3881.8386), np.float32(5481.911), np.float32(5648.038), np.float32(5729.9355), np.float32(5794.444), np.float32(1261.4297), np.float32(5726.466)]
2025-09-14 17:03:08,736 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:03:08,742 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 89/100 (estimated time remaining: 28 minutes, 45 seconds)
2025-09-14 17:05:24,850 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 17:05:32,562 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5143.38770 ± 1307.389
2025-09-14 17:05:32,562 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5598.79), np.float32(5683.7104), np.float32(5610.535), np.float32(5656.7383), np.float32(5663.733), np.float32(5653.534), np.float32(5094.625), np.float32(5601.9478), np.float32(5618.204), np.float32(1252.0566)]
2025-09-14 17:05:32,562 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:05:32,568 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 90/100 (estimated time remaining: 26 minutes, 22 seconds)
2025-09-14 17:07:49,246 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 17:07:56,952 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5671.70605 ± 71.077
2025-09-14 17:07:56,952 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5716.4644), np.float32(5473.835), np.float32(5673.937), np.float32(5714.931), np.float32(5678.6543), np.float32(5720.162), np.float32(5718.821), np.float32(5678.39), np.float32(5708.5317), np.float32(5633.333)]
2025-09-14 17:07:56,952 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:07:56,958 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 91/100 (estimated time remaining: 23 minutes, 59 seconds)
2025-09-14 17:10:13,167 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 17:10:20,869 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5754.72217 ± 48.089
2025-09-14 17:10:20,869 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5679.0854), np.float32(5754.6045), np.float32(5776.8564), np.float32(5747.3115), np.float32(5725.2197), np.float32(5796.5), np.float32(5721.7456), np.float32(5708.6665), np.float32(5856.502), np.float32(5780.7305)]
2025-09-14 17:10:20,869 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:10:20,875 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 92/100 (estimated time remaining: 21 minutes, 36 seconds)
2025-09-14 17:12:37,019 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 17:12:44,761 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5601.96191 ± 39.386
2025-09-14 17:12:44,762 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5642.4487), np.float32(5598.412), np.float32(5643.075), np.float32(5612.0015), np.float32(5502.072), np.float32(5626.6675), np.float32(5584.0273), np.float32(5580.0317), np.float32(5602.8926), np.float32(5627.9976)]
2025-09-14 17:12:44,762 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:12:44,768 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 93/100 (estimated time remaining: 19 minutes, 12 seconds)
2025-09-14 17:15:01,717 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 17:15:09,440 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5166.59863 ± 1282.667
2025-09-14 17:15:09,440 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1320.3896), np.float32(5630.0586), np.float32(5510.313), np.float32(5626.062), np.float32(5630.796), np.float32(5566.1606), np.float32(5575.3716), np.float32(5586.1562), np.float32(5572.0454), np.float32(5648.628)]
2025-09-14 17:15:09,440 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:15:09,446 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 94/100 (estimated time remaining: 16 minutes, 48 seconds)
2025-09-14 17:17:25,342 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 17:17:33,068 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4824.86035 ± 1791.441
2025-09-14 17:17:33,069 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5736.6416), np.float32(5778.511), np.float32(1252.5813), np.float32(5528.4995), np.float32(5693.959), np.float32(5754.74), np.float32(1236.6372), np.float32(5731.288), np.float32(5771.6587), np.float32(5764.082)]
2025-09-14 17:17:33,069 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:17:33,075 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 95/100 (estimated time remaining: 14 minutes, 24 seconds)
2025-09-14 17:19:49,178 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 17:19:56,893 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5317.63330 ± 1356.866
2025-09-14 17:19:56,893 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5800.052), np.float32(1248.7047), np.float32(5804.685), np.float32(5705.703), np.float32(5789.4297), np.float32(5781.382), np.float32(5831.225), np.float32(5757.916), np.float32(5706.003), np.float32(5751.232)]
2025-09-14 17:19:56,893 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:19:56,899 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 96/100 (estimated time remaining: 11 minutes, 59 seconds)
2025-09-14 17:22:13,367 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 17:22:21,100 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5657.74512 ± 88.215
2025-09-14 17:22:21,100 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5687.2466), np.float32(5733.6685), np.float32(5543.0405), np.float32(5785.5044), np.float32(5711.6626), np.float32(5473.771), np.float32(5641.5645), np.float32(5689.6294), np.float32(5612.8794), np.float32(5698.483)]
2025-09-14 17:22:21,100 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:22:21,106 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 97/100 (estimated time remaining: 9 minutes, 36 seconds)
2025-09-14 17:24:37,340 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 17:24:45,054 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5635.66846 ± 67.550
2025-09-14 17:24:45,055 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5645.7134), np.float32(5590.9775), np.float32(5549.3325), np.float32(5685.597), np.float32(5698.087), np.float32(5714.2837), np.float32(5697.2085), np.float32(5586.452), np.float32(5677.917), np.float32(5511.1133)]
2025-09-14 17:24:45,055 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:24:45,060 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 98/100 (estimated time remaining: 7 minutes, 12 seconds)
2025-09-14 17:27:06,558 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 17:27:14,446 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5322.77051 ± 1369.453
2025-09-14 17:27:14,446 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5840.599), np.float32(5794.8315), np.float32(5817.3193), np.float32(5722.259), np.float32(5622.2593), np.float32(5798.6816), np.float32(1218.5471), np.float32(5776.1533), np.float32(5839.755), np.float32(5797.3022)]
2025-09-14 17:27:14,446 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:27:14,452 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 99/100 (estimated time remaining: 4 minutes, 50 seconds)
2025-09-14 17:29:35,796 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 17:29:43,794 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5181.56152 ± 1064.221
2025-09-14 17:29:43,795 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2827.301), np.float32(3312.0754), np.float32(5693.9995), np.float32(5742.1943), np.float32(5795.868), np.float32(5714.52), np.float32(5499.9653), np.float32(5706.663), np.float32(5758.5635), np.float32(5764.468)]
2025-09-14 17:29:43,795 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:29:43,801 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 100/100 (estimated time remaining: 2 minutes, 26 seconds)
2025-09-14 17:32:06,713 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 21...
2025-09-14 17:32:14,530 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5737.90771 ± 61.394
2025-09-14 17:32:14,530 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5751.6807), np.float32(5791.827), np.float32(5583.876), np.float32(5824.636), np.float32(5757.633), np.float32(5754.919), np.float32(5738.017), np.float32(5709.419), np.float32(5763.6924), np.float32(5703.3774)]
2025-09-14 17:32:14,530 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 17:32:14,536 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1251 [DEBUG]: Training session finished
