2025-09-14 10:35:06,008 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1108 [DEBUG]: logdir: _logs/noise-eval-v2/halfcheetah/bpql-noise_0.000-delay_15
2025-09-14 10:35:06,008 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1109 [DEBUG]: trainer_prefix: noise-eval-v2/halfcheetah/bpql-noise_0.000-delay_15
2025-09-14 10:35:06,008 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1110 [DEBUG]: args.trainer_eval_latencies: {'15': <latency_env.delayed_mdp.ConstantDelay object at 0x7f9a4c10bc80>}
2025-09-14 10:35:06,008 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1111 [DEBUG]: using device: cpu
2025-09-14 10:35:06,012 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1133 [INFO]: Creating new trainer
2025-09-14 10:35:06,130 baseline-bpql-halfcheetah:113 [DEBUG]: pi network:
NNGaussianPolicy(
  (common_head): Sequential(
    (0): Flatten(start_dim=1, end_dim=-1)
    (1): Linear(in_features=107, out_features=256, bias=True)
    (2): ReLU()
    (3): Linear(in_features=256, out_features=256, bias=True)
    (4): ReLU()
  )
  (mu_head): Sequential(
    (0): Linear(in_features=256, out_features=6, bias=True)
    (1): Unflatten(dim=1, unflattened_size=(6,))
  )
  (log_std_head): Sequential(
    (0): Linear(in_features=256, out_features=6, bias=True)
    (1): Unflatten(dim=1, unflattened_size=(6,))
  )
  (tanh_refit): NNTanhRefit(scale: tensor([[2., 2., 2., 2., 2., 2.]]), shift: tensor([[-1., -1., -1., -1., -1., -1.]]))
)
2025-09-14 10:35:06,130 baseline-bpql-halfcheetah:114 [DEBUG]: q network:
NNLayerConcat2(
  dim: -1
  (next): Sequential(
    (0): Linear(in_features=23, out_features=256, bias=True)
    (1): ReLU()
    (2): Linear(in_features=256, out_features=256, bias=True)
    (3): ReLU()
    (4): Linear(in_features=256, out_features=1, bias=True)
    (5): NNLayerSqueeze(dim: -1)
  )
  (init_left): Flatten(start_dim=1, end_dim=-1)
  (init_right): Flatten(start_dim=1, end_dim=-1)
)
2025-09-14 10:35:07,878 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1194 [DEBUG]: Starting training session...
2025-09-14 10:35:07,879 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 1/100
2025-09-14 10:37:35,200 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 10:37:41,761 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: -426.86603 ± 120.962
2025-09-14 10:37:41,761 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(-560.2064), np.float32(-547.2803), np.float32(-350.75272), np.float32(-523.75507), np.float32(-282.5724), np.float32(-542.74554), np.float32(-351.56747), np.float32(-289.74176), np.float32(-550.6777), np.float32(-269.36087)]
2025-09-14 10:37:41,761 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:37:41,761 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (-426.87) for latency 15
2025-09-14 10:37:41,764 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 2/100 (estimated time remaining: 4 hours, 13 minutes, 54 seconds)
2025-09-14 10:40:11,299 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 10:40:17,759 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: -314.91574 ± 65.066
2025-09-14 10:40:17,759 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(-248.934), np.float32(-257.79453), np.float32(-324.61517), np.float32(-399.44662), np.float32(-391.92688), np.float32(-255.38669), np.float32(-374.48123), np.float32(-234.11171), np.float32(-392.04263), np.float32(-270.41794)]
2025-09-14 10:40:17,759 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:40:17,759 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (-314.92) for latency 15
2025-09-14 10:40:17,762 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 3/100 (estimated time remaining: 4 hours, 13 minutes, 4 seconds)
2025-09-14 10:42:47,459 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 10:42:53,951 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: -182.40207 ± 60.459
2025-09-14 10:42:53,951 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(-211.57751), np.float32(-108.49747), np.float32(-270.35156), np.float32(-155.0209), np.float32(-165.78252), np.float32(-163.39325), np.float32(-269.79752), np.float32(-144.43733), np.float32(-91.427605), np.float32(-243.73499)]
2025-09-14 10:42:53,951 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:42:53,951 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (-182.40) for latency 15
2025-09-14 10:42:53,955 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 4/100 (estimated time remaining: 4 hours, 11 minutes, 9 seconds)
2025-09-14 10:45:23,557 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 10:45:29,994 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 244.42221 ± 142.495
2025-09-14 10:45:29,994 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(340.68106), np.float32(220.91307), np.float32(280.24725), np.float32(98.645874), np.float32(94.278725), np.float32(377.52032), np.float32(93.557724), np.float32(436.56192), np.float32(438.89172), np.float32(62.924324)]
2025-09-14 10:45:29,994 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:45:29,994 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (244.42) for latency 15
2025-09-14 10:45:29,997 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 5/100 (estimated time remaining: 4 hours, 8 minutes, 50 seconds)
2025-09-14 10:47:59,716 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 10:48:06,313 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 134.81383 ± 222.108
2025-09-14 10:48:06,313 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(226.41515), np.float32(-117.590004), np.float32(195.04408), np.float32(57.254875), np.float32(-79.812035), np.float32(205.28358), np.float32(-117.510124), np.float32(551.52527), np.float32(449.1006), np.float32(-21.573133)]
2025-09-14 10:48:06,313 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:48:06,317 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 6/100 (estimated time remaining: 4 hours, 6 minutes, 30 seconds)
2025-09-14 10:50:45,949 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 10:50:52,497 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 302.38269 ± 372.541
2025-09-14 10:50:52,497 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(656.22504), np.float32(63.609802), np.float32(225.41162), np.float32(104.28995), np.float32(350.77783), np.float32(741.57196), np.float32(152.25565), np.float32(-483.56796), np.float32(342.16986), np.float32(871.08307)]
2025-09-14 10:50:52,497 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:50:52,497 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (302.38) for latency 15
2025-09-14 10:50:52,500 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 7/100 (estimated time remaining: 4 hours, 7 minutes, 45 seconds)
2025-09-14 10:53:32,573 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 10:53:39,083 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 377.06381 ± 399.910
2025-09-14 10:53:39,083 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(842.74023), np.float32(34.90754), np.float32(419.24026), np.float32(-544.1632), np.float32(511.77734), np.float32(767.2852), np.float32(166.66354), np.float32(589.4961), np.float32(237.39546), np.float32(745.29565)]
2025-09-14 10:53:39,083 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:53:39,083 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (377.06) for latency 15
2025-09-14 10:53:39,087 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 8/100 (estimated time remaining: 4 hours, 8 minutes, 24 seconds)
2025-09-14 10:56:19,296 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 10:56:25,790 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 680.98370 ± 267.923
2025-09-14 10:56:25,790 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1111.1495), np.float32(303.5901), np.float32(919.0561), np.float32(1098.8383), np.float32(537.41736), np.float32(678.9317), np.float32(512.1268), np.float32(716.98334), np.float32(376.1229), np.float32(555.62115)]
2025-09-14 10:56:25,790 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:56:25,790 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (680.98) for latency 15
2025-09-14 10:56:25,793 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 9/100 (estimated time remaining: 4 hours, 8 minutes, 57 seconds)
2025-09-14 10:59:06,056 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 10:59:12,518 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 1084.13062 ± 225.255
2025-09-14 10:59:12,518 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1118.3257), np.float32(728.20667), np.float32(1268.1697), np.float32(1180.2943), np.float32(707.29584), np.float32(961.2814), np.float32(1349.8807), np.float32(1333.7468), np.float32(945.2853), np.float32(1248.82)]
2025-09-14 10:59:12,518 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:59:12,518 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (1084.13) for latency 15
2025-09-14 10:59:12,521 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 10/100 (estimated time remaining: 4 hours, 9 minutes, 29 seconds)
2025-09-14 11:01:52,758 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 11:01:59,228 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 1060.62671 ± 239.180
2025-09-14 11:01:59,228 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(858.7783), np.float32(1393.4423), np.float32(1354.0408), np.float32(850.7654), np.float32(890.6682), np.float32(1407.0427), np.float32(866.3323), np.float32(801.43243), np.float32(1223.5121), np.float32(960.25415)]
2025-09-14 11:01:59,228 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:01:59,231 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 11/100 (estimated time remaining: 4 hours, 9 minutes, 52 seconds)
2025-09-14 11:04:39,243 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 11:04:45,700 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 961.76740 ± 206.402
2025-09-14 11:04:45,700 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1162.9944), np.float32(1009.07367), np.float32(919.7507), np.float32(785.987), np.float32(839.24146), np.float32(890.9955), np.float32(805.52325), np.float32(1473.5774), np.float32(978.4347), np.float32(752.0952)]
2025-09-14 11:04:45,700 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:04:45,704 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 12/100 (estimated time remaining: 4 hours, 7 minutes, 11 seconds)
2025-09-14 11:07:25,439 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 11:07:31,948 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 1095.12488 ± 259.415
2025-09-14 11:07:31,949 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1346.2366), np.float32(973.1754), np.float32(939.0864), np.float32(1009.71783), np.float32(1116.2789), np.float32(887.74884), np.float32(865.88464), np.float32(856.18225), np.float32(1237.2357), np.float32(1719.7021)]
2025-09-14 11:07:31,949 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:07:31,949 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (1095.12) for latency 15
2025-09-14 11:07:31,952 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 13/100 (estimated time remaining: 4 hours, 4 minutes, 18 seconds)
2025-09-14 11:10:12,212 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 11:10:18,719 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 1899.13513 ± 587.290
2025-09-14 11:10:18,719 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1290.8411), np.float32(1904.9869), np.float32(1639.8942), np.float32(2676.3738), np.float32(1305.992), np.float32(2905.501), np.float32(1346.2736), np.float32(1294.4929), np.float32(2432.8818), np.float32(2194.114)]
2025-09-14 11:10:18,719 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:10:18,719 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (1899.14) for latency 15
2025-09-14 11:10:18,723 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 14/100 (estimated time remaining: 4 hours, 1 minute, 32 seconds)
2025-09-14 11:12:59,118 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 11:13:05,657 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 1585.86401 ± 310.844
2025-09-14 11:13:05,657 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1852.3422), np.float32(1960.7008), np.float32(1580.5796), np.float32(1093.1842), np.float32(2131.0083), np.float32(1716.6421), np.float32(1303.8518), np.float32(1495.8374), np.float32(1433.0873), np.float32(1291.4062)]
2025-09-14 11:13:05,657 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:13:05,661 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 15/100 (estimated time remaining: 3 hours, 58 minutes, 50 seconds)
2025-09-14 11:15:46,868 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 11:15:53,443 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 1509.89038 ± 290.418
2025-09-14 11:15:53,443 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1406.6241), np.float32(1794.0714), np.float32(1996.6808), np.float32(1577.5251), np.float32(1329.3358), np.float32(1116.9003), np.float32(1258.3219), np.float32(1670.4696), np.float32(1814.9602), np.float32(1134.0149)]
2025-09-14 11:15:53,443 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:15:53,447 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 16/100 (estimated time remaining: 3 hours, 56 minutes, 21 seconds)
2025-09-14 11:18:33,355 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 11:18:39,797 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 1863.43518 ± 430.433
2025-09-14 11:18:39,797 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2186.2422), np.float32(2213.3792), np.float32(1251.2374), np.float32(2203.1067), np.float32(2169.6658), np.float32(2351.5193), np.float32(1319.0858), np.float32(2138.373), np.float32(1355.8202), np.float32(1445.9221)]
2025-09-14 11:18:39,798 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:18:39,801 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 17/100 (estimated time remaining: 3 hours, 53 minutes, 32 seconds)
2025-09-14 11:21:20,285 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 11:21:26,729 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 2238.04346 ± 382.703
2025-09-14 11:21:26,729 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2716.862), np.float32(2021.0314), np.float32(2177.8984), np.float32(1766.7875), np.float32(2610.8174), np.float32(1951.9362), np.float32(2047.2218), np.float32(1700.8695), np.float32(2724.1592), np.float32(2662.8489)]
2025-09-14 11:21:26,729 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:21:26,729 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (2238.04) for latency 15
2025-09-14 11:21:26,732 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 18/100 (estimated time remaining: 3 hours, 50 minutes, 57 seconds)
2025-09-14 11:24:06,587 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 11:24:13,044 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 1615.06274 ± 523.958
2025-09-14 11:24:13,045 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2649.2466), np.float32(2403.3164), np.float32(1149.3999), np.float32(1161.2118), np.float32(1286.2366), np.float32(1930.4447), np.float32(1722.934), np.float32(1135.3402), np.float32(1200.8761), np.float32(1511.6223)]
2025-09-14 11:24:13,045 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:24:13,048 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 19/100 (estimated time remaining: 3 hours, 48 minutes, 2 seconds)
2025-09-14 11:26:53,206 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 11:26:59,711 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 1522.67310 ± 306.152
2025-09-14 11:26:59,711 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1586.1364), np.float32(1162.7473), np.float32(1690.2798), np.float32(1646.2965), np.float32(2298.755), np.float32(1342.5015), np.float32(1317.9661), np.float32(1350.2957), np.float32(1540.6637), np.float32(1291.0875)]
2025-09-14 11:26:59,711 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:26:59,715 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 20/100 (estimated time remaining: 3 hours, 45 minutes, 11 seconds)
2025-09-14 11:29:40,959 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 11:29:48,168 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 2709.35986 ± 643.801
2025-09-14 11:29:48,168 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3238.694), np.float32(1285.5812), np.float32(3260.9768), np.float32(1996.4233), np.float32(3055.4421), np.float32(2933.3455), np.float32(2884.4639), np.float32(2505.7537), np.float32(2396.0889), np.float32(3536.829)]
2025-09-14 11:29:48,168 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:29:48,168 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (2709.36) for latency 15
2025-09-14 11:29:48,172 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 21/100 (estimated time remaining: 3 hours, 42 minutes, 35 seconds)
2025-09-14 11:32:30,916 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 11:32:37,579 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 2455.54126 ± 485.282
2025-09-14 11:32:37,579 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2883.6052), np.float32(1841.537), np.float32(2806.755), np.float32(3031.6455), np.float32(2658.9849), np.float32(1878.8905), np.float32(2833.3237), np.float32(1581.0519), np.float32(2677.715), np.float32(2361.904)]
2025-09-14 11:32:37,579 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:32:37,583 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 22/100 (estimated time remaining: 3 hours, 40 minutes, 36 seconds)
2025-09-14 11:35:19,434 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 11:35:26,023 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 2407.25439 ± 676.175
2025-09-14 11:35:26,023 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3156.5513), np.float32(1272.8094), np.float32(2695.66), np.float32(2112.0818), np.float32(2195.3308), np.float32(3074.5464), np.float32(1218.5961), np.float32(2784.3213), np.float32(2443.4758), np.float32(3119.1694)]
2025-09-14 11:35:26,023 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:35:26,027 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 23/100 (estimated time remaining: 3 hours, 38 minutes, 13 seconds)
2025-09-14 11:38:06,737 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 11:38:13,194 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 1961.40308 ± 667.503
2025-09-14 11:38:13,194 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1724.2975), np.float32(2275.2422), np.float32(1829.2823), np.float32(3273.0432), np.float32(2278.915), np.float32(2840.426), np.float32(1185.2466), np.float32(1328.9443), np.float32(1172.8862), np.float32(1705.749)]
2025-09-14 11:38:13,194 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:38:13,198 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 24/100 (estimated time remaining: 3 hours, 35 minutes, 38 seconds)
2025-09-14 11:40:53,296 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 11:40:59,772 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 1530.95593 ± 900.310
2025-09-14 11:40:59,772 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1292.0847), np.float32(1436.9028), np.float32(228.30661), np.float32(2084.3672), np.float32(1345.339), np.float32(1362.5273), np.float32(292.02408), np.float32(1319.6001), np.float32(2636.4595), np.float32(3311.9485)]
2025-09-14 11:40:59,772 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:40:59,777 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 25/100 (estimated time remaining: 3 hours, 32 minutes, 48 seconds)
2025-09-14 11:43:39,729 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 11:43:46,176 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 2910.97607 ± 694.400
2025-09-14 11:43:46,176 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2756.5957), np.float32(3053.6582), np.float32(3373.8794), np.float32(3484.0132), np.float32(3365.4297), np.float32(1392.736), np.float32(1835.7844), np.float32(3506.3298), np.float32(2973.028), np.float32(3368.3074)]
2025-09-14 11:43:46,176 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:43:46,176 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (2910.98) for latency 15
2025-09-14 11:43:46,181 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 26/100 (estimated time remaining: 3 hours, 29 minutes, 30 seconds)
2025-09-14 11:46:25,667 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 11:46:32,209 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 2155.82031 ± 973.935
2025-09-14 11:46:32,209 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1614.6805), np.float32(2262.703), np.float32(3427.5547), np.float32(988.7279), np.float32(1131.7235), np.float32(3422.1714), np.float32(1870.5308), np.float32(3762.1982), np.float32(1287.4396), np.float32(1790.4738)]
2025-09-14 11:46:32,209 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:46:32,216 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 27/100 (estimated time remaining: 3 hours, 25 minutes, 52 seconds)
2025-09-14 11:49:12,480 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 11:49:19,042 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 2636.54053 ± 767.437
2025-09-14 11:49:19,042 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1573.6875), np.float32(3110.6404), np.float32(3576.3445), np.float32(3534.7168), np.float32(2028.6135), np.float32(1950.4354), np.float32(2778.1553), np.float32(3421.9246), np.float32(2895.0479), np.float32(1495.8396)]
2025-09-14 11:49:19,042 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:49:19,046 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 28/100 (estimated time remaining: 3 hours, 22 minutes, 42 seconds)
2025-09-14 11:51:59,092 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 11:52:05,653 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 2296.47119 ± 657.205
2025-09-14 11:52:05,653 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1519.1923), np.float32(2147.6572), np.float32(2393.5955), np.float32(2358.126), np.float32(3437.846), np.float32(1384.7142), np.float32(3004.4304), np.float32(2031.0974), np.float32(3040.248), np.float32(1647.8029)]
2025-09-14 11:52:05,653 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:52:05,657 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 29/100 (estimated time remaining: 3 hours, 19 minutes, 47 seconds)
2025-09-14 11:54:45,772 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 11:54:52,306 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 3137.84937 ± 589.734
2025-09-14 11:54:52,306 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3803.033), np.float32(3881.0347), np.float32(2192.757), np.float32(2753.872), np.float32(3329.7478), np.float32(2100.9377), np.float32(2988.374), np.float32(3325.684), np.float32(3437.1902), np.float32(3565.8628)]
2025-09-14 11:54:52,307 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:54:52,307 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (3137.85) for latency 15
2025-09-14 11:54:52,310 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 30/100 (estimated time remaining: 3 hours, 17 minutes, 1 second)
2025-09-14 11:57:32,445 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 11:57:38,878 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 3427.69263 ± 447.663
2025-09-14 11:57:38,878 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2565.0383), np.float32(3630.3665), np.float32(3457.4036), np.float32(3773.6318), np.float32(4068.4036), np.float32(3216.8452), np.float32(2793.376), np.float32(3855.7595), np.float32(3625.0908), np.float32(3291.0115)]
2025-09-14 11:57:38,878 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:57:38,879 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (3427.69) for latency 15
2025-09-14 11:57:38,883 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 31/100 (estimated time remaining: 3 hours, 14 minutes, 17 seconds)
2025-09-14 12:00:19,490 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 12:00:25,953 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 3003.01782 ± 966.172
2025-09-14 12:00:25,953 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3925.9014), np.float32(3815.7021), np.float32(2188.645), np.float32(1527.2043), np.float32(3972.776), np.float32(3337.2979), np.float32(3571.6067), np.float32(3926.416), np.float32(2229.7397), np.float32(1534.8884)]
2025-09-14 12:00:25,954 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:00:25,959 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 32/100 (estimated time remaining: 3 hours, 11 minutes, 45 seconds)
2025-09-14 12:03:06,186 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 12:03:12,636 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 2889.37256 ± 950.764
2025-09-14 12:03:12,636 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3647.1794), np.float32(2027.3712), np.float32(3557.6406), np.float32(3538.451), np.float32(3865.9465), np.float32(4187.6846), np.float32(1393.9315), np.float32(2160.6877), np.float32(1687.788), np.float32(2827.0444)]
2025-09-14 12:03:12,636 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:03:12,641 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 33/100 (estimated time remaining: 3 hours, 8 minutes, 56 seconds)
2025-09-14 12:05:52,460 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 12:05:58,982 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 2782.71460 ± 1039.813
2025-09-14 12:05:58,982 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1858.1394), np.float32(3877.6936), np.float32(2207.7148), np.float32(3833.7163), np.float32(1385.7178), np.float32(1428.0494), np.float32(2728.4219), np.float32(4425.8965), np.float32(3632.5178), np.float32(2449.2798)]
2025-09-14 12:05:58,983 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:05:58,987 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 34/100 (estimated time remaining: 3 hours, 6 minutes, 6 seconds)
2025-09-14 12:08:38,501 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 12:08:45,060 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 2469.79761 ± 1042.411
2025-09-14 12:08:45,060 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1574.3854), np.float32(4326.1885), np.float32(1316.7274), np.float32(2864.398), np.float32(2949.4553), np.float32(1436.929), np.float32(4204.5845), np.float32(2227.8765), np.float32(1658.072), np.float32(2139.3591)]
2025-09-14 12:08:45,060 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:08:45,066 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 35/100 (estimated time remaining: 3 hours, 3 minutes, 12 seconds)
2025-09-14 12:11:25,493 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 12:11:31,976 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 3184.91431 ± 987.507
2025-09-14 12:11:31,976 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3604.1387), np.float32(1276.739), np.float32(2556.8137), np.float32(3701.1023), np.float32(3799.425), np.float32(3982.4062), np.float32(4419.3613), np.float32(3904.0222), np.float32(1727.1606), np.float32(2877.9739)]
2025-09-14 12:11:31,976 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:11:31,981 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 36/100 (estimated time remaining: 3 hours, 30 seconds)
2025-09-14 12:14:11,969 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 12:14:18,462 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 3775.59619 ± 583.857
2025-09-14 12:14:18,462 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3932.8247), np.float32(3255.13), np.float32(3262.1086), np.float32(3984.005), np.float32(4313.6235), np.float32(4174.1523), np.float32(4031.9), np.float32(4293.4014), np.float32(4114.862), np.float32(2393.9512)]
2025-09-14 12:14:18,462 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:14:18,462 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (3775.60) for latency 15
2025-09-14 12:14:18,467 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 37/100 (estimated time remaining: 2 hours, 57 minutes, 36 seconds)
2025-09-14 12:16:58,694 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 12:17:05,132 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 3181.17578 ± 1012.061
2025-09-14 12:17:05,132 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1400.3046), np.float32(4028.9856), np.float32(1307.0723), np.float32(4336.051), np.float32(3618.814), np.float32(2663.4382), np.float32(3879.1448), np.float32(3623.787), np.float32(3784.1824), np.float32(3169.978)]
2025-09-14 12:17:05,133 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:17:05,137 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 38/100 (estimated time remaining: 2 hours, 54 minutes, 49 seconds)
2025-09-14 12:19:45,798 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 12:19:52,249 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 3214.29810 ± 1006.117
2025-09-14 12:19:52,249 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3926.9358), np.float32(3368.202), np.float32(4172.2905), np.float32(3797.111), np.float32(1234.6807), np.float32(3419.7869), np.float32(3350.6401), np.float32(3537.5), np.float32(1316.525), np.float32(4019.3096)]
2025-09-14 12:19:52,249 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:19:52,254 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 39/100 (estimated time remaining: 2 hours, 52 minutes, 12 seconds)
2025-09-14 12:22:32,327 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 12:22:38,791 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 3302.39453 ± 1133.130
2025-09-14 12:22:38,791 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1263.363), np.float32(4161.5205), np.float32(3887.0374), np.float32(4128.3657), np.float32(1951.126), np.float32(3828.0547), np.float32(3884.7761), np.float32(3770.0837), np.float32(1633.2257), np.float32(4516.391)]
2025-09-14 12:22:38,792 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:22:38,796 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 40/100 (estimated time remaining: 2 hours, 49 minutes, 31 seconds)
2025-09-14 12:25:18,543 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 12:25:25,114 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 1898.43982 ± 543.682
2025-09-14 12:25:25,114 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2361.2744), np.float32(1415.9932), np.float32(2926.7336), np.float32(1507.773), np.float32(1624.6959), np.float32(1533.6383), np.float32(2096.147), np.float32(1336.0997), np.float32(2666.0947), np.float32(1515.9493)]
2025-09-14 12:25:25,114 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:25:25,119 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 41/100 (estimated time remaining: 2 hours, 46 minutes, 37 seconds)
2025-09-14 12:28:05,351 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 12:28:11,911 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 3180.37354 ± 1040.621
2025-09-14 12:28:11,912 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4235.7817), np.float32(4173.002), np.float32(3799.8962), np.float32(1482.4747), np.float32(3845.7463), np.float32(2260.9397), np.float32(1482.529), np.float32(3978.3306), np.float32(3837.5771), np.float32(2707.4587)]
2025-09-14 12:28:11,912 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:28:11,917 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 42/100 (estimated time remaining: 2 hours, 43 minutes, 54 seconds)
2025-09-14 12:30:52,091 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 12:30:58,602 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 3443.35010 ± 779.190
2025-09-14 12:30:58,603 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3877.4666), np.float32(4046.3945), np.float32(3825.5923), np.float32(1811.5901), np.float32(2086.7703), np.float32(3958.57), np.float32(3938.4814), np.float32(3596.2996), np.float32(4002.4788), np.float32(3289.8591)]
2025-09-14 12:30:58,603 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:30:58,608 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 43/100 (estimated time remaining: 2 hours, 41 minutes, 8 seconds)
2025-09-14 12:33:39,282 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 12:33:45,835 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 2102.87622 ± 679.856
2025-09-14 12:33:45,836 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1994.7362), np.float32(1895.6898), np.float32(1154.2698), np.float32(3328.1245), np.float32(2093.111), np.float32(1592.6791), np.float32(2959.1982), np.float32(1421.9637), np.float32(2858.2183), np.float32(1730.7701)]
2025-09-14 12:33:45,836 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:33:45,841 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 44/100 (estimated time remaining: 2 hours, 38 minutes, 22 seconds)
2025-09-14 12:36:26,568 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 12:36:33,014 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 3640.53979 ± 944.296
2025-09-14 12:36:33,014 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4507.264), np.float32(2425.745), np.float32(4334.986), np.float32(4480.3154), np.float32(2759.5037), np.float32(2168.4011), np.float32(4355.426), np.float32(4388.94), np.float32(2642.6848), np.float32(4342.132)]
2025-09-14 12:36:33,014 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:36:33,020 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 45/100 (estimated time remaining: 2 hours, 35 minutes, 43 seconds)
2025-09-14 12:39:13,045 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 12:39:19,514 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 2578.66528 ± 1053.120
2025-09-14 12:39:19,514 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2012.219), np.float32(3010.5332), np.float32(1353.5583), np.float32(4811.138), np.float32(1619.9906), np.float32(2131.973), np.float32(3813.8096), np.float32(1650.365), np.float32(2159.0715), np.float32(3223.994)]
2025-09-14 12:39:19,514 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:39:19,519 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 46/100 (estimated time remaining: 2 hours, 32 minutes, 58 seconds)
2025-09-14 12:41:59,427 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 12:42:05,942 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 2787.30518 ± 752.514
2025-09-14 12:42:05,943 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3119.1558), np.float32(2153.8098), np.float32(4140.6104), np.float32(3231.8022), np.float32(1892.9442), np.float32(2291.6313), np.float32(2239.9487), np.float32(3939.5251), np.float32(2745.4866), np.float32(2118.1392)]
2025-09-14 12:42:05,943 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:42:05,947 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 47/100 (estimated time remaining: 2 hours, 30 minutes, 7 seconds)
2025-09-14 12:44:46,071 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 12:44:52,636 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 3202.80859 ± 1146.735
2025-09-14 12:44:52,636 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3482.0264), np.float32(1842.0623), np.float32(3260.437), np.float32(3794.9263), np.float32(4288.0317), np.float32(3982.369), np.float32(4159.2646), np.float32(4449.6484), np.float32(1371.9991), np.float32(1397.3195)]
2025-09-14 12:44:52,636 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:44:52,641 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 48/100 (estimated time remaining: 2 hours, 27 minutes, 20 seconds)
2025-09-14 12:47:33,503 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 12:47:39,975 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 3698.25000 ± 915.504
2025-09-14 12:47:39,975 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4693.3857), np.float32(3792.6965), np.float32(2953.5432), np.float32(4286.1953), np.float32(3953.4998), np.float32(4724.324), np.float32(3143.6997), np.float32(3324.6191), np.float32(1625.2263), np.float32(4485.311)]
2025-09-14 12:47:39,975 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:47:39,981 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 49/100 (estimated time remaining: 2 hours, 24 minutes, 35 seconds)
2025-09-14 12:50:20,802 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 12:50:27,235 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 3564.92651 ± 823.130
2025-09-14 12:50:27,235 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2731.7717), np.float32(2352.4543), np.float32(2642.049), np.float32(3991.542), np.float32(4773.986), np.float32(4256.187), np.float32(3066.4036), np.float32(3749.227), np.float32(3372.5356), np.float32(4713.109)]
2025-09-14 12:50:27,235 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:50:27,241 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 50/100 (estimated time remaining: 2 hours, 21 minutes, 49 seconds)
2025-09-14 12:53:08,164 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 12:53:14,622 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 3980.73242 ± 722.334
2025-09-14 12:53:14,622 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4581.9854), np.float32(4606.0156), np.float32(4196.255), np.float32(4696.1367), np.float32(3426.569), np.float32(4247.5117), np.float32(3435.918), np.float32(4784.236), np.float32(2516.1042), np.float32(3316.5903)]
2025-09-14 12:53:14,622 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:53:14,622 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (3980.73) for latency 15
2025-09-14 12:53:14,628 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 51/100 (estimated time remaining: 2 hours, 19 minutes, 11 seconds)
2025-09-14 12:55:54,278 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 12:56:00,837 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 2370.65601 ± 909.976
2025-09-14 12:56:00,837 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2484.032), np.float32(1646.6107), np.float32(2768.172), np.float32(2426.3535), np.float32(1581.6042), np.float32(3908.759), np.float32(2154.1387), np.float32(1371.4637), np.float32(3977.6978), np.float32(1387.7312)]
2025-09-14 12:56:00,837 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:56:00,842 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 52/100 (estimated time remaining: 2 hours, 16 minutes, 21 seconds)
2025-09-14 12:59:07,836 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 12:59:15,269 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 2310.85693 ± 485.692
2025-09-14 12:59:15,269 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2011.1302), np.float32(2327.6482), np.float32(2648.5952), np.float32(1914.743), np.float32(3182.1895), np.float32(1631.1744), np.float32(2728.4387), np.float32(2737.149), np.float32(1664.0626), np.float32(2263.4385)]
2025-09-14 12:59:15,269 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:59:15,275 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 53/100 (estimated time remaining: 2 hours, 18 minutes, 1 second)
2025-09-14 13:02:22,016 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 13:02:29,324 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 2906.28149 ± 1084.394
2025-09-14 13:02:29,324 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4009.9722), np.float32(1736.0897), np.float32(2557.023), np.float32(3386.1665), np.float32(4208.2783), np.float32(3867.1382), np.float32(1512.289), np.float32(1929.277), np.float32(4213.2393), np.float32(1643.3422)]
2025-09-14 13:02:29,324 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:02:29,330 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 54/100 (estimated time remaining: 2 hours, 19 minutes, 19 seconds)
2025-09-14 13:05:38,514 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 13:05:46,241 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 3933.64697 ± 1103.618
2025-09-14 13:05:46,242 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4489.247), np.float32(1628.8301), np.float32(1917.8201), np.float32(4571.677), np.float32(4607.181), np.float32(4710.332), np.float32(4646.6304), np.float32(3881.1152), np.float32(4485.1113), np.float32(4398.5234)]
2025-09-14 13:05:46,242 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:05:46,247 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 55/100 (estimated time remaining: 2 hours, 20 minutes, 54 seconds)
2025-09-14 13:08:56,272 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 13:09:03,658 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 3087.38818 ± 977.811
2025-09-14 13:09:03,658 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3805.489), np.float32(4479.651), np.float32(3217.2866), np.float32(1571.6665), np.float32(2140.8928), np.float32(3458.8699), np.float32(1407.4894), np.float32(3518.792), np.float32(3365.845), np.float32(3907.8984)]
2025-09-14 13:09:03,658 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:09:03,664 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 56/100 (estimated time remaining: 2 hours, 22 minutes, 21 seconds)
2025-09-14 13:12:13,501 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 13:12:21,106 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 3956.00586 ± 967.804
2025-09-14 13:12:21,106 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4414.3145), np.float32(4472.9453), np.float32(4127.627), np.float32(4513.398), np.float32(4541.2397), np.float32(3973.4626), np.float32(4812.119), np.float32(1528.36), np.float32(2779.1572), np.float32(4397.433)]
2025-09-14 13:12:21,106 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:12:21,112 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 57/100 (estimated time remaining: 2 hours, 23 minutes, 46 seconds)
2025-09-14 13:15:31,350 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 13:15:38,703 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 3383.89795 ± 1160.610
2025-09-14 13:15:38,703 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2106.2131), np.float32(3446.7183), np.float32(5225.669), np.float32(1570.1537), np.float32(4018.1282), np.float32(2723.6682), np.float32(4527.379), np.float32(4537.4307), np.float32(3579.3389), np.float32(2104.282)]
2025-09-14 13:15:38,703 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:15:38,709 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 58/100 (estimated time remaining: 2 hours, 20 minutes, 57 seconds)
2025-09-14 13:18:48,863 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 13:18:56,378 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4434.64355 ± 1191.970
2025-09-14 13:18:56,378 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2044.9059), np.float32(5028.368), np.float32(5079.8438), np.float32(2402.1453), np.float32(5189.2705), np.float32(5277.794), np.float32(5237.2266), np.float32(5079.3765), np.float32(5293.5303), np.float32(3713.9746)]
2025-09-14 13:18:56,378 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:18:56,378 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (4434.64) for latency 15
2025-09-14 13:18:56,384 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 59/100 (estimated time remaining: 2 hours, 18 minutes, 11 seconds)
2025-09-14 13:22:06,603 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 13:22:14,311 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4819.28125 ± 283.499
2025-09-14 13:22:14,311 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4994.6104), np.float32(4987.789), np.float32(5001.06), np.float32(5052.0083), np.float32(5005.728), np.float32(4406.387), np.float32(4179.4253), np.float32(4826.296), np.float32(4732.6865), np.float32(5006.8213)]
2025-09-14 13:22:14,311 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:22:14,311 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (4819.28) for latency 15
2025-09-14 13:22:14,317 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 60/100 (estimated time remaining: 2 hours, 15 minutes, 2 seconds)
2025-09-14 13:25:24,627 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 13:25:32,105 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 3869.58008 ± 1210.256
2025-09-14 13:25:32,105 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2062.0347), np.float32(3288.751), np.float32(4502.2314), np.float32(2096.504), np.float32(2434.6345), np.float32(5076.432), np.float32(4962.915), np.float32(5269.2144), np.float32(4623.8135), np.float32(4379.2695)]
2025-09-14 13:25:32,105 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:25:32,111 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 61/100 (estimated time remaining: 2 hours, 11 minutes, 47 seconds)
2025-09-14 13:28:42,409 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 13:28:49,667 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4673.25488 ± 1272.934
2025-09-14 13:28:49,668 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1786.7233), np.float32(5326.682), np.float32(5320.658), np.float32(5335.7925), np.float32(5530.035), np.float32(5452.1016), np.float32(5278.1025), np.float32(4667.154), np.float32(5435.446), np.float32(2599.8528)]
2025-09-14 13:28:49,668 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:28:49,673 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 62/100 (estimated time remaining: 2 hours, 8 minutes, 30 seconds)
2025-09-14 13:31:58,291 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 13:32:05,994 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4048.54248 ± 1412.747
2025-09-14 13:32:05,994 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1878.6635), np.float32(5355.557), np.float32(5402.039), np.float32(3920.814), np.float32(4673.7173), np.float32(4879.7534), np.float32(4914.8774), np.float32(2167.6199), np.float32(1914.1887), np.float32(5378.1953)]
2025-09-14 13:32:05,994 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:32:06,000 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 63/100 (estimated time remaining: 2 hours, 5 minutes, 3 seconds)
2025-09-14 13:35:15,730 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 13:35:23,338 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5041.64746 ± 550.926
2025-09-14 13:35:23,338 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5422.116), np.float32(5539.3135), np.float32(4420.409), np.float32(4341.2817), np.float32(5583.548), np.float32(5005.6167), np.float32(5466.121), np.float32(4094.022), np.float32(4893.178), np.float32(5650.868)]
2025-09-14 13:35:23,339 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:35:23,339 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (5041.65) for latency 15
2025-09-14 13:35:23,345 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 64/100 (estimated time remaining: 2 hours, 1 minute, 43 seconds)
2025-09-14 13:38:33,585 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 13:38:40,883 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4476.56689 ± 1171.766
2025-09-14 13:38:40,883 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5160.9546), np.float32(5116.9277), np.float32(5135.012), np.float32(4407.677), np.float32(3627.1763), np.float32(5078.848), np.float32(5168.9717), np.float32(4620.456), np.float32(5186.395), np.float32(1263.2485)]
2025-09-14 13:38:40,883 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:38:40,890 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 65/100 (estimated time remaining: 1 hour, 58 minutes, 23 seconds)
2025-09-14 13:41:50,516 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 13:41:57,909 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5393.95020 ± 31.231
2025-09-14 13:41:57,910 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5348.641), np.float32(5357.287), np.float32(5422.2705), np.float32(5361.8003), np.float32(5417.6294), np.float32(5416.8276), np.float32(5356.0977), np.float32(5418.675), np.float32(5416.904), np.float32(5423.372)]
2025-09-14 13:41:57,910 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:41:57,910 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (5393.95) for latency 15
2025-09-14 13:41:57,916 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 66/100 (estimated time remaining: 1 hour, 55 minutes)
2025-09-14 13:44:54,999 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 13:45:02,218 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4247.74463 ± 1106.143
2025-09-14 13:45:02,218 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5465.8374), np.float32(5372.2153), np.float32(4161.947), np.float32(3674.1152), np.float32(5493.737), np.float32(4879.7056), np.float32(2298.3037), np.float32(3721.982), np.float32(2579.472), np.float32(4830.131)]
2025-09-14 13:45:02,219 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:45:02,225 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 67/100 (estimated time remaining: 1 hour, 50 minutes, 13 seconds)
2025-09-14 13:48:00,267 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 13:48:07,057 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4485.69434 ± 1361.521
2025-09-14 13:48:07,057 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4666.782), np.float32(5572.6836), np.float32(5579.082), np.float32(1702.0488), np.float32(5622.4346), np.float32(5472.9946), np.float32(5367.787), np.float32(2444.646), np.float32(3499.8877), np.float32(4928.5967)]
2025-09-14 13:48:07,057 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:48:07,063 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 68/100 (estimated time remaining: 1 hour, 45 minutes, 43 seconds)
2025-09-14 13:51:04,068 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 13:51:11,144 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4931.03418 ± 596.034
2025-09-14 13:51:11,144 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5345.204), np.float32(5151.651), np.float32(5348.5723), np.float32(5171.167), np.float32(5227.8896), np.float32(4161.4907), np.float32(5167.6143), np.float32(4846.831), np.float32(3474.2817), np.float32(5415.639)]
2025-09-14 13:51:11,145 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:51:11,151 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 69/100 (estimated time remaining: 1 hour, 41 minutes, 5 seconds)
2025-09-14 13:54:06,865 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 13:54:14,101 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5217.01318 ± 896.298
2025-09-14 13:54:14,102 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3342.1067), np.float32(5683.492), np.float32(5519.6865), np.float32(5661.055), np.float32(5696.688), np.float32(5701.2695), np.float32(5693.944), np.float32(5653.0835), np.float32(5702.851), np.float32(3515.9568)]
2025-09-14 13:54:14,102 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:54:14,108 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 70/100 (estimated time remaining: 1 hour, 36 minutes, 25 seconds)
2025-09-14 13:57:11,240 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 13:57:18,490 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4080.38330 ± 1486.471
2025-09-14 13:57:18,490 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4921.698), np.float32(5304.3164), np.float32(5385.6934), np.float32(5395.733), np.float32(5051.274), np.float32(1612.5642), np.float32(2757.4336), np.float32(5530.056), np.float32(2615.8142), np.float32(2229.2507)]
2025-09-14 13:57:18,490 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:57:18,497 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 71/100 (estimated time remaining: 1 hour, 32 minutes, 3 seconds)
2025-09-14 14:00:16,503 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 14:00:23,375 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4178.08301 ± 1667.510
2025-09-14 14:00:23,376 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5118.89), np.float32(5718.883), np.float32(5676.1084), np.float32(5617.2744), np.float32(2582.1042), np.float32(5127.716), np.float32(2036.7234), np.float32(2894.8884), np.float32(5699.7954), np.float32(1308.4479)]
2025-09-14 14:00:23,376 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:00:23,382 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 72/100 (estimated time remaining: 1 hour, 29 minutes, 2 seconds)
2025-09-14 14:03:20,377 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 14:03:27,636 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4650.18799 ± 1013.460
2025-09-14 14:03:27,636 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4184.0615), np.float32(5507.1523), np.float32(5184.0786), np.float32(2205.639), np.float32(5124.266), np.float32(5558.5386), np.float32(5377.5244), np.float32(5234.4575), np.float32(4536.823), np.float32(3589.337)]
2025-09-14 14:03:27,637 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:03:27,643 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 73/100 (estimated time remaining: 1 hour, 25 minutes, 55 seconds)
2025-09-14 14:06:25,269 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 14:06:32,487 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5561.69141 ± 235.711
2025-09-14 14:06:32,488 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5746.5923), np.float32(5660.3467), np.float32(5641.0845), np.float32(4910.7305), np.float32(5648.9434), np.float32(5691.9043), np.float32(5631.925), np.float32(5614.871), np.float32(5380.5054), np.float32(5690.0117)]
2025-09-14 14:06:32,488 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:06:32,488 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (5561.69) for latency 15
2025-09-14 14:06:32,494 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 74/100 (estimated time remaining: 1 hour, 22 minutes, 55 seconds)
2025-09-14 14:09:30,787 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 14:09:37,631 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4689.04980 ± 1531.938
2025-09-14 14:09:37,631 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5054.3965), np.float32(5671.073), np.float32(5235.809), np.float32(1882.4935), np.float32(5132.3857), np.float32(5595.142), np.float32(5635.9634), np.float32(5602.1294), np.float32(5639.216), np.float32(1441.8857)]
2025-09-14 14:09:37,631 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:09:37,638 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 75/100 (estimated time remaining: 1 hour, 20 minutes, 2 seconds)
2025-09-14 14:12:35,509 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 14:12:42,418 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5337.84082 ± 834.277
2025-09-14 14:12:42,418 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2903.8643), np.float32(5640.4165), np.float32(5714.458), np.float32(5679.561), np.float32(5692.451), np.float32(5032.6636), np.float32(5643.5234), np.float32(5693.114), np.float32(5702.664), np.float32(5675.6924)]
2025-09-14 14:12:42,418 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:12:42,425 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 76/100 (estimated time remaining: 1 hour, 16 minutes, 59 seconds)
2025-09-14 14:15:38,748 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 14:15:45,969 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4722.93018 ± 1502.536
2025-09-14 14:15:45,969 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1842.1912), np.float32(5708.9688), np.float32(5705.0073), np.float32(5627.7627), np.float32(5772.9717), np.float32(5650.9214), np.float32(2072.458), np.float32(5614.3423), np.float32(3690.965), np.float32(5543.7104)]
2025-09-14 14:15:45,969 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:15:45,975 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 77/100 (estimated time remaining: 1 hour, 13 minutes, 48 seconds)
2025-09-14 14:18:43,358 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 14:18:50,499 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5199.07031 ± 1498.293
2025-09-14 14:18:50,499 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5700.985), np.float32(5682.3496), np.float32(5690.847), np.float32(5483.7383), np.float32(5704.188), np.float32(710.69104), np.float32(5782.283), np.float32(5728.58), np.float32(5706.499), np.float32(5800.537)]
2025-09-14 14:18:50,499 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:18:50,505 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 78/100 (estimated time remaining: 1 hour, 10 minutes, 45 seconds)
2025-09-14 14:21:38,024 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 14:21:44,467 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5687.77979 ± 47.753
2025-09-14 14:21:44,467 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5563.462), np.float32(5653.528), np.float32(5748.9883), np.float32(5727.198), np.float32(5705.4873), np.float32(5700.034), np.float32(5696.1353), np.float32(5685.9805), np.float32(5705.86), np.float32(5691.124)]
2025-09-14 14:21:44,467 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:21:44,467 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (5687.78) for latency 15
2025-09-14 14:21:44,472 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 79/100 (estimated time remaining: 1 hour, 6 minutes, 52 seconds)
2025-09-14 14:24:20,196 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 14:24:26,644 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5695.26660 ± 33.998
2025-09-14 14:24:26,644 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5733.396), np.float32(5680.012), np.float32(5726.181), np.float32(5695.2363), np.float32(5717.4727), np.float32(5619.3335), np.float32(5662.6094), np.float32(5711.259), np.float32(5728.045), np.float32(5679.1226)]
2025-09-14 14:24:26,644 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:24:26,644 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (5695.27) for latency 15
2025-09-14 14:24:26,649 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 80/100 (estimated time remaining: 1 hour, 2 minutes, 13 seconds)
2025-09-14 14:27:01,868 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 14:27:08,576 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5305.54297 ± 1328.602
2025-09-14 14:27:08,576 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5790.5015), np.float32(5770.413), np.float32(5417.7705), np.float32(5809.8804), np.float32(1333.6825), np.float32(5784.9604), np.float32(5777.3174), np.float32(5808.2007), np.float32(5777.001), np.float32(5785.7046)]
2025-09-14 14:27:08,576 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:27:08,582 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 81/100 (estimated time remaining: 57 minutes, 44 seconds)
2025-09-14 14:29:44,926 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 14:29:51,617 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5521.53711 ± 364.715
2025-09-14 14:29:51,617 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5630.3984), np.float32(5681.653), np.float32(5726.749), np.float32(5606.5977), np.float32(5681.16), np.float32(5646.111), np.float32(5610.637), np.float32(5586.284), np.float32(5611.492), np.float32(4434.286)]
2025-09-14 14:29:51,617 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:29:51,622 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 82/100 (estimated time remaining: 53 minutes, 33 seconds)
2025-09-14 14:32:28,571 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 14:32:34,998 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5722.56543 ± 112.305
2025-09-14 14:32:34,998 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5767.379), np.float32(5760.996), np.float32(5701.673), np.float32(5776.722), np.float32(5797.9775), np.float32(5725.8164), np.float32(5753.0474), np.float32(5398.9756), np.float32(5731.9385), np.float32(5811.1304)]
2025-09-14 14:32:34,998 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:32:34,998 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (5722.57) for latency 15
2025-09-14 14:32:35,003 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 83/100 (estimated time remaining: 49 minutes, 28 seconds)
2025-09-14 14:35:10,536 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 14:35:17,142 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5680.65479 ± 58.867
2025-09-14 14:35:17,142 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5670.4224), np.float32(5699.4307), np.float32(5760.981), np.float32(5720.4683), np.float32(5703.0283), np.float32(5647.9785), np.float32(5555.503), np.float32(5620.2393), np.float32(5675.383), np.float32(5753.108)]
2025-09-14 14:35:17,142 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:35:17,150 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 84/100 (estimated time remaining: 46 minutes, 3 seconds)
2025-09-14 14:37:52,090 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 14:37:58,730 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5143.46924 ± 1412.860
2025-09-14 14:37:58,730 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5811.548), np.float32(5706.393), np.float32(5747.0425), np.float32(5825.0474), np.float32(5618.311), np.float32(5799.168), np.float32(5771.208), np.float32(1120.578), np.float32(5760.974), np.float32(4274.422)]
2025-09-14 14:37:58,730 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:37:58,736 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 85/100 (estimated time remaining: 43 minutes, 18 seconds)
2025-09-14 14:40:34,789 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 14:40:41,427 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5716.83447 ± 34.448
2025-09-14 14:40:41,427 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5684.433), np.float32(5681.2466), np.float32(5750.6626), np.float32(5649.455), np.float32(5771.866), np.float32(5721.339), np.float32(5721.0664), np.float32(5738.071), np.float32(5714.809), np.float32(5735.394)]
2025-09-14 14:40:41,427 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:40:41,433 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 86/100 (estimated time remaining: 40 minutes, 38 seconds)
2025-09-14 14:43:17,070 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 14:43:23,495 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5190.98682 ± 1298.218
2025-09-14 14:43:23,495 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5614.2197), np.float32(5646.714), np.float32(1298.8032), np.float32(5637.537), np.float32(5664.067), np.float32(5493.7607), np.float32(5618.6133), np.float32(5665.599), np.float32(5634.632), np.float32(5635.922)]
2025-09-14 14:43:23,495 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:43:23,500 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 87/100 (estimated time remaining: 37 minutes, 53 seconds)
2025-09-14 14:45:59,028 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 14:46:05,654 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5803.72168 ± 37.141
2025-09-14 14:46:05,654 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5812.594), np.float32(5745.2173), np.float32(5807.363), np.float32(5731.307), np.float32(5804.42), np.float32(5809.3286), np.float32(5837.9067), np.float32(5842.4326), np.float32(5795.3213), np.float32(5851.327)]
2025-09-14 14:46:05,654 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:46:05,654 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (5803.72) for latency 15
2025-09-14 14:46:05,660 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 88/100 (estimated time remaining: 35 minutes, 7 seconds)
2025-09-14 14:48:39,081 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 14:48:45,635 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5835.51611 ± 37.202
2025-09-14 14:48:45,635 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5815.553), np.float32(5846.0205), np.float32(5804.911), np.float32(5875.223), np.float32(5832.702), np.float32(5804.024), np.float32(5908.0957), np.float32(5808.934), np.float32(5786.378), np.float32(5873.316)]
2025-09-14 14:48:45,635 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:48:45,635 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (5835.52) for latency 15
2025-09-14 14:48:45,641 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 89/100 (estimated time remaining: 32 minutes, 20 seconds)
2025-09-14 14:51:21,663 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 14:51:28,282 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5734.66846 ± 42.598
2025-09-14 14:51:28,282 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5773.911), np.float32(5770.0483), np.float32(5758.651), np.float32(5744.2446), np.float32(5705.6855), np.float32(5703.1133), np.float32(5753.587), np.float32(5790.7964), np.float32(5643.1865), np.float32(5703.4556)]
2025-09-14 14:51:28,283 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:51:28,288 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 90/100 (estimated time remaining: 29 minutes, 41 seconds)
2025-09-14 14:54:03,717 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 14:54:10,110 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5735.31885 ± 61.631
2025-09-14 14:54:10,111 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5572.1353), np.float32(5733.662), np.float32(5744.1074), np.float32(5769.015), np.float32(5759.186), np.float32(5758.2007), np.float32(5803.9673), np.float32(5717.6387), np.float32(5790.7534), np.float32(5704.518)]
2025-09-14 14:54:10,111 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:54:10,116 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 91/100 (estimated time remaining: 26 minutes, 57 seconds)
2025-09-14 14:56:43,729 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 14:56:50,143 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5804.60840 ± 51.081
2025-09-14 14:56:50,143 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5752.2036), np.float32(5725.862), np.float32(5856.785), np.float32(5749.921), np.float32(5796.648), np.float32(5875.0396), np.float32(5808.3115), np.float32(5790.784), np.float32(5883.3315), np.float32(5807.197)]
2025-09-14 14:56:50,143 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:56:50,150 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 92/100 (estimated time remaining: 24 minutes, 11 seconds)
2025-09-14 14:59:23,934 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 14:59:30,510 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5336.60400 ± 1300.665
2025-09-14 14:59:30,510 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1438.9841), np.float32(5740.1655), np.float32(5771.05), np.float32(5807.6113), np.float32(5751.3203), np.float32(5789.0737), np.float32(5818.0806), np.float32(5781.348), np.float32(5856.4727), np.float32(5611.933)]
2025-09-14 14:59:30,510 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:59:30,516 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 93/100 (estimated time remaining: 21 minutes, 27 seconds)
2025-09-14 15:02:05,722 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 15:02:12,309 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5676.17773 ± 145.269
2025-09-14 15:02:12,309 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5705.3125), np.float32(5759.5024), np.float32(5734.2046), np.float32(5265.0625), np.float32(5723.3833), np.float32(5713.679), np.float32(5782.559), np.float32(5734.3896), np.float32(5592.8516), np.float32(5750.8315)]
2025-09-14 15:02:12,309 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:02:12,315 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 94/100 (estimated time remaining: 18 minutes, 49 seconds)
2025-09-14 15:04:47,627 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 15:04:54,027 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5787.68311 ± 35.020
2025-09-14 15:04:54,027 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5750.7), np.float32(5747.3013), np.float32(5835.6304), np.float32(5782.285), np.float32(5748.6973), np.float32(5833.802), np.float32(5807.5415), np.float32(5750.728), np.float32(5793.8213), np.float32(5826.3228)]
2025-09-14 15:04:54,027 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:04:54,033 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 95/100 (estimated time remaining: 16 minutes, 6 seconds)
2025-09-14 15:07:29,555 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 15:07:36,078 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5862.29590 ± 55.134
2025-09-14 15:07:36,078 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5902.484), np.float32(5906.0474), np.float32(5901.141), np.float32(5858.442), np.float32(5886.9824), np.float32(5856.881), np.float32(5857.8657), np.float32(5843.246), np.float32(5710.7476), np.float32(5899.125)]
2025-09-14 15:07:36,078 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:07:36,079 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (5862.30) for latency 15
2025-09-14 15:07:36,085 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 96/100 (estimated time remaining: 13 minutes, 25 seconds)
2025-09-14 15:10:10,878 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 15:10:17,505 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5738.77441 ± 61.141
2025-09-14 15:10:17,505 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5703.168), np.float32(5843.0674), np.float32(5807.9526), np.float32(5766.4736), np.float32(5741.018), np.float32(5600.0884), np.float32(5733.0317), np.float32(5722.2217), np.float32(5742.945), np.float32(5727.779)]
2025-09-14 15:10:17,505 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:10:17,512 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 97/100 (estimated time remaining: 10 minutes, 45 seconds)
2025-09-14 15:12:52,547 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 15:12:59,064 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5713.62744 ± 214.027
2025-09-14 15:12:59,064 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5887.445), np.float32(5867.963), np.float32(5372.6113), np.float32(5746.829), np.float32(5855.7075), np.float32(5920.6753), np.float32(5248.7495), np.float32(5764.956), np.float32(5691.6494), np.float32(5779.6875)]
2025-09-14 15:12:59,064 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:12:59,070 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 98/100 (estimated time remaining: 8 minutes, 5 seconds)
2025-09-14 15:15:34,358 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 15:15:40,825 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5425.87988 ± 1181.040
2025-09-14 15:15:40,825 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5818.9326), np.float32(5890.6772), np.float32(1885.346), np.float32(5821.723), np.float32(5815.8203), np.float32(5790.9033), np.float32(5735.412), np.float32(5819.712), np.float32(5782.977), np.float32(5897.2944)]
2025-09-14 15:15:40,825 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:15:40,835 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 99/100 (estimated time remaining: 5 minutes, 23 seconds)
2025-09-14 15:18:16,559 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 15:18:22,947 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5818.54639 ± 233.581
2025-09-14 15:18:22,948 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5864.118), np.float32(5727.039), np.float32(5142.8535), np.float32(5910.264), np.float32(5924.32), np.float32(5974.1855), np.float32(5931.5894), np.float32(5893.824), np.float32(5901.9536), np.float32(5915.316)]
2025-09-14 15:18:22,948 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:18:22,954 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 100/100 (estimated time remaining: 2 minutes, 41 seconds)
2025-09-14 15:20:58,170 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 15...
2025-09-14 15:21:04,827 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5813.67725 ± 66.188
2025-09-14 15:21:04,827 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5834.637), np.float32(5899.8384), np.float32(5823.91), np.float32(5651.2974), np.float32(5790.4844), np.float32(5785.7124), np.float32(5855.1895), np.float32(5859.387), np.float32(5771.785), np.float32(5864.532)]
2025-09-14 15:21:04,828 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:21:04,834 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1251 [DEBUG]: Training session finished
