2025-09-11 02:54:45,521 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1108 [DEBUG]: logdir: _logs/noise-eval/halfcheetah/bpql-noise_0.000-delay_6
2025-09-11 02:54:45,522 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1109 [DEBUG]: trainer_prefix: noise-eval/halfcheetah/bpql-noise_0.000-delay_6
2025-09-11 02:54:45,522 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1110 [DEBUG]: args.trainer_eval_latencies: {'6': <latency_env.delayed_mdp.ConstantDelay object at 0x7dc5dfb67a70>}
2025-09-11 02:54:45,522 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1111 [DEBUG]: using device: cpu
2025-09-11 02:54:45,525 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1133 [INFO]: Creating new trainer
2025-09-11 02:54:45,531 baseline-bpql-halfcheetah:113 [DEBUG]: pi network:
NNGaussianPolicy(
  (common_head): Sequential(
    (0): Flatten(start_dim=1, end_dim=-1)
    (1): Linear(in_features=53, out_features=256, bias=True)
    (2): ReLU()
    (3): Linear(in_features=256, out_features=256, bias=True)
    (4): ReLU()
  )
  (mu_head): Sequential(
    (0): Linear(in_features=256, out_features=6, bias=True)
    (1): Unflatten(dim=1, unflattened_size=(6,))
  )
  (log_std_head): Sequential(
    (0): Linear(in_features=256, out_features=6, bias=True)
    (1): Unflatten(dim=1, unflattened_size=(6,))
  )
  (tanh_refit): NNTanhRefit(scale: tensor([[2., 2., 2., 2., 2., 2.]]), shift: tensor([[-1., -1., -1., -1., -1., -1.]]))
)
2025-09-11 02:54:45,531 baseline-bpql-halfcheetah:114 [DEBUG]: q network:
NNLayerConcat2(
  dim: -1
  (next): Sequential(
    (0): Linear(in_features=23, out_features=256, bias=True)
    (1): ReLU()
    (2): Linear(in_features=256, out_features=256, bias=True)
    (3): ReLU()
    (4): Linear(in_features=256, out_features=1, bias=True)
    (5): NNLayerSqueeze(dim: -1)
  )
  (init_left): Flatten(start_dim=1, end_dim=-1)
  (init_right): Flatten(start_dim=1, end_dim=-1)
)
2025-09-11 02:54:46,281 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1194 [DEBUG]: Starting training session...
2025-09-11 02:54:46,281 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 1/100
2025-09-11 02:57:21,429 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 02:57:36,705 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: -343.17465 ± 72.484
2025-09-11 02:57:36,705 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(-299.83838), np.float32(-289.6197), np.float32(-288.40744), np.float32(-476.885), np.float32(-353.55167), np.float32(-282.45407), np.float32(-451.81143), np.float32(-366.48276), np.float32(-377.48868), np.float32(-245.20746)]
2025-09-11 02:57:36,705 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 02:57:36,706 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (-343.17) for latency 6
2025-09-11 02:57:36,706 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 2/100 (estimated time remaining: 4 hours, 41 minutes, 12 seconds)
2025-09-11 03:00:22,652 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 03:00:37,893 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: -213.72800 ± 29.346
2025-09-11 03:00:37,894 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(-224.27904), np.float32(-143.45885), np.float32(-236.72021), np.float32(-198.90909), np.float32(-196.39314), np.float32(-204.83646), np.float32(-255.8746), np.float32(-232.36089), np.float32(-213.08612), np.float32(-231.36165)]
2025-09-11 03:00:37,894 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 03:00:37,894 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (-213.73) for latency 6
2025-09-11 03:00:37,894 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 3/100 (estimated time remaining: 4 hours, 47 minutes, 9 seconds)
2025-09-11 03:03:23,757 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 03:03:39,036 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: -11.00730 ± 89.206
2025-09-11 03:03:39,037 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(174.42386), np.float32(-15.545943), np.float32(-89.54818), np.float32(115.811264), np.float32(-49.173542), np.float32(-95.60434), np.float32(-85.71074), np.float32(23.19203), np.float32(-95.84883), np.float32(7.9313803)]
2025-09-11 03:03:39,037 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 03:03:39,037 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (-11.01) for latency 6
2025-09-11 03:03:39,037 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 4/100 (estimated time remaining: 4 hours, 47 minutes, 5 seconds)
2025-09-11 03:06:25,179 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 03:06:40,438 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 377.07016 ± 210.502
2025-09-11 03:06:40,438 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(543.32263), np.float32(323.12512), np.float32(453.021), np.float32(7.7002797), np.float32(461.2634), np.float32(438.94778), np.float32(820.3083), np.float32(173.69917), np.float32(292.6191), np.float32(256.69492)]
2025-09-11 03:06:40,438 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 03:06:40,438 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (377.07) for latency 6
2025-09-11 03:06:40,439 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 5/100 (estimated time remaining: 4 hours, 45 minutes, 39 seconds)
2025-09-11 03:09:26,457 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 03:09:41,667 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 863.36670 ± 506.506
2025-09-11 03:09:41,667 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1520.6779), np.float32(241.71506), np.float32(580.6665), np.float32(1347.7836), np.float32(1388.8431), np.float32(200.31744), np.float32(314.41635), np.float32(1286.3762), np.float32(1219.9531), np.float32(532.91754)]
2025-09-11 03:09:41,667 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 03:09:41,667 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (863.37) for latency 6
2025-09-11 03:09:41,686 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 6/100 (estimated time remaining: 4 hours, 43 minutes, 32 seconds)
2025-09-11 03:12:27,652 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 03:12:42,858 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 1424.64807 ± 596.861
2025-09-11 03:12:42,858 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2092.3716), np.float32(1627.812), np.float32(778.8534), np.float32(794.02), np.float32(2106.7712), np.float32(1853.812), np.float32(1274.9674), np.float32(1746.5944), np.float32(244.16151), np.float32(1727.1172)]
2025-09-11 03:12:42,858 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 03:12:42,858 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (1424.65) for latency 6
2025-09-11 03:12:42,859 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 7/100 (estimated time remaining: 4 hours, 43 minutes, 55 seconds)
2025-09-11 03:15:28,882 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 03:15:44,083 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 1978.15137 ± 913.747
2025-09-11 03:15:44,083 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1179.7052), np.float32(2406.8848), np.float32(535.1461), np.float32(563.567), np.float32(1569.872), np.float32(2852.8462), np.float32(2572.1152), np.float32(2122.5715), np.float32(3206.2031), np.float32(2772.6035)]
2025-09-11 03:15:44,083 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 03:15:44,083 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (1978.15) for latency 6
2025-09-11 03:15:44,084 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 8/100 (estimated time remaining: 4 hours, 40 minutes, 55 seconds)
2025-09-11 03:18:30,690 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 03:18:45,960 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 2906.29932 ± 1325.757
2025-09-11 03:18:45,960 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3742.7288), np.float32(510.0614), np.float32(521.8693), np.float32(4068.2134), np.float32(2337.3154), np.float32(2712.0889), np.float32(4140.8657), np.float32(3669.5808), np.float32(3214.6602), np.float32(4145.611)]
2025-09-11 03:18:45,961 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 03:18:45,961 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (2906.30) for latency 6
2025-09-11 03:18:45,962 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 9/100 (estimated time remaining: 4 hours, 38 minutes, 7 seconds)
2025-09-11 03:21:32,013 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 03:21:47,229 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 2445.86865 ± 756.263
2025-09-11 03:21:47,230 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3168.7014), np.float32(2429.9338), np.float32(1408.7216), np.float32(2144.5798), np.float32(3697.2397), np.float32(1725.2913), np.float32(3200.5476), np.float32(1339.9982), np.float32(2830.5303), np.float32(2513.1418)]
2025-09-11 03:21:47,230 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 03:21:47,231 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 10/100 (estimated time remaining: 4 hours, 35 minutes, 3 seconds)
2025-09-11 03:24:34,621 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 03:24:49,850 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 3929.05615 ± 736.832
2025-09-11 03:24:49,850 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4211.3306), np.float32(4305.791), np.float32(1753.0483), np.float32(3958.8606), np.float32(4081.078), np.float32(4174.583), np.float32(3979.9595), np.float32(4405.3384), np.float32(4191.6445), np.float32(4228.9287)]
2025-09-11 03:24:49,850 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 03:24:49,850 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (3929.06) for latency 6
2025-09-11 03:24:49,852 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 11/100 (estimated time remaining: 4 hours, 32 minutes, 26 seconds)
2025-09-11 03:27:35,624 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 03:27:50,855 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4298.45117 ± 711.968
2025-09-11 03:27:50,855 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4643.2446), np.float32(4453.7783), np.float32(4796.851), np.float32(4745.1895), np.float32(4655.2974), np.float32(2798.8638), np.float32(4762.2266), np.float32(4569.8164), np.float32(4577.997), np.float32(2981.2495)]
2025-09-11 03:27:50,855 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 03:27:50,855 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (4298.45) for latency 6
2025-09-11 03:27:50,856 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 12/100 (estimated time remaining: 4 hours, 29 minutes, 22 seconds)
2025-09-11 03:30:36,886 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 03:30:52,167 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4721.78955 ± 135.263
2025-09-11 03:30:52,167 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4547.4185), np.float32(4774.0845), np.float32(4442.9683), np.float32(4779.533), np.float32(4909.999), np.float32(4672.355), np.float32(4719.6323), np.float32(4888.1084), np.float32(4776.5645), np.float32(4707.229)]
2025-09-11 03:30:52,168 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 03:30:52,168 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (4721.79) for latency 6
2025-09-11 03:30:52,169 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 13/100 (estimated time remaining: 4 hours, 26 minutes, 22 seconds)
2025-09-11 03:33:38,083 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 03:33:53,289 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4762.59570 ± 293.485
2025-09-11 03:33:53,289 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4628.5166), np.float32(5053.4854), np.float32(4740.6216), np.float32(4937.503), np.float32(4782.3345), np.float32(5074.4297), np.float32(4031.2139), np.float32(4617.277), np.float32(4728.1426), np.float32(5032.4346)]
2025-09-11 03:33:53,289 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 03:33:53,289 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (4762.60) for latency 6
2025-09-11 03:33:53,290 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 14/100 (estimated time remaining: 4 hours, 23 minutes, 7 seconds)
2025-09-11 03:36:39,412 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 03:36:54,634 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4365.45605 ± 1177.403
2025-09-11 03:36:54,635 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4977.469), np.float32(5128.036), np.float32(4928.3965), np.float32(4821.8486), np.float32(4705.637), np.float32(4654.063), np.float32(925.49255), np.float32(4891.6294), np.float32(4128.88), np.float32(4493.11)]
2025-09-11 03:36:54,635 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 03:36:54,636 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 15/100 (estimated time remaining: 4 hours, 20 minutes, 7 seconds)
2025-09-11 03:39:40,952 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 03:39:56,161 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4689.00391 ± 431.474
2025-09-11 03:39:56,161 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4728.9004), np.float32(3483.7048), np.float32(4968.0146), np.float32(4722.4146), np.float32(4553.3945), np.float32(4689.9023), np.float32(4868.0527), np.float32(4813.4443), np.float32(5150.266), np.float32(4911.9463)]
2025-09-11 03:39:56,162 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 03:39:56,163 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 16/100 (estimated time remaining: 4 hours, 16 minutes, 47 seconds)
2025-09-11 03:42:42,309 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 03:42:57,555 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5020.38281 ± 249.469
2025-09-11 03:42:57,555 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4444.829), np.float32(5006.0415), np.float32(5224.968), np.float32(5272.3667), np.float32(4713.6797), np.float32(5133.5635), np.float32(5150.604), np.float32(5013.167), np.float32(5265.255), np.float32(4979.3574)]
2025-09-11 03:42:57,555 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 03:42:57,555 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (5020.38) for latency 6
2025-09-11 03:42:57,556 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 17/100 (estimated time remaining: 4 hours, 13 minutes, 52 seconds)
2025-09-11 03:45:43,819 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 03:45:59,083 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4975.21387 ± 193.607
2025-09-11 03:45:59,083 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5106.385), np.float32(4786.927), np.float32(5150.6094), np.float32(4772.9956), np.float32(5105.294), np.float32(5121.8433), np.float32(4916.0356), np.float32(4568.6196), np.float32(5056.8594), np.float32(5166.5674)]
2025-09-11 03:45:59,084 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 03:45:59,085 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 18/100 (estimated time remaining: 4 hours, 10 minutes, 54 seconds)
2025-09-11 03:48:45,545 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 03:49:00,950 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5042.71875 ± 219.449
2025-09-11 03:49:00,950 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5129.574), np.float32(5370.892), np.float32(4854.9746), np.float32(5110.1313), np.float32(5274.718), np.float32(4794.548), np.float32(5165.8994), np.float32(4603.2227), np.float32(5093.9224), np.float32(5029.3027)]
2025-09-11 03:49:00,950 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 03:49:00,950 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (5042.72) for latency 6
2025-09-11 03:49:00,952 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 19/100 (estimated time remaining: 4 hours, 8 minutes, 5 seconds)
2025-09-11 03:51:48,265 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 03:52:03,495 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5244.41016 ± 115.753
2025-09-11 03:52:03,495 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5316.529), np.float32(5164.9653), np.float32(5333.621), np.float32(5066.1978), np.float32(5181.9224), np.float32(5334.6997), np.float32(5354.821), np.float32(5361.6533), np.float32(5291.539), np.float32(5038.151)]
2025-09-11 03:52:03,495 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 03:52:03,495 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (5244.41) for latency 6
2025-09-11 03:52:03,496 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 20/100 (estimated time remaining: 4 hours, 5 minutes, 23 seconds)
2025-09-11 03:54:49,721 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 03:55:04,940 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4910.40332 ± 833.045
2025-09-11 03:55:04,940 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5099.035), np.float32(5069.929), np.float32(5307.171), np.float32(5304.5654), np.float32(5275.8604), np.float32(5170.5264), np.float32(5395.9756), np.float32(5328.1206), np.float32(4670.8027), np.float32(2482.045)]
2025-09-11 03:55:04,940 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 03:55:04,942 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 21/100 (estimated time remaining: 4 hours, 2 minutes, 20 seconds)
2025-09-11 03:57:51,299 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 03:58:06,475 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5101.67139 ± 135.436
2025-09-11 03:58:06,475 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4870.107), np.float32(5244.9497), np.float32(5145.6675), np.float32(4981.6675), np.float32(4884.633), np.float32(5240.2715), np.float32(5245.8945), np.float32(5120.684), np.float32(5107.4707), np.float32(5175.3716)]
2025-09-11 03:58:06,475 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 03:58:06,477 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 22/100 (estimated time remaining: 3 hours, 59 minutes, 20 seconds)
2025-09-11 04:00:52,724 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 04:01:07,959 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4744.00244 ± 1234.855
2025-09-11 04:01:07,959 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5277.5034), np.float32(5344.899), np.float32(5230.841), np.float32(5122.5454), np.float32(5235.3657), np.float32(5342.903), np.float32(5137.5093), np.float32(4640.9844), np.float32(1086.1127), np.float32(5021.36)]
2025-09-11 04:01:07,959 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 04:01:07,961 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 23/100 (estimated time remaining: 3 hours, 56 minutes, 18 seconds)
2025-09-11 04:03:54,453 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 04:04:09,688 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5089.88818 ± 159.368
2025-09-11 04:04:09,689 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5198.9717), np.float32(5051.3257), np.float32(5218.8643), np.float32(5185.5933), np.float32(4820.738), np.float32(5267.128), np.float32(5218.577), np.float32(4814.3545), np.float32(5153.517), np.float32(4969.817)]
2025-09-11 04:04:09,689 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 04:04:09,690 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 24/100 (estimated time remaining: 3 hours, 53 minutes, 14 seconds)
2025-09-11 04:06:56,405 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 04:07:11,636 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5331.06396 ± 111.928
2025-09-11 04:07:11,636 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5248.312), np.float32(5463.9), np.float32(5137.2974), np.float32(5287.4697), np.float32(5435.7905), np.float32(5194.17), np.float32(5438.335), np.float32(5374.475), np.float32(5454.544), np.float32(5276.3467)]
2025-09-11 04:07:11,636 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 04:07:11,636 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (5331.06) for latency 6
2025-09-11 04:07:11,638 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 25/100 (estimated time remaining: 3 hours, 50 minutes, 3 seconds)
2025-09-11 04:09:57,460 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 04:10:12,683 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5184.95312 ± 227.438
2025-09-11 04:10:12,683 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5402.1167), np.float32(5400.1235), np.float32(5452.1304), np.float32(5227.9023), np.float32(5342.6045), np.float32(4876.828), np.float32(5211.958), np.float32(5220.038), np.float32(4771.508), np.float32(4944.322)]
2025-09-11 04:10:12,683 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 04:10:12,685 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 26/100 (estimated time remaining: 3 hours, 46 minutes, 56 seconds)
2025-09-11 04:12:58,763 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 04:13:13,973 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4968.55420 ± 1292.190
2025-09-11 04:13:13,973 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5394.2515), np.float32(5417.327), np.float32(5460.526), np.float32(1095.5752), np.float32(5423.5723), np.float32(5306.8467), np.float32(5452.4287), np.float32(5286.832), np.float32(5401.9404), np.float32(5446.2417)]
2025-09-11 04:13:13,973 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 04:13:13,975 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 27/100 (estimated time remaining: 3 hours, 43 minutes, 50 seconds)
2025-09-11 04:16:00,039 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 04:16:15,314 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5280.97021 ± 259.267
2025-09-11 04:16:15,314 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5392.119), np.float32(5463.689), np.float32(5313.47), np.float32(5027.7793), np.float32(5460.887), np.float32(5288.2334), np.float32(5498.7573), np.float32(4605.497), np.float32(5433.0386), np.float32(5326.2314)]
2025-09-11 04:16:15,314 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 04:16:15,316 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 28/100 (estimated time remaining: 3 hours, 40 minutes, 47 seconds)
2025-09-11 04:19:01,171 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 04:19:16,372 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5475.16504 ± 32.179
2025-09-11 04:19:16,373 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5449.036), np.float32(5495.3076), np.float32(5492.2666), np.float32(5474.5015), np.float32(5468.7847), np.float32(5468.5903), np.float32(5535.237), np.float32(5412.585), np.float32(5503.676), np.float32(5451.668)]
2025-09-11 04:19:16,373 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 04:19:16,373 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (5475.17) for latency 6
2025-09-11 04:19:16,375 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 29/100 (estimated time remaining: 3 hours, 37 minutes, 36 seconds)
2025-09-11 04:22:03,592 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 04:22:18,843 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4752.79346 ± 1212.774
2025-09-11 04:22:18,844 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5232.495), np.float32(1141.1973), np.float32(5127.6333), np.float32(5276.779), np.float32(5200.169), np.float32(5263.241), np.float32(5227.2207), np.float32(4747.436), np.float32(5090.1885), np.float32(5221.5757)]
2025-09-11 04:22:18,844 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 04:22:18,846 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 30/100 (estimated time remaining: 3 hours, 34 minutes, 42 seconds)
2025-09-11 04:25:04,720 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 04:25:19,927 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5426.43652 ± 100.064
2025-09-11 04:25:19,927 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5484.9844), np.float32(5472.441), np.float32(5404.871), np.float32(5145.863), np.float32(5484.5146), np.float32(5438.636), np.float32(5481.876), np.float32(5469.508), np.float32(5497.983), np.float32(5383.6895)]
2025-09-11 04:25:19,927 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 04:25:19,929 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 31/100 (estimated time remaining: 3 hours, 31 minutes, 41 seconds)
2025-09-11 04:28:06,028 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 04:28:21,210 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5490.21680 ± 72.313
2025-09-11 04:28:21,210 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5504.704), np.float32(5502.6807), np.float32(5547.4463), np.float32(5417.2466), np.float32(5328.2246), np.float32(5528.9214), np.float32(5499.414), np.float32(5582.221), np.float32(5554.726), np.float32(5436.582)]
2025-09-11 04:28:21,210 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 04:28:21,210 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (5490.22) for latency 6
2025-09-11 04:28:21,212 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 32/100 (estimated time remaining: 3 hours, 28 minutes, 39 seconds)
2025-09-11 04:31:07,072 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 04:31:22,259 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5359.44238 ± 190.952
2025-09-11 04:31:22,259 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5342.321), np.float32(5391.571), np.float32(5456.582), np.float32(5339.2295), np.float32(5428.3667), np.float32(5340.0244), np.float32(5548.1855), np.float32(4822.307), np.float32(5500.188), np.float32(5425.649)]
2025-09-11 04:31:22,259 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 04:31:22,261 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 33/100 (estimated time remaining: 3 hours, 25 minutes, 34 seconds)
2025-09-11 04:34:07,992 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 04:34:23,193 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5438.58838 ± 99.763
2025-09-11 04:34:23,193 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5485.8296), np.float32(5406.709), np.float32(5434.1777), np.float32(5489.7075), np.float32(5561.535), np.float32(5259.4443), np.float32(5466.3433), np.float32(5515.249), np.float32(5513.271), np.float32(5253.62)]
2025-09-11 04:34:23,193 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 04:34:23,195 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 34/100 (estimated time remaining: 3 hours, 22 minutes, 31 seconds)
2025-09-11 04:37:11,559 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 04:37:26,755 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5260.15918 ± 170.721
2025-09-11 04:37:26,755 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5422.285), np.float32(5071.23), np.float32(5088.3013), np.float32(5529.4883), np.float32(5089.572), np.float32(5339.1807), np.float32(5384.273), np.float32(5442.9087), np.float32(5151.606), np.float32(5082.7446)]
2025-09-11 04:37:26,756 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 04:37:26,758 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 35/100 (estimated time remaining: 3 hours, 19 minutes, 44 seconds)
2025-09-11 04:40:12,649 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 04:40:27,869 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5412.24609 ± 61.663
2025-09-11 04:40:27,869 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5534.9146), np.float32(5370.338), np.float32(5443.42), np.float32(5423.928), np.float32(5359.8765), np.float32(5284.958), np.float32(5415.452), np.float32(5442.009), np.float32(5422.393), np.float32(5425.174)]
2025-09-11 04:40:27,869 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 04:40:27,871 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 36/100 (estimated time remaining: 3 hours, 16 minutes, 43 seconds)
2025-09-11 04:43:13,996 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 04:43:29,187 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5216.80908 ± 94.130
2025-09-11 04:43:29,187 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5225.74), np.float32(5119.906), np.float32(5222.2256), np.float32(5339.358), np.float32(5110.568), np.float32(5097.8506), np.float32(5131.3633), np.float32(5312.842), np.float32(5362.766), np.float32(5245.473)]
2025-09-11 04:43:29,187 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 04:43:29,189 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 37/100 (estimated time remaining: 3 hours, 13 minutes, 42 seconds)
2025-09-11 04:46:15,166 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 04:46:30,375 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5458.03369 ± 61.050
2025-09-11 04:46:30,375 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5489.4067), np.float32(5515.6416), np.float32(5436.6167), np.float32(5408.529), np.float32(5463.6206), np.float32(5578.6113), np.float32(5369.742), np.float32(5395.1865), np.float32(5504.6133), np.float32(5418.365)]
2025-09-11 04:46:30,375 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 04:46:30,378 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 38/100 (estimated time remaining: 3 hours, 10 minutes, 42 seconds)
2025-09-11 04:49:16,567 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 04:49:31,831 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5535.13379 ± 117.560
2025-09-11 04:49:31,831 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5675.831), np.float32(5575.5347), np.float32(5542.8945), np.float32(5537.6377), np.float32(5513.7764), np.float32(5218.013), np.float32(5607.379), np.float32(5485.112), np.float32(5588.4214), np.float32(5606.736)]
2025-09-11 04:49:31,831 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 04:49:31,831 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (5535.13) for latency 6
2025-09-11 04:49:31,834 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 39/100 (estimated time remaining: 3 hours, 7 minutes, 47 seconds)
2025-09-11 04:52:17,727 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 04:52:33,032 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5376.85352 ± 167.171
2025-09-11 04:52:33,032 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5432.3613), np.float32(5304.3545), np.float32(5489.8335), np.float32(5336.3477), np.float32(5495.1284), np.float32(5540.99), np.float32(5367.371), np.float32(5385.786), np.float32(4926.471), np.float32(5489.886)]
2025-09-11 04:52:33,032 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 04:52:33,034 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 40/100 (estimated time remaining: 3 hours, 4 minutes, 16 seconds)
2025-09-11 04:55:19,103 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 04:55:34,376 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5448.96973 ± 129.650
2025-09-11 04:55:34,376 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5508.8066), np.float32(5409.692), np.float32(5552.563), np.float32(5418.3535), np.float32(5535.041), np.float32(5237.0854), np.float32(5563.0195), np.float32(5491.662), np.float32(5191.2607), np.float32(5582.2095)]
2025-09-11 04:55:34,376 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 04:55:34,379 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 41/100 (estimated time remaining: 3 hours, 1 minute, 18 seconds)
2025-09-11 04:58:20,590 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 04:58:35,825 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5479.22705 ± 56.888
2025-09-11 04:58:35,825 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5455.2705), np.float32(5457.4595), np.float32(5410.8223), np.float32(5530.7866), np.float32(5497.6914), np.float32(5602.1475), np.float32(5433.529), np.float32(5438.3755), np.float32(5531.7754), np.float32(5434.4087)]
2025-09-11 04:58:35,825 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 04:58:35,828 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 42/100 (estimated time remaining: 2 hours, 58 minutes, 18 seconds)
2025-09-11 05:01:22,130 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 05:01:37,327 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5512.94238 ± 86.734
2025-09-11 05:01:37,327 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5485.593), np.float32(5501.5757), np.float32(5594.8486), np.float32(5569.494), np.float32(5494.6655), np.float32(5556.8994), np.float32(5273.0815), np.float32(5539.5894), np.float32(5550.5015), np.float32(5563.176)]
2025-09-11 05:01:37,327 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 05:01:37,330 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 43/100 (estimated time remaining: 2 hours, 55 minutes, 20 seconds)
2025-09-11 05:04:23,333 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 05:04:38,636 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5506.78027 ± 95.840
2025-09-11 05:04:38,636 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5592.011), np.float32(5545.523), np.float32(5531.2217), np.float32(5475.2656), np.float32(5238.0796), np.float32(5529.058), np.float32(5529.724), np.float32(5589.1836), np.float32(5543.279), np.float32(5494.453)]
2025-09-11 05:04:38,636 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 05:04:38,639 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 44/100 (estimated time remaining: 2 hours, 52 minutes, 17 seconds)
2025-09-11 05:07:24,235 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 05:07:39,430 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5143.30811 ± 815.907
2025-09-11 05:07:39,430 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4871.4766), np.float32(5135.064), np.float32(2845.4575), np.float32(5778.4478), np.float32(5628.9453), np.float32(5088.5796), np.float32(5607.355), np.float32(5522.4907), np.float32(5675.1963), np.float32(5280.069)]
2025-09-11 05:07:39,430 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 05:07:39,433 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 45/100 (estimated time remaining: 2 hours, 49 minutes, 11 seconds)
2025-09-11 05:10:25,604 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 05:10:40,828 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5335.55371 ± 220.394
2025-09-11 05:10:40,828 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5394.6836), np.float32(4963.8657), np.float32(5442.9863), np.float32(5396.2466), np.float32(5554.985), np.float32(5331.2285), np.float32(5349.0747), np.float32(5318.802), np.float32(5674.631), np.float32(4929.0366)]
2025-09-11 05:10:40,828 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 05:10:40,831 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 46/100 (estimated time remaining: 2 hours, 46 minutes, 10 seconds)
2025-09-11 05:13:26,872 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 05:13:42,128 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4914.25879 ± 1027.293
2025-09-11 05:13:42,129 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5281.1377), np.float32(5590.6035), np.float32(5195.318), np.float32(5589.8257), np.float32(3819.5193), np.float32(2211.294), np.float32(5182.5347), np.float32(5248.143), np.float32(5404.608), np.float32(5619.6045)]
2025-09-11 05:13:42,129 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 05:13:42,132 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 47/100 (estimated time remaining: 2 hours, 43 minutes, 8 seconds)
2025-09-11 05:16:28,368 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 05:16:43,601 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5687.97705 ± 50.034
2025-09-11 05:16:43,602 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5676.606), np.float32(5783.0376), np.float32(5713.033), np.float32(5605.8667), np.float32(5707.3296), np.float32(5662.8086), np.float32(5614.852), np.float32(5733.774), np.float32(5687.7354), np.float32(5694.7275)]
2025-09-11 05:16:43,602 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 05:16:43,602 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (5687.98) for latency 6
2025-09-11 05:16:43,605 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 48/100 (estimated time remaining: 2 hours, 40 minutes, 6 seconds)
2025-09-11 05:19:29,647 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 05:19:44,883 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5159.44775 ± 766.107
2025-09-11 05:19:44,884 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5510.2734), np.float32(5525.23), np.float32(3749.5376), np.float32(3563.897), np.float32(5236.1333), np.float32(5625.3574), np.float32(5355.445), np.float32(5776.8613), np.float32(5648.5186), np.float32(5603.2236)]
2025-09-11 05:19:44,884 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 05:19:44,886 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 49/100 (estimated time remaining: 2 hours, 37 minutes, 4 seconds)
2025-09-11 05:22:30,906 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 05:22:46,151 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5600.78613 ± 178.059
2025-09-11 05:22:46,151 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5202.355), np.float32(5732.5356), np.float32(5643.81), np.float32(5636.585), np.float32(5302.913), np.float32(5691.2383), np.float32(5710.128), np.float32(5668.408), np.float32(5726.41), np.float32(5693.477)]
2025-09-11 05:22:46,151 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 05:22:46,154 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 50/100 (estimated time remaining: 2 hours, 34 minutes, 8 seconds)
2025-09-11 05:25:32,199 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 05:25:47,453 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5636.52002 ± 164.310
2025-09-11 05:25:47,453 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5736.302), np.float32(5714.214), np.float32(5757.0254), np.float32(5698.557), np.float32(5729.2705), np.float32(5716.3936), np.float32(5701.3184), np.float32(5293.868), np.float32(5326.6895), np.float32(5691.56)]
2025-09-11 05:25:47,453 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 05:25:47,456 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 51/100 (estimated time remaining: 2 hours, 31 minutes, 6 seconds)
2025-09-11 05:28:33,160 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 05:28:48,373 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5613.41699 ± 120.815
2025-09-11 05:28:48,373 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5379.859), np.float32(5532.3296), np.float32(5696.354), np.float32(5682.8105), np.float32(5719.1846), np.float32(5732.475), np.float32(5616.8555), np.float32(5418.751), np.float32(5705.989), np.float32(5649.562)]
2025-09-11 05:28:48,373 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 05:28:48,376 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 52/100 (estimated time remaining: 2 hours, 28 minutes, 1 second)
2025-09-11 05:31:34,356 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 05:31:49,611 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5691.91064 ± 52.188
2025-09-11 05:31:49,611 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5655.7397), np.float32(5727.208), np.float32(5658.0854), np.float32(5740.962), np.float32(5641.3374), np.float32(5619.5654), np.float32(5805.0537), np.float32(5672.2446), np.float32(5695.0923), np.float32(5703.8174)]
2025-09-11 05:31:49,611 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 05:31:49,611 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (5691.91) for latency 6
2025-09-11 05:31:49,615 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 53/100 (estimated time remaining: 2 hours, 24 minutes, 57 seconds)
2025-09-11 05:34:36,131 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 05:34:51,305 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5452.90967 ± 544.023
2025-09-11 05:34:51,305 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5370.648), np.float32(5627.693), np.float32(5669.111), np.float32(5799.3833), np.float32(5409.9917), np.float32(3881.7205), np.float32(5614.4604), np.float32(5857.8506), np.float32(5743.8794), np.float32(5554.358)]
2025-09-11 05:34:51,305 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 05:34:51,308 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 54/100 (estimated time remaining: 2 hours, 22 minutes)
2025-09-11 05:37:37,154 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 05:37:52,363 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5708.69580 ± 35.470
2025-09-11 05:37:52,363 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5703.072), np.float32(5732.8066), np.float32(5758.002), np.float32(5649.56), np.float32(5742.3423), np.float32(5732.0854), np.float32(5722.6885), np.float32(5660.392), np.float32(5667.124), np.float32(5718.8823)]
2025-09-11 05:37:52,363 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 05:37:52,363 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (5708.70) for latency 6
2025-09-11 05:37:52,366 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 55/100 (estimated time remaining: 2 hours, 18 minutes, 57 seconds)
2025-09-11 05:40:38,248 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 05:40:53,486 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5542.79736 ± 197.150
2025-09-11 05:40:53,486 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5534.218), np.float32(5644.912), np.float32(5784.9326), np.float32(5407.2236), np.float32(5567.4194), np.float32(5399.8945), np.float32(5214.5327), np.float32(5758.935), np.float32(5808.1636), np.float32(5307.743)]
2025-09-11 05:40:53,486 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 05:40:53,489 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 56/100 (estimated time remaining: 2 hours, 15 minutes, 54 seconds)
2025-09-11 05:43:40,810 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 05:43:56,070 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5763.09033 ± 78.192
2025-09-11 05:43:56,070 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5831.03), np.float32(5823.003), np.float32(5798.4478), np.float32(5806.6143), np.float32(5844.057), np.float32(5645.44), np.float32(5751.907), np.float32(5592.953), np.float32(5780.088), np.float32(5757.363)]
2025-09-11 05:43:56,070 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 05:43:56,070 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (5763.09) for latency 6
2025-09-11 05:43:56,073 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 57/100 (estimated time remaining: 2 hours, 13 minutes, 7 seconds)
2025-09-11 05:46:42,395 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 05:46:57,567 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5512.49316 ± 217.435
2025-09-11 05:46:57,568 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5782.781), np.float32(5714.181), np.float32(5630.475), np.float32(5646.069), np.float32(5678.5835), np.float32(5298.188), np.float32(5385.673), np.float32(5608.3105), np.float32(5088.0215), np.float32(5292.652)]
2025-09-11 05:46:57,568 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 05:46:57,571 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 58/100 (estimated time remaining: 2 hours, 10 minutes, 8 seconds)
2025-09-11 05:49:43,820 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 05:49:58,967 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5714.41602 ± 59.934
2025-09-11 05:49:58,967 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5781.2773), np.float32(5728.104), np.float32(5765.8125), np.float32(5723.4775), np.float32(5706.2803), np.float32(5769.219), np.float32(5569.125), np.float32(5654.744), np.float32(5705.0303), np.float32(5741.089)]
2025-09-11 05:49:58,967 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 05:49:58,971 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 59/100 (estimated time remaining: 2 hours, 7 minutes, 4 seconds)
2025-09-11 05:52:44,990 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 05:53:00,200 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5737.49316 ± 46.975
2025-09-11 05:53:00,200 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5760.5933), np.float32(5819.194), np.float32(5768.8423), np.float32(5683.3633), np.float32(5677.6353), np.float32(5767.166), np.float32(5761.6055), np.float32(5664.892), np.float32(5753.4766), np.float32(5718.164)]
2025-09-11 05:53:00,200 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 05:53:00,204 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 60/100 (estimated time remaining: 2 hours, 4 minutes, 4 seconds)
2025-09-11 05:55:46,237 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 05:56:01,465 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5484.24707 ± 651.997
2025-09-11 05:56:01,465 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5629.9956), np.float32(5563.0103), np.float32(5727.676), np.float32(5754.064), np.float32(3561.9197), np.float32(5445.912), np.float32(5726.2446), np.float32(5731.5493), np.float32(5796.8657), np.float32(5905.234)]
2025-09-11 05:56:01,465 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 05:56:01,468 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 61/100 (estimated time remaining: 2 hours, 1 minute, 3 seconds)
2025-09-11 05:58:47,528 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 05:59:02,749 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5729.20703 ± 82.527
2025-09-11 05:59:02,750 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5805.7866), np.float32(5714.1875), np.float32(5668.5513), np.float32(5761.2), np.float32(5760.634), np.float32(5760.248), np.float32(5750.037), np.float32(5790.571), np.float32(5507.3857), np.float32(5773.468)]
2025-09-11 05:59:02,750 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 05:59:02,753 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 62/100 (estimated time remaining: 1 hour, 57 minutes, 52 seconds)
2025-09-11 06:01:48,727 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 06:02:03,929 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5443.68115 ± 883.824
2025-09-11 06:02:03,929 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5780.6226), np.float32(5726.848), np.float32(5760.3013), np.float32(5654.23), np.float32(2794.8196), np.float32(5720.4126), np.float32(5769.817), np.float32(5688.578), np.float32(5774.264), np.float32(5766.917)]
2025-09-11 06:02:03,929 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 06:02:03,933 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 63/100 (estimated time remaining: 1 hour, 54 minutes, 48 seconds)
2025-09-11 06:04:51,012 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 06:05:06,266 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5578.35645 ± 85.878
2025-09-11 06:05:06,267 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5692.2773), np.float32(5417.5347), np.float32(5734.664), np.float32(5579.979), np.float32(5532.7), np.float32(5620.6763), np.float32(5567.8857), np.float32(5567.6953), np.float32(5571.3267), np.float32(5498.8213)]
2025-09-11 06:05:06,267 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 06:05:06,270 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 64/100 (estimated time remaining: 1 hour, 51 minutes, 54 seconds)
2025-09-11 06:07:52,281 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 06:08:07,638 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5785.05420 ± 105.056
2025-09-11 06:08:07,638 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5869.3296), np.float32(5634.5767), np.float32(5589.4634), np.float32(5879.787), np.float32(5769.6675), np.float32(5717.351), np.float32(5794.2075), np.float32(5933.264), np.float32(5864.7427), np.float32(5798.1523)]
2025-09-11 06:08:07,638 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 06:08:07,638 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (5785.05) for latency 6
2025-09-11 06:08:07,642 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 65/100 (estimated time remaining: 1 hour, 48 minutes, 53 seconds)
2025-09-11 06:10:53,853 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 06:11:09,045 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5224.04834 ± 611.530
2025-09-11 06:11:09,045 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(3448.4917), np.float32(5175.8057), np.float32(5509.7256), np.float32(5474.3896), np.float32(5171.77), np.float32(5649.63), np.float32(5429.309), np.float32(5284.9263), np.float32(5565.826), np.float32(5530.6084)]
2025-09-11 06:11:09,045 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 06:11:09,049 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 66/100 (estimated time remaining: 1 hour, 45 minutes, 53 seconds)
2025-09-11 06:13:55,026 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 06:14:10,300 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5741.59619 ± 114.788
2025-09-11 06:14:10,300 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5781.059), np.float32(5627.1064), np.float32(5813.834), np.float32(5779.7495), np.float32(5811.8677), np.float32(5742.585), np.float32(5441.541), np.float32(5762.947), np.float32(5817.8555), np.float32(5837.412)]
2025-09-11 06:14:10,300 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 06:14:10,304 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 67/100 (estimated time remaining: 1 hour, 42 minutes, 51 seconds)
2025-09-11 06:16:56,624 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 06:17:11,836 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5757.31543 ± 44.824
2025-09-11 06:17:11,837 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5787.2373), np.float32(5766.3545), np.float32(5671.127), np.float32(5810.2446), np.float32(5762.6978), np.float32(5689.8687), np.float32(5756.1143), np.float32(5734.0083), np.float32(5813.433), np.float32(5782.0684)]
2025-09-11 06:17:11,837 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 06:17:11,840 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 68/100 (estimated time remaining: 1 hour, 39 minutes, 52 seconds)
2025-09-11 06:19:57,725 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 06:20:13,009 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5627.88232 ± 240.288
2025-09-11 06:20:13,010 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5469.546), np.float32(5539.8774), np.float32(5824.413), np.float32(6037.202), np.float32(5279.1367), np.float32(5668.635), np.float32(5224.8403), np.float32(5761.28), np.float32(5811.4097), np.float32(5662.483)]
2025-09-11 06:20:13,010 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 06:20:13,013 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 69/100 (estimated time remaining: 1 hour, 36 minutes, 43 seconds)
2025-09-11 06:22:58,717 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 06:23:13,917 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5672.55518 ± 102.099
2025-09-11 06:23:13,917 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5673.413), np.float32(5672.302), np.float32(5549.1743), np.float32(5740.711), np.float32(5763.9604), np.float32(5697.5435), np.float32(5830.564), np.float32(5538.593), np.float32(5509.6465), np.float32(5749.6436)]
2025-09-11 06:23:13,917 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 06:23:13,921 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 70/100 (estimated time remaining: 1 hour, 33 minutes, 38 seconds)
2025-09-11 06:25:59,830 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 06:26:15,033 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5553.81152 ± 167.929
2025-09-11 06:26:15,033 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5701.985), np.float32(5437.085), np.float32(5574.222), np.float32(5220.6616), np.float32(5783.969), np.float32(5613.797), np.float32(5331.0596), np.float32(5697.768), np.float32(5535.6084), np.float32(5641.96)]
2025-09-11 06:26:15,033 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 06:26:15,036 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 71/100 (estimated time remaining: 1 hour, 30 minutes, 35 seconds)
2025-09-11 06:29:01,600 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 06:29:16,834 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5417.85400 ± 303.486
2025-09-11 06:29:16,834 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5716.0054), np.float32(4669.0845), np.float32(5567.8555), np.float32(5554.8867), np.float32(5233.5415), np.float32(5368.2363), np.float32(5777.5073), np.float32(5224.6514), np.float32(5547.5586), np.float32(5519.213)]
2025-09-11 06:29:16,834 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 06:29:16,838 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 72/100 (estimated time remaining: 1 hour, 27 minutes, 37 seconds)
2025-09-11 06:32:03,021 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 06:32:18,216 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5814.92432 ± 104.447
2025-09-11 06:32:18,216 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5669.1313), np.float32(5841.0938), np.float32(5638.215), np.float32(5699.5117), np.float32(5836.4507), np.float32(5824.284), np.float32(5913.8867), np.float32(5876.334), np.float32(5970.899), np.float32(5879.4365)]
2025-09-11 06:32:18,216 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 06:32:18,216 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (5814.92) for latency 6
2025-09-11 06:32:18,220 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 73/100 (estimated time remaining: 1 hour, 24 minutes, 35 seconds)
2025-09-11 06:35:04,539 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 06:35:19,708 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5640.30469 ± 193.567
2025-09-11 06:35:19,708 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5894.322), np.float32(5558.934), np.float32(5512.5337), np.float32(5372.6772), np.float32(5805.153), np.float32(5909.5674), np.float32(5383.9), np.float32(5636.593), np.float32(5508.217), np.float32(5821.1465)]
2025-09-11 06:35:19,708 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 06:35:19,712 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 74/100 (estimated time remaining: 1 hour, 21 minutes, 36 seconds)
2025-09-11 06:38:05,247 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 06:38:20,449 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5928.64697 ± 57.137
2025-09-11 06:38:20,449 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5872.124), np.float32(6010.344), np.float32(5897.6147), np.float32(5819.8237), np.float32(5979.2524), np.float32(5911.0537), np.float32(5998.032), np.float32(5944.2676), np.float32(5958.8955), np.float32(5895.063)]
2025-09-11 06:38:20,449 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 06:38:20,449 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (5928.65) for latency 6
2025-09-11 06:38:20,453 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 75/100 (estimated time remaining: 1 hour, 18 minutes, 33 seconds)
2025-09-11 06:41:06,386 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 06:41:21,581 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5731.24561 ± 202.236
2025-09-11 06:41:21,581 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5841.566), np.float32(5839.933), np.float32(6000.7134), np.float32(5427.9727), np.float32(5788.704), np.float32(5304.034), np.float32(5664.108), np.float32(5885.6646), np.float32(5807.542), np.float32(5752.219)]
2025-09-11 06:41:21,581 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 06:41:21,585 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 76/100 (estimated time remaining: 1 hour, 15 minutes, 32 seconds)
2025-09-11 06:44:07,869 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 06:44:23,149 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5576.58984 ± 288.735
2025-09-11 06:44:23,149 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4836.369), np.float32(5635.287), np.float32(5686.797), np.float32(5730.532), np.float32(5843.361), np.float32(5436.3813), np.float32(5384.654), np.float32(5778.013), np.float32(5856.273), np.float32(5578.229)]
2025-09-11 06:44:23,149 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 06:44:23,153 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 77/100 (estimated time remaining: 1 hour, 12 minutes, 30 seconds)
2025-09-11 06:47:09,185 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 06:47:24,364 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5734.91016 ± 122.743
2025-09-11 06:47:24,364 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5779.167), np.float32(5577.9775), np.float32(5707.597), np.float32(5825.184), np.float32(5761.481), np.float32(5755.104), np.float32(5761.739), np.float32(5954.272), np.float32(5746.781), np.float32(5479.7964)]
2025-09-11 06:47:24,364 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 06:47:24,368 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 78/100 (estimated time remaining: 1 hour, 9 minutes, 28 seconds)
2025-09-11 06:50:10,022 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 06:50:25,213 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5801.08887 ± 197.762
2025-09-11 06:50:25,213 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5989.0693), np.float32(5732.4277), np.float32(5942.1436), np.float32(5960.9966), np.float32(5895.339), np.float32(5827.8125), np.float32(5309.9233), np.float32(5870.2983), np.float32(5595.0986), np.float32(5887.785)]
2025-09-11 06:50:25,213 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 06:50:25,218 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 79/100 (estimated time remaining: 1 hour, 6 minutes, 24 seconds)
2025-09-11 06:53:13,247 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 06:53:28,515 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5519.92578 ± 596.143
2025-09-11 06:53:28,515 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5706.533), np.float32(5767.0254), np.float32(5848.984), np.float32(5635.1143), np.float32(5861.5557), np.float32(5873.335), np.float32(5841.6567), np.float32(3831.2988), np.float32(5660.409), np.float32(5173.348)]
2025-09-11 06:53:28,515 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 06:53:28,519 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 80/100 (estimated time remaining: 1 hour, 3 minutes, 33 seconds)
2025-09-11 06:56:14,332 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 06:56:29,533 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5896.92139 ± 58.955
2025-09-11 06:56:29,533 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5774.436), np.float32(5940.1836), np.float32(5871.738), np.float32(5950.2446), np.float32(5906.26), np.float32(5889.5103), np.float32(5956.533), np.float32(5890.7188), np.float32(5821.782), np.float32(5967.808)]
2025-09-11 06:56:29,534 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 06:56:29,538 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 81/100 (estimated time remaining: 1 hour, 31 seconds)
2025-09-11 06:59:15,645 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 06:59:30,859 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5861.01807 ± 107.776
2025-09-11 06:59:30,860 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5789.4263), np.float32(5758.358), np.float32(5900.074), np.float32(5978.7744), np.float32(5995.5503), np.float32(5948.5493), np.float32(5705.64), np.float32(5886.645), np.float32(5697.609), np.float32(5949.5537)]
2025-09-11 06:59:30,860 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 06:59:30,864 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 82/100 (estimated time remaining: 57 minutes, 29 seconds)
2025-09-11 07:02:16,906 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 07:02:32,117 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5953.00586 ± 55.095
2025-09-11 07:02:32,117 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5938.295), np.float32(5912.588), np.float32(6020.4546), np.float32(5924.5186), np.float32(5974.132), np.float32(5989.2007), np.float32(5917.4126), np.float32(5920.362), np.float32(6063.215), np.float32(5869.884)]
2025-09-11 07:02:32,117 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 07:02:32,117 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (5953.01) for latency 6
2025-09-11 07:02:32,121 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 83/100 (estimated time remaining: 54 minutes, 27 seconds)
2025-09-11 07:05:18,567 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 07:05:33,746 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5635.11523 ± 200.167
2025-09-11 07:05:33,747 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5897.7915), np.float32(5796.518), np.float32(5611.815), np.float32(5158.0625), np.float32(5458.13), np.float32(5747.5376), np.float32(5525.9937), np.float32(5695.669), np.float32(5733.138), np.float32(5726.499)]
2025-09-11 07:05:33,747 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 07:05:33,751 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 84/100 (estimated time remaining: 51 minutes, 29 seconds)
2025-09-11 07:08:20,127 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 07:08:35,408 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5968.66895 ± 105.583
2025-09-11 07:08:35,408 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5674.5405), np.float32(6019.0117), np.float32(5959.8354), np.float32(6064.4004), np.float32(5994.214), np.float32(5922.708), np.float32(6030.268), np.float32(5984.8286), np.float32(6045.023), np.float32(5991.8584)]
2025-09-11 07:08:35,408 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 07:08:35,408 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (5968.67) for latency 6
2025-09-11 07:08:35,413 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 85/100 (estimated time remaining: 48 minutes, 22 seconds)
2025-09-11 07:11:21,603 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 07:11:36,825 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5732.52783 ± 97.440
2025-09-11 07:11:36,825 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5654.6655), np.float32(5829.7446), np.float32(5739.03), np.float32(5748.0864), np.float32(5571.8535), np.float32(5658.2183), np.float32(5885.748), np.float32(5785.329), np.float32(5829.906), np.float32(5622.6943)]
2025-09-11 07:11:36,825 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 07:11:36,830 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 86/100 (estimated time remaining: 45 minutes, 21 seconds)
2025-09-11 07:14:23,041 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 07:14:38,286 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5808.64062 ± 141.570
2025-09-11 07:14:38,286 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5689.684), np.float32(5887.65), np.float32(5652.3555), np.float32(5924.3545), np.float32(5937.9854), np.float32(5597.8706), np.float32(5896.1406), np.float32(5970.411), np.float32(5617.6323), np.float32(5912.3228)]
2025-09-11 07:14:38,286 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 07:14:38,291 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 87/100 (estimated time remaining: 42 minutes, 20 seconds)
2025-09-11 07:17:24,647 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 07:17:39,868 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5541.27002 ± 1116.373
2025-09-11 07:17:39,868 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5833.576), np.float32(5943.092), np.float32(5804.668), np.float32(2198.1467), np.float32(5964.329), np.float32(5939.227), np.float32(5928.715), np.float32(5841.2046), np.float32(5916.737), np.float32(6042.9995)]
2025-09-11 07:17:39,868 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 07:17:39,873 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 88/100 (estimated time remaining: 39 minutes, 20 seconds)
2025-09-11 07:20:26,476 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 07:20:41,733 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5831.17432 ± 120.261
2025-09-11 07:20:41,733 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5891.7163), np.float32(5762.263), np.float32(5851.701), np.float32(6095.853), np.float32(5800.955), np.float32(5926.1743), np.float32(5626.2764), np.float32(5826.1123), np.float32(5714.7407), np.float32(5815.9443)]
2025-09-11 07:20:41,733 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 07:20:41,738 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 89/100 (estimated time remaining: 36 minutes, 19 seconds)
2025-09-11 07:23:28,232 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 07:23:43,443 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5947.33008 ± 91.195
2025-09-11 07:23:43,443 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5999.998), np.float32(5793.5073), np.float32(5981.1987), np.float32(5992.0444), np.float32(5993.0166), np.float32(6017.027), np.float32(5759.821), np.float32(5904.125), np.float32(6023.51), np.float32(6009.0513)]
2025-09-11 07:23:43,444 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 07:23:43,448 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 90/100 (estimated time remaining: 33 minutes, 17 seconds)
2025-09-11 07:26:29,705 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 07:26:44,940 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5427.92871 ± 872.713
2025-09-11 07:26:44,940 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5633.249), np.float32(5834.297), np.float32(5934.405), np.float32(5818.956), np.float32(5635.2285), np.float32(5721.122), np.float32(5571.3486), np.float32(5760.78), np.float32(2834.1577), np.float32(5535.743)]
2025-09-11 07:26:44,940 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 07:26:44,947 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 91/100 (estimated time remaining: 30 minutes, 16 seconds)
2025-09-11 07:29:31,597 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 07:29:46,819 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5936.12744 ± 93.788
2025-09-11 07:29:46,819 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5975.9927), np.float32(5930.889), np.float32(5935.6045), np.float32(5921.686), np.float32(6023.1245), np.float32(6062.2266), np.float32(5821.326), np.float32(6031.3687), np.float32(5735.48), np.float32(5923.5747)]
2025-09-11 07:29:46,819 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 07:29:46,824 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 92/100 (estimated time remaining: 27 minutes, 15 seconds)
2025-09-11 07:32:33,077 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 07:32:48,328 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5811.40186 ± 112.571
2025-09-11 07:32:48,328 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5762.5767), np.float32(5894.8154), np.float32(5847.489), np.float32(5621.8022), np.float32(5639.042), np.float32(5840.11), np.float32(5725.2373), np.float32(5947.626), np.float32(5940.4097), np.float32(5894.91)]
2025-09-11 07:32:48,328 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 07:32:48,333 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 93/100 (estimated time remaining: 24 minutes, 13 seconds)
2025-09-11 07:35:34,321 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 07:35:49,536 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5483.70020 ± 196.776
2025-09-11 07:35:49,536 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5814.131), np.float32(5665.7373), np.float32(5552.778), np.float32(5241.5576), np.float32(5171.911), np.float32(5237.652), np.float32(5529.472), np.float32(5454.647), np.float32(5569.6323), np.float32(5599.4844)]
2025-09-11 07:35:49,536 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 07:35:49,541 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 94/100 (estimated time remaining: 21 minutes, 10 seconds)
2025-09-11 07:38:35,672 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 07:38:50,843 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5874.57520 ± 131.257
2025-09-11 07:38:50,844 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5709.3027), np.float32(6060.4595), np.float32(5644.014), np.float32(5959.6865), np.float32(6002.056), np.float32(5846.349), np.float32(5924.698), np.float32(6004.544), np.float32(5768.8228), np.float32(5825.822)]
2025-09-11 07:38:50,844 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 07:38:50,852 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 95/100 (estimated time remaining: 18 minutes, 8 seconds)
2025-09-11 07:41:36,670 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 07:41:51,868 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5483.83545 ± 173.556
2025-09-11 07:41:51,868 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5582.89), np.float32(5439.2773), np.float32(5564.0283), np.float32(5721.919), np.float32(5229.039), np.float32(5376.989), np.float32(5504.8584), np.float32(5235.864), np.float32(5771.673), np.float32(5411.8154)]
2025-09-11 07:41:51,868 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 07:41:51,873 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 96/100 (estimated time remaining: 15 minutes, 6 seconds)
2025-09-11 07:44:37,859 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 07:44:53,045 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5945.45410 ± 55.153
2025-09-11 07:44:53,045 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5892.0527), np.float32(6002.8564), np.float32(5940.6416), np.float32(5858.8516), np.float32(5991.0996), np.float32(5970.01), np.float32(5858.116), np.float32(6019.6963), np.float32(5945.4), np.float32(5975.819)]
2025-09-11 07:44:53,045 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 07:44:53,050 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 97/100 (estimated time remaining: 12 minutes, 4 seconds)
2025-09-11 07:47:38,968 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 07:47:54,174 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5807.23682 ± 108.313
2025-09-11 07:47:54,174 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5740.0273), np.float32(5821.3813), np.float32(5862.378), np.float32(5715.473), np.float32(5899.2515), np.float32(5840.4824), np.float32(5889.93), np.float32(5848.5024), np.float32(5539.9844), np.float32(5914.952)]
2025-09-11 07:47:54,174 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 07:47:54,179 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 98/100 (estimated time remaining: 9 minutes, 3 seconds)
2025-09-11 07:50:40,241 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 07:50:55,461 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5709.31152 ± 214.319
2025-09-11 07:50:55,462 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5306.4165), np.float32(5768.9854), np.float32(5724.9824), np.float32(5940.737), np.float32(5848.89), np.float32(5563.9077), np.float32(5414.4185), np.float32(5915.895), np.float32(5644.326), np.float32(5964.5566)]
2025-09-11 07:50:55,462 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 07:50:55,467 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 99/100 (estimated time remaining: 6 minutes, 2 seconds)
2025-09-11 07:53:41,867 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 07:53:57,067 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5734.43750 ± 121.271
2025-09-11 07:53:57,067 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5819.296), np.float32(5784.0596), np.float32(5866.095), np.float32(5723.4697), np.float32(5787.119), np.float32(5855.687), np.float32(5446.4243), np.float32(5721.093), np.float32(5595.7886), np.float32(5745.3447)]
2025-09-11 07:53:57,068 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 07:53:57,073 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 100/100 (estimated time remaining: 3 minutes, 1 second)
2025-09-11 07:56:43,231 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 6...
2025-09-11 07:56:58,484 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 6027.12549 ± 50.790
2025-09-11 07:56:58,484 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(6071.0225), np.float32(6023.995), np.float32(6026.0615), np.float32(6022.2515), np.float32(6069.161), np.float32(6011.851), np.float32(5998.091), np.float32(6070.561), np.float32(6079.179), np.float32(5899.082)]
2025-09-11 07:56:58,484 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-11 07:56:58,484 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (6027.13) for latency 6
2025-09-11 07:56:58,490 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1251 [DEBUG]: Training session finished
