2025-09-14 08:43:01,587 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1108 [DEBUG]: logdir: _logs/noise-eval-v2/halfcheetah/bpql-noise_0.000-delay_3
2025-09-14 08:43:01,587 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1109 [DEBUG]: trainer_prefix: noise-eval-v2/halfcheetah/bpql-noise_0.000-delay_3
2025-09-14 08:43:01,588 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1110 [DEBUG]: args.trainer_eval_latencies: {'3': <latency_env.delayed_mdp.ConstantDelay object at 0x7ff478397920>}
2025-09-14 08:43:01,588 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1111 [DEBUG]: using device: cpu
2025-09-14 08:43:01,591 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1133 [INFO]: Creating new trainer
2025-09-14 08:43:01,711 baseline-bpql-halfcheetah:113 [DEBUG]: pi network:
NNGaussianPolicy(
  (common_head): Sequential(
    (0): Flatten(start_dim=1, end_dim=-1)
    (1): Linear(in_features=35, out_features=256, bias=True)
    (2): ReLU()
    (3): Linear(in_features=256, out_features=256, bias=True)
    (4): ReLU()
  )
  (mu_head): Sequential(
    (0): Linear(in_features=256, out_features=6, bias=True)
    (1): Unflatten(dim=1, unflattened_size=(6,))
  )
  (log_std_head): Sequential(
    (0): Linear(in_features=256, out_features=6, bias=True)
    (1): Unflatten(dim=1, unflattened_size=(6,))
  )
  (tanh_refit): NNTanhRefit(scale: tensor([[2., 2., 2., 2., 2., 2.]]), shift: tensor([[-1., -1., -1., -1., -1., -1.]]))
)
2025-09-14 08:43:01,711 baseline-bpql-halfcheetah:114 [DEBUG]: q network:
NNLayerConcat2(
  dim: -1
  (next): Sequential(
    (0): Linear(in_features=23, out_features=256, bias=True)
    (1): ReLU()
    (2): Linear(in_features=256, out_features=256, bias=True)
    (3): ReLU()
    (4): Linear(in_features=256, out_features=1, bias=True)
    (5): NNLayerSqueeze(dim: -1)
  )
  (init_left): Flatten(start_dim=1, end_dim=-1)
  (init_right): Flatten(start_dim=1, end_dim=-1)
)
2025-09-14 08:43:03,291 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1194 [DEBUG]: Starting training session...
2025-09-14 08:43:03,291 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 1/100
2025-09-14 08:47:21,296 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 08:47:26,594 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: -183.23561 ± 139.368
2025-09-14 08:47:26,595 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(-420.14053), np.float32(-344.07654), np.float32(-6.996011), np.float32(-168.79723), np.float32(-254.1713), np.float32(-78.29726), np.float32(-23.34764), np.float32(-340.6486), np.float32(-93.71882), np.float32(-102.16206)]
2025-09-14 08:47:26,595 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 08:47:26,595 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (-183.24) for latency 3
2025-09-14 08:47:26,597 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 2/100 (estimated time remaining: 7 hours, 14 minutes, 27 seconds)
2025-09-14 08:51:11,734 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 08:51:16,900 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: -102.96818 ± 197.866
2025-09-14 08:51:16,900 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(160.91055), np.float32(121.84497), np.float32(55.9481), np.float32(131.22874), np.float32(-212.04022), np.float32(-175.24057), np.float32(-136.6898), np.float32(-437.57263), np.float32(-214.9039), np.float32(-323.16702)]
2025-09-14 08:51:16,900 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 08:51:16,900 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (-102.97) for latency 3
2025-09-14 08:51:16,902 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 3/100 (estimated time remaining: 6 hours, 43 minutes, 6 seconds)
2025-09-14 08:54:52,400 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 08:54:57,571 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 538.33862 ± 207.262
2025-09-14 08:54:57,571 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(404.01968), np.float32(444.91797), np.float32(750.0212), np.float32(490.04944), np.float32(688.3807), np.float32(245.28424), np.float32(877.69727), np.float32(419.78024), np.float32(288.75916), np.float32(774.4762)]
2025-09-14 08:54:57,571 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 08:54:57,571 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (538.34) for latency 3
2025-09-14 08:54:57,573 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 4/100 (estimated time remaining: 6 hours, 24 minutes, 55 seconds)
2025-09-14 08:59:21,358 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 08:59:26,308 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 1972.44177 ± 203.675
2025-09-14 08:59:26,308 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1991.455), np.float32(2205.519), np.float32(2059.1924), np.float32(1777.4033), np.float32(1943.8098), np.float32(2133.7593), np.float32(2106.3767), np.float32(2012.2089), np.float32(1458.4829), np.float32(2036.2119)]
2025-09-14 08:59:26,309 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 08:59:26,309 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (1972.44) for latency 3
2025-09-14 08:59:26,312 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 5/100 (estimated time remaining: 6 hours, 33 minutes, 12 seconds)
2025-09-14 09:05:06,328 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 09:05:11,383 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 3163.84326 ± 939.454
2025-09-14 09:05:11,384 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1959.2429), np.float32(3929.7534), np.float32(2698.0437), np.float32(3754.3943), np.float32(3753.1145), np.float32(4059.2207), np.float32(3057.0864), np.float32(3842.5935), np.float32(3534.0188), np.float32(1050.964)]
2025-09-14 09:05:11,384 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:05:11,384 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (3163.84) for latency 3
2025-09-14 09:05:11,386 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 6/100 (estimated time remaining: 7 hours, 33 seconds)
2025-09-14 09:09:07,503 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 09:09:12,697 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 1784.90002 ± 957.938
2025-09-14 09:09:12,697 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(1099.636), np.float32(3132.453), np.float32(1754.9496), np.float32(1273.1638), np.float32(999.2274), np.float32(2982.1584), np.float32(736.90155), np.float32(3356.8105), np.float32(1731.5175), np.float32(782.1828)]
2025-09-14 09:09:12,697 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:09:12,700 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 7/100 (estimated time remaining: 6 hours, 49 minutes, 14 seconds)
2025-09-14 09:12:45,467 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 09:12:50,656 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 3565.23560 ± 1215.745
2025-09-14 09:12:50,656 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4189.8516), np.float32(3681.767), np.float32(4380.9214), np.float32(1434.0853), np.float32(4290.697), np.float32(4180.0903), np.float32(906.5951), np.float32(4182.4844), np.float32(4238.0947), np.float32(4167.7705)]
2025-09-14 09:12:50,657 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:12:50,657 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (3565.24) for latency 3
2025-09-14 09:12:50,659 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 8/100 (estimated time remaining: 6 hours, 41 minutes, 3 seconds)
2025-09-14 09:18:35,918 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 09:18:41,099 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4870.62598 ± 60.403
2025-09-14 09:18:41,100 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4804.94), np.float32(4856.317), np.float32(4856.5527), np.float32(4825.521), np.float32(4870.4453), np.float32(4795.7085), np.float32(4878.509), np.float32(4907.761), np.float32(4890.613), np.float32(5019.889)]
2025-09-14 09:18:41,100 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:18:41,100 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (4870.63) for latency 3
2025-09-14 09:18:41,102 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 9/100 (estimated time remaining: 7 hours, 16 minutes, 32 seconds)
2025-09-14 09:22:59,587 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 09:23:04,715 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4858.83350 ± 122.693
2025-09-14 09:23:04,715 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(4775.6733), np.float32(5003.173), np.float32(4777.7783), np.float32(4764.821), np.float32(5108.873), np.float32(4948.263), np.float32(4723.719), np.float32(4812.9653), np.float32(4924.5913), np.float32(4748.4824)]
2025-09-14 09:23:04,715 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:23:04,718 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 10/100 (estimated time remaining: 7 hours, 10 minutes, 14 seconds)
2025-09-14 09:28:59,477 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 09:29:04,580 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5219.57080 ± 67.449
2025-09-14 09:29:04,580 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5155.965), np.float32(5232.029), np.float32(5329.1978), np.float32(5130.0874), np.float32(5148.6465), np.float32(5315.727), np.float32(5245.723), np.float32(5270.773), np.float32(5207.482), np.float32(5160.079)]
2025-09-14 09:29:04,580 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:29:04,580 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (5219.57) for latency 3
2025-09-14 09:29:04,582 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 11/100 (estimated time remaining: 7 hours, 9 minutes, 57 seconds)
2025-09-14 09:34:25,838 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 09:34:31,012 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 4603.00488 ± 1173.740
2025-09-14 09:34:31,012 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5352.727), np.float32(5147.008), np.float32(5143.0234), np.float32(5328.5366), np.float32(4021.5928), np.float32(1459.1052), np.float32(4857.638), np.float32(5417.355), np.float32(5408.4253), np.float32(3894.642)]
2025-09-14 09:34:31,012 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:34:31,015 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 12/100 (estimated time remaining: 7 hours, 30 minutes, 25 seconds)
2025-09-14 09:38:19,967 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 09:38:25,119 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5407.45410 ± 67.066
2025-09-14 09:38:25,120 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5409.252), np.float32(5420.234), np.float32(5233.0464), np.float32(5482.201), np.float32(5442.954), np.float32(5377.4263), np.float32(5466.014), np.float32(5455.5605), np.float32(5377.5693), np.float32(5410.2793)]
2025-09-14 09:38:25,120 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:38:25,120 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (5407.45) for latency 3
2025-09-14 09:38:25,123 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 13/100 (estimated time remaining: 7 hours, 30 minutes, 6 seconds)
2025-09-14 09:42:35,522 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 09:42:40,622 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5446.46973 ± 101.659
2025-09-14 09:42:40,622 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5317.0986), np.float32(5599.6445), np.float32(5371.5273), np.float32(5446.5415), np.float32(5311.3765), np.float32(5521.19), np.float32(5544.3286), np.float32(5352.1597), np.float32(5568.993), np.float32(5431.8384)]
2025-09-14 09:42:40,622 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:42:40,622 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (5446.47) for latency 3
2025-09-14 09:42:40,624 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 14/100 (estimated time remaining: 6 hours, 57 minutes, 27 seconds)
2025-09-14 09:46:56,174 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 09:47:01,363 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5379.85693 ± 83.097
2025-09-14 09:47:01,363 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5339.7935), np.float32(5546.395), np.float32(5341.8477), np.float32(5236.958), np.float32(5458.1216), np.float32(5431.372), np.float32(5404.927), np.float32(5316.269), np.float32(5318.3335), np.float32(5404.554)]
2025-09-14 09:47:01,363 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:47:01,366 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 15/100 (estimated time remaining: 6 hours, 51 minutes, 50 seconds)
2025-09-14 09:52:23,481 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 09:52:28,633 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5744.84131 ± 115.415
2025-09-14 09:52:28,633 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5833.568), np.float32(5594.4146), np.float32(5689.482), np.float32(5766.666), np.float32(5795.1987), np.float32(5518.193), np.float32(5807.3066), np.float32(5797.9844), np.float32(5937.646), np.float32(5707.955)]
2025-09-14 09:52:28,633 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:52:28,633 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (5744.84) for latency 3
2025-09-14 09:52:28,637 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 16/100 (estimated time remaining: 6 hours, 37 minutes, 48 seconds)
2025-09-14 09:59:13,388 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 09:59:18,433 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5653.36035 ± 60.047
2025-09-14 09:59:18,433 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5780.861), np.float32(5648.129), np.float32(5645.19), np.float32(5584.7725), np.float32(5748.2495), np.float32(5601.3403), np.float32(5634.096), np.float32(5607.5938), np.float32(5625.0586), np.float32(5658.3086)]
2025-09-14 09:59:18,433 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 09:59:18,438 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 17/100 (estimated time remaining: 6 hours, 56 minutes, 28 seconds)
2025-09-14 10:04:08,197 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 10:04:13,338 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5845.96240 ± 166.525
2025-09-14 10:04:13,339 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(6003.4224), np.float32(5889.2314), np.float32(5944.686), np.float32(5926.332), np.float32(5408.442), np.float32(5774.5327), np.float32(5947.266), np.float32(5812.464), np.float32(5765.6196), np.float32(5987.63)]
2025-09-14 10:04:13,339 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:04:13,339 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (5845.96) for latency 3
2025-09-14 10:04:13,342 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 18/100 (estimated time remaining: 7 hours, 8 minutes, 20 seconds)
2025-09-14 10:08:45,255 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 10:08:50,304 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5451.68311 ± 946.229
2025-09-14 10:08:50,304 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5913.0186), np.float32(5915.8706), np.float32(5631.2266), np.float32(5944.9536), np.float32(5992.3623), np.float32(5385.9814), np.float32(2666.3103), np.float32(5794.898), np.float32(5702.6704), np.float32(5569.5415)]
2025-09-14 10:08:50,304 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:08:50,307 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 19/100 (estimated time remaining: 7 hours, 9 minutes, 2 seconds)
2025-09-14 10:12:25,023 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 10:12:30,128 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5913.11572 ± 311.245
2025-09-14 10:12:30,128 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5977.709), np.float32(6054.9136), np.float32(5963.6626), np.float32(6123.8833), np.float32(5013.3984), np.float32(5929.4453), np.float32(6140.4277), np.float32(6001.1006), np.float32(6071.167), np.float32(5855.448)]
2025-09-14 10:12:30,128 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:12:30,128 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (5913.12) for latency 3
2025-09-14 10:12:30,132 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 20/100 (estimated time remaining: 6 hours, 52 minutes, 46 seconds)
2025-09-14 10:17:32,843 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 10:17:37,864 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 6005.22412 ± 174.910
2025-09-14 10:17:37,864 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(6155.5522), np.float32(5851.3047), np.float32(5752.1094), np.float32(6167.038), np.float32(5933.117), np.float32(5937.996), np.float32(5890.0723), np.float32(5915.5693), np.float32(6079.8237), np.float32(6369.658)]
2025-09-14 10:17:37,864 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:17:37,864 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (6005.22) for latency 3
2025-09-14 10:17:37,867 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 21/100 (estimated time remaining: 6 hours, 42 minutes, 27 seconds)
2025-09-14 10:22:08,303 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 10:22:13,324 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 6059.92725 ± 185.513
2025-09-14 10:22:13,324 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(5624.935), np.float32(6083.6904), np.float32(6151.4976), np.float32(6188.1206), np.float32(5872.9463), np.float32(6025.912), np.float32(6335.4644), np.float32(6067.1094), np.float32(6057.561), np.float32(6192.0366)]
2025-09-14 10:22:13,324 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:22:13,324 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (6059.93) for latency 3
2025-09-14 10:22:13,327 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 22/100 (estimated time remaining: 6 hours, 2 minutes, 3 seconds)
2025-09-14 10:28:10,838 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 10:28:15,902 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 6214.14648 ± 135.786
2025-09-14 10:28:15,902 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(6067.102), np.float32(5993.2905), np.float32(6369.5728), np.float32(6283.767), np.float32(5988.9688), np.float32(6220.5176), np.float32(6325.336), np.float32(6273.4194), np.float32(6315.869), np.float32(6303.6216)]
2025-09-14 10:28:15,902 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:28:15,902 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (6214.15) for latency 3
2025-09-14 10:28:15,905 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 23/100 (estimated time remaining: 6 hours, 15 minutes, 3 seconds)
2025-09-14 10:32:23,814 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 10:32:28,862 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 6250.51855 ± 168.786
2025-09-14 10:32:28,862 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(6373.614), np.float32(6324.4644), np.float32(6057.686), np.float32(6293.432), np.float32(6024.5415), np.float32(6559.476), np.float32(6324.2275), np.float32(6381.1074), np.float32(6070.6587), np.float32(6095.977)]
2025-09-14 10:32:28,862 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:32:28,862 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (6250.52) for latency 3
2025-09-14 10:32:28,865 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 24/100 (estimated time remaining: 6 hours, 4 minutes, 5 seconds)
2025-09-14 10:37:24,668 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 10:37:29,751 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 5202.78418 ± 1218.575
2025-09-14 10:37:29,751 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(2215.3918), np.float32(6135.8228), np.float32(6056.8877), np.float32(6077.4463), np.float32(4883.464), np.float32(4483.2524), np.float32(4250.03), np.float32(6006.0044), np.float32(6306.8994), np.float32(5612.645)]
2025-09-14 10:37:29,751 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:37:29,755 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 25/100 (estimated time remaining: 6 hours, 19 minutes, 54 seconds)
2025-09-14 10:40:50,491 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 10:40:55,516 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 6456.14307 ± 96.457
2025-09-14 10:40:55,516 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(6478.809), np.float32(6462.6006), np.float32(6466.664), np.float32(6589.804), np.float32(6558.4097), np.float32(6409.0376), np.float32(6332.4043), np.float32(6563.5146), np.float32(6271.043), np.float32(6429.1396)]
2025-09-14 10:40:55,516 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:40:55,516 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (6456.14) for latency 3
2025-09-14 10:40:55,519 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 26/100 (estimated time remaining: 5 hours, 49 minutes, 24 seconds)
2025-09-14 10:46:06,737 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 10:46:11,713 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 6387.48633 ± 162.325
2025-09-14 10:46:11,714 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(6145.19), np.float32(6476.5674), np.float32(6436.8003), np.float32(6425.743), np.float32(6285.8804), np.float32(6182.661), np.float32(6526.2324), np.float32(6204.6484), np.float32(6622.053), np.float32(6569.085)]
2025-09-14 10:46:11,714 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:46:11,717 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 27/100 (estimated time remaining: 5 hours, 54 minutes, 48 seconds)
2025-09-14 10:50:31,742 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 10:50:36,832 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 6659.24316 ± 101.960
2025-09-14 10:50:36,832 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(6544.7446), np.float32(6728.114), np.float32(6546.4956), np.float32(6463.818), np.float32(6814.461), np.float32(6684.7393), np.float32(6727.592), np.float32(6726.894), np.float32(6693.975), np.float32(6661.5903)]
2025-09-14 10:50:36,832 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:50:36,832 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (6659.24) for latency 3
2025-09-14 10:50:36,835 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 28/100 (estimated time remaining: 5 hours, 26 minutes, 17 seconds)
2025-09-14 10:54:36,365 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 10:54:41,463 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 6150.06543 ± 974.665
2025-09-14 10:54:41,464 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(6586.353), np.float32(6340.1206), np.float32(6633.6025), np.float32(6466.0444), np.float32(3266.165), np.float32(6162.028), np.float32(6550.5957), np.float32(6649.9043), np.float32(6244.8516), np.float32(6600.9917)]
2025-09-14 10:54:41,464 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:54:41,469 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 29/100 (estimated time remaining: 5 hours, 19 minutes, 49 seconds)
2025-09-14 10:59:17,550 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 10:59:22,731 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 6114.01025 ± 1433.255
2025-09-14 10:59:22,731 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(6607.039), np.float32(1841.5985), np.float32(6521.754), np.float32(6227.787), np.float32(6581.133), np.float32(6848.271), np.float32(6670.765), np.float32(6734.513), np.float32(6436.1904), np.float32(6671.047)]
2025-09-14 10:59:22,731 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 10:59:22,734 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 30/100 (estimated time remaining: 5 hours, 10 minutes, 44 seconds)
2025-09-14 11:04:09,274 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 11:04:14,316 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 6838.07324 ± 126.051
2025-09-14 11:04:14,316 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(6980.779), np.float32(6942.15), np.float32(6587.591), np.float32(6790.336), np.float32(6980.2876), np.float32(6826.635), np.float32(6868.44), np.float32(6935.233), np.float32(6661.9917), np.float32(6807.2896)]
2025-09-14 11:04:14,316 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:04:14,316 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (6838.07) for latency 3
2025-09-14 11:04:14,320 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 31/100 (estimated time remaining: 5 hours, 26 minutes, 23 seconds)
2025-09-14 11:07:58,620 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 11:08:03,756 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 6949.48926 ± 92.235
2025-09-14 11:08:03,757 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(7022.809), np.float32(6935.0054), np.float32(6912.4272), np.float32(7089.8633), np.float32(6925.2534), np.float32(6809.8857), np.float32(6982.4146), np.float32(6786.1533), np.float32(7045.904), np.float32(6985.173)]
2025-09-14 11:08:03,757 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:08:03,757 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (6949.49) for latency 3
2025-09-14 11:08:03,760 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 32/100 (estimated time remaining: 5 hours, 1 minute, 46 seconds)
2025-09-14 11:13:16,135 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 11:13:21,209 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 6669.20947 ± 325.564
2025-09-14 11:13:21,209 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(6810.8247), np.float32(6699.26), np.float32(5835.4175), np.float32(6822.311), np.float32(6354.0566), np.float32(6788.2754), np.float32(6674.7827), np.float32(6939.229), np.float32(6739.6904), np.float32(7028.243)]
2025-09-14 11:13:21,209 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:13:21,213 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 33/100 (estimated time remaining: 5 hours, 9 minutes, 15 seconds)
2025-09-14 11:17:12,310 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 11:17:17,276 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 6911.98193 ± 128.407
2025-09-14 11:17:17,276 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(6725.803), np.float32(6845.731), np.float32(6809.7373), np.float32(7148.421), np.float32(6801.129), np.float32(7038.264), np.float32(6878.569), np.float32(6842.891), np.float32(6968.413), np.float32(7060.86)]
2025-09-14 11:17:17,276 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:17:17,280 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 34/100 (estimated time remaining: 5 hours, 2 minutes, 47 seconds)
2025-09-14 11:21:10,632 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 11:21:15,811 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 6761.42725 ± 544.854
2025-09-14 11:21:15,812 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(6906.016), np.float32(5159.655), np.float32(6944.9736), np.float32(6731.332), np.float32(6946.8716), np.float32(6888.5977), np.float32(7032.698), np.float32(6907.688), np.float32(7185.042), np.float32(6911.399)]
2025-09-14 11:21:15,812 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:21:15,815 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 35/100 (estimated time remaining: 4 hours, 48 minutes, 52 seconds)
2025-09-14 11:26:11,552 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 11:26:16,668 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7275.07812 ± 126.064
2025-09-14 11:26:16,668 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(7426.871), np.float32(7303.6724), np.float32(7101.779), np.float32(7171.538), np.float32(7310.1587), np.float32(7127.6714), np.float32(7476.811), np.float32(7311.6846), np.float32(7140.138), np.float32(7380.4604)]
2025-09-14 11:26:16,668 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:26:16,668 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (7275.08) for latency 3
2025-09-14 11:26:16,672 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 36/100 (estimated time remaining: 4 hours, 46 minutes, 30 seconds)
2025-09-14 11:30:37,860 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 11:30:43,030 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7227.35254 ± 176.349
2025-09-14 11:30:43,031 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(6929.5815), np.float32(6945.9243), np.float32(7344.7573), np.float32(7169.065), np.float32(7537.035), np.float32(7385.421), np.float32(7253.871), np.float32(7221.956), np.float32(7286.72), np.float32(7199.193)]
2025-09-14 11:30:43,031 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:30:43,036 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 37/100 (estimated time remaining: 4 hours, 49 minutes, 58 seconds)
2025-09-14 11:34:36,934 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 11:34:41,971 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7002.45605 ± 150.826
2025-09-14 11:34:41,971 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(7043.801), np.float32(6939.195), np.float32(7075.145), np.float32(6706.666), np.float32(7078.953), np.float32(6981.5054), np.float32(6875.118), np.float32(6887.542), np.float32(7211.5464), np.float32(7225.0933)]
2025-09-14 11:34:41,971 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:34:41,975 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 38/100 (estimated time remaining: 4 hours, 28 minutes, 57 seconds)
2025-09-14 11:38:00,080 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 11:38:05,168 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7416.79785 ± 88.654
2025-09-14 11:38:05,168 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(7256.663), np.float32(7438.992), np.float32(7307.3804), np.float32(7501.7476), np.float32(7446.4814), np.float32(7476.467), np.float32(7354.539), np.float32(7378.337), np.float32(7568.3955), np.float32(7438.9717)]
2025-09-14 11:38:05,168 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:38:05,168 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (7416.80) for latency 3
2025-09-14 11:38:05,172 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 39/100 (estimated time remaining: 4 hours, 17 minutes, 53 seconds)
2025-09-14 11:42:45,872 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 11:42:51,054 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7338.77246 ± 69.761
2025-09-14 11:42:51,055 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(7376.819), np.float32(7428.0244), np.float32(7359.698), np.float32(7346.5664), np.float32(7282.608), np.float32(7465.9253), np.float32(7265.9116), np.float32(7223.6904), np.float32(7326.8555), np.float32(7311.6323)]
2025-09-14 11:42:51,055 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:42:51,059 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 40/100 (estimated time remaining: 4 hours, 23 minutes, 21 seconds)
2025-09-14 11:47:18,567 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 11:47:23,735 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7128.82715 ± 159.111
2025-09-14 11:47:23,735 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(7242.7935), np.float32(7066.339), np.float32(7284.352), np.float32(7170.2534), np.float32(7133.702), np.float32(7282.952), np.float32(7064.707), np.float32(7074.767), np.float32(7246.614), np.float32(6721.7905)]
2025-09-14 11:47:23,735 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:47:23,740 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 41/100 (estimated time remaining: 4 hours, 13 minutes, 24 seconds)
2025-09-14 11:51:23,242 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 11:51:28,387 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7386.34473 ± 114.843
2025-09-14 11:51:28,387 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(7449.4746), np.float32(7516.529), np.float32(7496.6304), np.float32(7210.046), np.float32(7167.4995), np.float32(7365.174), np.float32(7310.212), np.float32(7425.1963), np.float32(7441.1353), np.float32(7481.5444)]
2025-09-14 11:51:28,387 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:51:28,391 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 42/100 (estimated time remaining: 4 hours, 4 minutes, 55 seconds)
2025-09-14 11:57:44,191 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 11:57:49,324 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 6724.19238 ± 211.157
2025-09-14 11:57:49,324 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(6233.3516), np.float32(6766.197), np.float32(6866.8), np.float32(6882.3164), np.float32(6544.4375), np.float32(6809.255), np.float32(6793.7993), np.float32(6961.617), np.float32(6526.215), np.float32(6857.933)]
2025-09-14 11:57:49,324 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 11:57:49,328 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 43/100 (estimated time remaining: 4 hours, 28 minutes, 13 seconds)
2025-09-14 12:02:40,733 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 12:02:45,824 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7463.15527 ± 115.107
2025-09-14 12:02:45,824 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(7518.973), np.float32(7555.491), np.float32(7218.2163), np.float32(7306.966), np.float32(7587.7305), np.float32(7461.756), np.float32(7484.117), np.float32(7395.2886), np.float32(7539.5435), np.float32(7563.4688)]
2025-09-14 12:02:45,824 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:02:45,824 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (7463.16) for latency 3
2025-09-14 12:02:45,829 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 44/100 (estimated time remaining: 4 hours, 41 minutes, 19 seconds)
2025-09-14 12:07:57,788 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 12:08:02,942 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7391.62500 ± 159.887
2025-09-14 12:08:02,942 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(7499.319), np.float32(7489.151), np.float32(7436.6157), np.float32(7475.826), np.float32(7023.0283), np.float32(7155.81), np.float32(7376.0522), np.float32(7446.9556), np.float32(7552.206), np.float32(7461.2925)]
2025-09-14 12:08:02,942 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:08:02,946 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 45/100 (estimated time remaining: 4 hours, 42 minutes, 13 seconds)
2025-09-14 12:12:50,372 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 12:12:55,358 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7414.61475 ± 85.722
2025-09-14 12:12:55,358 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(7463.134), np.float32(7398.169), np.float32(7288.874), np.float32(7250.0254), np.float32(7502.617), np.float32(7482.3735), np.float32(7402.9644), np.float32(7366.8594), np.float32(7488.529), np.float32(7502.6025)]
2025-09-14 12:12:55,359 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:12:55,363 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 46/100 (estimated time remaining: 4 hours, 40 minutes, 47 seconds)
2025-09-14 12:17:55,007 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 12:18:00,091 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7371.55371 ± 121.980
2025-09-14 12:18:00,091 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(7361.422), np.float32(7318.855), np.float32(7481.2773), np.float32(7160.508), np.float32(7565.583), np.float32(7254.1475), np.float32(7404.6304), np.float32(7361.23), np.float32(7272.4087), np.float32(7535.4766)]
2025-09-14 12:18:00,091 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:18:00,095 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 47/100 (estimated time remaining: 4 hours, 46 minutes, 30 seconds)
2025-09-14 12:21:54,046 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 12:21:59,158 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7500.15527 ± 93.675
2025-09-14 12:21:59,159 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(7440.363), np.float32(7623.07), np.float32(7527.8374), np.float32(7546.4116), np.float32(7588.8057), np.float32(7543.255), np.float32(7331.466), np.float32(7349.9854), np.float32(7479.4355), np.float32(7570.9233)]
2025-09-14 12:21:59,159 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:21:59,159 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (7500.16) for latency 3
2025-09-14 12:21:59,163 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 48/100 (estimated time remaining: 4 hours, 16 minutes, 8 seconds)
2025-09-14 12:27:34,194 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 12:27:39,265 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7411.59082 ± 111.415
2025-09-14 12:27:39,265 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(7392.924), np.float32(7407.391), np.float32(7416.239), np.float32(7583.9756), np.float32(7474.2812), np.float32(7409.822), np.float32(7435.7017), np.float32(7213.336), np.float32(7548.2373), np.float32(7234.002)]
2025-09-14 12:27:39,265 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:27:39,270 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 49/100 (estimated time remaining: 4 hours, 18 minutes, 51 seconds)
2025-09-14 12:32:00,524 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 12:32:05,721 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7528.73340 ± 77.204
2025-09-14 12:32:05,721 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(7580.1875), np.float32(7565.6855), np.float32(7417.328), np.float32(7440.97), np.float32(7604.9624), np.float32(7624.0938), np.float32(7473.647), np.float32(7556.6553), np.float32(7421.92), np.float32(7601.8794)]
2025-09-14 12:32:05,722 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:32:05,722 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (7528.73) for latency 3
2025-09-14 12:32:05,726 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 50/100 (estimated time remaining: 4 hours, 5 minutes, 16 seconds)
2025-09-14 12:35:50,969 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 12:35:55,958 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7512.69629 ± 211.982
2025-09-14 12:35:55,958 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(7650.165), np.float32(7659.072), np.float32(7568.662), np.float32(7522.956), np.float32(7690.897), np.float32(7457.6777), np.float32(7438.0464), np.float32(7667.061), np.float32(6929.0205), np.float32(7543.3984)]
2025-09-14 12:35:55,958 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:35:55,963 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 51/100 (estimated time remaining: 3 hours, 50 minutes, 5 seconds)
2025-09-14 12:40:35,875 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 12:40:40,971 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7688.57812 ± 84.713
2025-09-14 12:40:40,971 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(7762.7104), np.float32(7796.1567), np.float32(7740.983), np.float32(7712.6416), np.float32(7713.655), np.float32(7585.208), np.float32(7772.3564), np.float32(7668.063), np.float32(7606.0815), np.float32(7527.927)]
2025-09-14 12:40:40,971 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:40:40,971 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (7688.58) for latency 3
2025-09-14 12:40:40,976 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 52/100 (estimated time remaining: 3 hours, 42 minutes, 16 seconds)
2025-09-14 12:47:22,647 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 12:47:27,682 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7166.53125 ± 56.767
2025-09-14 12:47:27,683 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(7255.3354), np.float32(7191.455), np.float32(7135.034), np.float32(7147.822), np.float32(7186.269), np.float32(7168.321), np.float32(7190.4697), np.float32(7040.8164), np.float32(7227.1685), np.float32(7122.618)]
2025-09-14 12:47:27,683 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:47:27,687 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 53/100 (estimated time remaining: 4 hours, 4 minutes, 33 seconds)
2025-09-14 12:51:55,376 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 12:52:00,448 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7639.24512 ± 183.368
2025-09-14 12:52:00,448 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(7237.695), np.float32(7643.8267), np.float32(7800.928), np.float32(7875.9766), np.float32(7388.409), np.float32(7752.302), np.float32(7730.607), np.float32(7637.981), np.float32(7600.429), np.float32(7724.3003)]
2025-09-14 12:52:00,448 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:52:00,452 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 54/100 (estimated time remaining: 3 hours, 48 minutes, 55 seconds)
2025-09-14 12:57:51,888 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 12:57:56,828 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7420.40771 ± 215.723
2025-09-14 12:57:56,828 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(7370.636), np.float32(7210.061), np.float32(7378.313), np.float32(7624.7803), np.float32(7432.3843), np.float32(7317.0757), np.float32(7639.6465), np.float32(7543.141), np.float32(6963.573), np.float32(7724.4673)]
2025-09-14 12:57:56,829 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 12:57:56,834 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 55/100 (estimated time remaining: 3 hours, 57 minutes, 50 seconds)
2025-09-14 13:01:49,569 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 13:01:54,667 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7582.14746 ± 88.149
2025-09-14 13:01:54,667 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(7599.22), np.float32(7479.638), np.float32(7549.6265), np.float32(7587.627), np.float32(7446.041), np.float32(7596.3945), np.float32(7683.1724), np.float32(7635.9478), np.float32(7497.7563), np.float32(7746.0513)]
2025-09-14 13:01:54,667 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:01:54,672 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 56/100 (estimated time remaining: 3 hours, 53 minutes, 48 seconds)
2025-09-14 13:08:30,555 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 13:08:35,523 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7533.51416 ± 100.446
2025-09-14 13:08:35,523 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(7580.7725), np.float32(7325.8813), np.float32(7434.002), np.float32(7589.761), np.float32(7643.981), np.float32(7540.7495), np.float32(7672.9824), np.float32(7575.04), np.float32(7438.369), np.float32(7533.6006)]
2025-09-14 13:08:35,523 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:08:35,528 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 57/100 (estimated time remaining: 4 hours, 5 minutes, 36 seconds)
2025-09-14 13:13:06,898 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 13:13:12,018 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7633.22266 ± 160.557
2025-09-14 13:13:12,018 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(7410.8076), np.float32(7660.286), np.float32(7873.8276), np.float32(7604.7905), np.float32(7698.8403), np.float32(7670.699), np.float32(7714.4087), np.float32(7319.0054), np.float32(7815.1904), np.float32(7564.372)]
2025-09-14 13:13:12,018 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:13:12,023 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 58/100 (estimated time remaining: 3 hours, 41 minutes, 21 seconds)
2025-09-14 13:17:33,762 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 13:17:38,796 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7584.84912 ± 113.933
2025-09-14 13:17:38,796 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(7639.409), np.float32(7683.9253), np.float32(7618.4307), np.float32(7615.509), np.float32(7557.0366), np.float32(7705.629), np.float32(7270.3403), np.float32(7567.1084), np.float32(7587.628), np.float32(7603.475)]
2025-09-14 13:17:38,796 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:17:38,802 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 59/100 (estimated time remaining: 3 hours, 35 minutes, 22 seconds)
2025-09-14 13:23:20,552 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 13:23:25,594 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7550.50244 ± 102.191
2025-09-14 13:23:25,595 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(7366.4326), np.float32(7503.0083), np.float32(7618.966), np.float32(7469.0767), np.float32(7634.0205), np.float32(7667.556), np.float32(7709.6743), np.float32(7578.107), np.float32(7478.7046), np.float32(7479.4756)]
2025-09-14 13:23:25,595 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:23:25,599 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 60/100 (estimated time remaining: 3 hours, 28 minutes, 55 seconds)
2025-09-14 13:28:09,703 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 13:28:14,816 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7693.88525 ± 87.727
2025-09-14 13:28:14,816 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(7642.06), np.float32(7616.8057), np.float32(7672.9136), np.float32(7692.9575), np.float32(7795.0425), np.float32(7734.147), np.float32(7773.731), np.float32(7753.569), np.float32(7765.9863), np.float32(7491.6416)]
2025-09-14 13:28:14,816 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:28:14,816 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (7693.89) for latency 3
2025-09-14 13:28:14,821 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 61/100 (estimated time remaining: 3 hours, 30 minutes, 41 seconds)
2025-09-14 13:33:58,810 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 13:34:03,870 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7613.18115 ± 149.652
2025-09-14 13:34:03,870 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(7540.6963), np.float32(7623.8433), np.float32(7693.495), np.float32(7583.4688), np.float32(7591.366), np.float32(7830.715), np.float32(7238.645), np.float32(7755.221), np.float32(7606.719), np.float32(7667.6426)]
2025-09-14 13:34:03,870 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:34:03,875 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 62/100 (estimated time remaining: 3 hours, 18 minutes, 41 seconds)
2025-09-14 13:37:30,930 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 13:37:35,957 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7723.58740 ± 65.231
2025-09-14 13:37:35,957 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(7757.8174), np.float32(7746.271), np.float32(7740.4346), np.float32(7798.663), np.float32(7792.1167), np.float32(7711.7427), np.float32(7771.0806), np.float32(7637.2964), np.float32(7583.184), np.float32(7697.2646)]
2025-09-14 13:37:35,957 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:37:35,957 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (7723.59) for latency 3
2025-09-14 13:37:35,962 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 63/100 (estimated time remaining: 3 hours, 5 minutes, 25 seconds)
2025-09-14 13:41:11,618 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 13:41:16,666 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7791.64844 ± 60.594
2025-09-14 13:41:16,666 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(7802.22), np.float32(7702.247), np.float32(7813.551), np.float32(7746.388), np.float32(7753.85), np.float32(7869.3154), np.float32(7757.851), np.float32(7919.581), np.float32(7793.164), np.float32(7758.3154)]
2025-09-14 13:41:16,666 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:41:16,666 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (7791.65) for latency 3
2025-09-14 13:41:16,671 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 64/100 (estimated time remaining: 2 hours, 54 minutes, 52 seconds)
2025-09-14 13:46:09,605 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 13:46:14,764 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7735.96582 ± 137.069
2025-09-14 13:46:14,764 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(7848.9473), np.float32(7789.836), np.float32(7663.745), np.float32(7867.2866), np.float32(7395.8325), np.float32(7698.351), np.float32(7884.1665), np.float32(7673.3403), np.float32(7721.286), np.float32(7816.8584)]
2025-09-14 13:46:14,765 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:46:14,769 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 65/100 (estimated time remaining: 2 hours, 44 minutes, 18 seconds)
2025-09-14 13:50:17,216 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 13:50:22,320 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7636.28760 ± 107.118
2025-09-14 13:50:22,321 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(7440.8193), np.float32(7632.0273), np.float32(7733.228), np.float32(7662.336), np.float32(7650.8574), np.float32(7682.176), np.float32(7455.811), np.float32(7783.223), np.float32(7595.1655), np.float32(7727.2314)]
2025-09-14 13:50:22,321 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:50:22,326 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 66/100 (estimated time remaining: 2 hours, 34 minutes, 52 seconds)
2025-09-14 13:57:12,613 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 13:57:17,628 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7724.10254 ± 125.098
2025-09-14 13:57:17,629 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(7805.9067), np.float32(7540.0093), np.float32(7763.4805), np.float32(7610.5034), np.float32(7801.5835), np.float32(7811.439), np.float32(7476.6343), np.float32(7844.6226), np.float32(7828.4224), np.float32(7758.422)]
2025-09-14 13:57:17,629 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 13:57:17,635 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 67/100 (estimated time remaining: 2 hours, 37 minutes, 57 seconds)
2025-09-14 14:03:49,875 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 14:03:55,013 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7659.23340 ± 113.477
2025-09-14 14:03:55,013 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(7676.9507), np.float32(7855.8057), np.float32(7630.3374), np.float32(7753.971), np.float32(7492.9395), np.float32(7622.432), np.float32(7628.877), np.float32(7491.93), np.float32(7633.4893), np.float32(7805.6)]
2025-09-14 14:03:55,013 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:03:55,018 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 68/100 (estimated time remaining: 2 hours, 53 minutes, 41 seconds)
2025-09-14 14:07:59,839 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 14:08:04,988 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7175.87744 ± 1318.990
2025-09-14 14:08:04,989 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(7638.3022), np.float32(7633.774), np.float32(7694.0293), np.float32(3230.6045), np.float32(7645.8086), np.float32(7607.2656), np.float32(7791.562), np.float32(7370.153), np.float32(7570.261), np.float32(7577.015)]
2025-09-14 14:08:04,989 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:08:04,994 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 69/100 (estimated time remaining: 2 hours, 51 minutes, 33 seconds)
2025-09-14 14:12:50,763 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 14:12:55,839 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7845.94287 ± 56.567
2025-09-14 14:12:55,839 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(7915.224), np.float32(7902.372), np.float32(7791.1675), np.float32(7863.8516), np.float32(7729.276), np.float32(7782.2695), np.float32(7891.0483), np.float32(7851.846), np.float32(7857.851), np.float32(7874.525)]
2025-09-14 14:12:55,839 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:12:55,839 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (7845.94) for latency 3
2025-09-14 14:12:55,845 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 70/100 (estimated time remaining: 2 hours, 45 minutes, 26 seconds)
2025-09-14 14:18:25,424 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 14:18:30,445 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7825.54541 ± 100.851
2025-09-14 14:18:30,446 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(7805.1514), np.float32(7959.1963), np.float32(7873.3784), np.float32(7935.4517), np.float32(7764.573), np.float32(7805.5513), np.float32(7972.245), np.float32(7747.756), np.float32(7652.2114), np.float32(7739.935)]
2025-09-14 14:18:30,446 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:18:30,451 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 71/100 (estimated time remaining: 2 hours, 48 minutes, 48 seconds)
2025-09-14 14:22:34,856 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 14:22:39,830 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7654.26562 ± 109.410
2025-09-14 14:22:39,830 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(7666.82), np.float32(7470.5225), np.float32(7704.8613), np.float32(7655.3286), np.float32(7566.0703), np.float32(7780.1826), np.float32(7698.9883), np.float32(7489.509), np.float32(7684.203), np.float32(7826.1685)]
2025-09-14 14:22:39,830 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:22:39,835 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 72/100 (estimated time remaining: 2 hours, 27 minutes, 8 seconds)
2025-09-14 14:26:29,705 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 14:26:34,843 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7689.43359 ± 102.701
2025-09-14 14:26:34,844 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(7646.583), np.float32(7790.702), np.float32(7635.972), np.float32(7762.4106), np.float32(7703.109), np.float32(7784.537), np.float32(7863.388), np.float32(7598.5103), np.float32(7577.588), np.float32(7531.539)]
2025-09-14 14:26:34,844 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:26:34,849 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 73/100 (estimated time remaining: 2 hours, 6 minutes, 55 seconds)
2025-09-14 14:30:13,405 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 14:30:18,546 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7735.44434 ± 126.873
2025-09-14 14:30:18,546 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(7693.1914), np.float32(7770.13), np.float32(7774.276), np.float32(7744.783), np.float32(7665.75), np.float32(7880.059), np.float32(7943.5493), np.float32(7813.9653), np.float32(7546.1753), np.float32(7522.5684)]
2025-09-14 14:30:18,547 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:30:18,551 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 74/100 (estimated time remaining: 2 hours, 1 second)
2025-09-14 14:39:00,660 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 14:39:05,756 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7860.52344 ± 52.041
2025-09-14 14:39:05,756 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(7825.9814), np.float32(7821.638), np.float32(7929.43), np.float32(7882.9453), np.float32(7794.924), np.float32(7876.079), np.float32(7960.861), np.float32(7879.023), np.float32(7802.8027), np.float32(7831.543)]
2025-09-14 14:39:05,756 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:39:05,757 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (7860.52) for latency 3
2025-09-14 14:39:05,762 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 75/100 (estimated time remaining: 2 hours, 16 minutes, 3 seconds)
2025-09-14 14:44:10,198 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 14:44:15,359 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7718.37012 ± 151.583
2025-09-14 14:44:15,359 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(7830.092), np.float32(7811.875), np.float32(7749.01), np.float32(7595.333), np.float32(7810.7666), np.float32(7396.917), np.float32(7854.6514), np.float32(7770.037), np.float32(7851.0156), np.float32(7514.0054)]
2025-09-14 14:44:15,359 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:44:15,365 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 76/100 (estimated time remaining: 2 hours, 8 minutes, 44 seconds)
2025-09-14 14:49:06,106 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 14:49:11,151 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7946.55615 ± 148.845
2025-09-14 14:49:11,151 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(8055.8267), np.float32(8090.472), np.float32(7913.9663), np.float32(7751.009), np.float32(7835.577), np.float32(7718.604), np.float32(8109.825), np.float32(8126.028), np.float32(7818.035), np.float32(8046.2197)]
2025-09-14 14:49:11,151 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:49:11,151 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (7946.56) for latency 3
2025-09-14 14:49:11,156 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 77/100 (estimated time remaining: 2 hours, 7 minutes, 18 seconds)
2025-09-14 14:54:12,448 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 14:54:17,534 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7099.51953 ± 86.369
2025-09-14 14:54:17,535 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(7128.837), np.float32(7045.262), np.float32(7137.884), np.float32(6898.1274), np.float32(7109.205), np.float32(7174.756), np.float32(7144.406), np.float32(7233.482), np.float32(7052.2866), np.float32(7070.948)]
2025-09-14 14:54:17,535 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:54:17,540 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 78/100 (estimated time remaining: 2 hours, 7 minutes, 28 seconds)
2025-09-14 14:58:25,753 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 14:58:30,756 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 8098.92871 ± 61.659
2025-09-14 14:58:30,757 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(8087.5083), np.float32(8069.303), np.float32(8096.484), np.float32(8172.765), np.float32(8112.7363), np.float32(8017.5244), np.float32(7981.727), np.float32(8195.478), np.float32(8128.202), np.float32(8127.561)]
2025-09-14 14:58:30,757 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 14:58:30,757 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (8098.93) for latency 3
2025-09-14 14:58:30,762 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 79/100 (estimated time remaining: 2 hours, 4 minutes, 5 seconds)
2025-09-14 15:01:39,029 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 15:01:44,183 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7753.67188 ± 113.562
2025-09-14 15:01:44,183 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(7817.822), np.float32(7883.038), np.float32(7802.391), np.float32(7670.984), np.float32(7746.8813), np.float32(7538.9004), np.float32(7808.2437), np.float32(7644.2383), np.float32(7687.715), np.float32(7936.497)]
2025-09-14 15:01:44,183 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:01:44,190 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 80/100 (estimated time remaining: 1 hour, 35 minutes, 5 seconds)
2025-09-14 15:06:17,597 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 15:06:22,658 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7971.58203 ± 81.551
2025-09-14 15:06:22,659 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(7972.971), np.float32(7969.658), np.float32(7994.157), np.float32(8009.742), np.float32(8035.9614), np.float32(8039.242), np.float32(7743.449), np.float32(8009.644), np.float32(7936.4653), np.float32(8004.5303)]
2025-09-14 15:06:22,659 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:06:22,664 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 81/100 (estimated time remaining: 1 hour, 28 minutes, 29 seconds)
2025-09-14 15:10:18,108 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 15:10:23,198 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7937.13916 ± 64.153
2025-09-14 15:10:23,199 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(8016.7993), np.float32(7964.346), np.float32(7829.5996), np.float32(7870.22), np.float32(7920.5317), np.float32(7944.587), np.float32(8046.1636), np.float32(7882.19), np.float32(7984.066), np.float32(7912.888)]
2025-09-14 15:10:23,199 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:10:23,205 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 82/100 (estimated time remaining: 1 hour, 20 minutes, 33 seconds)
2025-09-14 15:15:46,690 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 15:15:51,821 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7590.00635 ± 72.500
2025-09-14 15:15:51,821 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(7660.164), np.float32(7576.3945), np.float32(7640.622), np.float32(7629.6826), np.float32(7631.4766), np.float32(7637.44), np.float32(7516.349), np.float32(7582.3945), np.float32(7617.4995), np.float32(7408.043)]
2025-09-14 15:15:51,822 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:15:51,827 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 83/100 (estimated time remaining: 1 hour, 17 minutes, 39 seconds)
2025-09-14 15:19:02,980 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 15:19:08,064 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7945.27051 ± 76.323
2025-09-14 15:19:08,065 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(7833.6455), np.float32(8059.454), np.float32(7846.8667), np.float32(7938.62), np.float32(7993.4756), np.float32(8058.381), np.float32(7989.9194), np.float32(7893.326), np.float32(7950.252), np.float32(7888.769)]
2025-09-14 15:19:08,065 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:19:08,071 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 84/100 (estimated time remaining: 1 hour, 10 minutes, 6 seconds)
2025-09-14 15:23:41,807 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 15:23:46,909 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7985.99219 ± 133.129
2025-09-14 15:23:46,909 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(8093.8745), np.float32(7763.6074), np.float32(7973.636), np.float32(7907.9097), np.float32(8018.647), np.float32(8202.43), np.float32(8162.734), np.float32(7831.702), np.float32(7904.1235), np.float32(8001.2617)]
2025-09-14 15:23:46,909 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:23:46,915 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 85/100 (estimated time remaining: 1 hour, 10 minutes, 32 seconds)
2025-09-14 15:28:25,746 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 15:28:30,788 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7898.66260 ± 98.363
2025-09-14 15:28:30,788 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(7984.0645), np.float32(7950.9546), np.float32(7702.0923), np.float32(7821.823), np.float32(7990.075), np.float32(7961.39), np.float32(7838.9243), np.float32(7873.314), np.float32(7823.693), np.float32(8040.286)]
2025-09-14 15:28:30,788 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:28:30,794 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 86/100 (estimated time remaining: 1 hour, 6 minutes, 24 seconds)
2025-09-14 15:31:34,212 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 15:31:39,293 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7862.72656 ± 119.476
2025-09-14 15:31:39,293 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(8048.2124), np.float32(7870.322), np.float32(7681.285), np.float32(7714.021), np.float32(7856.0176), np.float32(7983.0835), np.float32(7820.363), np.float32(7780.7197), np.float32(7838.443), np.float32(8034.7935)]
2025-09-14 15:31:39,293 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:31:39,299 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 87/100 (estimated time remaining: 59 minutes, 33 seconds)
2025-09-14 15:35:56,781 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 15:36:01,905 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7947.56104 ± 56.904
2025-09-14 15:36:01,905 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(7870.4155), np.float32(7828.19), np.float32(7975.8184), np.float32(7938.68), np.float32(7948.96), np.float32(7992.2153), np.float32(7978.426), np.float32(8035.232), np.float32(7936.6797), np.float32(7970.9893)]
2025-09-14 15:36:01,905 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:36:01,911 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 88/100 (estimated time remaining: 52 minutes, 26 seconds)
2025-09-14 15:41:59,742 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 15:42:04,890 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7876.65332 ± 148.634
2025-09-14 15:42:04,890 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(7979.8403), np.float32(7703.5903), np.float32(7957.7334), np.float32(7501.612), np.float32(7981.355), np.float32(7860.727), np.float32(7943.507), np.float32(7923.2783), np.float32(7990.4487), np.float32(7924.4365)]
2025-09-14 15:42:04,890 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:42:04,896 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 89/100 (estimated time remaining: 55 minutes, 4 seconds)
2025-09-14 15:46:30,972 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 15:46:36,046 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 8033.27197 ± 122.200
2025-09-14 15:46:36,047 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(8079.423), np.float32(8036.214), np.float32(8074.8384), np.float32(7823.5674), np.float32(8262.566), np.float32(8071.146), np.float32(7916.5586), np.float32(7915.365), np.float32(8166.549), np.float32(7986.4917)]
2025-09-14 15:46:36,047 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:46:36,053 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 90/100 (estimated time remaining: 50 minutes, 12 seconds)
2025-09-14 15:50:55,371 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 15:51:00,336 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 8041.80371 ± 76.347
2025-09-14 15:51:00,336 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(7894.499), np.float32(8054.041), np.float32(7996.0737), np.float32(8048.6304), np.float32(8026.9355), np.float32(8015.652), np.float32(8162.5815), np.float32(7982.0747), np.float32(8077.5786), np.float32(8159.9653)]
2025-09-14 15:51:00,337 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:51:00,343 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 91/100 (estimated time remaining: 44 minutes, 59 seconds)
2025-09-14 15:54:34,104 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 15:54:39,191 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 8159.27979 ± 50.121
2025-09-14 15:54:39,191 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(8038.7686), np.float32(8181.801), np.float32(8139.941), np.float32(8172.2075), np.float32(8220.086), np.float32(8214.396), np.float32(8118.0903), np.float32(8145.741), np.float32(8181.722), np.float32(8180.0503)]
2025-09-14 15:54:39,191 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:54:39,191 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1226 [INFO]: New best (8159.28) for latency 3
2025-09-14 15:54:39,198 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 92/100 (estimated time remaining: 41 minutes, 23 seconds)
2025-09-14 15:57:37,150 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 15:57:42,256 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7296.59863 ± 160.655
2025-09-14 15:57:42,257 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(7196.081), np.float32(6972.668), np.float32(7480.394), np.float32(7158.8594), np.float32(7489.8667), np.float32(7433.8325), np.float32(7419.762), np.float32(7188.7334), np.float32(7254.6177), np.float32(7371.162)]
2025-09-14 15:57:42,257 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 15:57:42,264 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 93/100 (estimated time remaining: 34 minutes, 40 seconds)
2025-09-14 16:01:25,359 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 16:01:30,458 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 8110.30176 ± 89.259
2025-09-14 16:01:30,459 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(8118.6177), np.float32(8212.597), np.float32(7997.132), np.float32(8111.237), np.float32(7920.969), np.float32(8149.64), np.float32(8175.1406), np.float32(8119.3286), np.float32(8225.422), np.float32(8072.93)]
2025-09-14 16:01:30,459 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:01:30,466 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 94/100 (estimated time remaining: 27 minutes, 11 seconds)
2025-09-14 16:05:51,952 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 16:05:57,039 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7932.15771 ± 161.693
2025-09-14 16:05:57,039 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(7774.4224), np.float32(7935.9175), np.float32(7999.634), np.float32(8165.4487), np.float32(7969.7505), np.float32(7873.5674), np.float32(7907.9614), np.float32(8022.7827), np.float32(8105.7275), np.float32(7566.3687)]
2025-09-14 16:05:57,039 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:05:57,046 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 95/100 (estimated time remaining: 23 minutes, 13 seconds)
2025-09-14 16:11:46,051 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 16:11:50,952 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7627.04297 ± 983.339
2025-09-14 16:11:50,952 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(8023.1616), np.float32(7843.3384), np.float32(7957.188), np.float32(7807.559), np.float32(7820.4473), np.float32(8065.268), np.float32(4695.92), np.float32(7935.187), np.float32(7935.772), np.float32(8186.5835)]
2025-09-14 16:11:50,953 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:11:50,959 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 96/100 (estimated time remaining: 20 minutes, 50 seconds)
2025-09-14 16:16:12,768 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 16:16:17,785 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 8013.59277 ± 136.291
2025-09-14 16:16:17,785 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(8185.1226), np.float32(7841.8013), np.float32(8135.0503), np.float32(7797.5347), np.float32(8090.3813), np.float32(8083.8857), np.float32(8063.202), np.float32(8113.625), np.float32(7808.4854), np.float32(8016.846)]
2025-09-14 16:16:17,785 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:16:17,792 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 97/100 (estimated time remaining: 17 minutes, 18 seconds)
2025-09-14 16:21:11,686 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 16:21:16,821 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7875.56006 ± 104.962
2025-09-14 16:21:16,821 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(7879.2847), np.float32(7907.725), np.float32(7636.7163), np.float32(7901.2705), np.float32(7988.053), np.float32(7832.3765), np.float32(7936.126), np.float32(8037.5815), np.float32(7832.456), np.float32(7804.017)]
2025-09-14 16:21:16,822 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:21:16,828 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 98/100 (estimated time remaining: 14 minutes, 8 seconds)
2025-09-14 16:25:31,352 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 16:25:36,474 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7638.76416 ± 84.064
2025-09-14 16:25:36,475 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(7467.385), np.float32(7670.656), np.float32(7650.47), np.float32(7702.7363), np.float32(7709.018), np.float32(7496.786), np.float32(7626.509), np.float32(7685.2075), np.float32(7731.353), np.float32(7647.517)]
2025-09-14 16:25:36,475 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:25:36,482 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 99/100 (estimated time remaining: 9 minutes, 38 seconds)
2025-09-14 16:30:22,227 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 16:30:27,326 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 8072.53662 ± 60.652
2025-09-14 16:30:27,327 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(8113.9805), np.float32(8046.8022), np.float32(8030.8433), np.float32(8157.7104), np.float32(8023.4966), np.float32(8136.0625), np.float32(8104.1753), np.float32(7953.1675), np.float32(8122.1387), np.float32(8036.989)]
2025-09-14 16:30:27,327 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:30:27,334 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1199 [INFO]: Iteration 100/100 (estimated time remaining: 4 minutes, 54 seconds)
2025-09-14 16:36:43,820 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1214 [DEBUG]: Evaluating for latency 3...
2025-09-14 16:36:48,974 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1221 [DEBUG]: Total Reward: 7968.52051 ± 161.186
2025-09-14 16:36:48,975 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1222 [DEBUG]: All rewards: [np.float32(8068.0156), np.float32(7952.491), np.float32(7685.0024), np.float32(7978.313), np.float32(8198.794), np.float32(8038.216), np.float32(7785.132), np.float32(8175.7993), np.float32(8015.233), np.float32(7788.206)]
2025-09-14 16:36:48,975 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1223 [DEBUG]: All trajectory lengths: [np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0), np.float32(1000.0)]
2025-09-14 16:36:48,982 latency_env.delayed_mdp:training_loop(baseline-bpql-halfcheetah):1251 [DEBUG]: Training session finished
