Logging to experiments/gym_cheetahO01/oct31/w350e3_Durl_seed2341
Print configuration .....
{'env_name': 'gym_cheetahO01', 'random_seeds': [4321, 2314, 2341, 3421], 'save_variables': False, 'model_save_dir': '/tmp/gym_cheetahO01_models/', 'restore_variables': False, 'start_onpol_iter': 0, 'onpol_iters': 33, 'num_path_random': 6, 'num_path_onpol': 6, 'env_horizon': 1000, 'max_train_data': 200000, 'max_val_data': 100000, 'discard_ratio': 0.0, 'dynamics': {'pre_training': {'mode': 'intrinsic_reward', 'itr': 0, 'policy_itr': 20}, 'model': 'nn', 'ensemble': True, 'ensemble_model_count': 5, 'enable_particle_ensemble': True, 'particles': 5, 'intrinsic_reward_only': False, 'external_reward_evaluation_interval': 5, 'obs_var': 1.0, 'intrinsic_reward_coeff': 1.0, 'ita': 1.0, 'mode': 'random', 'val': True, 'n_layers': 4, 'hidden_size': 1000, 'activation': 'relu', 'batch_size': 1000, 'learning_rate': 0.001, 'epochs': 200, 'kfac_params': {'learning_rate': 0.1, 'damping': 0.001, 'momentum': 0.9, 'kl_clip': 0.0001, 'cov_ema_decay': 0.99}}, 'policy': {'network_shape': [32, 32], 'init_logstd': 0.0, 'activation': 'tanh', 'reinitialize_every_itr': False}, 'trpo': {'horizon': 1000, 'gamma': 0.99, 'step_size': 0.01, 'iterations': 20, 'batch_size': 50000, 'gae': 0.95}, 'trpo_ext_reward': {'horizon': 1000, 'gamma': 0.99, 'step_size': 0.01, 'iterations': 20, 'batch_size': 50000, 'gae': 0.95}, 'algo': 'trpo'}
Generating random rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 0
average number of affinization = 0.0
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 0
average number of affinization = 0.0
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 0
average number of affinization = 0.0
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 0
average number of affinization = 0.0
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 0
average number of affinization = 0.0
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 0
average number of affinization = 0.0
Done generating random rollouts.
Creating normalization for training data.
Done creating normalization for training data.
Particle ensemble enabled? True
An ensemble of 5 dynamics model <class 'model.dynamics.NNDynamicsModel'> initialized
Train dynamics model with intrinsic reward only? False
Pre-training enabled. Using only intrinsic reward.
Pre-training dynamics model for 0 iterations...
Done pre-training dynamics model.
Using external reward only.
itr #0 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 1.0958514213562012
Validation loss = 0.5189859867095947
Validation loss = 0.6541313529014587
Validation loss = 0.6616743803024292
Validation loss = 0.640098512172699
Validation loss = 0.7057974338531494
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 1.899274230003357
Validation loss = 0.6281936168670654
Validation loss = 0.5960037112236023
Validation loss = 0.6354615092277527
Validation loss = 0.5860515832901001
Validation loss = 0.6003513932228088
Validation loss = 0.7673337459564209
Validation loss = 1.079140305519104
Validation loss = 0.8903045654296875
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 2.2996108531951904
Validation loss = 0.5215333104133606
Validation loss = 0.5575882196426392
Validation loss = 0.583286702632904
Validation loss = 0.5684998035430908
Validation loss = 0.8396872878074646
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.9434590339660645
Validation loss = 0.5479012727737427
Validation loss = 0.6307691931724548
Validation loss = 0.7038718461990356
Validation loss = 0.7369853258132935
Validation loss = 0.8269882798194885
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 1.723659873008728
Validation loss = 0.5498285293579102
Validation loss = 0.6934716701507568
Validation loss = 0.6293610334396362
Validation loss = 0.7674366235733032
Validation loss = 0.7740256786346436
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 202
average number of affinization = 28.857142857142858
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 43
average number of affinization = 30.625
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 166
average number of affinization = 45.666666666666664
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 47
average number of affinization = 45.8
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 167
average number of affinization = 56.81818181818182
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 53
average number of affinization = 56.5
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -350     |
| Iteration     | 0        |
| MaximumReturn | -291     |
| MinimumReturn | -410     |
| TotalSamples  | 8000     |
----------------------------
itr #1 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.2524130940437317
Validation loss = 0.22289693355560303
Validation loss = 0.19901497662067413
Validation loss = 0.2011410892009735
Validation loss = 0.2035650908946991
Validation loss = 0.19762134552001953
Validation loss = 0.20934657752513885
Validation loss = 0.2399597018957138
Validation loss = 0.20660322904586792
Validation loss = 0.20616070926189423
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.2627830505371094
Validation loss = 0.21823455393314362
Validation loss = 0.2033335268497467
Validation loss = 0.2000390738248825
Validation loss = 0.2197163701057434
Validation loss = 0.2028745710849762
Validation loss = 0.20654448866844177
Validation loss = 0.20533783733844757
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.26726847887039185
Validation loss = 0.22393468022346497
Validation loss = 0.19920386373996735
Validation loss = 0.19782617688179016
Validation loss = 0.1907615214586258
Validation loss = 0.2078535407781601
Validation loss = 0.1979992538690567
Validation loss = 0.21137265861034393
Validation loss = 0.20503592491149902
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.25971516966819763
Validation loss = 0.2147025465965271
Validation loss = 0.20159924030303955
Validation loss = 0.21503575146198273
Validation loss = 0.20115846395492554
Validation loss = 0.19783242046833038
Validation loss = 0.19842688739299774
Validation loss = 0.19980530440807343
Validation loss = 0.2039099931716919
Validation loss = 0.21840515732765198
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.2557254433631897
Validation loss = 0.21218687295913696
Validation loss = 0.21088621020317078
Validation loss = 0.2160673290491104
Validation loss = 0.19927573204040527
Validation loss = 0.19414812326431274
Validation loss = 0.2015916407108307
Validation loss = 0.2072432041168213
Validation loss = 0.2034481316804886
Validation loss = 0.20106056332588196
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 478
average number of affinization = 88.92307692307692
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 401
average number of affinization = 111.21428571428571
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 420
average number of affinization = 131.8
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 223
average number of affinization = 137.5
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 425
average number of affinization = 154.41176470588235
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 446
average number of affinization = 170.61111111111111
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -359     |
| Iteration     | 1        |
| MaximumReturn | -283     |
| MinimumReturn | -437     |
| TotalSamples  | 12000    |
----------------------------
itr #2 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.21888671815395355
Validation loss = 0.18435005843639374
Validation loss = 0.18930989503860474
Validation loss = 0.1940961331129074
Validation loss = 0.20256222784519196
Validation loss = 0.19129061698913574
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.21662770211696625
Validation loss = 0.1845192313194275
Validation loss = 0.18149976432323456
Validation loss = 0.182067409157753
Validation loss = 0.190235897898674
Validation loss = 0.20840738713741302
Validation loss = 0.22293846309185028
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.21174286305904388
Validation loss = 0.19120155274868011
Validation loss = 0.18312855064868927
Validation loss = 0.1830487847328186
Validation loss = 0.1878582388162613
Validation loss = 0.18832464516162872
Validation loss = 0.18913470208644867
Validation loss = 0.1875334233045578
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.21320687234401703
Validation loss = 0.18263810873031616
Validation loss = 0.19495363533496857
Validation loss = 0.19252116978168488
Validation loss = 0.18724803626537323
Validation loss = 0.19008241593837738
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.21288341283798218
Validation loss = 0.17979608476161957
Validation loss = 0.184814453125
Validation loss = 0.18912191689014435
Validation loss = 0.18610064685344696
Validation loss = 0.18611373007297516
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 433
average number of affinization = 184.42105263157896
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 426
average number of affinization = 196.5
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 439
average number of affinization = 208.04761904761904
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 437
average number of affinization = 218.45454545454547
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 456
average number of affinization = 228.7826086956522
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 465
average number of affinization = 238.625
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -240     |
| Iteration     | 2        |
| MaximumReturn | -147     |
| MinimumReturn | -283     |
| TotalSamples  | 16000    |
----------------------------
itr #3 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.17968089878559113
Validation loss = 0.18632814288139343
Validation loss = 0.18113812804222107
Validation loss = 0.2373746931552887
Validation loss = 0.1896885633468628
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.1765790730714798
Validation loss = 0.179152712225914
Validation loss = 0.18638406693935394
Validation loss = 0.18270379304885864
Validation loss = 0.18567079305648804
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.1829342544078827
Validation loss = 0.17945557832717896
Validation loss = 0.18079179525375366
Validation loss = 0.20158851146697998
Validation loss = 0.18561527132987976
Validation loss = 0.18430590629577637
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.18149378895759583
Validation loss = 0.17559696733951569
Validation loss = 0.18043871223926544
Validation loss = 0.1897316575050354
Validation loss = 0.18636538088321686
Validation loss = 0.1846189945936203
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.17385441064834595
Validation loss = 0.1794419288635254
Validation loss = 0.18302834033966064
Validation loss = 0.23627540469169617
Validation loss = 0.19938091933727264
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 521
average number of affinization = 249.92
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 538
average number of affinization = 261.0
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 558
average number of affinization = 272.0
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 561
average number of affinization = 282.32142857142856
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 602
average number of affinization = 293.3448275862069
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 554
average number of affinization = 302.03333333333336
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 76.9     |
| Iteration     | 3        |
| MaximumReturn | 153      |
| MinimumReturn | -12.1    |
| TotalSamples  | 20000    |
----------------------------
itr #4 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.17720867693424225
Validation loss = 0.1804412603378296
Validation loss = 0.17950519919395447
Validation loss = 0.17940352857112885
Validation loss = 0.18532247841358185
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.17089639604091644
Validation loss = 0.17519384622573853
Validation loss = 0.1789383441209793
Validation loss = 0.17697134613990784
Validation loss = 0.19072186946868896
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.17207665741443634
Validation loss = 0.17295010387897491
Validation loss = 0.17654529213905334
Validation loss = 0.18218210339546204
Validation loss = 0.17914871871471405
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.1722736358642578
Validation loss = 0.1748773157596588
Validation loss = 0.17738120257854462
Validation loss = 0.17707598209381104
Validation loss = 0.17823217809200287
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.17056718468666077
Validation loss = 0.17401860654354095
Validation loss = 0.17656314373016357
Validation loss = 0.18031542003154755
Validation loss = 0.1811869740486145
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 605
average number of affinization = 311.80645161290323
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 567
average number of affinization = 319.78125
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 604
average number of affinization = 328.3939393939394
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 593
average number of affinization = 336.1764705882353
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 629
average number of affinization = 344.54285714285714
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 569
average number of affinization = 350.77777777777777
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 594      |
| Iteration     | 4        |
| MaximumReturn | 692      |
| MinimumReturn | 499      |
| TotalSamples  | 24000    |
----------------------------
itr #5 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.17232365906238556
Validation loss = 0.1747206598520279
Validation loss = 0.1850273758172989
Validation loss = 0.1756402850151062
Validation loss = 0.17334234714508057
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.1702491044998169
Validation loss = 0.17198842763900757
Validation loss = 0.18040312826633453
Validation loss = 0.18010342121124268
Validation loss = 0.17445440590381622
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.17252510786056519
Validation loss = 0.17413806915283203
Validation loss = 0.17695946991443634
Validation loss = 0.19848501682281494
Validation loss = 0.17402754724025726
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.16879095137119293
Validation loss = 0.17474530637264252
Validation loss = 0.1731429547071457
Validation loss = 0.17626738548278809
Validation loss = 0.19488872587680817
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.1702302247285843
Validation loss = 0.17217029631137848
Validation loss = 0.1717100590467453
Validation loss = 0.17614400386810303
Validation loss = 0.17473191022872925
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 668
average number of affinization = 359.35135135135135
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 652
average number of affinization = 367.05263157894734
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 673
average number of affinization = 374.8974358974359
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 680
average number of affinization = 382.525
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 665
average number of affinization = 389.4146341463415
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 707
average number of affinization = 396.9761904761905
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.04e+03 |
| Iteration     | 5        |
| MaximumReturn | 1.34e+03 |
| MinimumReturn | 336      |
| TotalSamples  | 28000    |
----------------------------
itr #6 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.16770951449871063
Validation loss = 0.16803953051567078
Validation loss = 0.17175666987895966
Validation loss = 0.1673443466424942
Validation loss = 0.17068108916282654
Validation loss = 0.17747752368450165
Validation loss = 0.17507179081439972
Validation loss = 0.18338944017887115
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.16895343363285065
Validation loss = 0.16956742107868195
Validation loss = 0.18011856079101562
Validation loss = 0.16899169981479645
Validation loss = 0.18040971457958221
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.16821227967739105
Validation loss = 0.16793501377105713
Validation loss = 0.17383089661598206
Validation loss = 0.18253697454929352
Validation loss = 0.17425203323364258
Validation loss = 0.1745537966489792
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.1663619577884674
Validation loss = 0.176014706492424
Validation loss = 0.17758528888225555
Validation loss = 0.17094288766384125
Validation loss = 0.1709553450345993
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.16695347428321838
Validation loss = 0.16830691695213318
Validation loss = 0.1694350689649582
Validation loss = 0.16879835724830627
Validation loss = 0.17195363342761993
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 781
average number of affinization = 405.90697674418607
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 727
average number of affinization = 413.20454545454544
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 733
average number of affinization = 420.31111111111113
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 757
average number of affinization = 427.6304347826087
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 725
average number of affinization = 433.9574468085106
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 733
average number of affinization = 440.1875
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.38e+03 |
| Iteration     | 6        |
| MaximumReturn | 1.67e+03 |
| MinimumReturn | 417      |
| TotalSamples  | 32000    |
----------------------------
itr #7 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.1637898087501526
Validation loss = 0.16750271618366241
Validation loss = 0.16755597293376923
Validation loss = 0.16639459133148193
Validation loss = 0.16600772738456726
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.16262593865394592
Validation loss = 0.1612909734249115
Validation loss = 0.16555273532867432
Validation loss = 0.16210466623306274
Validation loss = 0.1624128818511963
Validation loss = 0.16490893065929413
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.1659954935312271
Validation loss = 0.16303905844688416
Validation loss = 0.1667604297399521
Validation loss = 0.1768854707479477
Validation loss = 0.16737939417362213
Validation loss = 0.1705016791820526
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.16400691866874695
Validation loss = 0.16237998008728027
Validation loss = 0.16067050397396088
Validation loss = 0.1645652949810028
Validation loss = 0.16490718722343445
Validation loss = 0.1685597002506256
Validation loss = 0.1644240915775299
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.16043291985988617
Validation loss = 0.16300973296165466
Validation loss = 0.1606098711490631
Validation loss = 0.1631014049053192
Validation loss = 0.1628410816192627
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 785
average number of affinization = 447.2244897959184
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 805
average number of affinization = 454.38
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 803
average number of affinization = 461.2156862745098
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 778
average number of affinization = 467.3076923076923
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 790
average number of affinization = 473.39622641509436
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 793
average number of affinization = 479.31481481481484
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.41e+03 |
| Iteration     | 7        |
| MaximumReturn | 2.07e+03 |
| MinimumReturn | -194     |
| TotalSamples  | 36000    |
----------------------------
itr #8 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.1595948189496994
Validation loss = 0.15813034772872925
Validation loss = 0.1601009964942932
Validation loss = 0.16340938210487366
Validation loss = 0.16095399856567383
Validation loss = 0.16124214231967926
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.1603286862373352
Validation loss = 0.16389280557632446
Validation loss = 0.16009745001792908
Validation loss = 0.1591324359178543
Validation loss = 0.15865494310855865
Validation loss = 0.15979453921318054
Validation loss = 0.16207028925418854
Validation loss = 0.16758327186107635
Validation loss = 0.16109997034072876
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.15972772240638733
Validation loss = 0.15832220017910004
Validation loss = 0.16511328518390656
Validation loss = 0.15882112085819244
Validation loss = 0.16288897395133972
Validation loss = 0.16712769865989685
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.16027501225471497
Validation loss = 0.16027003526687622
Validation loss = 0.15830188989639282
Validation loss = 0.15888269245624542
Validation loss = 0.16812103986740112
Validation loss = 0.161356583237648
Validation loss = 0.16125553846359253
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.15561668574810028
Validation loss = 0.15665897727012634
Validation loss = 0.1558389514684677
Validation loss = 0.15937577188014984
Validation loss = 0.15756665170192719
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 822
average number of affinization = 485.54545454545456
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 824
average number of affinization = 491.5892857142857
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 841
average number of affinization = 497.719298245614
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 830
average number of affinization = 503.44827586206895
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 784
average number of affinization = 508.20338983050846
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 821
average number of affinization = 513.4166666666666
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.81e+03 |
| Iteration     | 8        |
| MaximumReturn | 2.16e+03 |
| MinimumReturn | 1.16e+03 |
| TotalSamples  | 40000    |
----------------------------
itr #9 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.15537799894809723
Validation loss = 0.15821747481822968
Validation loss = 0.1539011299610138
Validation loss = 0.15712231397628784
Validation loss = 0.1554616391658783
Validation loss = 0.159209206700325
Validation loss = 0.16267208755016327
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.15914098918437958
Validation loss = 0.15598472952842712
Validation loss = 0.15556874871253967
Validation loss = 0.15899553894996643
Validation loss = 0.15819229185581207
Validation loss = 0.15754421055316925
Validation loss = 0.15834805369377136
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.15739957988262177
Validation loss = 0.15561571717262268
Validation loss = 0.15615883469581604
Validation loss = 0.15879838168621063
Validation loss = 0.1585795134305954
Validation loss = 0.15736068785190582
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.1559590995311737
Validation loss = 0.15597236156463623
Validation loss = 0.15749654173851013
Validation loss = 0.15724526345729828
Validation loss = 0.1567527949810028
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.15376752614974976
Validation loss = 0.15219630300998688
Validation loss = 0.15259388089179993
Validation loss = 0.15674413740634918
Validation loss = 0.16017737984657288
Validation loss = 0.15520021319389343
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 802
average number of affinization = 518.1475409836065
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 815
average number of affinization = 522.9354838709677
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 846
average number of affinization = 528.063492063492
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 824
average number of affinization = 532.6875
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 841
average number of affinization = 537.4307692307692
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 833
average number of affinization = 541.9090909090909
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.79e+03 |
| Iteration     | 9        |
| MaximumReturn | 2.14e+03 |
| MinimumReturn | 1.39e+03 |
| TotalSamples  | 44000    |
----------------------------
itr #10 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.1521548181772232
Validation loss = 0.15335623919963837
Validation loss = 0.15883928537368774
Validation loss = 0.15395498275756836
Validation loss = 0.15433906018733978
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.1524731069803238
Validation loss = 0.15336643159389496
Validation loss = 0.15298181772232056
Validation loss = 0.15358518064022064
Validation loss = 0.15427392721176147
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.1521202176809311
Validation loss = 0.15342293679714203
Validation loss = 0.15369288623332977
Validation loss = 0.15699703991413116
Validation loss = 0.15629293024539948
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.15407820045948029
Validation loss = 0.15410198271274567
Validation loss = 0.1519719511270523
Validation loss = 0.15392465889453888
Validation loss = 0.15649330615997314
Validation loss = 0.15639322996139526
Validation loss = 0.15522770583629608
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.15629933774471283
Validation loss = 0.15162920951843262
Validation loss = 0.15022553503513336
Validation loss = 0.1526608169078827
Validation loss = 0.1517992615699768
Validation loss = 0.1531735062599182
Validation loss = 0.15469729900360107
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 873
average number of affinization = 546.8507462686567
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 881
average number of affinization = 551.7647058823529
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 860
average number of affinization = 556.231884057971
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 837
average number of affinization = 560.2428571428571
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 871
average number of affinization = 564.6197183098592
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 872
average number of affinization = 568.8888888888889
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 2.06e+03 |
| Iteration     | 10       |
| MaximumReturn | 2.27e+03 |
| MinimumReturn | 1.94e+03 |
| TotalSamples  | 48000    |
----------------------------
itr #11 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.1474606841802597
Validation loss = 0.14982430636882782
Validation loss = 0.1486525982618332
Validation loss = 0.15079839527606964
Validation loss = 0.15334244072437286
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.14808416366577148
Validation loss = 0.14923851191997528
Validation loss = 0.15326260030269623
Validation loss = 0.15026678144931793
Validation loss = 0.149894580245018
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.1483220010995865
Validation loss = 0.1490275263786316
Validation loss = 0.14971104264259338
Validation loss = 0.15162129700183868
Validation loss = 0.15430085361003876
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.15118764340877533
Validation loss = 0.15103425085544586
Validation loss = 0.14863373339176178
Validation loss = 0.15274988114833832
Validation loss = 0.150663360953331
Validation loss = 0.15250132977962494
Validation loss = 0.15381477773189545
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.147047221660614
Validation loss = 0.14741235971450806
Validation loss = 0.14829695224761963
Validation loss = 0.1500825732946396
Validation loss = 0.15103961527347565
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 884
average number of affinization = 573.2054794520548
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 902
average number of affinization = 577.6486486486486
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 866
average number of affinization = 581.4933333333333
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 885
average number of affinization = 585.4868421052631
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 879
average number of affinization = 589.2987012987013
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 883
average number of affinization = 593.0641025641025
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.77e+03 |
| Iteration     | 11       |
| MaximumReturn | 2.27e+03 |
| MinimumReturn | 163      |
| TotalSamples  | 52000    |
----------------------------
itr #12 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.14836637675762177
Validation loss = 0.144746795296669
Validation loss = 0.14643529057502747
Validation loss = 0.14758118987083435
Validation loss = 0.14887240529060364
Validation loss = 0.14980237185955048
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.14584454894065857
Validation loss = 0.14857257902622223
Validation loss = 0.1453457623720169
Validation loss = 0.14605917036533356
Validation loss = 0.1492745727300644
Validation loss = 0.14985589683055878
Validation loss = 0.14897283911705017
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.1480865627527237
Validation loss = 0.1464424580335617
Validation loss = 0.14643611013889313
Validation loss = 0.15117861330509186
Validation loss = 0.15043461322784424
Validation loss = 0.1478714495897293
Validation loss = 0.14827023446559906
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.14631298184394836
Validation loss = 0.14610354602336884
Validation loss = 0.14680829644203186
Validation loss = 0.1460193246603012
Validation loss = 0.14716379344463348
Validation loss = 0.1474987119436264
Validation loss = 0.1491016000509262
Validation loss = 0.14836543798446655
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.14637568593025208
Validation loss = 0.1448826640844345
Validation loss = 0.14406313002109528
Validation loss = 0.14501839876174927
Validation loss = 0.14889679849147797
Validation loss = 0.14687985181808472
Validation loss = 0.14679287374019623
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 890
average number of affinization = 596.8227848101266
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 896
average number of affinization = 600.5625
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 891
average number of affinization = 604.1481481481482
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 877
average number of affinization = 607.4756097560976
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 899
average number of affinization = 610.9879518072289
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 892
average number of affinization = 614.3333333333334
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.96e+03 |
| Iteration     | 12       |
| MaximumReturn | 2.41e+03 |
| MinimumReturn | 775      |
| TotalSamples  | 56000    |
----------------------------
itr #13 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.1451375037431717
Validation loss = 0.14717213809490204
Validation loss = 0.14373734593391418
Validation loss = 0.1466662585735321
Validation loss = 0.14580772817134857
Validation loss = 0.14558382332324982
Validation loss = 0.14759734272956848
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.14366546273231506
Validation loss = 0.143064484000206
Validation loss = 0.14468753337860107
Validation loss = 0.14371085166931152
Validation loss = 0.14526726305484772
Validation loss = 0.14624522626399994
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.14561888575553894
Validation loss = 0.14273668825626373
Validation loss = 0.14521722495555878
Validation loss = 0.1452993005514145
Validation loss = 0.14611898362636566
Validation loss = 0.14475129544734955
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.1447029858827591
Validation loss = 0.1435295045375824
Validation loss = 0.14327862858772278
Validation loss = 0.1460607796907425
Validation loss = 0.1458429992198944
Validation loss = 0.14447762072086334
Validation loss = 0.14475341141223907
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.14404462277889252
Validation loss = 0.14456011354923248
Validation loss = 0.14432279765605927
Validation loss = 0.14458398520946503
Validation loss = 0.14632420241832733
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 930
average number of affinization = 618.0470588235294
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 908
average number of affinization = 621.4186046511628
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 906
average number of affinization = 624.6896551724138
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 921
average number of affinization = 628.0568181818181
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 918
average number of affinization = 631.314606741573
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 921
average number of affinization = 634.5333333333333
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 2.27e+03 |
| Iteration     | 13       |
| MaximumReturn | 2.37e+03 |
| MinimumReturn | 2.05e+03 |
| TotalSamples  | 60000    |
----------------------------
itr #14 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.14298458397388458
Validation loss = 0.14438238739967346
Validation loss = 0.1426977664232254
Validation loss = 0.14282606542110443
Validation loss = 0.14524216949939728
Validation loss = 0.142807736992836
Validation loss = 0.14390535652637482
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.1412196308374405
Validation loss = 0.14023606479167938
Validation loss = 0.14105255901813507
Validation loss = 0.14223404228687286
Validation loss = 0.14202268421649933
Validation loss = 0.14235523343086243
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.14353543519973755
Validation loss = 0.14203716814517975
Validation loss = 0.1418433040380478
Validation loss = 0.14365988969802856
Validation loss = 0.14416831731796265
Validation loss = 0.1426491141319275
Validation loss = 0.14375166594982147
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.14543260633945465
Validation loss = 0.14151430130004883
Validation loss = 0.1439497023820877
Validation loss = 0.14556775987148285
Validation loss = 0.14355328679084778
Validation loss = 0.14355354011058807
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.14016489684581757
Validation loss = 0.14096103608608246
Validation loss = 0.1445857137441635
Validation loss = 0.14344839751720428
Validation loss = 0.14299854636192322
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 914
average number of affinization = 637.6043956043956
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 922
average number of affinization = 640.695652173913
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 920
average number of affinization = 643.6989247311828
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 916
average number of affinization = 646.5957446808511
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 931
average number of affinization = 649.5894736842105
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 915
average number of affinization = 652.3541666666666
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 2.25e+03 |
| Iteration     | 14       |
| MaximumReturn | 2.86e+03 |
| MinimumReturn | 1.4e+03  |
| TotalSamples  | 64000    |
----------------------------
itr #15 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.13979099690914154
Validation loss = 0.14068755507469177
Validation loss = 0.14179794490337372
Validation loss = 0.1415013074874878
Validation loss = 0.14070528745651245
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.14124369621276855
Validation loss = 0.1384696066379547
Validation loss = 0.13982811570167542
Validation loss = 0.14045073091983795
Validation loss = 0.14014413952827454
Validation loss = 0.1405428647994995
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.1398628205060959
Validation loss = 0.14060449600219727
Validation loss = 0.1406669020652771
Validation loss = 0.1403329074382782
Validation loss = 0.14264927804470062
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.1410020887851715
Validation loss = 0.14027124643325806
Validation loss = 0.1418253481388092
Validation loss = 0.13977840542793274
Validation loss = 0.14004558324813843
Validation loss = 0.14186778664588928
Validation loss = 0.14044120907783508
Validation loss = 0.14105407893657684
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.13862338662147522
Validation loss = 0.13964663445949554
Validation loss = 0.1392357498407364
Validation loss = 0.1403069794178009
Validation loss = 0.14073653519153595
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 937
average number of affinization = 655.2886597938144
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 946
average number of affinization = 658.2551020408164
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 909
average number of affinization = 660.7878787878788
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 944
average number of affinization = 663.62
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 902
average number of affinization = 665.980198019802
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 910
average number of affinization = 668.3725490196078
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.39e+03 |
| Iteration     | 15       |
| MaximumReturn | 2.47e+03 |
| MinimumReturn | 194      |
| TotalSamples  | 68000    |
----------------------------
itr #16 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.13935422897338867
Validation loss = 0.1388438642024994
Validation loss = 0.13877113163471222
Validation loss = 0.13852256536483765
Validation loss = 0.13901330530643463
Validation loss = 0.1394232213497162
Validation loss = 0.14084923267364502
Validation loss = 0.1403713822364807
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.1395728886127472
Validation loss = 0.13834063708782196
Validation loss = 0.14048324525356293
Validation loss = 0.13870105147361755
Validation loss = 0.13901247084140778
Validation loss = 0.14044378697872162
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.13975250720977783
Validation loss = 0.13945989310741425
Validation loss = 0.1392311453819275
Validation loss = 0.1398501843214035
Validation loss = 0.1391395926475525
Validation loss = 0.14134809374809265
Validation loss = 0.14141464233398438
Validation loss = 0.13882668316364288
Validation loss = 0.14197339117527008
Validation loss = 0.14050047099590302
Validation loss = 0.14163216948509216
Validation loss = 0.1419219821691513
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.14062629640102386
Validation loss = 0.13798344135284424
Validation loss = 0.13947202265262604
Validation loss = 0.13993410766124725
Validation loss = 0.14067400991916656
Validation loss = 0.13972370326519012
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.13991084694862366
Validation loss = 0.14015591144561768
Validation loss = 0.14024554193019867
Validation loss = 0.13850955665111542
Validation loss = 0.139219269156456
Validation loss = 0.14092938601970673
Validation loss = 0.13874761760234833
Validation loss = 0.1404792070388794
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 955
average number of affinization = 671.1553398058253
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 947
average number of affinization = 673.8076923076923
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 926
average number of affinization = 676.2095238095238
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 918
average number of affinization = 678.4905660377359
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 949
average number of affinization = 681.018691588785
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 942
average number of affinization = 683.4351851851852
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.68e+03 |
| Iteration     | 16       |
| MaximumReturn | 2.53e+03 |
| MinimumReturn | -89.8    |
| TotalSamples  | 72000    |
----------------------------
itr #17 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.13817831873893738
Validation loss = 0.13658477365970612
Validation loss = 0.13750582933425903
Validation loss = 0.13918152451515198
Validation loss = 0.13912378251552582
Validation loss = 0.14000660181045532
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.13641174137592316
Validation loss = 0.1364894062280655
Validation loss = 0.1373997926712036
Validation loss = 0.13789844512939453
Validation loss = 0.13765183091163635
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.13779927790164948
Validation loss = 0.13737665116786957
Validation loss = 0.13767941296100616
Validation loss = 0.13923993706703186
Validation loss = 0.13719266653060913
Validation loss = 0.13951729238033295
Validation loss = 0.13985081017017365
Validation loss = 0.1383589804172516
Validation loss = 0.13860328495502472
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.13960999250411987
Validation loss = 0.13868169486522675
Validation loss = 0.1387420892715454
Validation loss = 0.13724131882190704
Validation loss = 0.13871674239635468
Validation loss = 0.13816601037979126
Validation loss = 0.13805970549583435
Validation loss = 0.13921459019184113
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.1374484896659851
Validation loss = 0.13855381309986115
Validation loss = 0.13659554719924927
Validation loss = 0.13752180337905884
Validation loss = 0.13922441005706787
Validation loss = 0.13882511854171753
Validation loss = 0.13822680711746216
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 951
average number of affinization = 685.8899082568807
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 953
average number of affinization = 688.3181818181819
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 956
average number of affinization = 690.7297297297297
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 950
average number of affinization = 693.0446428571429
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 952
average number of affinization = 695.3362831858407
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 928
average number of affinization = 697.3771929824561
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.91e+03 |
| Iteration     | 17       |
| MaximumReturn | 2.41e+03 |
| MinimumReturn | 141      |
| TotalSamples  | 76000    |
----------------------------
itr #18 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.13616470992565155
Validation loss = 0.13504429161548615
Validation loss = 0.13559916615486145
Validation loss = 0.1361810564994812
Validation loss = 0.13582409918308258
Validation loss = 0.13613839447498322
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.1361512392759323
Validation loss = 0.13500891625881195
Validation loss = 0.13493013381958008
Validation loss = 0.13537460565567017
Validation loss = 0.1368342787027359
Validation loss = 0.13663172721862793
Validation loss = 0.1368134617805481
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.13711042702198029
Validation loss = 0.1360231637954712
Validation loss = 0.13667471706867218
Validation loss = 0.1370319426059723
Validation loss = 0.1371873915195465
Validation loss = 0.13769865036010742
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.13725681602954865
Validation loss = 0.13512514531612396
Validation loss = 0.13561563193798065
Validation loss = 0.13611122965812683
Validation loss = 0.13646511733531952
Validation loss = 0.1366768181324005
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.1358470767736435
Validation loss = 0.13567426800727844
Validation loss = 0.13620926439762115
Validation loss = 0.13634628057479858
Validation loss = 0.13632428646087646
Validation loss = 0.13551636040210724
Validation loss = 0.13562388718128204
Validation loss = 0.13660871982574463
Validation loss = 0.1362825483083725
Validation loss = 0.13696648180484772
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 956
average number of affinization = 699.6260869565217
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 956
average number of affinization = 701.8362068965517
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 942
average number of affinization = 703.8888888888889
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 959
average number of affinization = 706.0508474576271
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 940
average number of affinization = 708.0168067226891
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 962
average number of affinization = 710.1333333333333
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.87e+03 |
| Iteration     | 18       |
| MaximumReturn | 2.59e+03 |
| MinimumReturn | 485      |
| TotalSamples  | 80000    |
----------------------------
itr #19 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.1352032870054245
Validation loss = 0.13454455137252808
Validation loss = 0.1351042240858078
Validation loss = 0.13541020452976227
Validation loss = 0.13579270243644714
Validation loss = 0.13483083248138428
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.13570763170719147
Validation loss = 0.13485115766525269
Validation loss = 0.1331179440021515
Validation loss = 0.1342705339193344
Validation loss = 0.13526979088783264
Validation loss = 0.13512073457241058
Validation loss = 0.13582901656627655
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.1355210840702057
Validation loss = 0.13684284687042236
Validation loss = 0.1355745792388916
Validation loss = 0.1353437602519989
Validation loss = 0.13614018261432648
Validation loss = 0.1356218308210373
Validation loss = 0.1367202252149582
Validation loss = 0.13757486641407013
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.13697628676891327
Validation loss = 0.13483014702796936
Validation loss = 0.13470833003520966
Validation loss = 0.13386766612529755
Validation loss = 0.1362563818693161
Validation loss = 0.1345359981060028
Validation loss = 0.1359597146511078
Validation loss = 0.13482420146465302
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.13721933960914612
Validation loss = 0.13432428240776062
Validation loss = 0.13530318439006805
Validation loss = 0.13445071876049042
Validation loss = 0.1347358226776123
Validation loss = 0.13643357157707214
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 958
average number of affinization = 712.1818181818181
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 942
average number of affinization = 714.0655737704918
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 960
average number of affinization = 716.0650406504066
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 910
average number of affinization = 717.6290322580645
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 943
average number of affinization = 719.432
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 950
average number of affinization = 721.2619047619048
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.42e+03 |
| Iteration     | 19       |
| MaximumReturn | 2.44e+03 |
| MinimumReturn | -409     |
| TotalSamples  | 84000    |
----------------------------
itr #20 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.13542430102825165
Validation loss = 0.1329118013381958
Validation loss = 0.13366718590259552
Validation loss = 0.13413333892822266
Validation loss = 0.13466964662075043
Validation loss = 0.1345032900571823
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.13361474871635437
Validation loss = 0.13431710004806519
Validation loss = 0.1338447779417038
Validation loss = 0.13326439261436462
Validation loss = 0.1343311220407486
Validation loss = 0.1342027634382248
Validation loss = 0.13340380787849426
Validation loss = 0.1346067190170288
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.1359119862318039
Validation loss = 0.13374021649360657
Validation loss = 0.13479992747306824
Validation loss = 0.13565300405025482
Validation loss = 0.1350962370634079
Validation loss = 0.13574250042438507
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.13481998443603516
Validation loss = 0.13427959382534027
Validation loss = 0.13417385518550873
Validation loss = 0.13393019139766693
Validation loss = 0.134566068649292
Validation loss = 0.13471268117427826
Validation loss = 0.13477188348770142
Validation loss = 0.13378369808197021
Validation loss = 0.13493932783603668
Validation loss = 0.13382314145565033
Validation loss = 0.1336127519607544
Validation loss = 0.13426117599010468
Validation loss = 0.13368292152881622
Validation loss = 0.13441139459609985
Validation loss = 0.1340184360742569
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.13437773287296295
Validation loss = 0.13424836099147797
Validation loss = 0.13353422284126282
Validation loss = 0.13453978300094604
Validation loss = 0.13447387516498566
Validation loss = 0.13505423069000244
Validation loss = 0.13395726680755615
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 942
average number of affinization = 723.0
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 949
average number of affinization = 724.765625
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 952
average number of affinization = 726.5271317829457
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 936
average number of affinization = 728.1384615384616
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 959
average number of affinization = 729.9007633587786
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 934
average number of affinization = 731.4469696969697
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.3e+03  |
| Iteration     | 20       |
| MaximumReturn | 2.53e+03 |
| MinimumReturn | 133      |
| TotalSamples  | 88000    |
----------------------------
itr #21 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.1328914314508438
Validation loss = 0.13238315284252167
Validation loss = 0.1328047513961792
Validation loss = 0.13392406702041626
Validation loss = 0.13312698900699615
Validation loss = 0.1336934119462967
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.13433526456356049
Validation loss = 0.13325637578964233
Validation loss = 0.13250146806240082
Validation loss = 0.13468150794506073
Validation loss = 0.1332399994134903
Validation loss = 0.13267308473587036
Validation loss = 0.1327342987060547
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.13365955650806427
Validation loss = 0.13380606472492218
Validation loss = 0.13316792249679565
Validation loss = 0.13419288396835327
Validation loss = 0.13411161303520203
Validation loss = 0.13468848168849945
Validation loss = 0.13471928238868713
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.13315166532993317
Validation loss = 0.13185416162014008
Validation loss = 0.13299034535884857
Validation loss = 0.1328321248292923
Validation loss = 0.13307923078536987
Validation loss = 0.13363203406333923
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.1343902349472046
Validation loss = 0.13169099390506744
Validation loss = 0.13226860761642456
Validation loss = 0.13347189128398895
Validation loss = 0.13458196818828583
Validation loss = 0.1330721527338028
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 960
average number of affinization = 733.1654135338346
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 953
average number of affinization = 734.8059701492538
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 963
average number of affinization = 736.4962962962964
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 955
average number of affinization = 738.1029411764706
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 964
average number of affinization = 739.7518248175182
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 940
average number of affinization = 741.2028985507246
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.84e+03 |
| Iteration     | 21       |
| MaximumReturn | 2.48e+03 |
| MinimumReturn | 245      |
| TotalSamples  | 92000    |
----------------------------
itr #22 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.13274279236793518
Validation loss = 0.13110795617103577
Validation loss = 0.131235271692276
Validation loss = 0.13307638466358185
Validation loss = 0.13128267228603363
Validation loss = 0.1353318840265274
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.13281036913394928
Validation loss = 0.13168089091777802
Validation loss = 0.13307708501815796
Validation loss = 0.1319318264722824
Validation loss = 0.13177096843719482
Validation loss = 0.13265761733055115
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.1337268203496933
Validation loss = 0.1319197714328766
Validation loss = 0.13308794796466827
Validation loss = 0.1326848417520523
Validation loss = 0.13330884277820587
Validation loss = 0.13358910381793976
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.13313132524490356
Validation loss = 0.13137806951999664
Validation loss = 0.13255952298641205
Validation loss = 0.13212764263153076
Validation loss = 0.1323690414428711
Validation loss = 0.13236616551876068
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.13225050270557404
Validation loss = 0.13082574307918549
Validation loss = 0.13166329264640808
Validation loss = 0.13145655393600464
Validation loss = 0.13231495022773743
Validation loss = 0.1336762011051178
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 968
average number of affinization = 742.8345323741007
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 960
average number of affinization = 744.3857142857142
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 945
average number of affinization = 745.8085106382979
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 949
average number of affinization = 747.2394366197183
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 952
average number of affinization = 748.6713286713286
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 953
average number of affinization = 750.0902777777778
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 2.44e+03 |
| Iteration     | 22       |
| MaximumReturn | 2.56e+03 |
| MinimumReturn | 2.33e+03 |
| TotalSamples  | 96000    |
----------------------------
itr #23 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.13105621933937073
Validation loss = 0.13090698421001434
Validation loss = 0.13201536238193512
Validation loss = 0.1311732679605484
Validation loss = 0.13178597390651703
Validation loss = 0.13209523260593414
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.13147062063217163
Validation loss = 0.1295524388551712
Validation loss = 0.1299300342798233
Validation loss = 0.13135617971420288
Validation loss = 0.13112175464630127
Validation loss = 0.13050484657287598
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.132241353392601
Validation loss = 0.1313389092683792
Validation loss = 0.13354337215423584
Validation loss = 0.1315230280160904
Validation loss = 0.1336100697517395
Validation loss = 0.13270889222621918
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.13030071556568146
Validation loss = 0.13049215078353882
Validation loss = 0.130343958735466
Validation loss = 0.13198383152484894
Validation loss = 0.13144630193710327
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.13127826154232025
Validation loss = 0.129572793841362
Validation loss = 0.13044820725917816
Validation loss = 0.1304667592048645
Validation loss = 0.1297052800655365
Validation loss = 0.13174961507320404
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 957
average number of affinization = 751.5172413793103
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 953
average number of affinization = 752.8972602739726
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 958
average number of affinization = 754.2925170068028
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 951
average number of affinization = 755.6216216216217
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 943
average number of affinization = 756.8791946308725
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 965
average number of affinization = 758.2666666666667
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.74e+03 |
| Iteration     | 23       |
| MaximumReturn | 2.56e+03 |
| MinimumReturn | 726      |
| TotalSamples  | 100000   |
----------------------------
itr #24 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.13097764551639557
Validation loss = 0.12849636375904083
Validation loss = 0.13014574348926544
Validation loss = 0.13109374046325684
Validation loss = 0.13032199442386627
Validation loss = 0.12992513179779053
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.13012729585170746
Validation loss = 0.12904556095600128
Validation loss = 0.13031136989593506
Validation loss = 0.13113388419151306
Validation loss = 0.13116571307182312
Validation loss = 0.1293494552373886
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.13175660371780396
Validation loss = 0.12946121394634247
Validation loss = 0.13170874118804932
Validation loss = 0.13197235763072968
Validation loss = 0.13084626197814941
Validation loss = 0.13153843581676483
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.13240686058998108
Validation loss = 0.12961173057556152
Validation loss = 0.1311240792274475
Validation loss = 0.13032986223697662
Validation loss = 0.13009300827980042
Validation loss = 0.1292954385280609
Validation loss = 0.12925703823566437
Validation loss = 0.1307913064956665
Validation loss = 0.12965895235538483
Validation loss = 0.1308095008134842
Validation loss = 0.12972843647003174
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.1292886734008789
Validation loss = 0.12928131222724915
Validation loss = 0.12971971929073334
Validation loss = 0.12997210025787354
Validation loss = 0.13087248802185059
Validation loss = 0.12890370190143585
Validation loss = 0.12985099852085114
Validation loss = 0.1305035948753357
Validation loss = 0.13025325536727905
Validation loss = 0.13018038868904114
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 977
average number of affinization = 759.7152317880794
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 967
average number of affinization = 761.078947368421
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 967
average number of affinization = 762.4248366013072
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 961
average number of affinization = 763.7142857142857
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 957
average number of affinization = 764.9612903225807
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 964
average number of affinization = 766.2371794871794
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 2.34e+03 |
| Iteration     | 24       |
| MaximumReturn | 2.62e+03 |
| MinimumReturn | 2.02e+03 |
| TotalSamples  | 104000   |
----------------------------
itr #25 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.1299503743648529
Validation loss = 0.12871919572353363
Validation loss = 0.13009029626846313
Validation loss = 0.1295963078737259
Validation loss = 0.12971840798854828
Validation loss = 0.1300753951072693
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.12986868619918823
Validation loss = 0.12795118987560272
Validation loss = 0.12810060381889343
Validation loss = 0.12926746904850006
Validation loss = 0.12935371696949005
Validation loss = 0.12866076827049255
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.13096743822097778
Validation loss = 0.12929747998714447
Validation loss = 0.13032472133636475
Validation loss = 0.1300727277994156
Validation loss = 0.13073962926864624
Validation loss = 0.12979842722415924
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.12974753975868225
Validation loss = 0.12824492156505585
Validation loss = 0.12874390184879303
Validation loss = 0.12905432283878326
Validation loss = 0.12982603907585144
Validation loss = 0.12963367998600006
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.1304754614830017
Validation loss = 0.12862470746040344
Validation loss = 0.12848716974258423
Validation loss = 0.12863953411579132
Validation loss = 0.12927447259426117
Validation loss = 0.1282593160867691
Validation loss = 0.12964726984500885
Validation loss = 0.12975765764713287
Validation loss = 0.1289965808391571
Validation loss = 0.12943623960018158
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 958
average number of affinization = 767.4585987261147
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 963
average number of affinization = 768.6962025316456
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 967
average number of affinization = 769.9433962264151
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 961
average number of affinization = 771.1375
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 966
average number of affinization = 772.3478260869565
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 966
average number of affinization = 773.5432098765432
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.51e+03 |
| Iteration     | 25       |
| MaximumReturn | 2.52e+03 |
| MinimumReturn | 120      |
| TotalSamples  | 108000   |
----------------------------
itr #26 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.13017256557941437
Validation loss = 0.12799426913261414
Validation loss = 0.12936502695083618
Validation loss = 0.12927044928073883
Validation loss = 0.12892386317253113
Validation loss = 0.12924660742282867
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.1299852579832077
Validation loss = 0.12801840901374817
Validation loss = 0.12859205901622772
Validation loss = 0.1292596459388733
Validation loss = 0.12882521748542786
Validation loss = 0.12888172268867493
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.1302286833524704
Validation loss = 0.12893706560134888
Validation loss = 0.12978771328926086
Validation loss = 0.12964504957199097
Validation loss = 0.13012944161891937
Validation loss = 0.1293542981147766
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.1289684772491455
Validation loss = 0.12775620818138123
Validation loss = 0.12788495421409607
Validation loss = 0.1275629699230194
Validation loss = 0.12897323071956635
Validation loss = 0.1287861317396164
Validation loss = 0.12870821356773376
Validation loss = 0.12927263975143433
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.12908591330051422
Validation loss = 0.12808233499526978
Validation loss = 0.1278231292963028
Validation loss = 0.12847919762134552
Validation loss = 0.1280551254749298
Validation loss = 0.12841349840164185
Validation loss = 0.128326416015625
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 970
average number of affinization = 774.7484662576687
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 971
average number of affinization = 775.9451219512196
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 975
average number of affinization = 777.1515151515151
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 978
average number of affinization = 778.3614457831326
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 978
average number of affinization = 779.556886227545
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 976
average number of affinization = 780.7261904761905
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.97e+03 |
| Iteration     | 26       |
| MaximumReturn | 2.39e+03 |
| MinimumReturn | 915      |
| TotalSamples  | 112000   |
----------------------------
itr #27 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.12742753326892853
Validation loss = 0.12730661034584045
Validation loss = 0.1272168904542923
Validation loss = 0.1285945028066635
Validation loss = 0.12715037167072296
Validation loss = 0.12804196774959564
Validation loss = 0.12831725180149078
Validation loss = 0.12831038236618042
Validation loss = 0.1280449777841568
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.1280204951763153
Validation loss = 0.12765875458717346
Validation loss = 0.12748777866363525
Validation loss = 0.1284411996603012
Validation loss = 0.12749947607517242
Validation loss = 0.12738823890686035
Validation loss = 0.12751439213752747
Validation loss = 0.12794964015483856
Validation loss = 0.12811219692230225
Validation loss = 0.12693484127521515
Validation loss = 0.12733666598796844
Validation loss = 0.1271393597126007
Validation loss = 0.12752306461334229
Validation loss = 0.12725992500782013
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.1294563114643097
Validation loss = 0.12841854989528656
Validation loss = 0.12784907221794128
Validation loss = 0.12886236608028412
Validation loss = 0.1285293847322464
Validation loss = 0.12857162952423096
Validation loss = 0.1293061524629593
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.12814272940158844
Validation loss = 0.1269608587026596
Validation loss = 0.12695729732513428
Validation loss = 0.12808212637901306
Validation loss = 0.1269562542438507
Validation loss = 0.12838587164878845
Validation loss = 0.12860439717769623
Validation loss = 0.127860426902771
Validation loss = 0.12735597789287567
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.129598930478096
Validation loss = 0.12670786678791046
Validation loss = 0.12765510380268097
Validation loss = 0.12714464962482452
Validation loss = 0.12731435894966125
Validation loss = 0.12801074981689453
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 975
average number of affinization = 781.8757396449704
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 969
average number of affinization = 782.9764705882353
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 975
average number of affinization = 784.0994152046784
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 961
average number of affinization = 785.1279069767442
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 972
average number of affinization = 786.2080924855492
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 969
average number of affinization = 787.2586206896551
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 2.59e+03 |
| Iteration     | 27       |
| MaximumReturn | 2.75e+03 |
| MinimumReturn | 2.35e+03 |
| TotalSamples  | 116000   |
----------------------------
itr #28 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.1273968666791916
Validation loss = 0.12672697007656097
Validation loss = 0.12764737010002136
Validation loss = 0.12755508720874786
Validation loss = 0.1266210973262787
Validation loss = 0.12737052142620087
Validation loss = 0.1266850233078003
Validation loss = 0.1271948218345642
Validation loss = 0.12727919220924377
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.12718231976032257
Validation loss = 0.12557974457740784
Validation loss = 0.12641210854053497
Validation loss = 0.1258784383535385
Validation loss = 0.12655438482761383
Validation loss = 0.12662814557552338
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.128961980342865
Validation loss = 0.12684716284275055
Validation loss = 0.1273699402809143
Validation loss = 0.12764178216457367
Validation loss = 0.12858161330223083
Validation loss = 0.1276208758354187
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.12813767790794373
Validation loss = 0.1262700855731964
Validation loss = 0.127385213971138
Validation loss = 0.12703190743923187
Validation loss = 0.12643317878246307
Validation loss = 0.12609778344631195
Validation loss = 0.12714819610118866
Validation loss = 0.12687291204929352
Validation loss = 0.12638741731643677
Validation loss = 0.12654438614845276
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.1275845468044281
Validation loss = 0.12582285702228546
Validation loss = 0.12647922337055206
Validation loss = 0.12635116279125214
Validation loss = 0.12649428844451904
Validation loss = 0.12649689614772797
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 977
average number of affinization = 788.3428571428572
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 961
average number of affinization = 789.3238636363636
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 973
average number of affinization = 790.361581920904
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 974
average number of affinization = 791.3932584269663
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 969
average number of affinization = 792.3854748603352
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 960
average number of affinization = 793.3166666666667
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.54e+03 |
| Iteration     | 28       |
| MaximumReturn | 2.38e+03 |
| MinimumReturn | 208      |
| TotalSamples  | 120000   |
----------------------------
itr #29 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.12922519445419312
Validation loss = 0.12629400193691254
Validation loss = 0.1272653043270111
Validation loss = 0.12720461189746857
Validation loss = 0.1257762312889099
Validation loss = 0.12582045793533325
Validation loss = 0.1272580772638321
Validation loss = 0.12678614258766174
Validation loss = 0.1263023465871811
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.12699884176254272
Validation loss = 0.12586745619773865
Validation loss = 0.12616926431655884
Validation loss = 0.12611433863639832
Validation loss = 0.12687820196151733
Validation loss = 0.12589356303215027
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.12771965563297272
Validation loss = 0.126247838139534
Validation loss = 0.12760019302368164
Validation loss = 0.12715575098991394
Validation loss = 0.12771382927894592
Validation loss = 0.12783852219581604
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.1284533590078354
Validation loss = 0.12554776668548584
Validation loss = 0.12623481452465057
Validation loss = 0.12610813975334167
Validation loss = 0.12590987980365753
Validation loss = 0.12601500749588013
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.1258297860622406
Validation loss = 0.12569919228553772
Validation loss = 0.12624981999397278
Validation loss = 0.1259451061487198
Validation loss = 0.12573465704917908
Validation loss = 0.1269104927778244
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 960
average number of affinization = 794.2375690607735
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 952
average number of affinization = 795.1043956043956
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 985
average number of affinization = 796.1420765027323
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 981
average number of affinization = 797.1467391304348
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 951
average number of affinization = 797.9783783783784
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 979
average number of affinization = 798.9516129032259
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.55e+03 |
| Iteration     | 29       |
| MaximumReturn | 2.59e+03 |
| MinimumReturn | 444      |
| TotalSamples  | 124000   |
----------------------------
itr #30 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.1256542056798935
Validation loss = 0.12521585822105408
Validation loss = 0.12604454159736633
Validation loss = 0.12646189332008362
Validation loss = 0.12586678564548492
Validation loss = 0.12515288591384888
Validation loss = 0.1254449486732483
Validation loss = 0.12622861564159393
Validation loss = 0.12456503510475159
Validation loss = 0.12547440826892853
Validation loss = 0.1268044114112854
Validation loss = 0.12587971985340118
Validation loss = 0.12626983225345612
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.1274702250957489
Validation loss = 0.12521038949489594
Validation loss = 0.12527257204055786
Validation loss = 0.12580864131450653
Validation loss = 0.12634064257144928
Validation loss = 0.12550386786460876
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.12713167071342468
Validation loss = 0.12639662623405457
Validation loss = 0.12752561271190643
Validation loss = 0.12737713754177094
Validation loss = 0.12704063951969147
Validation loss = 0.12725228071212769
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.12676291167736053
Validation loss = 0.12531150877475739
Validation loss = 0.12585122883319855
Validation loss = 0.12579059600830078
Validation loss = 0.12667793035507202
Validation loss = 0.12678763270378113
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.1258312612771988
Validation loss = 0.1241055279970169
Validation loss = 0.12599146366119385
Validation loss = 0.1254780888557434
Validation loss = 0.12682507932186127
Validation loss = 0.12735304236412048
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 953
average number of affinization = 799.7754010695187
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 948
average number of affinization = 800.563829787234
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 970
average number of affinization = 801.4603174603175
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 969
average number of affinization = 802.3421052631579
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 968
average number of affinization = 803.2094240837696
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 977
average number of affinization = 804.1145833333334
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.56e+03 |
| Iteration     | 30       |
| MaximumReturn | 2.55e+03 |
| MinimumReturn | -138     |
| TotalSamples  | 128000   |
----------------------------
itr #31 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.12711060047149658
Validation loss = 0.12451022863388062
Validation loss = 0.1248362585902214
Validation loss = 0.12555697560310364
Validation loss = 0.1253412514925003
Validation loss = 0.1261466145515442
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.12660004198551178
Validation loss = 0.1242312490940094
Validation loss = 0.12482002377510071
Validation loss = 0.1253175437450409
Validation loss = 0.1252901256084442
Validation loss = 0.12533098459243774
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.12774249911308289
Validation loss = 0.1259976625442505
Validation loss = 0.12589313089847565
Validation loss = 0.12639397382736206
Validation loss = 0.12732133269309998
Validation loss = 0.12719273567199707
Validation loss = 0.12665045261383057
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.12547490000724792
Validation loss = 0.12514573335647583
Validation loss = 0.1244564801454544
Validation loss = 0.12572039663791656
Validation loss = 0.125016987323761
Validation loss = 0.1253010332584381
Validation loss = 0.1261659562587738
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.1273198425769806
Validation loss = 0.12436723709106445
Validation loss = 0.12535510957241058
Validation loss = 0.12531310319900513
Validation loss = 0.12514622509479523
Validation loss = 0.12591156363487244
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 979
average number of affinization = 805.020725388601
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 983
average number of affinization = 805.9381443298969
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 970
average number of affinization = 806.7794871794872
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 975
average number of affinization = 807.6377551020408
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 977
average number of affinization = 808.4974619289341
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 983
average number of affinization = 809.3787878787879
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 2.45e+03 |
| Iteration     | 31       |
| MaximumReturn | 2.71e+03 |
| MinimumReturn | 2.04e+03 |
| TotalSamples  | 132000   |
----------------------------
itr #32 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.1252274066209793
Validation loss = 0.12398089468479156
Validation loss = 0.1252954602241516
Validation loss = 0.12440001219511032
Validation loss = 0.12409603595733643
Validation loss = 0.12484301626682281
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.1243964284658432
Validation loss = 0.12382736802101135
Validation loss = 0.12449315190315247
Validation loss = 0.12424453347921371
Validation loss = 0.1238546073436737
Validation loss = 0.12452605366706848
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.1260673999786377
Validation loss = 0.12591774761676788
Validation loss = 0.12586717307567596
Validation loss = 0.1262962818145752
Validation loss = 0.12513776123523712
Validation loss = 0.12561579048633575
Validation loss = 0.12516091763973236
Validation loss = 0.12679816782474518
Validation loss = 0.1258435696363449
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.1249258890748024
Validation loss = 0.12405246496200562
Validation loss = 0.12550094723701477
Validation loss = 0.12460003793239594
Validation loss = 0.12450041621923447
Validation loss = 0.12413179874420166
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.12503647804260254
Validation loss = 0.12456432729959488
Validation loss = 0.1249428242444992
Validation loss = 0.12532857060432434
Validation loss = 0.1249699741601944
Validation loss = 0.12430104613304138
Validation loss = 0.12478221207857132
Validation loss = 0.12426955997943878
Validation loss = 0.12390092015266418
Validation loss = 0.12386277318000793
Validation loss = 0.12524138391017914
Validation loss = 0.12474852055311203
Validation loss = 0.12422902882099152
Validation loss = 0.12580277025699615
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 965
average number of affinization = 810.1608040201005
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 985
average number of affinization = 811.035
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 977
average number of affinization = 811.860696517413
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 975
average number of affinization = 812.6683168316831
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 983
average number of affinization = 813.5073891625616
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 975
average number of affinization = 814.2990196078431
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 2.11e+03 |
| Iteration     | 32       |
| MaximumReturn | 2.45e+03 |
| MinimumReturn | 1.46e+03 |
| TotalSamples  | 136000   |
----------------------------
