Logging to experiments/hopper/oct31/w350e03_Durl_seed2231
Print configuration .....
{'env_name': 'hopper', 'random_seeds': [1234, 2431, 2531, 2231], 'save_variables': False, 'model_save_dir': '/tmp/hopper_models/', 'restore_variables': False, 'start_onpol_iter': 0, 'onpol_iters': 33, 'num_path_random': 6, 'num_path_onpol': 6, 'env_horizon': 1000, 'max_train_data': 200000, 'max_val_data': 100000, 'discard_ratio': 0.0, 'dynamics': {'pre_training': {'mode': 'intrinsic_reward', 'itr': 0, 'policy_itr': 20}, 'model': 'nn', 'ensemble': True, 'ensemble_model_count': 5, 'enable_particle_ensemble': True, 'particles': 5, 'obs_var': 1.0, 'intrinsic_reward_coeff': 1.0, 'ita': 1.0, 'mode': 'random', 'val': True, 'n_layers': 4, 'hidden_size': 1000, 'activation': 'relu', 'batch_size': 1000, 'learning_rate': 0.001, 'reg_coeff': 0.0, 'epochs': 200, 'kfac_params': {'learning_rate': 0.1, 'damping': 0.001, 'momentum': 0.9, 'kl_clip': 0.0001, 'cov_ema_decay': 0.99}}, 'policy': {'network_shape': [64, 64], 'init_logstd': 0.0, 'activation': 'tanh', 'reinitialize_every_itr': False}, 'trpo': {'horizon': 1000, 'gamma': 0.99, 'step_size': 0.01, 'iterations': 20, 'batch_size': 50000, 'gae': 0.95, 'visualization': False, 'visualize_iterations': [0]}, 'algo': 'trpo'}
Generating random rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 0
average number of affinization = 0.0
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 0
average number of affinization = 0.0
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 0
average number of affinization = 0.0
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 0
average number of affinization = 0.0
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 0
average number of affinization = 0.0
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 0
average number of affinization = 0.0
Done generating random rollouts.
Creating normalization for training data.
Done creating normalization for training data.
Particle ensemble enabled? True
An ensemble of 5 dynamics model <class 'model.dynamics.NNDynamicsModel'> initialized
Train dynamics model with intrinsic reward only? False
Pre-training enabled. Using only intrinsic reward.
Pre-training dynamics model for 0 iterations...
Done pre-training dynamics model.
Using external reward only.
itr #0 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.47307896614074707
Validation loss = 0.22882641851902008
Validation loss = 0.19187213480472565
Validation loss = 0.18591928482055664
Validation loss = 0.18742313981056213
Validation loss = 0.19413813948631287
Validation loss = 0.1922641396522522
Validation loss = 0.2038457691669464
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.46809306740760803
Validation loss = 0.22472083568572998
Validation loss = 0.18837107717990875
Validation loss = 0.18292340636253357
Validation loss = 0.18477825820446014
Validation loss = 0.19703951478004456
Validation loss = 0.19801411032676697
Validation loss = 0.19518151879310608
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.44333121180534363
Validation loss = 0.22506922483444214
Validation loss = 0.18892064690589905
Validation loss = 0.18715015053749084
Validation loss = 0.18031594157218933
Validation loss = 0.18963634967803955
Validation loss = 0.18993115425109863
Validation loss = 0.1961786448955536
Validation loss = 0.20382778346538544
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.35475555062294006
Validation loss = 0.21192437410354614
Validation loss = 0.1855074018239975
Validation loss = 0.1879969984292984
Validation loss = 0.1904071867465973
Validation loss = 0.18604694306850433
Validation loss = 0.19691500067710876
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.4633393883705139
Validation loss = 0.22224187850952148
Validation loss = 0.19353973865509033
Validation loss = 0.1870187669992447
Validation loss = 0.18900048732757568
Validation loss = 0.19172999262809753
Validation loss = 0.2167517989873886
Validation loss = 0.1902400255203247
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 327
average number of affinization = 46.714285714285715
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 347
average number of affinization = 84.25
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 371
average number of affinization = 116.11111111111111
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 352
average number of affinization = 139.7
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 336
average number of affinization = 157.54545454545453
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 347
average number of affinization = 173.33333333333334
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.59e+03 |
| Iteration     | 0         |
| MaximumReturn | -1.42e+03 |
| MinimumReturn | -1.74e+03 |
| TotalSamples  | 8000      |
-----------------------------
itr #1 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.24397167563438416
Validation loss = 0.1926380842924118
Validation loss = 0.18035992980003357
Validation loss = 0.18468248844146729
Validation loss = 0.18617689609527588
Validation loss = 0.18565374612808228
Validation loss = 0.18331609666347504
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.24031734466552734
Validation loss = 0.19678300619125366
Validation loss = 0.19801469147205353
Validation loss = 0.18499113619327545
Validation loss = 0.18613415956497192
Validation loss = 0.18540340662002563
Validation loss = 0.19172701239585876
Validation loss = 0.1850232630968094
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.23940831422805786
Validation loss = 0.19542506337165833
Validation loss = 0.18885165452957153
Validation loss = 0.19045253098011017
Validation loss = 0.18477830290794373
Validation loss = 0.19226402044296265
Validation loss = 0.18414127826690674
Validation loss = 0.17886967957019806
Validation loss = 0.18530219793319702
Validation loss = 0.19181977212429047
Validation loss = 0.18678632378578186
Validation loss = 0.1921708732843399
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.22882160544395447
Validation loss = 0.19282080233097076
Validation loss = 0.18721003830432892
Validation loss = 0.18020415306091309
Validation loss = 0.1831236481666565
Validation loss = 0.1792207509279251
Validation loss = 0.17099764943122864
Validation loss = 0.18468305468559265
Validation loss = 0.17387546598911285
Validation loss = 0.18718889355659485
Validation loss = 0.19018246233463287
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.24400240182876587
Validation loss = 0.19650809466838837
Validation loss = 0.18536296486854553
Validation loss = 0.18457499146461487
Validation loss = 0.1839473396539688
Validation loss = 0.1807178258895874
Validation loss = 0.17387071251869202
Validation loss = 0.1745864450931549
Validation loss = 0.1781223714351654
Validation loss = 0.17338916659355164
Validation loss = 0.18710151314735413
Validation loss = 0.1800183653831482
Validation loss = 0.18570928275585175
Validation loss = 0.1718437224626541
Validation loss = 0.17963360249996185
Validation loss = 0.18060745298862457
Validation loss = 0.18059198558330536
Validation loss = 0.19129204750061035
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 443
average number of affinization = 194.07692307692307
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 492
average number of affinization = 215.35714285714286
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 497
average number of affinization = 234.13333333333333
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 458
average number of affinization = 248.125
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 436
average number of affinization = 259.1764705882353
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 460
average number of affinization = 270.3333333333333
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.73e+03 |
| Iteration     | 1         |
| MaximumReturn | -1.32e+03 |
| MinimumReturn | -2.01e+03 |
| TotalSamples  | 12000     |
-----------------------------
itr #2 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.2886222302913666
Validation loss = 0.24959544837474823
Validation loss = 0.2557957172393799
Validation loss = 0.23774783313274384
Validation loss = 0.24080683290958405
Validation loss = 0.23595111072063446
Validation loss = 0.24283523857593536
Validation loss = 0.2355254888534546
Validation loss = 0.23878340423107147
Validation loss = 0.24552349746227264
Validation loss = 0.248642697930336
Validation loss = 0.2351377010345459
Validation loss = 0.24461834132671356
Validation loss = 0.24016179144382477
Validation loss = 0.24605530500411987
Validation loss = 0.23560290038585663
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.30243828892707825
Validation loss = 0.26220500469207764
Validation loss = 0.26231375336647034
Validation loss = 0.24643971025943756
Validation loss = 0.24047665297985077
Validation loss = 0.2322903871536255
Validation loss = 0.23530347645282745
Validation loss = 0.24264824390411377
Validation loss = 0.23339207470417023
Validation loss = 0.23424722254276276
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.2963877022266388
Validation loss = 0.2523132264614105
Validation loss = 0.24938668310642242
Validation loss = 0.23723244667053223
Validation loss = 0.23339486122131348
Validation loss = 0.23700757324695587
Validation loss = 0.2413628250360489
Validation loss = 0.23553413152694702
Validation loss = 0.25243785977363586
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.2861471474170685
Validation loss = 0.2482500523328781
Validation loss = 0.24499821662902832
Validation loss = 0.24126039445400238
Validation loss = 0.23958416283130646
Validation loss = 0.23249979317188263
Validation loss = 0.23349161446094513
Validation loss = 0.24835489690303802
Validation loss = 0.23111766576766968
Validation loss = 0.23761169612407684
Validation loss = 0.240613654255867
Validation loss = 0.23157672584056854
Validation loss = 0.25757521390914917
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.30298635363578796
Validation loss = 0.2444583773612976
Validation loss = 0.2416604906320572
Validation loss = 0.23688757419586182
Validation loss = 0.2472168356180191
Validation loss = 0.2309895008802414
Validation loss = 0.23937958478927612
Validation loss = 0.22928698360919952
Validation loss = 0.2333848476409912
Validation loss = 0.262638121843338
Validation loss = 0.23061944544315338
Validation loss = 0.24532681703567505
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 624
average number of affinization = 288.94736842105266
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 636
average number of affinization = 306.3
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 643
average number of affinization = 322.3333333333333
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 641
average number of affinization = 336.8181818181818
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 587
average number of affinization = 347.69565217391306
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 621
average number of affinization = 359.0833333333333
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -2.69e+03 |
| Iteration     | 2         |
| MaximumReturn | -2.38e+03 |
| MinimumReturn | -2.79e+03 |
| TotalSamples  | 16000     |
-----------------------------
itr #3 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.35168182849884033
Validation loss = 0.23078659176826477
Validation loss = 0.22848638892173767
Validation loss = 0.22265198826789856
Validation loss = 0.21585015952587128
Validation loss = 0.20786508917808533
Validation loss = 0.21178759634494781
Validation loss = 0.2078152298927307
Validation loss = 0.21128486096858978
Validation loss = 0.21203874051570892
Validation loss = 0.20793989300727844
Validation loss = 0.20888009667396545
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.30280518531799316
Validation loss = 0.23749399185180664
Validation loss = 0.23003312945365906
Validation loss = 0.2226838618516922
Validation loss = 0.21771857142448425
Validation loss = 0.2260386198759079
Validation loss = 0.22164049744606018
Validation loss = 0.22624075412750244
Validation loss = 0.22709596157073975
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.2737150192260742
Validation loss = 0.23493653535842896
Validation loss = 0.2244676649570465
Validation loss = 0.22365376353263855
Validation loss = 0.227645143866539
Validation loss = 0.21594029664993286
Validation loss = 0.22594726085662842
Validation loss = 0.22214362025260925
Validation loss = 0.217906191945076
Validation loss = 0.21850816905498505
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.28756484389305115
Validation loss = 0.2305077463388443
Validation loss = 0.22111359238624573
Validation loss = 0.23181882500648499
Validation loss = 0.21422675251960754
Validation loss = 0.2256617695093155
Validation loss = 0.2180197387933731
Validation loss = 0.2189028561115265
Validation loss = 0.225786492228508
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.29956620931625366
Validation loss = 0.23103812336921692
Validation loss = 0.22171324491500854
Validation loss = 0.2171659916639328
Validation loss = 0.22212642431259155
Validation loss = 0.2230425775051117
Validation loss = 0.22278600931167603
Validation loss = 0.22738948464393616
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 379
average number of affinization = 359.88
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 399
average number of affinization = 361.38461538461536
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 409
average number of affinization = 363.14814814814815
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 395
average number of affinization = 364.2857142857143
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 427
average number of affinization = 366.44827586206895
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 433
average number of affinization = 368.6666666666667
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.3e+03  |
| Iteration     | 3         |
| MaximumReturn | -1.17e+03 |
| MinimumReturn | -1.43e+03 |
| TotalSamples  | 20000     |
-----------------------------
itr #4 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.20945389568805695
Validation loss = 0.18131494522094727
Validation loss = 0.1832098811864853
Validation loss = 0.17837682366371155
Validation loss = 0.16924962401390076
Validation loss = 0.17389997839927673
Validation loss = 0.16843684017658234
Validation loss = 0.16792839765548706
Validation loss = 0.1661292314529419
Validation loss = 0.1744648665189743
Validation loss = 0.16672183573246002
Validation loss = 0.16368074715137482
Validation loss = 0.16647323966026306
Validation loss = 0.17222566902637482
Validation loss = 0.16865788400173187
Validation loss = 0.16751766204833984
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.23283842206001282
Validation loss = 0.1974245011806488
Validation loss = 0.19070535898208618
Validation loss = 0.18615534901618958
Validation loss = 0.1817699372768402
Validation loss = 0.17944838106632233
Validation loss = 0.17845289409160614
Validation loss = 0.18162164092063904
Validation loss = 0.1756025105714798
Validation loss = 0.17835953831672668
Validation loss = 0.1773148626089096
Validation loss = 0.18635712563991547
Validation loss = 0.1743430197238922
Validation loss = 0.17465919256210327
Validation loss = 0.1746681034564972
Validation loss = 0.17749305069446564
Validation loss = 0.17415675520896912
Validation loss = 0.17193233966827393
Validation loss = 0.17139950394630432
Validation loss = 0.17605307698249817
Validation loss = 0.17313988506793976
Validation loss = 0.17811566591262817
Validation loss = 0.17354588210582733
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.2093847543001175
Validation loss = 0.19074761867523193
Validation loss = 0.18631306290626526
Validation loss = 0.18214896321296692
Validation loss = 0.1747068464756012
Validation loss = 0.17579582333564758
Validation loss = 0.17419162392616272
Validation loss = 0.17848438024520874
Validation loss = 0.1712266057729721
Validation loss = 0.1808921992778778
Validation loss = 0.1737280786037445
Validation loss = 0.1691851168870926
Validation loss = 0.17691799998283386
Validation loss = 0.1727340817451477
Validation loss = 0.16583406925201416
Validation loss = 0.17133952677249908
Validation loss = 0.16930963099002838
Validation loss = 0.16915935277938843
Validation loss = 0.16894611716270447
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.2175043821334839
Validation loss = 0.19296692311763763
Validation loss = 0.18565520644187927
Validation loss = 0.1786443144083023
Validation loss = 0.17718133330345154
Validation loss = 0.1758025884628296
Validation loss = 0.17149847745895386
Validation loss = 0.1838279366493225
Validation loss = 0.17558932304382324
Validation loss = 0.16849596798419952
Validation loss = 0.16761288046836853
Validation loss = 0.17289207875728607
Validation loss = 0.16720351576805115
Validation loss = 0.1718733012676239
Validation loss = 0.17049238085746765
Validation loss = 0.16885465383529663
Validation loss = 0.1710730791091919
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.24785558879375458
Validation loss = 0.20250773429870605
Validation loss = 0.18727096915245056
Validation loss = 0.18698087334632874
Validation loss = 0.19024872779846191
Validation loss = 0.18610772490501404
Validation loss = 0.18455539643764496
Validation loss = 0.1782473921775818
Validation loss = 0.17836368083953857
Validation loss = 0.17755085229873657
Validation loss = 0.17854037880897522
Validation loss = 0.19376614689826965
Validation loss = 0.174344003200531
Validation loss = 0.17570000886917114
Validation loss = 0.17346474528312683
Validation loss = 0.1726667284965515
Validation loss = 0.19535329937934875
Validation loss = 0.1807876080274582
Validation loss = 0.1725860834121704
Validation loss = 0.17376650869846344
Validation loss = 0.1718670129776001
Validation loss = 0.1738506406545639
Validation loss = 0.1737542599439621
Validation loss = 0.18190011382102966
Validation loss = 0.17041228711605072
Validation loss = 0.17096595466136932
Validation loss = 0.16874299943447113
Validation loss = 0.1725156605243683
Validation loss = 0.1674879491329193
Validation loss = 0.16568902134895325
Validation loss = 0.17644965648651123
Validation loss = 0.17388302087783813
Validation loss = 0.17239397764205933
Validation loss = 0.17500966787338257
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 472
average number of affinization = 372.0
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 492
average number of affinization = 375.75
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 436
average number of affinization = 377.57575757575756
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 470
average number of affinization = 380.29411764705884
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 483
average number of affinization = 383.22857142857146
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 469
average number of affinization = 385.6111111111111
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.21e+03 |
| Iteration     | 4         |
| MaximumReturn | -1.05e+03 |
| MinimumReturn | -1.31e+03 |
| TotalSamples  | 24000     |
-----------------------------
itr #5 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.15923412144184113
Validation loss = 0.14085735380649567
Validation loss = 0.1427651345729828
Validation loss = 0.141313374042511
Validation loss = 0.14240048825740814
Validation loss = 0.1378813087940216
Validation loss = 0.13968224823474884
Validation loss = 0.13655751943588257
Validation loss = 0.13645322620868683
Validation loss = 0.13692589104175568
Validation loss = 0.13347089290618896
Validation loss = 0.1407538205385208
Validation loss = 0.1378258764743805
Validation loss = 0.13719379901885986
Validation loss = 0.13756366074085236
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.18214726448059082
Validation loss = 0.15078745782375336
Validation loss = 0.14833742380142212
Validation loss = 0.14814506471157074
Validation loss = 0.15005546808242798
Validation loss = 0.15043905377388
Validation loss = 0.14427655935287476
Validation loss = 0.14435133337974548
Validation loss = 0.14327625930309296
Validation loss = 0.14133045077323914
Validation loss = 0.14472003281116486
Validation loss = 0.14451105892658234
Validation loss = 0.13947844505310059
Validation loss = 0.13742370903491974
Validation loss = 0.1428305059671402
Validation loss = 0.13862727582454681
Validation loss = 0.13840574026107788
Validation loss = 0.13367745280265808
Validation loss = 0.13561518490314484
Validation loss = 0.1385205090045929
Validation loss = 0.13999813795089722
Validation loss = 0.13865166902542114
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.1805635690689087
Validation loss = 0.1480431705713272
Validation loss = 0.14233426749706268
Validation loss = 0.14623421430587769
Validation loss = 0.13961078226566315
Validation loss = 0.1431926041841507
Validation loss = 0.14109204709529877
Validation loss = 0.14017486572265625
Validation loss = 0.13362839818000793
Validation loss = 0.1447352170944214
Validation loss = 0.14161467552185059
Validation loss = 0.14340369403362274
Validation loss = 0.1374656707048416
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.16796933114528656
Validation loss = 0.14989475905895233
Validation loss = 0.14289139211177826
Validation loss = 0.14408525824546814
Validation loss = 0.16300402581691742
Validation loss = 0.13974489271640778
Validation loss = 0.15034569799900055
Validation loss = 0.13542203605175018
Validation loss = 0.13893486559391022
Validation loss = 0.13822539150714874
Validation loss = 0.1353994607925415
Validation loss = 0.1392887383699417
Validation loss = 0.1383049339056015
Validation loss = 0.13828465342521667
Validation loss = 0.13384006917476654
Validation loss = 0.1323917806148529
Validation loss = 0.1345411092042923
Validation loss = 0.13338765501976013
Validation loss = 0.13511455059051514
Validation loss = 0.14230065047740936
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.16406501829624176
Validation loss = 0.14674153923988342
Validation loss = 0.14160315692424774
Validation loss = 0.14380131661891937
Validation loss = 0.13984522223472595
Validation loss = 0.1437777727842331
Validation loss = 0.1412595510482788
Validation loss = 0.13726530969142914
Validation loss = 0.1462307721376419
Validation loss = 0.13877169787883759
Validation loss = 0.136783167719841
Validation loss = 0.13581104576587677
Validation loss = 0.13666467368602753
Validation loss = 0.14022916555404663
Validation loss = 0.1379302442073822
Validation loss = 0.13998191058635712
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 354
average number of affinization = 384.7567567567568
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 384
average number of affinization = 384.7368421052632
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 365
average number of affinization = 384.2307692307692
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 345
average number of affinization = 383.25
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 332
average number of affinization = 382.0
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 388
average number of affinization = 382.14285714285717
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -423     |
| Iteration     | 5        |
| MaximumReturn | -115     |
| MinimumReturn | -628     |
| TotalSamples  | 28000    |
----------------------------
itr #6 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.15704429149627686
Validation loss = 0.11958379298448563
Validation loss = 0.11851394921541214
Validation loss = 0.1104850098490715
Validation loss = 0.11042438447475433
Validation loss = 0.11080869287252426
Validation loss = 0.11471498757600784
Validation loss = 0.11288745701313019
Validation loss = 0.11055020242929459
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.13695935904979706
Validation loss = 0.12198866158723831
Validation loss = 0.12779764831066132
Validation loss = 0.11446277797222137
Validation loss = 0.11215630918741226
Validation loss = 0.11502797156572342
Validation loss = 0.1101769208908081
Validation loss = 0.11407484859228134
Validation loss = 0.11143125593662262
Validation loss = 0.11957568675279617
Validation loss = 0.11540890485048294
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.1418086141347885
Validation loss = 0.12425149977207184
Validation loss = 0.11891267448663712
Validation loss = 0.11591265350580215
Validation loss = 0.1164293885231018
Validation loss = 0.11362993717193604
Validation loss = 0.12029097974300385
Validation loss = 0.11464160680770874
Validation loss = 0.11346781253814697
Validation loss = 0.11406558752059937
Validation loss = 0.11551009863615036
Validation loss = 0.11329174786806107
Validation loss = 0.1162780299782753
Validation loss = 0.11471565812826157
Validation loss = 0.11353661119937897
Validation loss = 0.10967783629894257
Validation loss = 0.11481727659702301
Validation loss = 0.11413853615522385
Validation loss = 0.11534867435693741
Validation loss = 0.10970955342054367
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.1507827192544937
Validation loss = 0.13052615523338318
Validation loss = 0.11425555497407913
Validation loss = 0.11412955075502396
Validation loss = 0.11278556287288666
Validation loss = 0.11502303183078766
Validation loss = 0.11186189949512482
Validation loss = 0.11028845608234406
Validation loss = 0.1136743500828743
Validation loss = 0.11104150116443634
Validation loss = 0.11254775524139404
Validation loss = 0.11214137077331543
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.1426318734884262
Validation loss = 0.1255151778459549
Validation loss = 0.12154259532690048
Validation loss = 0.11418463289737701
Validation loss = 0.11751075834035873
Validation loss = 0.11376044899225235
Validation loss = 0.11617975682020187
Validation loss = 0.11280407756567001
Validation loss = 0.11219130456447601
Validation loss = 0.11721473932266235
Validation loss = 0.11106348782777786
Validation loss = 0.11067898571491241
Validation loss = 0.11334723979234695
Validation loss = 0.11362447589635849
Validation loss = 0.1110454648733139
Validation loss = 0.11099635064601898
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 419
average number of affinization = 383.0
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 427
average number of affinization = 384.0
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 445
average number of affinization = 385.35555555555555
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 425
average number of affinization = 386.2173913043478
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 420
average number of affinization = 386.93617021276594
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 434
average number of affinization = 387.9166666666667
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -625     |
| Iteration     | 6        |
| MaximumReturn | -434     |
| MinimumReturn | -819     |
| TotalSamples  | 32000    |
----------------------------
itr #7 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.11481016874313354
Validation loss = 0.10641058534383774
Validation loss = 0.10264045745134354
Validation loss = 0.0983162596821785
Validation loss = 0.09976699203252792
Validation loss = 0.10206028819084167
Validation loss = 0.09888198971748352
Validation loss = 0.10571051388978958
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.12042221426963806
Validation loss = 0.10512077808380127
Validation loss = 0.10386133193969727
Validation loss = 0.10615105926990509
Validation loss = 0.10480150580406189
Validation loss = 0.10126812011003494
Validation loss = 0.10938908159732819
Validation loss = 0.10274428129196167
Validation loss = 0.10413244366645813
Validation loss = 0.10146192461252213
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.11850747466087341
Validation loss = 0.10331644117832184
Validation loss = 0.1049196720123291
Validation loss = 0.0993766337633133
Validation loss = 0.09844550490379333
Validation loss = 0.09700420498847961
Validation loss = 0.10248614847660065
Validation loss = 0.09882768988609314
Validation loss = 0.09746664762496948
Validation loss = 0.10103391110897064
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.1307353377342224
Validation loss = 0.10506220906972885
Validation loss = 0.09788778424263
Validation loss = 0.09943176060914993
Validation loss = 0.11234656721353531
Validation loss = 0.10026494413614273
Validation loss = 0.09810008853673935
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.11189274489879608
Validation loss = 0.10316437482833862
Validation loss = 0.09967243671417236
Validation loss = 0.1008361279964447
Validation loss = 0.10112603008747101
Validation loss = 0.10030995309352875
Validation loss = 0.0988360121846199
Validation loss = 0.09890693426132202
Validation loss = 0.10156390070915222
Validation loss = 0.09831889718770981
Validation loss = 0.10483276844024658
Validation loss = 0.10143467783927917
Validation loss = 0.10103599727153778
Validation loss = 0.09744749963283539
Validation loss = 0.09764888882637024
Validation loss = 0.10433626174926758
Validation loss = 0.09814661741256714
Validation loss = 0.0944095253944397
Validation loss = 0.09553144872188568
Validation loss = 0.09853821992874146
Validation loss = 0.09649494290351868
Validation loss = 0.11530835926532745
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 352
average number of affinization = 387.18367346938777
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 343
average number of affinization = 386.3
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 349
average number of affinization = 385.5686274509804
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 354
average number of affinization = 384.96153846153845
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 357
average number of affinization = 384.4339622641509
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 314
average number of affinization = 383.1296296296296
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -28      |
| Iteration     | 7        |
| MaximumReturn | 529      |
| MinimumReturn | -251     |
| TotalSamples  | 36000    |
----------------------------
itr #8 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.11318066716194153
Validation loss = 0.0969652608036995
Validation loss = 0.09007825702428818
Validation loss = 0.09271921962499619
Validation loss = 0.09021273255348206
Validation loss = 0.08977028727531433
Validation loss = 0.08636000752449036
Validation loss = 0.09839361906051636
Validation loss = 0.09324301034212112
Validation loss = 0.08924052119255066
Validation loss = 0.0921030044555664
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.10937952995300293
Validation loss = 0.097623310983181
Validation loss = 0.09892915934324265
Validation loss = 0.09054496884346008
Validation loss = 0.08879479765892029
Validation loss = 0.08747658133506775
Validation loss = 0.09446727484464645
Validation loss = 0.1016906201839447
Validation loss = 0.0903482437133789
Validation loss = 0.0920250415802002
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.11177073419094086
Validation loss = 0.09271931648254395
Validation loss = 0.09609904140233994
Validation loss = 0.09435833990573883
Validation loss = 0.09110367298126221
Validation loss = 0.09040229767560959
Validation loss = 0.08866777271032333
Validation loss = 0.08888334035873413
Validation loss = 0.09211605042219162
Validation loss = 0.09128463268280029
Validation loss = 0.08902226388454437
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.11532490700483322
Validation loss = 0.09139563143253326
Validation loss = 0.09096840769052505
Validation loss = 0.08794836699962616
Validation loss = 0.09127417951822281
Validation loss = 0.0877261608839035
Validation loss = 0.08952350169420242
Validation loss = 0.08767275512218475
Validation loss = 0.09061133116483688
Validation loss = 0.08954887092113495
Validation loss = 0.08901526033878326
Validation loss = 0.08658650517463684
Validation loss = 0.08668401837348938
Validation loss = 0.0890783816576004
Validation loss = 0.08793769776821136
Validation loss = 0.08901092410087585
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.11519268900156021
Validation loss = 0.09125027060508728
Validation loss = 0.08764877915382385
Validation loss = 0.08855021744966507
Validation loss = 0.08824559301137924
Validation loss = 0.08674351871013641
Validation loss = 0.08413884788751602
Validation loss = 0.08604811877012253
Validation loss = 0.09107330441474915
Validation loss = 0.08579444140195847
Validation loss = 0.08365556597709656
Validation loss = 0.0844203531742096
Validation loss = 0.089290089905262
Validation loss = 0.08385458588600159
Validation loss = 0.088229238986969
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 374
average number of affinization = 382.96363636363634
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 320
average number of affinization = 381.8392857142857
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 330
average number of affinization = 380.9298245614035
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 284
average number of affinization = 379.2586206896552
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 342
average number of affinization = 378.6271186440678
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 354
average number of affinization = 378.21666666666664
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -513      |
| Iteration     | 8         |
| MaximumReturn | -58.8     |
| MinimumReturn | -2.16e+03 |
| TotalSamples  | 40000     |
-----------------------------
itr #9 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.10233569145202637
Validation loss = 0.08430976420640945
Validation loss = 0.08029983192682266
Validation loss = 0.08014796674251556
Validation loss = 0.08013863861560822
Validation loss = 0.08410318195819855
Validation loss = 0.0792364776134491
Validation loss = 0.08624239265918732
Validation loss = 0.07834596931934357
Validation loss = 0.07919667661190033
Validation loss = 0.07961618900299072
Validation loss = 0.08731520920991898
Validation loss = 0.08104346692562103
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.11529663950204849
Validation loss = 0.08948265761137009
Validation loss = 0.0829184278845787
Validation loss = 0.0818011537194252
Validation loss = 0.08302034437656403
Validation loss = 0.08587446808815002
Validation loss = 0.07955309003591537
Validation loss = 0.08805485814809799
Validation loss = 0.08087669312953949
Validation loss = 0.08122240751981735
Validation loss = 0.08002115786075592
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.10173586755990982
Validation loss = 0.08016066998243332
Validation loss = 0.08648823946714401
Validation loss = 0.08175569772720337
Validation loss = 0.08192678540945053
Validation loss = 0.07938523590564728
Validation loss = 0.08201905339956284
Validation loss = 0.07938793301582336
Validation loss = 0.07904072105884552
Validation loss = 0.08651530742645264
Validation loss = 0.0788152664899826
Validation loss = 0.07720454037189484
Validation loss = 0.08004587888717651
Validation loss = 0.08542466163635254
Validation loss = 0.07677720487117767
Validation loss = 0.07985755801200867
Validation loss = 0.07994227856397629
Validation loss = 0.0769377201795578
Validation loss = 0.07923635095357895
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.1016523465514183
Validation loss = 0.08129146695137024
Validation loss = 0.08226045221090317
Validation loss = 0.07793834805488586
Validation loss = 0.08045685291290283
Validation loss = 0.07730665802955627
Validation loss = 0.07823896408081055
Validation loss = 0.0794534757733345
Validation loss = 0.07898451387882233
Validation loss = 0.07763934135437012
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.10461874306201935
Validation loss = 0.08021629601716995
Validation loss = 0.07876528799533844
Validation loss = 0.07560061663389206
Validation loss = 0.07512002438306808
Validation loss = 0.07945729792118073
Validation loss = 0.07810825109481812
Validation loss = 0.0771658793091774
Validation loss = 0.07517673075199127
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 294
average number of affinization = 376.8360655737705
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 310
average number of affinization = 375.758064516129
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 270
average number of affinization = 374.07936507936506
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 282
average number of affinization = 372.640625
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 273
average number of affinization = 371.10769230769233
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 258
average number of affinization = 369.3939393939394
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -2.62e+03 |
| Iteration     | 9         |
| MaximumReturn | -1.82e+03 |
| MinimumReturn | -2.88e+03 |
| TotalSamples  | 44000     |
-----------------------------
itr #10 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.09116042405366898
Validation loss = 0.08191642165184021
Validation loss = 0.08216861635446548
Validation loss = 0.08005648106336594
Validation loss = 0.08372128754854202
Validation loss = 0.08354739844799042
Validation loss = 0.08075127750635147
Validation loss = 0.08097832649946213
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.11563415080308914
Validation loss = 0.08408255875110626
Validation loss = 0.08345285803079605
Validation loss = 0.08193433284759521
Validation loss = 0.0826929360628128
Validation loss = 0.08529739826917648
Validation loss = 0.08116895705461502
Validation loss = 0.08645010739564896
Validation loss = 0.08065610378980637
Validation loss = 0.08311396092176437
Validation loss = 0.085506372153759
Validation loss = 0.0795440524816513
Validation loss = 0.07782041281461716
Validation loss = 0.084312804043293
Validation loss = 0.08104486763477325
Validation loss = 0.0827970802783966
Validation loss = 0.07985644042491913
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.09187710285186768
Validation loss = 0.08123239874839783
Validation loss = 0.0790863186120987
Validation loss = 0.08040674030780792
Validation loss = 0.08459091931581497
Validation loss = 0.08189713209867477
Validation loss = 0.0773608386516571
Validation loss = 0.07918132096529007
Validation loss = 0.08327087014913559
Validation loss = 0.07779762148857117
Validation loss = 0.07798871397972107
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.0967666357755661
Validation loss = 0.08420490473508835
Validation loss = 0.0806346908211708
Validation loss = 0.08491774648427963
Validation loss = 0.0799843966960907
Validation loss = 0.08324050158262253
Validation loss = 0.08005344867706299
Validation loss = 0.08526492118835449
Validation loss = 0.0827716588973999
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.09526702761650085
Validation loss = 0.07966521382331848
Validation loss = 0.08084416389465332
Validation loss = 0.07882024347782135
Validation loss = 0.08094090968370438
Validation loss = 0.08026336878538132
Validation loss = 0.08207019418478012
Validation loss = 0.08463814109563828
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 279
average number of affinization = 368.04477611940297
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 358
average number of affinization = 367.8970588235294
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 394
average number of affinization = 368.27536231884056
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 396
average number of affinization = 368.6714285714286
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 332
average number of affinization = 368.15492957746477
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 402
average number of affinization = 368.625
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -486      |
| Iteration     | 10        |
| MaximumReturn | 645       |
| MinimumReturn | -2.44e+03 |
| TotalSamples  | 48000     |
-----------------------------
itr #11 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.09458611160516739
Validation loss = 0.08430828899145126
Validation loss = 0.07441892474889755
Validation loss = 0.07798514515161514
Validation loss = 0.08293996751308441
Validation loss = 0.07557425647974014
Validation loss = 0.07868561148643494
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.09292461723089218
Validation loss = 0.07719592750072479
Validation loss = 0.07690301537513733
Validation loss = 0.07516370713710785
Validation loss = 0.08056557178497314
Validation loss = 0.07578957825899124
Validation loss = 0.07877745479345322
Validation loss = 0.07640651613473892
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.0874645784497261
Validation loss = 0.08699295669794083
Validation loss = 0.07552092522382736
Validation loss = 0.07768652588129044
Validation loss = 0.07586387544870377
Validation loss = 0.07497978210449219
Validation loss = 0.07463958114385605
Validation loss = 0.0788697600364685
Validation loss = 0.07933709770441055
Validation loss = 0.0800490751862526
Validation loss = 0.0756007730960846
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.09267785400152206
Validation loss = 0.08430560678243637
Validation loss = 0.07618249207735062
Validation loss = 0.07877440750598907
Validation loss = 0.078091561794281
Validation loss = 0.07583843916654587
Validation loss = 0.07903266698122025
Validation loss = 0.08036583662033081
Validation loss = 0.0784711241722107
Validation loss = 0.07428238540887833
Validation loss = 0.07367362827062607
Validation loss = 0.07868622988462448
Validation loss = 0.07589084655046463
Validation loss = 0.07484403997659683
Validation loss = 0.08386854082345963
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.09827744215726852
Validation loss = 0.077333964407444
Validation loss = 0.07557410001754761
Validation loss = 0.07493904232978821
Validation loss = 0.0874464139342308
Validation loss = 0.07429907470941544
Validation loss = 0.07271497696638107
Validation loss = 0.07628803700208664
Validation loss = 0.07446891069412231
Validation loss = 0.0792054533958435
Validation loss = 0.0752752423286438
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 346
average number of affinization = 368.3150684931507
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 253
average number of affinization = 366.7567567567568
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 413
average number of affinization = 367.37333333333333
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 365
average number of affinization = 367.3421052631579
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 388
average number of affinization = 367.61038961038963
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 398
average number of affinization = 368.0
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -439      |
| Iteration     | 11        |
| MaximumReturn | 258       |
| MinimumReturn | -1.89e+03 |
| TotalSamples  | 52000     |
-----------------------------
itr #12 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.09203274548053741
Validation loss = 0.0736066997051239
Validation loss = 0.0737571194767952
Validation loss = 0.07230149954557419
Validation loss = 0.07735086232423782
Validation loss = 0.07679040729999542
Validation loss = 0.07252942770719528
Validation loss = 0.07479787617921829
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.09119720011949539
Validation loss = 0.07597667723894119
Validation loss = 0.07129884511232376
Validation loss = 0.07158102095127106
Validation loss = 0.07186669111251831
Validation loss = 0.07274273037910461
Validation loss = 0.07401882112026215
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.09832821786403656
Validation loss = 0.0757041722536087
Validation loss = 0.07174691557884216
Validation loss = 0.07009611278772354
Validation loss = 0.07159572094678879
Validation loss = 0.07438115775585175
Validation loss = 0.0708007961511612
Validation loss = 0.07138847559690475
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.08393775671720505
Validation loss = 0.07095508277416229
Validation loss = 0.071942999958992
Validation loss = 0.06987316906452179
Validation loss = 0.07214533537626266
Validation loss = 0.06932586431503296
Validation loss = 0.06918752938508987
Validation loss = 0.06948782503604889
Validation loss = 0.0745929628610611
Validation loss = 0.07411433011293411
Validation loss = 0.07016170024871826
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.08600544929504395
Validation loss = 0.07166391611099243
Validation loss = 0.07021861523389816
Validation loss = 0.07046525180339813
Validation loss = 0.07003624737262726
Validation loss = 0.07159945368766785
Validation loss = 0.07375575602054596
Validation loss = 0.06973100453615189
Validation loss = 0.07102491706609726
Validation loss = 0.07556997984647751
Validation loss = 0.06910105794668198
Validation loss = 0.0695824921131134
Validation loss = 0.07161212712526321
Validation loss = 0.07287082076072693
Validation loss = 0.07119148224592209
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 371
average number of affinization = 368.0379746835443
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 392
average number of affinization = 368.3375
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 365
average number of affinization = 368.2962962962963
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 368
average number of affinization = 368.2926829268293
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 353
average number of affinization = 368.10843373493975
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 395
average number of affinization = 368.42857142857144
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -29.1    |
| Iteration     | 12       |
| MaximumReturn | 332      |
| MinimumReturn | -586     |
| TotalSamples  | 56000    |
----------------------------
itr #13 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.07938926666975021
Validation loss = 0.06922348588705063
Validation loss = 0.06959068030118942
Validation loss = 0.06541599333286285
Validation loss = 0.06951975077390671
Validation loss = 0.06470930576324463
Validation loss = 0.06737362593412399
Validation loss = 0.06626155227422714
Validation loss = 0.0737779289484024
Validation loss = 0.06484536826610565
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.08438344299793243
Validation loss = 0.06722410768270493
Validation loss = 0.06678880006074905
Validation loss = 0.06498022377490997
Validation loss = 0.06720957905054092
Validation loss = 0.06763261556625366
Validation loss = 0.07213294506072998
Validation loss = 0.06447644531726837
Validation loss = 0.06554864346981049
Validation loss = 0.0667470172047615
Validation loss = 0.07292170077562332
Validation loss = 0.06473729014396667
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.06991766393184662
Validation loss = 0.06934015452861786
Validation loss = 0.06475239247083664
Validation loss = 0.06696908921003342
Validation loss = 0.07093624025583267
Validation loss = 0.06488176435232162
Validation loss = 0.06585387140512466
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.07012235373258591
Validation loss = 0.06475414335727692
Validation loss = 0.06582914292812347
Validation loss = 0.06518588215112686
Validation loss = 0.0684262290596962
Validation loss = 0.06599225848913193
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.08791057765483856
Validation loss = 0.0646260604262352
Validation loss = 0.062073756009340286
Validation loss = 0.06524805724620819
Validation loss = 0.06589436531066895
Validation loss = 0.06953684240579605
Validation loss = 0.06371679157018661
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 411
average number of affinization = 368.9294117647059
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 433
average number of affinization = 369.6744186046512
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 418
average number of affinization = 370.2298850574713
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 284
average number of affinization = 369.25
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 419
average number of affinization = 369.80898876404495
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 327
average number of affinization = 369.3333333333333
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -788      |
| Iteration     | 13        |
| MaximumReturn | 592       |
| MinimumReturn | -2.56e+03 |
| TotalSamples  | 60000     |
-----------------------------
itr #14 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.07395359873771667
Validation loss = 0.0649523213505745
Validation loss = 0.06533438712358475
Validation loss = 0.07012847065925598
Validation loss = 0.06581031531095505
Validation loss = 0.06480852514505386
Validation loss = 0.06202225387096405
Validation loss = 0.06291574239730835
Validation loss = 0.06373495608568192
Validation loss = 0.0697840079665184
Validation loss = 0.06815239787101746
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.07321727275848389
Validation loss = 0.06322810053825378
Validation loss = 0.06420957297086716
Validation loss = 0.06935453414916992
Validation loss = 0.06339077651500702
Validation loss = 0.06728441268205643
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.07217249274253845
Validation loss = 0.06397806107997894
Validation loss = 0.06267828494310379
Validation loss = 0.06500772386789322
Validation loss = 0.06682025641202927
Validation loss = 0.0655781477689743
Validation loss = 0.06589808315038681
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.07136022299528122
Validation loss = 0.06314554065465927
Validation loss = 0.06281121075153351
Validation loss = 0.0632564052939415
Validation loss = 0.06288182735443115
Validation loss = 0.061099160462617874
Validation loss = 0.0621473528444767
Validation loss = 0.06503520905971527
Validation loss = 0.06202555447816849
Validation loss = 0.06834910809993744
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.07214467972517014
Validation loss = 0.06541875004768372
Validation loss = 0.06287186592817307
Validation loss = 0.0621301531791687
Validation loss = 0.06661191582679749
Validation loss = 0.06267520785331726
Validation loss = 0.06379898637533188
Validation loss = 0.06183677911758423
Validation loss = 0.061848629266023636
Validation loss = 0.06581566482782364
Validation loss = 0.06094447150826454
Validation loss = 0.06093040481209755
Validation loss = 0.06264831125736237
Validation loss = 0.07220984250307083
Validation loss = 0.06009150296449661
Validation loss = 0.06105964258313179
Validation loss = 0.06107223406434059
Validation loss = 0.06821373105049133
Validation loss = 0.061276063323020935
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 406
average number of affinization = 369.7362637362637
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 442
average number of affinization = 370.5217391304348
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 431
average number of affinization = 371.1720430107527
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 442
average number of affinization = 371.9255319148936
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 430
average number of affinization = 372.53684210526313
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 442
average number of affinization = 373.2604166666667
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 119      |
| Iteration     | 14       |
| MaximumReturn | 898      |
| MinimumReturn | -343     |
| TotalSamples  | 64000    |
----------------------------
itr #15 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.06562864780426025
Validation loss = 0.06185368448495865
Validation loss = 0.05887043476104736
Validation loss = 0.06386516243219376
Validation loss = 0.0588095486164093
Validation loss = 0.057891249656677246
Validation loss = 0.06033460795879364
Validation loss = 0.05932927131652832
Validation loss = 0.05899824947118759
Validation loss = 0.05786754935979843
Validation loss = 0.05859995260834694
Validation loss = 0.05904918164014816
Validation loss = 0.05695794150233269
Validation loss = 0.05895973742008209
Validation loss = 0.05759323760867119
Validation loss = 0.06104860082268715
Validation loss = 0.05933733657002449
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.06741224229335785
Validation loss = 0.06003668159246445
Validation loss = 0.058924078941345215
Validation loss = 0.06076451763510704
Validation loss = 0.06033175066113472
Validation loss = 0.06267821788787842
Validation loss = 0.05867069214582443
Validation loss = 0.06257794797420502
Validation loss = 0.060906264930963516
Validation loss = 0.061144132167100906
Validation loss = 0.05923556163907051
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.07423201203346252
Validation loss = 0.057932592928409576
Validation loss = 0.05871227756142616
Validation loss = 0.05913131311535835
Validation loss = 0.059513870626688004
Validation loss = 0.06022932752966881
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.07249888777732849
Validation loss = 0.057895056903362274
Validation loss = 0.05964316800236702
Validation loss = 0.05814129859209061
Validation loss = 0.05958183482289314
Validation loss = 0.06047264486551285
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.06384279578924179
Validation loss = 0.057160068303346634
Validation loss = 0.05585239455103874
Validation loss = 0.05947307124733925
Validation loss = 0.058440208435058594
Validation loss = 0.05672997981309891
Validation loss = 0.05568268895149231
Validation loss = 0.06049223989248276
Validation loss = 0.060027070343494415
Validation loss = 0.0552547350525856
Validation loss = 0.05949726700782776
Validation loss = 0.057614028453826904
Validation loss = 0.05547142028808594
Validation loss = 0.05894513055682182
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 347
average number of affinization = 372.9896907216495
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 513
average number of affinization = 374.4183673469388
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 453
average number of affinization = 375.2121212121212
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 464
average number of affinization = 376.1
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 501
average number of affinization = 377.33663366336634
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 531
average number of affinization = 378.84313725490193
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -280      |
| Iteration     | 15        |
| MaximumReturn | 1.19e+03  |
| MinimumReturn | -2.04e+03 |
| TotalSamples  | 68000     |
-----------------------------
itr #16 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.0671907439827919
Validation loss = 0.0588982030749321
Validation loss = 0.05997634679079056
Validation loss = 0.05979805812239647
Validation loss = 0.05903460830450058
Validation loss = 0.0615057535469532
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.06941723823547363
Validation loss = 0.057842958718538284
Validation loss = 0.06080639362335205
Validation loss = 0.05951816216111183
Validation loss = 0.061970870941877365
Validation loss = 0.05948479101061821
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.06997819244861603
Validation loss = 0.06437534093856812
Validation loss = 0.06144445016980171
Validation loss = 0.059362396597862244
Validation loss = 0.06245804578065872
Validation loss = 0.059762414544820786
Validation loss = 0.059923965483903885
Validation loss = 0.07007085531949997
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.07458017021417618
Validation loss = 0.05865055322647095
Validation loss = 0.057481177151203156
Validation loss = 0.05832865834236145
Validation loss = 0.05851730331778526
Validation loss = 0.059179846197366714
Validation loss = 0.059446342289447784
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.08321943879127502
Validation loss = 0.057205311954021454
Validation loss = 0.05860301852226257
Validation loss = 0.0559246800839901
Validation loss = 0.06192491203546524
Validation loss = 0.05844993516802788
Validation loss = 0.05678561329841614
Validation loss = 0.05752156674861908
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 597
average number of affinization = 380.9611650485437
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 535
average number of affinization = 382.4423076923077
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 420
average number of affinization = 382.8
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 473
average number of affinization = 383.6509433962264
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 532
average number of affinization = 385.0373831775701
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 402
average number of affinization = 385.19444444444446
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.36e+03 |
| Iteration     | 16        |
| MaximumReturn | 92.1      |
| MinimumReturn | -2.25e+03 |
| TotalSamples  | 72000     |
-----------------------------
itr #17 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.07924921810626984
Validation loss = 0.06979508697986603
Validation loss = 0.06794580072164536
Validation loss = 0.0660896971821785
Validation loss = 0.06702134758234024
Validation loss = 0.07028089463710785
Validation loss = 0.07501009106636047
Validation loss = 0.06570599973201752
Validation loss = 0.06719449162483215
Validation loss = 0.06732552498579025
Validation loss = 0.06935390084981918
Validation loss = 0.0673888549208641
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.07592816650867462
Validation loss = 0.0688876137137413
Validation loss = 0.06856881827116013
Validation loss = 0.06887975335121155
Validation loss = 0.0670647844672203
Validation loss = 0.07136843353509903
Validation loss = 0.07213767617940903
Validation loss = 0.06771102547645569
Validation loss = 0.06984000653028488
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.07263873517513275
Validation loss = 0.06815033406019211
Validation loss = 0.07166134566068649
Validation loss = 0.06898350268602371
Validation loss = 0.06959980726242065
Validation loss = 0.07073095440864563
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.08024360239505768
Validation loss = 0.06626225262880325
Validation loss = 0.06988771259784698
Validation loss = 0.06883943825960159
Validation loss = 0.0709025040268898
Validation loss = 0.06741508096456528
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.07283929735422134
Validation loss = 0.06748982518911362
Validation loss = 0.06615009158849716
Validation loss = 0.06748685985803604
Validation loss = 0.06528947502374649
Validation loss = 0.06654351204633713
Validation loss = 0.06595151871442795
Validation loss = 0.06731871515512466
Validation loss = 0.06994513422250748
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 363
average number of affinization = 384.9908256880734
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 398
average number of affinization = 385.1090909090909
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 369
average number of affinization = 384.963963963964
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 378
average number of affinization = 384.9017857142857
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 395
average number of affinization = 384.9911504424779
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 398
average number of affinization = 385.10526315789474
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -46.1    |
| Iteration     | 17       |
| MaximumReturn | 268      |
| MinimumReturn | -711     |
| TotalSamples  | 76000    |
----------------------------
itr #18 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.07173224538564682
Validation loss = 0.06627124547958374
Validation loss = 0.06433374434709549
Validation loss = 0.06681489199399948
Validation loss = 0.06480170041322708
Validation loss = 0.06551632285118103
Validation loss = 0.06462555378675461
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.06809435039758682
Validation loss = 0.06523285061120987
Validation loss = 0.0634080320596695
Validation loss = 0.0634540244936943
Validation loss = 0.06438460201025009
Validation loss = 0.06227847933769226
Validation loss = 0.07112214714288712
Validation loss = 0.06611881405115128
Validation loss = 0.06544986367225647
Validation loss = 0.06560084223747253
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.07386182248592377
Validation loss = 0.0645160973072052
Validation loss = 0.06867963075637817
Validation loss = 0.06433569639921188
Validation loss = 0.06592295318841934
Validation loss = 0.06933703273534775
Validation loss = 0.0644594356417656
Validation loss = 0.06556222587823868
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.07423525303602219
Validation loss = 0.060712236911058426
Validation loss = 0.06255000084638596
Validation loss = 0.06445784866809845
Validation loss = 0.06753754615783691
Validation loss = 0.06213259696960449
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.07382462173700333
Validation loss = 0.06374582648277283
Validation loss = 0.06098955497145653
Validation loss = 0.06419210135936737
Validation loss = 0.06979287415742874
Validation loss = 0.062460560351610184
Validation loss = 0.06128067895770073
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 437
average number of affinization = 385.55652173913046
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 356
average number of affinization = 385.30172413793105
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 385
average number of affinization = 385.2991452991453
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 409
average number of affinization = 385.5
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 365
average number of affinization = 385.327731092437
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 381
average number of affinization = 385.2916666666667
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 312      |
| Iteration     | 18       |
| MaximumReturn | 828      |
| MinimumReturn | -163     |
| TotalSamples  | 80000    |
----------------------------
itr #19 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.06656835973262787
Validation loss = 0.06350797414779663
Validation loss = 0.0608060359954834
Validation loss = 0.06523720920085907
Validation loss = 0.06787166744470596
Validation loss = 0.06116359308362007
Validation loss = 0.062230903655290604
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.06932942569255829
Validation loss = 0.0681743174791336
Validation loss = 0.061220813542604446
Validation loss = 0.06202378123998642
Validation loss = 0.06622450798749924
Validation loss = 0.06134214252233505
Validation loss = 0.05930965021252632
Validation loss = 0.06911425292491913
Validation loss = 0.06105329841375351
Validation loss = 0.07126803696155548
Validation loss = 0.0656089037656784
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.0685863345861435
Validation loss = 0.06236384063959122
Validation loss = 0.06053368002176285
Validation loss = 0.06541407853364944
Validation loss = 0.06285963952541351
Validation loss = 0.06347496807575226
Validation loss = 0.06061787158250809
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.06769596040248871
Validation loss = 0.05949602276086807
Validation loss = 0.06074799224734306
Validation loss = 0.06325377523899078
Validation loss = 0.06129125505685806
Validation loss = 0.05985576659440994
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.07085859775543213
Validation loss = 0.0611928328871727
Validation loss = 0.059329818934202194
Validation loss = 0.06438930332660675
Validation loss = 0.0640912726521492
Validation loss = 0.06393791735172272
Validation loss = 0.06151539087295532
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 393
average number of affinization = 385.35537190082647
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 382
average number of affinization = 385.327868852459
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 362
average number of affinization = 385.1382113821138
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 418
average number of affinization = 385.4032258064516
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 361
average number of affinization = 385.208
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 373
average number of affinization = 385.1111111111111
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -194     |
| Iteration     | 19       |
| MaximumReturn | 467      |
| MinimumReturn | -468     |
| TotalSamples  | 84000    |
----------------------------
itr #20 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.06760487705469131
Validation loss = 0.0589265413582325
Validation loss = 0.05926821380853653
Validation loss = 0.06086837127804756
Validation loss = 0.06096749007701874
Validation loss = 0.05665011703968048
Validation loss = 0.05715632438659668
Validation loss = 0.06027012690901756
Validation loss = 0.05752747878432274
Validation loss = 0.06248435005545616
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.06257420033216476
Validation loss = 0.059889573603868484
Validation loss = 0.060393862426280975
Validation loss = 0.05832964554429054
Validation loss = 0.05915597081184387
Validation loss = 0.06360165774822235
Validation loss = 0.06329556554555893
Validation loss = 0.06048355624079704
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.06462506204843521
Validation loss = 0.05718015506863594
Validation loss = 0.05840201675891876
Validation loss = 0.061260536313056946
Validation loss = 0.060834549367427826
Validation loss = 0.05767209827899933
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.06136230006814003
Validation loss = 0.05639214813709259
Validation loss = 0.05846652388572693
Validation loss = 0.05945425480604172
Validation loss = 0.05724690109491348
Validation loss = 0.059858255088329315
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.06378456950187683
Validation loss = 0.05604787543416023
Validation loss = 0.055629655718803406
Validation loss = 0.05763896182179451
Validation loss = 0.058547042310237885
Validation loss = 0.06088272109627724
Validation loss = 0.058712106198072433
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 411
average number of affinization = 385.31496062992125
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 460
average number of affinization = 385.8984375
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 394
average number of affinization = 385.9612403100775
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 426
average number of affinization = 386.2692307692308
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 413
average number of affinization = 386.4732824427481
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 403
average number of affinization = 386.5984848484849
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 312      |
| Iteration     | 20       |
| MaximumReturn | 949      |
| MinimumReturn | -601     |
| TotalSamples  | 88000    |
----------------------------
itr #21 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.06145768240094185
Validation loss = 0.05567357689142227
Validation loss = 0.055932942777872086
Validation loss = 0.059963420033454895
Validation loss = 0.054692402482032776
Validation loss = 0.056582894176244736
Validation loss = 0.05618410184979439
Validation loss = 0.06233889237046242
Validation loss = 0.055381737649440765
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.07042533159255981
Validation loss = 0.05853843688964844
Validation loss = 0.05661225691437721
Validation loss = 0.06010136753320694
Validation loss = 0.05747130885720253
Validation loss = 0.05880655348300934
Validation loss = 0.05760281905531883
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.0656152069568634
Validation loss = 0.05560505390167236
Validation loss = 0.05764411762356758
Validation loss = 0.0568922758102417
Validation loss = 0.05818263068795204
Validation loss = 0.06205538287758827
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.05812902748584747
Validation loss = 0.05297306552529335
Validation loss = 0.05337722226977348
Validation loss = 0.05625927820801735
Validation loss = 0.05465438589453697
Validation loss = 0.05583773925900459
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.06282815337181091
Validation loss = 0.05467423051595688
Validation loss = 0.05440225452184677
Validation loss = 0.05481143295764923
Validation loss = 0.056511711329221725
Validation loss = 0.05480941757559776
Validation loss = 0.052630528807640076
Validation loss = 0.05710664018988609
Validation loss = 0.05352453887462616
Validation loss = 0.054615091532468796
Validation loss = 0.05469337850809097
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 356
average number of affinization = 386.36842105263156
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 425
average number of affinization = 386.65671641791045
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 375
average number of affinization = 386.57037037037037
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 371
average number of affinization = 386.45588235294116
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 377
average number of affinization = 386.3868613138686
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 357
average number of affinization = 386.17391304347825
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 65.8     |
| Iteration     | 21       |
| MaximumReturn | 664      |
| MinimumReturn | -435     |
| TotalSamples  | 92000    |
----------------------------
itr #22 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.05604051053524017
Validation loss = 0.05390435829758644
Validation loss = 0.057081472128629684
Validation loss = 0.05837716907262802
Validation loss = 0.05286288261413574
Validation loss = 0.05887512490153313
Validation loss = 0.055579572916030884
Validation loss = 0.052724603563547134
Validation loss = 0.054739516228437424
Validation loss = 0.05249116197228432
Validation loss = 0.05177979916334152
Validation loss = 0.05411858484148979
Validation loss = 0.05532706901431084
Validation loss = 0.05230078846216202
Validation loss = 0.055318474769592285
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.05762164667248726
Validation loss = 0.05512461066246033
Validation loss = 0.05409815534949303
Validation loss = 0.05346965417265892
Validation loss = 0.05944708734750748
Validation loss = 0.05783930420875549
Validation loss = 0.05545135587453842
Validation loss = 0.053837958723306656
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.05743454396724701
Validation loss = 0.05776553973555565
Validation loss = 0.05635333061218262
Validation loss = 0.05471168830990791
Validation loss = 0.05427396669983864
Validation loss = 0.05598796531558037
Validation loss = 0.052414558827877045
Validation loss = 0.05709119513630867
Validation loss = 0.05315103381872177
Validation loss = 0.054380375891923904
Validation loss = 0.053902506828308105
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.05637067183852196
Validation loss = 0.05366305634379387
Validation loss = 0.05396508798003197
Validation loss = 0.05221140757203102
Validation loss = 0.05448267236351967
Validation loss = 0.058299463242292404
Validation loss = 0.05130968987941742
Validation loss = 0.05480840802192688
Validation loss = 0.052342191338539124
Validation loss = 0.05131697282195091
Validation loss = 0.05484579876065254
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.05910708010196686
Validation loss = 0.05309440940618515
Validation loss = 0.054319094866514206
Validation loss = 0.05464181676506996
Validation loss = 0.05236518755555153
Validation loss = 0.052972253412008286
Validation loss = 0.05229516699910164
Validation loss = 0.05301279574632645
Validation loss = 0.054327938705682755
Validation loss = 0.05119726061820984
Validation loss = 0.05804448947310448
Validation loss = 0.05198540911078453
Validation loss = 0.05081305280327797
Validation loss = 0.0544445663690567
Validation loss = 0.05328863486647606
Validation loss = 0.05188887193799019
Validation loss = 0.053801558911800385
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 415
average number of affinization = 386.38129496402877
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 376
average number of affinization = 386.3071428571429
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 407
average number of affinization = 386.45390070921985
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 367
average number of affinization = 386.3169014084507
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 373
average number of affinization = 386.2237762237762
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 389
average number of affinization = 386.24305555555554
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 438      |
| Iteration     | 22       |
| MaximumReturn | 993      |
| MinimumReturn | 68.7     |
| TotalSamples  | 96000    |
----------------------------
itr #23 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.06603458523750305
Validation loss = 0.049620091915130615
Validation loss = 0.052284907549619675
Validation loss = 0.05082646384835243
Validation loss = 0.052245814353227615
Validation loss = 0.04970352724194527
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.05606967210769653
Validation loss = 0.05286876857280731
Validation loss = 0.05860672891139984
Validation loss = 0.0543547123670578
Validation loss = 0.05561453104019165
Validation loss = 0.05368531867861748
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.05545875430107117
Validation loss = 0.05158945545554161
Validation loss = 0.05386173352599144
Validation loss = 0.05123386159539223
Validation loss = 0.0527830608189106
Validation loss = 0.0513431690633297
Validation loss = 0.05114765465259552
Validation loss = 0.05112946033477783
Validation loss = 0.05236120894551277
Validation loss = 0.052811432629823685
Validation loss = 0.04986928775906563
Validation loss = 0.05190999433398247
Validation loss = 0.05107268691062927
Validation loss = 0.052671562880277634
Validation loss = 0.0513613261282444
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.05390921235084534
Validation loss = 0.049951810389757156
Validation loss = 0.0498725064098835
Validation loss = 0.05091550946235657
Validation loss = 0.05184486508369446
Validation loss = 0.05219026282429695
Validation loss = 0.049064576625823975
Validation loss = 0.05022653937339783
Validation loss = 0.05043099448084831
Validation loss = 0.05195598676800728
Validation loss = 0.049790024757385254
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.0527840293943882
Validation loss = 0.048988502472639084
Validation loss = 0.05225002393126488
Validation loss = 0.05084320902824402
Validation loss = 0.05035797134041786
Validation loss = 0.050543446093797684
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 405
average number of affinization = 386.3724137931035
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 392
average number of affinization = 386.4109589041096
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 411
average number of affinization = 386.578231292517
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 423
average number of affinization = 386.8243243243243
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 362
average number of affinization = 386.6577181208054
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 401
average number of affinization = 386.75333333333333
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 396      |
| Iteration     | 23       |
| MaximumReturn | 1.07e+03 |
| MinimumReturn | -271     |
| TotalSamples  | 100000   |
----------------------------
itr #24 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.05392526462674141
Validation loss = 0.049384187906980515
Validation loss = 0.049012504518032074
Validation loss = 0.05032612383365631
Validation loss = 0.04909994453191757
Validation loss = 0.04812446981668472
Validation loss = 0.05219639837741852
Validation loss = 0.04725894331932068
Validation loss = 0.04848600551486015
Validation loss = 0.047961071133613586
Validation loss = 0.048424605280160904
Validation loss = 0.05027909204363823
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.05344729125499725
Validation loss = 0.05162421613931656
Validation loss = 0.05151885002851486
Validation loss = 0.05373268201947212
Validation loss = 0.05082620307803154
Validation loss = 0.05193260684609413
Validation loss = 0.055412646383047104
Validation loss = 0.04965507984161377
Validation loss = 0.05285932496190071
Validation loss = 0.051298871636390686
Validation loss = 0.05002575367689133
Validation loss = 0.05060417950153351
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.050854090601205826
Validation loss = 0.050723716616630554
Validation loss = 0.049771737307310104
Validation loss = 0.05063111335039139
Validation loss = 0.05581085383892059
Validation loss = 0.04802510887384415
Validation loss = 0.04663888365030289
Validation loss = 0.049105409532785416
Validation loss = 0.04733574017882347
Validation loss = 0.050497908145189285
Validation loss = 0.05261249095201492
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.051922157406806946
Validation loss = 0.049161024391651154
Validation loss = 0.0485362708568573
Validation loss = 0.0497870072722435
Validation loss = 0.04909699410200119
Validation loss = 0.04993328079581261
Validation loss = 0.0490972176194191
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.04959438368678093
Validation loss = 0.04901498928666115
Validation loss = 0.04891249164938927
Validation loss = 0.04989662766456604
Validation loss = 0.04913735017180443
Validation loss = 0.04820767045021057
Validation loss = 0.04847569763660431
Validation loss = 0.04975292086601257
Validation loss = 0.05039652809500694
Validation loss = 0.047429345548152924
Validation loss = 0.04932696372270584
Validation loss = 0.04782329127192497
Validation loss = 0.04798700287938118
Validation loss = 0.04684731364250183
Validation loss = 0.047839175909757614
Validation loss = 0.05079540237784386
Validation loss = 0.04767061769962311
Validation loss = 0.04724617302417755
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 381
average number of affinization = 386.71523178807945
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 403
average number of affinization = 386.82236842105266
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 346
average number of affinization = 386.55555555555554
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 395
average number of affinization = 386.61038961038963
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 368
average number of affinization = 386.49032258064517
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 391
average number of affinization = 386.5192307692308
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 103      |
| Iteration     | 24       |
| MaximumReturn | 1.17e+03 |
| MinimumReturn | -338     |
| TotalSamples  | 104000   |
----------------------------
itr #25 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.049172282218933105
Validation loss = 0.04694032669067383
Validation loss = 0.04728620499372482
Validation loss = 0.04735210910439491
Validation loss = 0.04545331746339798
Validation loss = 0.049722712486982346
Validation loss = 0.04716174304485321
Validation loss = 0.04840993881225586
Validation loss = 0.04662501439452171
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.05442093312740326
Validation loss = 0.04899875447154045
Validation loss = 0.04854036867618561
Validation loss = 0.05386326462030411
Validation loss = 0.04977085813879967
Validation loss = 0.04769138991832733
Validation loss = 0.0519423633813858
Validation loss = 0.04798436909914017
Validation loss = 0.04862574115395546
Validation loss = 0.048765625804662704
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.048042137175798416
Validation loss = 0.04692854732275009
Validation loss = 0.04679810628294945
Validation loss = 0.048162415623664856
Validation loss = 0.047743361443281174
Validation loss = 0.04844057187438011
Validation loss = 0.04688853397965431
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.04962873458862305
Validation loss = 0.048056330531835556
Validation loss = 0.04672933369874954
Validation loss = 0.05075144022703171
Validation loss = 0.04584423825144768
Validation loss = 0.049161020666360855
Validation loss = 0.046299971640110016
Validation loss = 0.047375086694955826
Validation loss = 0.04726551100611687
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.05093596503138542
Validation loss = 0.04649857431650162
Validation loss = 0.046290211379528046
Validation loss = 0.04824892804026604
Validation loss = 0.04634711518883705
Validation loss = 0.04473457485437393
Validation loss = 0.04657979682087898
Validation loss = 0.0451592355966568
Validation loss = 0.045955874025821686
Validation loss = 0.045699283480644226
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 409
average number of affinization = 386.6624203821656
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 430
average number of affinization = 386.9367088607595
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 408
average number of affinization = 387.0691823899371
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 419
average number of affinization = 387.26875
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 397
average number of affinization = 387.32919254658384
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 413
average number of affinization = 387.48765432098764
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 344      |
| Iteration     | 25       |
| MaximumReturn | 670      |
| MinimumReturn | 144      |
| TotalSamples  | 108000   |
----------------------------
itr #26 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.04650229960680008
Validation loss = 0.044358160346746445
Validation loss = 0.04530998691916466
Validation loss = 0.04746784642338753
Validation loss = 0.04561495780944824
Validation loss = 0.045810990035533905
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.049570225179195404
Validation loss = 0.04721824452280998
Validation loss = 0.050214871764183044
Validation loss = 0.04586310312151909
Validation loss = 0.04638919234275818
Validation loss = 0.05120769143104553
Validation loss = 0.04659102484583855
Validation loss = 0.047075316309928894
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.04638231173157692
Validation loss = 0.044922664761543274
Validation loss = 0.04710569977760315
Validation loss = 0.04539038985967636
Validation loss = 0.046477481722831726
Validation loss = 0.04428936541080475
Validation loss = 0.0448232963681221
Validation loss = 0.04596734419465065
Validation loss = 0.044880397617816925
Validation loss = 0.04532159864902496
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.0505172461271286
Validation loss = 0.045793548226356506
Validation loss = 0.044831663370132446
Validation loss = 0.04409537836909294
Validation loss = 0.04492023214697838
Validation loss = 0.04582567885518074
Validation loss = 0.04798448458313942
Validation loss = 0.044411543756723404
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.046980492770671844
Validation loss = 0.04343635216355324
Validation loss = 0.044701457023620605
Validation loss = 0.04454007372260094
Validation loss = 0.04384623467922211
Validation loss = 0.044646382331848145
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 405
average number of affinization = 387.5950920245399
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 439
average number of affinization = 387.9085365853659
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 410
average number of affinization = 388.04242424242426
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 406
average number of affinization = 388.15060240963857
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 428
average number of affinization = 388.38922155688624
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 462
average number of affinization = 388.82738095238096
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 371      |
| Iteration     | 26       |
| MaximumReturn | 652      |
| MinimumReturn | -26.3    |
| TotalSamples  | 112000   |
----------------------------
itr #27 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.04691705107688904
Validation loss = 0.044286515563726425
Validation loss = 0.043121982365846634
Validation loss = 0.04394034668803215
Validation loss = 0.04689229279756546
Validation loss = 0.04454663023352623
Validation loss = 0.0436687245965004
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.04683459550142288
Validation loss = 0.0491817332804203
Validation loss = 0.04462716355919838
Validation loss = 0.04762713238596916
Validation loss = 0.04430520534515381
Validation loss = 0.04555675759911537
Validation loss = 0.04711705818772316
Validation loss = 0.044125624001026154
Validation loss = 0.045223213732242584
Validation loss = 0.04538816213607788
Validation loss = 0.04656267166137695
Validation loss = 0.04759484529495239
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.044968653470277786
Validation loss = 0.043081991374492645
Validation loss = 0.045985836535692215
Validation loss = 0.04326152056455612
Validation loss = 0.04216259717941284
Validation loss = 0.04876643419265747
Validation loss = 0.041985537856817245
Validation loss = 0.04425746947526932
Validation loss = 0.04384132847189903
Validation loss = 0.042985763400793076
Validation loss = 0.04212365299463272
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.043807294219732285
Validation loss = 0.04426328465342522
Validation loss = 0.044626347720623016
Validation loss = 0.043926529586315155
Validation loss = 0.04761764407157898
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.046504907310009
Validation loss = 0.042987432330846786
Validation loss = 0.04519047960639
Validation loss = 0.04363522306084633
Validation loss = 0.04355508089065552
Validation loss = 0.04260239377617836
Validation loss = 0.04563770070672035
Validation loss = 0.04295860603451729
Validation loss = 0.04193735867738724
Validation loss = 0.0428193137049675
Validation loss = 0.04342098906636238
Validation loss = 0.04269647225737572
Validation loss = 0.04125363379716873
Validation loss = 0.043090302497148514
Validation loss = 0.042064037173986435
Validation loss = 0.04283836483955383
Validation loss = 0.04311428591609001
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 377
average number of affinization = 388.7573964497041
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 360
average number of affinization = 388.5882352941176
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 352
average number of affinization = 388.37426900584796
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 404
average number of affinization = 388.4651162790698
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 362
average number of affinization = 388.3121387283237
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 428
average number of affinization = 388.5402298850575
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 416      |
| Iteration     | 27       |
| MaximumReturn | 1.18e+03 |
| MinimumReturn | -615     |
| TotalSamples  | 116000   |
----------------------------
itr #28 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.05384746566414833
Validation loss = 0.043203189969062805
Validation loss = 0.04317355901002884
Validation loss = 0.04321201890707016
Validation loss = 0.04253481328487396
Validation loss = 0.04212705418467522
Validation loss = 0.042561110109090805
Validation loss = 0.04724450409412384
Validation loss = 0.04226938262581825
Validation loss = 0.043387770652770996
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.046686284244060516
Validation loss = 0.04306871071457863
Validation loss = 0.04363676533102989
Validation loss = 0.04351572319865227
Validation loss = 0.045103564858436584
Validation loss = 0.04502443969249725
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.043488141149282455
Validation loss = 0.048478394746780396
Validation loss = 0.04186131805181503
Validation loss = 0.04216728359460831
Validation loss = 0.044171713292598724
Validation loss = 0.04089310020208359
Validation loss = 0.04323296621441841
Validation loss = 0.044680628925561905
Validation loss = 0.042327024042606354
Validation loss = 0.04501345381140709
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.04800673946738243
Validation loss = 0.04402245581150055
Validation loss = 0.04408727586269379
Validation loss = 0.043509341776371
Validation loss = 0.04299350455403328
Validation loss = 0.04252180457115173
Validation loss = 0.042972318828105927
Validation loss = 0.043459925800561905
Validation loss = 0.04390840604901314
Validation loss = 0.04499613121151924
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.04387890174984932
Validation loss = 0.041887275874614716
Validation loss = 0.04129963368177414
Validation loss = 0.041787032037973404
Validation loss = 0.04048026353120804
Validation loss = 0.041184958070516586
Validation loss = 0.04249986633658409
Validation loss = 0.04478282108902931
Validation loss = 0.040832795202732086
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 229
average number of affinization = 387.62857142857143
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 392
average number of affinization = 387.65340909090907
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 255
average number of affinization = 386.9039548022599
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 384
average number of affinization = 386.8876404494382
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 399
average number of affinization = 386.9553072625698
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 265
average number of affinization = 386.27777777777777
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -109      |
| Iteration     | 28        |
| MaximumReturn | 1.48e+03  |
| MinimumReturn | -2.33e+03 |
| TotalSamples  | 120000    |
-----------------------------
itr #29 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.04517175257205963
Validation loss = 0.04145345836877823
Validation loss = 0.041867319494485855
Validation loss = 0.041320107877254486
Validation loss = 0.042781464755535126
Validation loss = 0.04308294877409935
Validation loss = 0.04390568286180496
Validation loss = 0.045395780354738235
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.04724951460957527
Validation loss = 0.04269803315401077
Validation loss = 0.04418213292956352
Validation loss = 0.0414537638425827
Validation loss = 0.043026991188526154
Validation loss = 0.04301157221198082
Validation loss = 0.04138237237930298
Validation loss = 0.04329882934689522
Validation loss = 0.04566923901438713
Validation loss = 0.04286499693989754
Validation loss = 0.04592078551650047
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.04410395398736
Validation loss = 0.0421542190015316
Validation loss = 0.04141334444284439
Validation loss = 0.042954009026288986
Validation loss = 0.041986484080553055
Validation loss = 0.04482544958591461
Validation loss = 0.04223690927028656
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.04306011274456978
Validation loss = 0.0450139045715332
Validation loss = 0.04515964537858963
Validation loss = 0.04290012642741203
Validation loss = 0.04282209649682045
Validation loss = 0.041012953966856
Validation loss = 0.044269438832998276
Validation loss = 0.04261987283825874
Validation loss = 0.043445054441690445
Validation loss = 0.04339024797081947
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.04216168075799942
Validation loss = 0.04099952057003975
Validation loss = 0.04266967996954918
Validation loss = 0.040922027081251144
Validation loss = 0.04293987527489662
Validation loss = 0.040764667093753815
Validation loss = 0.04248865693807602
Validation loss = 0.042473204433918
Validation loss = 0.04169144108891487
Validation loss = 0.04147457703948021
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 463
average number of affinization = 386.70165745856355
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 438
average number of affinization = 386.9835164835165
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 104
average number of affinization = 385.43715846994536
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 366
average number of affinization = 385.33152173913044
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 207
average number of affinization = 384.36756756756756
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 442
average number of affinization = 384.6774193548387
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -85.4     |
| Iteration     | 29        |
| MaximumReturn | 803       |
| MinimumReturn | -2.26e+03 |
| TotalSamples  | 124000    |
-----------------------------
itr #30 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.04396553337574005
Validation loss = 0.04402715340256691
Validation loss = 0.04298173263669014
Validation loss = 0.04115939140319824
Validation loss = 0.04295701906085014
Validation loss = 0.042489711195230484
Validation loss = 0.043663930147886276
Validation loss = 0.04320784658193588
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.04495296627283096
Validation loss = 0.04361727833747864
Validation loss = 0.043683554977178574
Validation loss = 0.04380258917808533
Validation loss = 0.04383281618356705
Validation loss = 0.04293594881892204
Validation loss = 0.04293268173933029
Validation loss = 0.04140952602028847
Validation loss = 0.042797934263944626
Validation loss = 0.042542796581983566
Validation loss = 0.043217554688453674
Validation loss = 0.043622374534606934
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.042602188885211945
Validation loss = 0.04176810383796692
Validation loss = 0.04148275405168533
Validation loss = 0.04261750727891922
Validation loss = 0.041121359914541245
Validation loss = 0.04135129600763321
Validation loss = 0.04031526297330856
Validation loss = 0.040466997772455215
Validation loss = 0.04221785068511963
Validation loss = 0.041136790066957474
Validation loss = 0.04346723482012749
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.04271586984395981
Validation loss = 0.04116291552782059
Validation loss = 0.04208655655384064
Validation loss = 0.04199473187327385
Validation loss = 0.039925917983055115
Validation loss = 0.04405020922422409
Validation loss = 0.04217886924743652
Validation loss = 0.04174386337399483
Validation loss = 0.04185032099485397
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.04162855073809624
Validation loss = 0.04075033962726593
Validation loss = 0.04080626741051674
Validation loss = 0.04103178530931473
Validation loss = 0.039804812520742416
Validation loss = 0.03966950252652168
Validation loss = 0.042094819247722626
Validation loss = 0.03962138667702675
Validation loss = 0.041083257645368576
Validation loss = 0.04212068393826485
Validation loss = 0.03953488916158676
Validation loss = 0.04083862155675888
Validation loss = 0.039571117609739304
Validation loss = 0.04177691787481308
Validation loss = 0.04016084969043732
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 88
average number of affinization = 383.09090909090907
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 439
average number of affinization = 383.38829787234044
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 465
average number of affinization = 383.8201058201058
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 450
average number of affinization = 384.1684210526316
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 403
average number of affinization = 384.26701570680626
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 295
average number of affinization = 383.8020833333333
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -128      |
| Iteration     | 30        |
| MaximumReturn | 1.69e+03  |
| MinimumReturn | -2.65e+03 |
| TotalSamples  | 128000    |
-----------------------------
itr #31 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.04214534908533096
Validation loss = 0.043985988944768906
Validation loss = 0.04281223565340042
Validation loss = 0.042510874569416046
Validation loss = 0.0420282781124115
Validation loss = 0.04451673477888107
Validation loss = 0.04118233546614647
Validation loss = 0.04223152622580528
Validation loss = 0.04198069125413895
Validation loss = 0.039638765156269073
Validation loss = 0.04210317134857178
Validation loss = 0.0435100756585598
Validation loss = 0.04120243713259697
Validation loss = 0.04109961539506912
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.042651742696762085
Validation loss = 0.04156943038105965
Validation loss = 0.04432164877653122
Validation loss = 0.042756009846925735
Validation loss = 0.04305485263466835
Validation loss = 0.041441865265369415
Validation loss = 0.04193946719169617
Validation loss = 0.043329060077667236
Validation loss = 0.04192902147769928
Validation loss = 0.0406608060002327
Validation loss = 0.04235701262950897
Validation loss = 0.04133814573287964
Validation loss = 0.047886840999126434
Validation loss = 0.041794583201408386
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.041912272572517395
Validation loss = 0.039928875863552094
Validation loss = 0.041268572211265564
Validation loss = 0.040446687489748
Validation loss = 0.042589835822582245
Validation loss = 0.04031238704919815
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.045148495584726334
Validation loss = 0.040144458413124084
Validation loss = 0.04312862455844879
Validation loss = 0.042537860572338104
Validation loss = 0.04388756677508354
Validation loss = 0.0416533425450325
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.04100527614355087
Validation loss = 0.04068992659449577
Validation loss = 0.04047151654958725
Validation loss = 0.04316870868206024
Validation loss = 0.041310012340545654
Validation loss = 0.040334682911634445
Validation loss = 0.039911605417728424
Validation loss = 0.040198907256126404
Validation loss = 0.04258798062801361
Validation loss = 0.04087621718645096
Validation loss = 0.040379464626312256
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 423
average number of affinization = 384.00518134715026
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 368
average number of affinization = 383.9226804123711
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 427
average number of affinization = 384.14358974358976
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 406
average number of affinization = 384.2551020408163
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 407
average number of affinization = 384.3705583756345
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 441
average number of affinization = 384.65656565656565
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 882      |
| Iteration     | 31       |
| MaximumReturn | 1.35e+03 |
| MinimumReturn | 371      |
| TotalSamples  | 132000   |
----------------------------
itr #32 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.04070139303803444
Validation loss = 0.04102746397256851
Validation loss = 0.04000797122716904
Validation loss = 0.04120512679219246
Validation loss = 0.041787151247262955
Validation loss = 0.04182520508766174
Validation loss = 0.04142629727721214
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.041289474815130234
Validation loss = 0.04418652877211571
Validation loss = 0.04115786775946617
Validation loss = 0.042368318885564804
Validation loss = 0.040015414357185364
Validation loss = 0.041471775621175766
Validation loss = 0.041882652789354324
Validation loss = 0.04078296199440956
Validation loss = 0.041688479483127594
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.04093696549534798
Validation loss = 0.04004001244902611
Validation loss = 0.04022006690502167
Validation loss = 0.04040636867284775
Validation loss = 0.04036589711904526
Validation loss = 0.04023544862866402
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.04761284589767456
Validation loss = 0.039888378232717514
Validation loss = 0.0424480140209198
Validation loss = 0.04044037312269211
Validation loss = 0.04045261815190315
Validation loss = 0.041065316647291183
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.04068634286522865
Validation loss = 0.03983265161514282
Validation loss = 0.038681551814079285
Validation loss = 0.03912908583879471
Validation loss = 0.039362214505672455
Validation loss = 0.03825795650482178
Validation loss = 0.03822781518101692
Validation loss = 0.03979936242103577
Validation loss = 0.03923055902123451
Validation loss = 0.04008966684341431
Validation loss = 0.038789052516222
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 383
average number of affinization = 384.64824120603015
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 426
average number of affinization = 384.855
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 463
average number of affinization = 385.2437810945274
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 349
average number of affinization = 385.06435643564356
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 403
average number of affinization = 385.1527093596059
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 429
average number of affinization = 385.36764705882354
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 739      |
| Iteration     | 32       |
| MaximumReturn | 1.28e+03 |
| MinimumReturn | 281      |
| TotalSamples  | 136000   |
----------------------------
