Logging to experiments/gym_fswimmer/nov4/SO01w350e1_seed2312
Print configuration .....
{'env_name': 'gym_fswimmer', 'random_seeds': [2312, 1231, 2631, 5543], 'save_variables': False, 'model_save_dir': '/tmp/gym_fswimmer_models/', 'restore_variables': False, 'start_onpol_iter': 0, 'onpol_iters': 33, 'num_path_random': 6, 'num_path_onpol': 6, 'env_horizon': 1000, 'max_train_data': 200000, 'max_val_data': 100000, 'discard_ratio': 0.0, 'dynamics': {'pre_training': {'mode': 'intrinsic_reward', 'itr': 0, 'policy_itr': 20}, 'model': 'nn', 'ensemble': True, 'ensemble_model_count': 5, 'enable_particle_ensemble': True, 'particles': 5, 'intrinsic_reward_only': False, 'external_reward_evaluation_interval': 5, 'obs_var': 1.0, 'intrinsic_reward_coeff': 1.0, 'ita': 1.0, 'mode': 'random', 'val': True, 'n_layers': 4, 'hidden_size': 1000, 'activation': 'relu', 'batch_size': 1000, 'learning_rate': 0.001, 'epochs': 200, 'kfac_params': {'learning_rate': 0.1, 'damping': 0.001, 'momentum': 0.9, 'kl_clip': 0.0001, 'cov_ema_decay': 0.99}}, 'policy': {'network_shape': [32, 32], 'init_logstd': 0.0, 'activation': 'tanh', 'reinitialize_every_itr': False}, 'trpo': {'horizon': 200, 'gamma': 0.99, 'step_size': 0.01, 'iterations': 20, 'batch_size': 50000, 'gae': 0.95}, 'trpo_ext_reward': {'horizon': 1000, 'gamma': 0.99, 'step_size': 0.01, 'iterations': 20, 'batch_size': 50000, 'gae': 0.95}, 'algo': 'trpo'}
Generating random rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 0
average number of affinization = 0.0
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 0
average number of affinization = 0.0
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 0
average number of affinization = 0.0
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 0
average number of affinization = 0.0
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 0
average number of affinization = 0.0
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 0
average number of affinization = 0.0
Done generating random rollouts.
Creating normalization for training data.
Done creating normalization for training data.
Particle ensemble enabled? True
An ensemble of 5 dynamics model <class 'model.dynamics.NNDynamicsModel'> initialized
Train dynamics model with intrinsic reward only? False
Pre-training enabled. Using only intrinsic reward.
Pre-training dynamics model for 0 iterations...
Done pre-training dynamics model.
Using external reward only.
itr #0 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.5948647260665894
Validation loss = 0.3922995924949646
Validation loss = 0.33297258615493774
Validation loss = 0.31681013107299805
Validation loss = 0.31399214267730713
Validation loss = 0.3127298355102539
Validation loss = 0.3106233477592468
Validation loss = 0.3130667209625244
Validation loss = 0.3160264492034912
Validation loss = 0.31442540884017944
Validation loss = 0.3206152617931366
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7833093404769897
Validation loss = 0.3989008069038391
Validation loss = 0.3410189747810364
Validation loss = 0.3209555745124817
Validation loss = 0.31374478340148926
Validation loss = 0.32131606340408325
Validation loss = 0.31466537714004517
Validation loss = 0.3116084933280945
Validation loss = 0.3186929225921631
Validation loss = 0.3152152895927429
Validation loss = 0.31611168384552
Validation loss = 0.3328148424625397
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.6705715656280518
Validation loss = 0.40362808108329773
Validation loss = 0.34561070799827576
Validation loss = 0.32084184885025024
Validation loss = 0.31488916277885437
Validation loss = 0.3147355318069458
Validation loss = 0.3196612000465393
Validation loss = 0.3137394189834595
Validation loss = 0.31381022930145264
Validation loss = 0.31769612431526184
Validation loss = 0.3159329891204834
Validation loss = 0.3204032778739929
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.6544090509414673
Validation loss = 0.39886143803596497
Validation loss = 0.33519989252090454
Validation loss = 0.31984448432922363
Validation loss = 0.31383854150772095
Validation loss = 0.3130987882614136
Validation loss = 0.3186388611793518
Validation loss = 0.31754356622695923
Validation loss = 0.31821107864379883
Validation loss = 0.32397618889808655
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6934680938720703
Validation loss = 0.40347820520401
Validation loss = 0.3389256000518799
Validation loss = 0.32247287034988403
Validation loss = 0.3218602240085602
Validation loss = 0.3137853443622589
Validation loss = 0.3145916163921356
Validation loss = 0.31455790996551514
Validation loss = 0.3139646053314209
Validation loss = 0.31806015968322754
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 5
average number of affinization = 0.7142857142857143
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 1
average number of affinization = 0.75
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 3
average number of affinization = 1.0
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 4
average number of affinization = 1.3
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 4
average number of affinization = 1.5454545454545454
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 3
average number of affinization = 1.6666666666666667
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 83.9     |
| Iteration     | 0        |
| MaximumReturn | 103      |
| MinimumReturn | 73.5     |
| TotalSamples  | 8000     |
----------------------------
itr #1 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.27393078804016113
Validation loss = 0.21627382934093475
Validation loss = 0.21429817378520966
Validation loss = 0.21570850908756256
Validation loss = 0.21331658959388733
Validation loss = 0.22087958455085754
Validation loss = 0.21552661061286926
Validation loss = 0.2191852629184723
Validation loss = 0.22026857733726501
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.26422733068466187
Validation loss = 0.21709609031677246
Validation loss = 0.21397820115089417
Validation loss = 0.2218644767999649
Validation loss = 0.21834121644496918
Validation loss = 0.2227216362953186
Validation loss = 0.22226352989673615
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.25586456060409546
Validation loss = 0.21619530022144318
Validation loss = 0.2157120704650879
Validation loss = 0.21681085228919983
Validation loss = 0.21601217985153198
Validation loss = 0.21837086975574493
Validation loss = 0.22556090354919434
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.2564007341861725
Validation loss = 0.21509933471679688
Validation loss = 0.2156684547662735
Validation loss = 0.2166844606399536
Validation loss = 0.21835923194885254
Validation loss = 0.21485598385334015
Validation loss = 0.21826493740081787
Validation loss = 0.22842982411384583
Validation loss = 0.22248506546020508
Validation loss = 0.22394970059394836
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.2502002418041229
Validation loss = 0.2158026248216629
Validation loss = 0.21561110019683838
Validation loss = 0.21730823814868927
Validation loss = 0.21837957203388214
Validation loss = 0.2203260213136673
Validation loss = 0.21799087524414062
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 35
average number of affinization = 4.230769230769231
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 27
average number of affinization = 5.857142857142857
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 29
average number of affinization = 7.4
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 39
average number of affinization = 9.375
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 36
average number of affinization = 10.941176470588236
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 57
average number of affinization = 13.5
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 31.9     |
| Iteration     | 1        |
| MaximumReturn | 45.5     |
| MinimumReturn | 24       |
| TotalSamples  | 12000    |
----------------------------
itr #2 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.2102510780096054
Validation loss = 0.1948026865720749
Validation loss = 0.19869635999202728
Validation loss = 0.20020340383052826
Validation loss = 0.204023540019989
Validation loss = 0.2023950070142746
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.2051098495721817
Validation loss = 0.19817368686199188
Validation loss = 0.19819124042987823
Validation loss = 0.20264965295791626
Validation loss = 0.20283211767673492
Validation loss = 0.20104289054870605
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.2043079137802124
Validation loss = 0.19845354557037354
Validation loss = 0.20185166597366333
Validation loss = 0.19970017671585083
Validation loss = 0.20404082536697388
Validation loss = 0.2027699500322342
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.20268285274505615
Validation loss = 0.1995689868927002
Validation loss = 0.2051117867231369
Validation loss = 0.20429913699626923
Validation loss = 0.1995164006948471
Validation loss = 0.20382589101791382
Validation loss = 0.20776230096817017
Validation loss = 0.20602057874202728
Validation loss = 0.20822089910507202
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.21216435730457306
Validation loss = 0.19717885553836823
Validation loss = 0.20054399967193604
Validation loss = 0.19901114702224731
Validation loss = 0.2007266730070114
Validation loss = 0.20056253671646118
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 2
average number of affinization = 12.894736842105264
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 7
average number of affinization = 12.6
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 1
average number of affinization = 12.047619047619047
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 12
average number of affinization = 12.045454545454545
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 1
average number of affinization = 11.565217391304348
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 0
average number of affinization = 11.083333333333334
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 235      |
| Iteration     | 2        |
| MaximumReturn | 242      |
| MinimumReturn | 220      |
| TotalSamples  | 16000    |
----------------------------
itr #3 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.18970298767089844
Validation loss = 0.1845080703496933
Validation loss = 0.1861254721879959
Validation loss = 0.18714138865470886
Validation loss = 0.18573707342147827
Validation loss = 0.19049689173698425
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.18511304259300232
Validation loss = 0.18468117713928223
Validation loss = 0.1815350353717804
Validation loss = 0.18851734697818756
Validation loss = 0.18401652574539185
Validation loss = 0.18699148297309875
Validation loss = 0.18712230026721954
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.1891535520553589
Validation loss = 0.18536418676376343
Validation loss = 0.1845700442790985
Validation loss = 0.19071701169013977
Validation loss = 0.1864229142665863
Validation loss = 0.1878899335861206
Validation loss = 0.19033247232437134
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.18793261051177979
Validation loss = 0.18697980046272278
Validation loss = 0.19167226552963257
Validation loss = 0.19123172760009766
Validation loss = 0.19245824217796326
Validation loss = 0.1939747929573059
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.18620815873146057
Validation loss = 0.18346752226352692
Validation loss = 0.18300125002861023
Validation loss = 0.18323703110218048
Validation loss = 0.18459481000900269
Validation loss = 0.1860067993402481
Validation loss = 0.18858163058757782
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 24
average number of affinization = 11.6
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 20
average number of affinization = 11.923076923076923
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 18
average number of affinization = 12.148148148148149
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 40
average number of affinization = 13.142857142857142
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 37
average number of affinization = 13.96551724137931
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 5
average number of affinization = 13.666666666666666
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 258      |
| Iteration     | 3        |
| MaximumReturn | 264      |
| MinimumReturn | 255      |
| TotalSamples  | 20000    |
----------------------------
itr #4 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.18502779304981232
Validation loss = 0.18137523531913757
Validation loss = 0.18685239553451538
Validation loss = 0.18631061911582947
Validation loss = 0.18726180493831635
Validation loss = 0.1903788149356842
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.18381597101688385
Validation loss = 0.1849455088376999
Validation loss = 0.1814277619123459
Validation loss = 0.18485139310359955
Validation loss = 0.1839846968650818
Validation loss = 0.18313458561897278
Validation loss = 0.1858179122209549
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.18417419493198395
Validation loss = 0.18023914098739624
Validation loss = 0.1845671534538269
Validation loss = 0.18787746131420135
Validation loss = 0.18379774689674377
Validation loss = 0.1866592913866043
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.18977254629135132
Validation loss = 0.18706196546554565
Validation loss = 0.18865616619586945
Validation loss = 0.19253702461719513
Validation loss = 0.18766595423221588
Validation loss = 0.18838942050933838
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.18106594681739807
Validation loss = 0.18040411174297333
Validation loss = 0.1792658567428589
Validation loss = 0.18396107852458954
Validation loss = 0.1822960376739502
Validation loss = 0.18406376242637634
Validation loss = 0.18506675958633423
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 127
average number of affinization = 17.322580645161292
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 122
average number of affinization = 20.59375
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 106
average number of affinization = 23.181818181818183
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 123
average number of affinization = 26.11764705882353
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 117
average number of affinization = 28.714285714285715
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 113
average number of affinization = 31.055555555555557
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 214      |
| Iteration     | 4        |
| MaximumReturn | 217      |
| MinimumReturn | 210      |
| TotalSamples  | 24000    |
----------------------------
itr #5 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.1863250881433487
Validation loss = 0.1828518956899643
Validation loss = 0.18466497957706451
Validation loss = 0.18510954082012177
Validation loss = 0.18383584916591644
Validation loss = 0.1861070841550827
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.18498682975769043
Validation loss = 0.18176133930683136
Validation loss = 0.1825646609067917
Validation loss = 0.18306541442871094
Validation loss = 0.18761028349399567
Validation loss = 0.1860780566930771
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.1800621747970581
Validation loss = 0.18179671466350555
Validation loss = 0.18328706920146942
Validation loss = 0.1870715618133545
Validation loss = 0.1836978942155838
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.18370692431926727
Validation loss = 0.1841116100549698
Validation loss = 0.1865098923444748
Validation loss = 0.18673883378505707
Validation loss = 0.1907465010881424
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.18200059235095978
Validation loss = 0.18295203149318695
Validation loss = 0.1822454333305359
Validation loss = 0.18886379897594452
Validation loss = 0.18581657111644745
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 181
average number of affinization = 35.108108108108105
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 175
average number of affinization = 38.78947368421053
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 190
average number of affinization = 42.666666666666664
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 212
average number of affinization = 46.9
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 160
average number of affinization = 49.65853658536585
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 168
average number of affinization = 52.476190476190474
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 188      |
| Iteration     | 5        |
| MaximumReturn | 193      |
| MinimumReturn | 184      |
| TotalSamples  | 28000    |
----------------------------
itr #6 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.18336108326911926
Validation loss = 0.18206532299518585
Validation loss = 0.18351693451404572
Validation loss = 0.1856348216533661
Validation loss = 0.1856115162372589
Validation loss = 0.188840851187706
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.18394829332828522
Validation loss = 0.18527695536613464
Validation loss = 0.18571460247039795
Validation loss = 0.18586191534996033
Validation loss = 0.18731434643268585
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.18466946482658386
Validation loss = 0.18553665280342102
Validation loss = 0.1848522275686264
Validation loss = 0.18261365592479706
Validation loss = 0.1858789175748825
Validation loss = 0.18620850145816803
Validation loss = 0.18779146671295166
Validation loss = 0.18768031895160675
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.18467681109905243
Validation loss = 0.18588313460350037
Validation loss = 0.1863173246383667
Validation loss = 0.1884404569864273
Validation loss = 0.18708017468452454
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.18292653560638428
Validation loss = 0.18328531086444855
Validation loss = 0.1859661489725113
Validation loss = 0.18538466095924377
Validation loss = 0.18547680974006653
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 192
average number of affinization = 55.72093023255814
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 197
average number of affinization = 58.93181818181818
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 124
average number of affinization = 60.37777777777778
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 194
average number of affinization = 63.28260869565217
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 194
average number of affinization = 66.06382978723404
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 173
average number of affinization = 68.29166666666667
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 152      |
| Iteration     | 6        |
| MaximumReturn | 158      |
| MinimumReturn | 141      |
| TotalSamples  | 32000    |
----------------------------
itr #7 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.18846364319324493
Validation loss = 0.1895167976617813
Validation loss = 0.19002877175807953
Validation loss = 0.19076302647590637
Validation loss = 0.19404236972332
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.18831725418567657
Validation loss = 0.18920595943927765
Validation loss = 0.1873471736907959
Validation loss = 0.18919065594673157
Validation loss = 0.1923614889383316
Validation loss = 0.1924147754907608
Validation loss = 0.19348105788230896
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.19031713902950287
Validation loss = 0.1888439655303955
Validation loss = 0.1904417872428894
Validation loss = 0.19181008636951447
Validation loss = 0.19480453431606293
Validation loss = 0.19258061051368713
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.1917986273765564
Validation loss = 0.19319766759872437
Validation loss = 0.19217871129512787
Validation loss = 0.19210350513458252
Validation loss = 0.19062720239162445
Validation loss = 0.19482630491256714
Validation loss = 0.19365829229354858
Validation loss = 0.19368058443069458
Validation loss = 0.1958986520767212
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.18644438683986664
Validation loss = 0.1884579062461853
Validation loss = 0.1876874566078186
Validation loss = 0.18998286128044128
Validation loss = 0.18811580538749695
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 237
average number of affinization = 71.73469387755102
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 219
average number of affinization = 74.68
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 189
average number of affinization = 76.92156862745098
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 218
average number of affinization = 79.63461538461539
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 235
average number of affinization = 82.56603773584905
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 206
average number of affinization = 84.85185185185185
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 181      |
| Iteration     | 7        |
| MaximumReturn | 186      |
| MinimumReturn | 171      |
| TotalSamples  | 36000    |
----------------------------
itr #8 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.191864475607872
Validation loss = 0.19112688302993774
Validation loss = 0.1909964233636856
Validation loss = 0.19242970645427704
Validation loss = 0.19297797977924347
Validation loss = 0.1924990862607956
Validation loss = 0.19431599974632263
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.1912737488746643
Validation loss = 0.19237230718135834
Validation loss = 0.1902492344379425
Validation loss = 0.19315454363822937
Validation loss = 0.19577980041503906
Validation loss = 0.19686995446681976
Validation loss = 0.19671203196048737
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.19255739450454712
Validation loss = 0.19019563496112823
Validation loss = 0.19451682269573212
Validation loss = 0.19421890377998352
Validation loss = 0.19603905081748962
Validation loss = 0.1967044621706009
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.196429044008255
Validation loss = 0.19972147047519684
Validation loss = 0.19521993398666382
Validation loss = 0.1989555060863495
Validation loss = 0.19760474562644958
Validation loss = 0.1982504278421402
Validation loss = 0.20177358388900757
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.1869201362133026
Validation loss = 0.18810687959194183
Validation loss = 0.18889468908309937
Validation loss = 0.19072864949703217
Validation loss = 0.18959838151931763
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 175
average number of affinization = 86.49090909090908
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 215
average number of affinization = 88.78571428571429
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 169
average number of affinization = 90.19298245614036
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 234
average number of affinization = 92.67241379310344
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 198
average number of affinization = 94.45762711864407
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 208
average number of affinization = 96.35
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 175      |
| Iteration     | 8        |
| MaximumReturn | 184      |
| MinimumReturn | 166      |
| TotalSamples  | 40000    |
----------------------------
itr #9 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.19418637454509735
Validation loss = 0.19272038340568542
Validation loss = 0.19585934281349182
Validation loss = 0.19449466466903687
Validation loss = 0.19877737760543823
Validation loss = 0.19552025198936462
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.19573691487312317
Validation loss = 0.19487598538398743
Validation loss = 0.19960109889507294
Validation loss = 0.19766566157341003
Validation loss = 0.19851192831993103
Validation loss = 0.2030293196439743
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.19409196078777313
Validation loss = 0.1940142810344696
Validation loss = 0.1953917294740677
Validation loss = 0.19567754864692688
Validation loss = 0.1983632594347
Validation loss = 0.2002040147781372
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.19752436876296997
Validation loss = 0.19809190928936005
Validation loss = 0.19900184869766235
Validation loss = 0.20018169283866882
Validation loss = 0.20225663483142853
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.19088175892829895
Validation loss = 0.19174161553382874
Validation loss = 0.19089408218860626
Validation loss = 0.19081974029541016
Validation loss = 0.19347737729549408
Validation loss = 0.19551058113574982
Validation loss = 0.19722935557365417
Validation loss = 0.19743797183036804
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 203
average number of affinization = 98.09836065573771
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 215
average number of affinization = 99.98387096774194
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 208
average number of affinization = 101.6984126984127
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 225
average number of affinization = 103.625
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 193
average number of affinization = 105.0
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 199
average number of affinization = 106.42424242424242
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 137      |
| Iteration     | 9        |
| MaximumReturn | 144      |
| MinimumReturn | 134      |
| TotalSamples  | 44000    |
----------------------------
itr #10 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.19990316033363342
Validation loss = 0.19833050668239594
Validation loss = 0.19972985982894897
Validation loss = 0.1996331363916397
Validation loss = 0.20039024949073792
Validation loss = 0.2018713355064392
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.19708126783370972
Validation loss = 0.20052604377269745
Validation loss = 0.20178698003292084
Validation loss = 0.20545953512191772
Validation loss = 0.20382176339626312
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.19912561774253845
Validation loss = 0.20112743973731995
Validation loss = 0.20042918622493744
Validation loss = 0.20426040887832642
Validation loss = 0.2029806226491928
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.2027963101863861
Validation loss = 0.20105266571044922
Validation loss = 0.20278741419315338
Validation loss = 0.20247173309326172
Validation loss = 0.20351052284240723
Validation loss = 0.20563596487045288
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.19594191014766693
Validation loss = 0.19772721827030182
Validation loss = 0.19829672574996948
Validation loss = 0.1990625262260437
Validation loss = 0.19985021650791168
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 229
average number of affinization = 108.25373134328358
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 215
average number of affinization = 109.82352941176471
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 240
average number of affinization = 111.71014492753623
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 233
average number of affinization = 113.44285714285714
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 245
average number of affinization = 115.29577464788733
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 210
average number of affinization = 116.61111111111111
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 121      |
| Iteration     | 10       |
| MaximumReturn | 127      |
| MinimumReturn | 109      |
| TotalSamples  | 48000    |
----------------------------
itr #11 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.20081646740436554
Validation loss = 0.20167142152786255
Validation loss = 0.2025495022535324
Validation loss = 0.20718474686145782
Validation loss = 0.21025753021240234
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.20149610936641693
Validation loss = 0.2042863816022873
Validation loss = 0.20479579269886017
Validation loss = 0.20750097930431366
Validation loss = 0.20998673141002655
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.20343239605426788
Validation loss = 0.20168626308441162
Validation loss = 0.206031933426857
Validation loss = 0.20755980908870697
Validation loss = 0.2075764685869217
Validation loss = 0.20874249935150146
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.20604555308818817
Validation loss = 0.20651543140411377
Validation loss = 0.21113692224025726
Validation loss = 0.2086985558271408
Validation loss = 0.21187646687030792
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.19988258183002472
Validation loss = 0.20318599045276642
Validation loss = 0.20165133476257324
Validation loss = 0.20396803319454193
Validation loss = 0.20604431629180908
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 235
average number of affinization = 118.23287671232876
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 230
average number of affinization = 119.74324324324324
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 235
average number of affinization = 121.28
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 224
average number of affinization = 122.63157894736842
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 224
average number of affinization = 123.94805194805195
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 228
average number of affinization = 125.28205128205128
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 146      |
| Iteration     | 11       |
| MaximumReturn | 153      |
| MinimumReturn | 139      |
| TotalSamples  | 52000    |
----------------------------
itr #12 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.20675024390220642
Validation loss = 0.20586572587490082
Validation loss = 0.2100800722837448
Validation loss = 0.2100057303905487
Validation loss = 0.21225684881210327
Validation loss = 0.2119530588388443
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.20778559148311615
Validation loss = 0.20749571919441223
Validation loss = 0.2113475203514099
Validation loss = 0.2102123349905014
Validation loss = 0.21330632269382477
Validation loss = 0.2149481624364853
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.20854409039020538
Validation loss = 0.20948536694049835
Validation loss = 0.2108415961265564
Validation loss = 0.21095864474773407
Validation loss = 0.21472035348415375
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.21031524240970612
Validation loss = 0.2097095549106598
Validation loss = 0.21016357839107513
Validation loss = 0.2157287895679474
Validation loss = 0.21478727459907532
Validation loss = 0.21655996143817902
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.20342397689819336
Validation loss = 0.2051386833190918
Validation loss = 0.2067965716123581
Validation loss = 0.2066618800163269
Validation loss = 0.2097759246826172
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 224
average number of affinization = 126.53164556962025
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 234
average number of affinization = 127.875
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 222
average number of affinization = 129.03703703703704
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 227
average number of affinization = 130.23170731707316
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 224
average number of affinization = 131.36144578313252
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 221
average number of affinization = 132.42857142857142
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 142      |
| Iteration     | 12       |
| MaximumReturn | 149      |
| MinimumReturn | 137      |
| TotalSamples  | 56000    |
----------------------------
itr #13 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.21030865609645844
Validation loss = 0.21197305619716644
Validation loss = 0.21269917488098145
Validation loss = 0.21504072844982147
Validation loss = 0.21720676124095917
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.21052123606204987
Validation loss = 0.21216367185115814
Validation loss = 0.2157222181558609
Validation loss = 0.21616415679454803
Validation loss = 0.2166299968957901
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.21059098839759827
Validation loss = 0.21270795166492462
Validation loss = 0.21664251387119293
Validation loss = 0.21603728830814362
Validation loss = 0.215109720826149
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.2143860012292862
Validation loss = 0.21295808255672455
Validation loss = 0.217858225107193
Validation loss = 0.21927644312381744
Validation loss = 0.2198716253042221
Validation loss = 0.2226341962814331
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.20729851722717285
Validation loss = 0.2103230059146881
Validation loss = 0.21116258203983307
Validation loss = 0.21113303303718567
Validation loss = 0.2129756063222885
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 250
average number of affinization = 133.81176470588235
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 242
average number of affinization = 135.06976744186048
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 250
average number of affinization = 136.39080459770116
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 250
average number of affinization = 137.6818181818182
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 236
average number of affinization = 138.7865168539326
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 257
average number of affinization = 140.1
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 115      |
| Iteration     | 13       |
| MaximumReturn | 120      |
| MinimumReturn | 106      |
| TotalSamples  | 60000    |
----------------------------
itr #14 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.21509775519371033
Validation loss = 0.2165423184633255
Validation loss = 0.21966518461704254
Validation loss = 0.22093088924884796
Validation loss = 0.22053639590740204
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.2158036231994629
Validation loss = 0.21745239198207855
Validation loss = 0.21971386671066284
Validation loss = 0.22094115614891052
Validation loss = 0.22337205708026886
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.21506479382514954
Validation loss = 0.218732088804245
Validation loss = 0.218776136636734
Validation loss = 0.22130154073238373
Validation loss = 0.22403934597969055
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.22030381858348846
Validation loss = 0.21950048208236694
Validation loss = 0.22170762717723846
Validation loss = 0.22476063668727875
Validation loss = 0.22476093471050262
Validation loss = 0.2288956493139267
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.21148838102817535
Validation loss = 0.21040727198123932
Validation loss = 0.21390393376350403
Validation loss = 0.2168606072664261
Validation loss = 0.21765968203544617
Validation loss = 0.22192366421222687
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 239
average number of affinization = 141.1868131868132
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 242
average number of affinization = 142.2826086956522
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 247
average number of affinization = 143.40860215053763
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 246
average number of affinization = 144.5
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 247
average number of affinization = 145.57894736842104
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 259
average number of affinization = 146.76041666666666
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 126      |
| Iteration     | 14       |
| MaximumReturn | 138      |
| MinimumReturn | 113      |
| TotalSamples  | 64000    |
----------------------------
itr #15 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.22093608975410461
Validation loss = 0.22198480367660522
Validation loss = 0.22137919068336487
Validation loss = 0.22433426976203918
Validation loss = 0.22678595781326294
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.22037363052368164
Validation loss = 0.22204309701919556
Validation loss = 0.22622239589691162
Validation loss = 0.2265390157699585
Validation loss = 0.22813710570335388
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.2200315296649933
Validation loss = 0.22068989276885986
Validation loss = 0.22767981886863708
Validation loss = 0.2255059778690338
Validation loss = 0.23033256828784943
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.22568829357624054
Validation loss = 0.22921794652938843
Validation loss = 0.2302365005016327
Validation loss = 0.23192983865737915
Validation loss = 0.23515138030052185
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.2191055566072464
Validation loss = 0.2181997001171112
Validation loss = 0.22228875756263733
Validation loss = 0.2239006906747818
Validation loss = 0.22636070847511292
Validation loss = 0.22926324605941772
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 245
average number of affinization = 147.77319587628867
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 221
average number of affinization = 148.5204081632653
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 224
average number of affinization = 149.2828282828283
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 217
average number of affinization = 149.96
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 224
average number of affinization = 150.69306930693068
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 231
average number of affinization = 151.48039215686273
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 125      |
| Iteration     | 15       |
| MaximumReturn | 135      |
| MinimumReturn | 114      |
| TotalSamples  | 68000    |
----------------------------
itr #16 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.2251138538122177
Validation loss = 0.2281579226255417
Validation loss = 0.23095667362213135
Validation loss = 0.23246082663536072
Validation loss = 0.2350902110338211
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.22733308374881744
Validation loss = 0.23008452355861664
Validation loss = 0.23056326806545258
Validation loss = 0.23339799046516418
Validation loss = 0.23549982905387878
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.22710160911083221
Validation loss = 0.23154222965240479
Validation loss = 0.23228982090950012
Validation loss = 0.23352816700935364
Validation loss = 0.23318159580230713
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.23232457041740417
Validation loss = 0.23652324080467224
Validation loss = 0.23857063055038452
Validation loss = 0.23974648118019104
Validation loss = 0.242177352309227
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.22763511538505554
Validation loss = 0.22924745082855225
Validation loss = 0.233422189950943
Validation loss = 0.2350248098373413
Validation loss = 0.23691847920417786
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 235
average number of affinization = 152.29126213592232
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 206
average number of affinization = 152.80769230769232
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 214
average number of affinization = 153.3904761904762
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 223
average number of affinization = 154.04716981132074
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 222
average number of affinization = 154.6822429906542
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 211
average number of affinization = 155.2037037037037
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 132      |
| Iteration     | 16       |
| MaximumReturn | 140      |
| MinimumReturn | 125      |
| TotalSamples  | 72000    |
----------------------------
itr #17 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.23341873288154602
Validation loss = 0.2338990569114685
Validation loss = 0.23607410490512848
Validation loss = 0.23814545571804047
Validation loss = 0.24050576984882355
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.2333959937095642
Validation loss = 0.2355210781097412
Validation loss = 0.23782780766487122
Validation loss = 0.23701055347919464
Validation loss = 0.2398248016834259
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.2349928766489029
Validation loss = 0.2363424301147461
Validation loss = 0.2370491623878479
Validation loss = 0.24086973071098328
Validation loss = 0.2416982352733612
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.23978549242019653
Validation loss = 0.24176008999347687
Validation loss = 0.24150197207927704
Validation loss = 0.24696557223796844
Validation loss = 0.24585217237472534
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.2335404008626938
Validation loss = 0.23568612337112427
Validation loss = 0.23764580488204956
Validation loss = 0.2434808909893036
Validation loss = 0.24143224954605103
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 218
average number of affinization = 155.77981651376146
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 243
average number of affinization = 156.57272727272726
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 220
average number of affinization = 157.14414414414415
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 209
average number of affinization = 157.60714285714286
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 230
average number of affinization = 158.24778761061947
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 229
average number of affinization = 158.8684210526316
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 144      |
| Iteration     | 17       |
| MaximumReturn | 152      |
| MinimumReturn | 138      |
| TotalSamples  | 76000    |
----------------------------
itr #18 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.2369532585144043
Validation loss = 0.2421804964542389
Validation loss = 0.2398928552865982
Validation loss = 0.2447238266468048
Validation loss = 0.2478809356689453
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.2380235642194748
Validation loss = 0.23963913321495056
Validation loss = 0.240343376994133
Validation loss = 0.24125142395496368
Validation loss = 0.24423496425151825
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.24212679266929626
Validation loss = 0.23936650156974792
Validation loss = 0.2429882138967514
Validation loss = 0.24296286702156067
Validation loss = 0.24539794027805328
Validation loss = 0.24523316323757172
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.24226248264312744
Validation loss = 0.24220775067806244
Validation loss = 0.24565286934375763
Validation loss = 0.2468818873167038
Validation loss = 0.24804875254631042
Validation loss = 0.2502240240573883
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.23773102462291718
Validation loss = 0.2394922375679016
Validation loss = 0.24265263974666595
Validation loss = 0.2423992156982422
Validation loss = 0.24323563277721405
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 218
average number of affinization = 159.38260869565218
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 224
average number of affinization = 159.93965517241378
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 230
average number of affinization = 160.53846153846155
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 213
average number of affinization = 160.98305084745763
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 216
average number of affinization = 161.4453781512605
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 213
average number of affinization = 161.875
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 135      |
| Iteration     | 18       |
| MaximumReturn | 141      |
| MinimumReturn | 129      |
| TotalSamples  | 80000    |
----------------------------
itr #19 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.23888058960437775
Validation loss = 0.23911896347999573
Validation loss = 0.24298501014709473
Validation loss = 0.24283842742443085
Validation loss = 0.24513404071331024
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.23705148696899414
Validation loss = 0.24014148116111755
Validation loss = 0.24257047474384308
Validation loss = 0.24370762705802917
Validation loss = 0.24602237343788147
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.24438190460205078
Validation loss = 0.24147097766399384
Validation loss = 0.2444715052843094
Validation loss = 0.24585728347301483
Validation loss = 0.24698933959007263
Validation loss = 0.24882598221302032
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.24435639381408691
Validation loss = 0.24555960297584534
Validation loss = 0.24777254462242126
Validation loss = 0.24817028641700745
Validation loss = 0.2507384717464447
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.23922137916088104
Validation loss = 0.2429492473602295
Validation loss = 0.2421909123659134
Validation loss = 0.24620279669761658
Validation loss = 0.24844463169574738
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 239
average number of affinization = 162.51239669421489
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 259
average number of affinization = 163.30327868852459
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 244
average number of affinization = 163.95934959349594
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 255
average number of affinization = 164.69354838709677
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 252
average number of affinization = 165.392
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 251
average number of affinization = 166.07142857142858
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 117      |
| Iteration     | 19       |
| MaximumReturn | 123      |
| MinimumReturn | 113      |
| TotalSamples  | 84000    |
----------------------------
itr #20 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.24184033274650574
Validation loss = 0.24352653324604034
Validation loss = 0.24661904573440552
Validation loss = 0.24738849699497223
Validation loss = 0.248006209731102
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.24201293289661407
Validation loss = 0.24393154680728912
Validation loss = 0.24438226222991943
Validation loss = 0.24831950664520264
Validation loss = 0.24955476820468903
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.24828356504440308
Validation loss = 0.2451377809047699
Validation loss = 0.24931760132312775
Validation loss = 0.2513415217399597
Validation loss = 0.2529073655605316
Validation loss = 0.25307145714759827
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.24738509953022003
Validation loss = 0.2496841996908188
Validation loss = 0.250889390707016
Validation loss = 0.25139516592025757
Validation loss = 0.2539604902267456
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.24309562146663666
Validation loss = 0.2449730485677719
Validation loss = 0.24696390330791473
Validation loss = 0.24841615557670593
Validation loss = 0.25204166769981384
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 260
average number of affinization = 166.81102362204723
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 248
average number of affinization = 167.4453125
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 263
average number of affinization = 168.1860465116279
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 258
average number of affinization = 168.87692307692308
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 267
average number of affinization = 169.6259541984733
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 285
average number of affinization = 170.5
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 122      |
| Iteration     | 20       |
| MaximumReturn | 127      |
| MinimumReturn | 116      |
| TotalSamples  | 88000    |
----------------------------
itr #21 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.24390491843223572
Validation loss = 0.24540895223617554
Validation loss = 0.2494942992925644
Validation loss = 0.25090810656547546
Validation loss = 0.2512994110584259
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.2448427528142929
Validation loss = 0.24719201028347015
Validation loss = 0.24908976256847382
Validation loss = 0.24978691339492798
Validation loss = 0.2516958713531494
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.24947918951511383
Validation loss = 0.25095418095588684
Validation loss = 0.25408312678337097
Validation loss = 0.25374773144721985
Validation loss = 0.2554187476634979
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.24960646033287048
Validation loss = 0.25406286120414734
Validation loss = 0.25411558151245117
Validation loss = 0.2530280351638794
Validation loss = 0.2551243305206299
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.24746374785900116
Validation loss = 0.25024452805519104
Validation loss = 0.2518201768398285
Validation loss = 0.2517143785953522
Validation loss = 0.253339558839798
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 214
average number of affinization = 170.82706766917292
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 230
average number of affinization = 171.26865671641792
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 235
average number of affinization = 171.74074074074073
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 220
average number of affinization = 172.09558823529412
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 222
average number of affinization = 172.45985401459853
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 205
average number of affinization = 172.69565217391303
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 179      |
| Iteration     | 21       |
| MaximumReturn | 190      |
| MinimumReturn | 171      |
| TotalSamples  | 92000    |
----------------------------
itr #22 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.24865826964378357
Validation loss = 0.24974432587623596
Validation loss = 0.2508757710456848
Validation loss = 0.2537068724632263
Validation loss = 0.25504744052886963
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.2505466043949127
Validation loss = 0.24979136884212494
Validation loss = 0.24998946487903595
Validation loss = 0.25519177317619324
Validation loss = 0.25571343302726746
Validation loss = 0.2585952579975128
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.25283727049827576
Validation loss = 0.2550152838230133
Validation loss = 0.2544388175010681
Validation loss = 0.25808364152908325
Validation loss = 0.2608542740345001
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.25306639075279236
Validation loss = 0.2528356611728668
Validation loss = 0.2567012906074524
Validation loss = 0.2554222643375397
Validation loss = 0.2587277889251709
Validation loss = 0.2590763568878174
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.2496849000453949
Validation loss = 0.2516261637210846
Validation loss = 0.2541055977344513
Validation loss = 0.2564461827278137
Validation loss = 0.2540597915649414
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 146
average number of affinization = 172.50359712230215
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 137
average number of affinization = 172.25
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 143
average number of affinization = 172.04255319148936
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 136
average number of affinization = 171.7887323943662
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 136
average number of affinization = 171.53846153846155
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 136
average number of affinization = 171.29166666666666
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 266      |
| Iteration     | 22       |
| MaximumReturn | 268      |
| MinimumReturn | 265      |
| TotalSamples  | 96000    |
----------------------------
itr #23 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.24708135426044464
Validation loss = 0.24973785877227783
Validation loss = 0.2539465129375458
Validation loss = 0.25400200486183167
Validation loss = 0.2549661695957184
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.24995960295200348
Validation loss = 0.2517032027244568
Validation loss = 0.2559226155281067
Validation loss = 0.2544349431991577
Validation loss = 0.2562629282474518
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.25038424134254456
Validation loss = 0.25345441699028015
Validation loss = 0.25580599904060364
Validation loss = 0.2564190626144409
Validation loss = 0.25719964504241943
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.25510939955711365
Validation loss = 0.25372907519340515
Validation loss = 0.2561070919036865
Validation loss = 0.260579913854599
Validation loss = 0.25872984528541565
Validation loss = 0.2607637941837311
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.24698777496814728
Validation loss = 0.2507003843784332
Validation loss = 0.2528354227542877
Validation loss = 0.2551739513874054
Validation loss = 0.2566154897212982
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 140
average number of affinization = 171.0758620689655
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 156
average number of affinization = 170.97260273972603
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 139
average number of affinization = 170.75510204081633
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 148
average number of affinization = 170.60135135135135
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 151
average number of affinization = 170.46979865771812
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 160
average number of affinization = 170.4
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 251      |
| Iteration     | 23       |
| MaximumReturn | 256      |
| MinimumReturn | 245      |
| TotalSamples  | 100000   |
----------------------------
itr #24 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.24861697852611542
Validation loss = 0.24995021522045135
Validation loss = 0.2522805333137512
Validation loss = 0.2544771432876587
Validation loss = 0.25472959876060486
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.25046879053115845
Validation loss = 0.2501714825630188
Validation loss = 0.25313693284988403
Validation loss = 0.2574254274368286
Validation loss = 0.2559153437614441
Validation loss = 0.2564346492290497
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.24973197281360626
Validation loss = 0.2505744993686676
Validation loss = 0.25492167472839355
Validation loss = 0.2560192346572876
Validation loss = 0.25659435987472534
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.251801997423172
Validation loss = 0.2547778785228729
Validation loss = 0.2574845850467682
Validation loss = 0.2569029927253723
Validation loss = 0.2572125494480133
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.2502335011959076
Validation loss = 0.2498137503862381
Validation loss = 0.25286954641342163
Validation loss = 0.2548283636569977
Validation loss = 0.25850480794906616
Validation loss = 0.2582634687423706
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 179
average number of affinization = 170.4569536423841
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 159
average number of affinization = 170.3815789473684
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 172
average number of affinization = 170.3921568627451
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 154
average number of affinization = 170.28571428571428
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 179
average number of affinization = 170.34193548387097
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 158
average number of affinization = 170.26282051282053
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 246      |
| Iteration     | 24       |
| MaximumReturn | 248      |
| MinimumReturn | 242      |
| TotalSamples  | 104000   |
----------------------------
itr #25 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.2474707067012787
Validation loss = 0.24790503084659576
Validation loss = 0.25287896394729614
Validation loss = 0.2514592111110687
Validation loss = 0.25367027521133423
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.2497325986623764
Validation loss = 0.24885839223861694
Validation loss = 0.2512560486793518
Validation loss = 0.25247955322265625
Validation loss = 0.253004252910614
Validation loss = 0.25571733713150024
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.24863125383853912
Validation loss = 0.2514435648918152
Validation loss = 0.25062793493270874
Validation loss = 0.25364282727241516
Validation loss = 0.2543383240699768
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.251330703496933
Validation loss = 0.25086167454719543
Validation loss = 0.2531392276287079
Validation loss = 0.2546232044696808
Validation loss = 0.2550041675567627
Validation loss = 0.2560448944568634
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.24862615764141083
Validation loss = 0.2510625720024109
Validation loss = 0.25148800015449524
Validation loss = 0.25147679448127747
Validation loss = 0.2531718611717224
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 179
average number of affinization = 170.3184713375796
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 186
average number of affinization = 170.41772151898735
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 191
average number of affinization = 170.54716981132074
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 176
average number of affinization = 170.58125
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 172
average number of affinization = 170.59006211180125
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 173
average number of affinization = 170.60493827160494
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 235      |
| Iteration     | 25       |
| MaximumReturn | 237      |
| MinimumReturn | 232      |
| TotalSamples  | 108000   |
----------------------------
itr #26 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.2465372085571289
Validation loss = 0.24821080267429352
Validation loss = 0.24768276512622833
Validation loss = 0.24930055439472198
Validation loss = 0.2502191960811615
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.2490289956331253
Validation loss = 0.2503349483013153
Validation loss = 0.2511938512325287
Validation loss = 0.25107356905937195
Validation loss = 0.2535690665245056
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.24834594130516052
Validation loss = 0.2473210096359253
Validation loss = 0.25104981660842896
Validation loss = 0.25181320309638977
Validation loss = 0.25117236375808716
Validation loss = 0.2521321177482605
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.25098010897636414
Validation loss = 0.25000184774398804
Validation loss = 0.2535359859466553
Validation loss = 0.2530699074268341
Validation loss = 0.253374844789505
Validation loss = 0.25495046377182007
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.24703167378902435
Validation loss = 0.24843718111515045
Validation loss = 0.24848565459251404
Validation loss = 0.24962908029556274
Validation loss = 0.25225958228111267
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 177
average number of affinization = 170.64417177914112
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 181
average number of affinization = 170.70731707317074
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 188
average number of affinization = 170.8121212121212
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 183
average number of affinization = 170.8855421686747
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 166
average number of affinization = 170.8562874251497
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 192
average number of affinization = 170.98214285714286
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 232      |
| Iteration     | 26       |
| MaximumReturn | 236      |
| MinimumReturn | 227      |
| TotalSamples  | 112000   |
----------------------------
itr #27 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.24640688300132751
Validation loss = 0.24460458755493164
Validation loss = 0.24602244794368744
Validation loss = 0.24784335494041443
Validation loss = 0.24918653070926666
Validation loss = 0.2513454854488373
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.24639464914798737
Validation loss = 0.24690912663936615
Validation loss = 0.24788008630275726
Validation loss = 0.25016894936561584
Validation loss = 0.24904121458530426
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.24620914459228516
Validation loss = 0.24707730114459991
Validation loss = 0.24844740331172943
Validation loss = 0.24968065321445465
Validation loss = 0.250522643327713
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.24866367876529694
Validation loss = 0.24804091453552246
Validation loss = 0.25073519349098206
Validation loss = 0.25107434391975403
Validation loss = 0.250392347574234
Validation loss = 0.2521153390407562
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.24503742158412933
Validation loss = 0.24578294157981873
Validation loss = 0.24717208743095398
Validation loss = 0.2477448284626007
Validation loss = 0.24878430366516113
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 202
average number of affinization = 171.1656804733728
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 199
average number of affinization = 171.3294117647059
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 176
average number of affinization = 171.35672514619884
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 193
average number of affinization = 171.4825581395349
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 199
average number of affinization = 171.64161849710982
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 187
average number of affinization = 171.72988505747125
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 222      |
| Iteration     | 27       |
| MaximumReturn | 225      |
| MinimumReturn | 217      |
| TotalSamples  | 116000   |
----------------------------
itr #28 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.24410194158554077
Validation loss = 0.24521595239639282
Validation loss = 0.24580460786819458
Validation loss = 0.24872943758964539
Validation loss = 0.24887794256210327
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.24309763312339783
Validation loss = 0.2436409294605255
Validation loss = 0.24594618380069733
Validation loss = 0.24759991466999054
Validation loss = 0.24622821807861328
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.24526499211788177
Validation loss = 0.24460318684577942
Validation loss = 0.24702148139476776
Validation loss = 0.24708378314971924
Validation loss = 0.24900613725185394
Validation loss = 0.2500525414943695
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.2466755211353302
Validation loss = 0.24639303982257843
Validation loss = 0.24713099002838135
Validation loss = 0.2501636743545532
Validation loss = 0.2506483793258667
Validation loss = 0.24938145279884338
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.2433832734823227
Validation loss = 0.24410119652748108
Validation loss = 0.24374736845493317
Validation loss = 0.24670156836509705
Validation loss = 0.24725854396820068
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 209
average number of affinization = 171.94285714285715
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 194
average number of affinization = 172.0681818181818
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 192
average number of affinization = 172.180790960452
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 199
average number of affinization = 172.3314606741573
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 190
average number of affinization = 172.43016759776538
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 193
average number of affinization = 172.54444444444445
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 233      |
| Iteration     | 28       |
| MaximumReturn | 235      |
| MinimumReturn | 232      |
| TotalSamples  | 120000   |
----------------------------
itr #29 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.24127385020256042
Validation loss = 0.24229799211025238
Validation loss = 0.24431900680065155
Validation loss = 0.24410390853881836
Validation loss = 0.24623392522335052
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.2432538866996765
Validation loss = 0.2416183203458786
Validation loss = 0.24395401775836945
Validation loss = 0.2437606155872345
Validation loss = 0.24443671107292175
Validation loss = 0.24461963772773743
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.24300867319107056
Validation loss = 0.24360044300556183
Validation loss = 0.24417416751384735
Validation loss = 0.24900789558887482
Validation loss = 0.24473364651203156
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.24503231048583984
Validation loss = 0.2456558346748352
Validation loss = 0.24546828866004944
Validation loss = 0.24884741008281708
Validation loss = 0.24707536399364471
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.24111101031303406
Validation loss = 0.24150559306144714
Validation loss = 0.24332159757614136
Validation loss = 0.24455063045024872
Validation loss = 0.24393337965011597
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 223
average number of affinization = 172.8232044198895
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 229
average number of affinization = 173.13186813186815
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 217
average number of affinization = 173.37158469945356
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 221
average number of affinization = 173.6304347826087
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 216
average number of affinization = 173.85945945945946
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 217
average number of affinization = 174.09139784946237
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 214      |
| Iteration     | 29       |
| MaximumReturn | 217      |
| MinimumReturn | 212      |
| TotalSamples  | 124000   |
----------------------------
itr #30 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.2394576072692871
Validation loss = 0.2404191941022873
Validation loss = 0.24036364257335663
Validation loss = 0.2407824844121933
Validation loss = 0.2438148856163025
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.2409786432981491
Validation loss = 0.24110400676727295
Validation loss = 0.24273164570331573
Validation loss = 0.24138280749320984
Validation loss = 0.24479669332504272
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.24136775732040405
Validation loss = 0.24011974036693573
Validation loss = 0.24366962909698486
Validation loss = 0.242280513048172
Validation loss = 0.24491912126541138
Validation loss = 0.24457168579101562
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.24187897145748138
Validation loss = 0.24118919670581818
Validation loss = 0.24462442100048065
Validation loss = 0.2433406561613083
Validation loss = 0.24400867521762848
Validation loss = 0.24530631303787231
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.23924632370471954
Validation loss = 0.23947854340076447
Validation loss = 0.24012164771556854
Validation loss = 0.24323438107967377
Validation loss = 0.24279724061489105
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 228
average number of affinization = 174.37967914438502
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 233
average number of affinization = 174.69148936170214
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 232
average number of affinization = 174.994708994709
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 233
average number of affinization = 175.3
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 233
average number of affinization = 175.6020942408377
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 241
average number of affinization = 175.94270833333334
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 213      |
| Iteration     | 30       |
| MaximumReturn | 217      |
| MinimumReturn | 210      |
| TotalSamples  | 128000   |
----------------------------
itr #31 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.237909734249115
Validation loss = 0.23820015788078308
Validation loss = 0.23834362626075745
Validation loss = 0.24039724469184875
Validation loss = 0.23991860449314117
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.2392912209033966
Validation loss = 0.23769648373126984
Validation loss = 0.24147379398345947
Validation loss = 0.2402673065662384
Validation loss = 0.2407253533601761
Validation loss = 0.24178574979305267
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.24021783471107483
Validation loss = 0.24107417464256287
Validation loss = 0.24122487008571625
Validation loss = 0.24087631702423096
Validation loss = 0.2433907389640808
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.24017831683158875
Validation loss = 0.24053947627544403
Validation loss = 0.23994776606559753
Validation loss = 0.24029025435447693
Validation loss = 0.2427961230278015
Validation loss = 0.2421397566795349
Validation loss = 0.24276432394981384
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.23664240539073944
Validation loss = 0.23773999512195587
Validation loss = 0.23939888179302216
Validation loss = 0.23991996049880981
Validation loss = 0.24012339115142822
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 233
average number of affinization = 176.23834196891193
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 233
average number of affinization = 176.53092783505156
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 238
average number of affinization = 176.84615384615384
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 234
average number of affinization = 177.1377551020408
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 230
average number of affinization = 177.40609137055839
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 236
average number of affinization = 177.7020202020202
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 219      |
| Iteration     | 31       |
| MaximumReturn | 222      |
| MinimumReturn | 215      |
| TotalSamples  | 132000   |
----------------------------
itr #32 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.23557132482528687
Validation loss = 0.23574499785900116
Validation loss = 0.23745787143707275
Validation loss = 0.23721209168434143
Validation loss = 0.23954319953918457
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.23717208206653595
Validation loss = 0.23621302843093872
Validation loss = 0.23733508586883545
Validation loss = 0.23852147161960602
Validation loss = 0.23911643028259277
Validation loss = 0.24010539054870605
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.237947016954422
Validation loss = 0.23731738328933716
Validation loss = 0.23821797966957092
Validation loss = 0.2388700544834137
Validation loss = 0.23922264575958252
Validation loss = 0.24044175446033478
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.23800288140773773
Validation loss = 0.23850998282432556
Validation loss = 0.24139295518398285
Validation loss = 0.23936356604099274
Validation loss = 0.24042797088623047
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.235398530960083
Validation loss = 0.23552632331848145
Validation loss = 0.23754215240478516
Validation loss = 0.23725129663944244
Validation loss = 0.23836340010166168
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 240
average number of affinization = 178.01507537688443
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 234
average number of affinization = 178.295
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 231
average number of affinization = 178.55721393034827
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 225
average number of affinization = 178.7871287128713
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 233
average number of affinization = 179.05418719211823
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 245
average number of affinization = 179.37745098039215
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 202      |
| Iteration     | 32       |
| MaximumReturn | 208      |
| MinimumReturn | 195      |
| TotalSamples  | 136000   |
----------------------------
