Logging to experiments/gym_fswimmer/SO01/Wed-02-Nov-2022-04-25-22-PM-CDT_gym_fswimmer_trpo_iteration_20_seed2312
Print configuration .....
{'env_name': 'gym_fswimmer', 'random_seeds': [2312, 1231, 2631, 5543], 'save_variables': False, 'model_save_dir': '/tmp/gym_fswimmer_models/', 'restore_variables': False, 'start_onpol_iter': 0, 'onpol_iters': 33, 'num_path_random': 6, 'num_path_onpol': 6, 'env_horizon': 1000, 'max_train_data': 200000, 'max_val_data': 100000, 'discard_ratio': 0.0, 'dynamics': {'pre_training': {'mode': 'intrinsic_reward', 'itr': 0, 'policy_itr': 20}, 'model': 'nn', 'ensemble': True, 'ensemble_model_count': 5, 'enable_particle_ensemble': True, 'particles': 5, 'intrinsic_reward_only': False, 'external_reward_evaluation_interval': 5, 'obs_var': 1.0, 'intrinsic_reward_coeff': 1.0, 'ita': 1.0, 'mode': 'random', 'val': True, 'n_layers': 4, 'hidden_size': 1000, 'activation': 'relu', 'batch_size': 1000, 'learning_rate': 0.001, 'epochs': 200, 'kfac_params': {'learning_rate': 0.1, 'damping': 0.001, 'momentum': 0.9, 'kl_clip': 0.0001, 'cov_ema_decay': 0.99}}, 'policy': {'network_shape': [32, 32], 'init_logstd': 0.0, 'activation': 'tanh', 'reinitialize_every_itr': False}, 'trpo': {'horizon': 200, 'gamma': 0.99, 'step_size': 0.01, 'iterations': 20, 'batch_size': 50000, 'gae': 0.95}, 'trpo_ext_reward': {'horizon': 1000, 'gamma': 0.99, 'step_size': 0.01, 'iterations': 20, 'batch_size': 50000, 'gae': 0.95}, 'algo': 'trpo'}
Generating random rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating random rollouts.
Creating normalization for training data.
Done creating normalization for training data.
Particle ensemble enabled? True
An ensemble of 5 dynamics model <class 'model.dynamics.NNDynamicsModel'> initialized
Train dynamics model with intrinsic reward only? False
Pre-training enabled. Using only intrinsic reward.
Pre-training dynamics model for 0 iterations...
Done pre-training dynamics model.
Using external reward only.
itr #0 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6777771711349487
Validation loss = 0.40742552280426025
Validation loss = 0.35211795568466187
Validation loss = 0.3303389549255371
Validation loss = 0.3278079628944397
Validation loss = 0.3241390883922577
Validation loss = 0.3194494843482971
Validation loss = 0.32472360134124756
Validation loss = 0.3347023129463196
Validation loss = 0.32818421721458435
Validation loss = 0.3366011679172516
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.663345217704773
Validation loss = 0.40748023986816406
Validation loss = 0.3532558083534241
Validation loss = 0.3344247341156006
Validation loss = 0.32355302572250366
Validation loss = 0.3242987096309662
Validation loss = 0.32254141569137573
Validation loss = 0.32119977474212646
Validation loss = 0.3214873671531677
Validation loss = 0.32366567850112915
Validation loss = 0.32873037457466125
Validation loss = 0.3345644772052765
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.8521976470947266
Validation loss = 0.41652712225914
Validation loss = 0.35421305894851685
Validation loss = 0.3319846987724304
Validation loss = 0.32431185245513916
Validation loss = 0.32701367139816284
Validation loss = 0.3255746364593506
Validation loss = 0.32421600818634033
Validation loss = 0.3192235231399536
Validation loss = 0.3289451003074646
Validation loss = 0.32788577675819397
Validation loss = 0.3348282277584076
Validation loss = 0.33941149711608887
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.6405792236328125
Validation loss = 0.40642014145851135
Validation loss = 0.3432976007461548
Validation loss = 0.3331279158592224
Validation loss = 0.32680392265319824
Validation loss = 0.3238007426261902
Validation loss = 0.3255905508995056
Validation loss = 0.3181154727935791
Validation loss = 0.3249492645263672
Validation loss = 0.32741057872772217
Validation loss = 0.33007633686065674
Validation loss = 0.3334795832633972
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6547174453735352
Validation loss = 0.3982042372226715
Validation loss = 0.3508419990539551
Validation loss = 0.33169472217559814
Validation loss = 0.3238522410392761
Validation loss = 0.32228219509124756
Validation loss = 0.3213554918766022
Validation loss = 0.3198022246360779
Validation loss = 0.3332479000091553
Validation loss = 0.33634912967681885
Validation loss = 0.33831578493118286
Validation loss = 0.3229745328426361
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 107      |
| Iteration     | 0        |
| MaximumReturn | 117      |
| MinimumReturn | 95       |
| TotalSamples  | 8000     |
----------------------------
itr #1 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.29017648100852966
Validation loss = 0.21740715205669403
Validation loss = 0.21071867644786835
Validation loss = 0.20585504174232483
Validation loss = 0.2042849063873291
Validation loss = 0.20520426332950592
Validation loss = 0.2068086862564087
Validation loss = 0.21370987594127655
Validation loss = 0.2059587985277176
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.29258638620376587
Validation loss = 0.21796372532844543
Validation loss = 0.20817874372005463
Validation loss = 0.2079550325870514
Validation loss = 0.20836319029331207
Validation loss = 0.20622947812080383
Validation loss = 0.2107679396867752
Validation loss = 0.21681730449199677
Validation loss = 0.21281512081623077
Validation loss = 0.21833330392837524
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.29285168647766113
Validation loss = 0.21777713298797607
Validation loss = 0.21099352836608887
Validation loss = 0.20526713132858276
Validation loss = 0.21116629242897034
Validation loss = 0.20936432480812073
Validation loss = 0.2150241881608963
Validation loss = 0.21386896073818207
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.2895318269729614
Validation loss = 0.219103142619133
Validation loss = 0.21063725650310516
Validation loss = 0.20763052999973297
Validation loss = 0.2117852121591568
Validation loss = 0.2066040188074112
Validation loss = 0.2090204954147339
Validation loss = 0.2094571441411972
Validation loss = 0.211322620511055
Validation loss = 0.21482233703136444
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.27313220500946045
Validation loss = 0.21523553133010864
Validation loss = 0.20765767991542816
Validation loss = 0.20552565157413483
Validation loss = 0.20615358650684357
Validation loss = 0.20236335694789886
Validation loss = 0.20700517296791077
Validation loss = 0.20514541864395142
Validation loss = 0.21077701449394226
Validation loss = 0.2222336381673813
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 109      |
| Iteration     | 1        |
| MaximumReturn | 117      |
| MinimumReturn | 105      |
| TotalSamples  | 12000    |
----------------------------
itr #2 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.18059556186199188
Validation loss = 0.17543406784534454
Validation loss = 0.17471815645694733
Validation loss = 0.1791832000017166
Validation loss = 0.17997516691684723
Validation loss = 0.17477470636367798
Validation loss = 0.17789874970912933
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.1876232624053955
Validation loss = 0.1773131936788559
Validation loss = 0.17894776165485382
Validation loss = 0.17720215022563934
Validation loss = 0.1854357123374939
Validation loss = 0.18922704458236694
Validation loss = 0.18242166936397552
Validation loss = 0.18352395296096802
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.18432797491550446
Validation loss = 0.1738974004983902
Validation loss = 0.17481613159179688
Validation loss = 0.1779649257659912
Validation loss = 0.17994976043701172
Validation loss = 0.17705373466014862
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.18584324419498444
Validation loss = 0.17640288174152374
Validation loss = 0.17969447374343872
Validation loss = 0.17719854414463043
Validation loss = 0.17859448492527008
Validation loss = 0.18535463511943817
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.18243171274662018
Validation loss = 0.175995334982872
Validation loss = 0.17571844160556793
Validation loss = 0.17447753250598907
Validation loss = 0.18156032264232635
Validation loss = 0.1861754208803177
Validation loss = 0.18503884971141815
Validation loss = 0.18155060708522797
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 133      |
| Iteration     | 2        |
| MaximumReturn | 142      |
| MinimumReturn | 123      |
| TotalSamples  | 16000    |
----------------------------
itr #3 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.1720041036605835
Validation loss = 0.17268598079681396
Validation loss = 0.1751306653022766
Validation loss = 0.1725788712501526
Validation loss = 0.17369818687438965
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.17380720376968384
Validation loss = 0.1700088381767273
Validation loss = 0.17500793933868408
Validation loss = 0.1736723780632019
Validation loss = 0.17546188831329346
Validation loss = 0.1759401261806488
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.1727168709039688
Validation loss = 0.17479735612869263
Validation loss = 0.17147545516490936
Validation loss = 0.17112860083580017
Validation loss = 0.17253969609737396
Validation loss = 0.17645803093910217
Validation loss = 0.18093529343605042
Validation loss = 0.17518484592437744
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.17251558601856232
Validation loss = 0.16891460120677948
Validation loss = 0.17042027413845062
Validation loss = 0.17683473229408264
Validation loss = 0.1745835691690445
Validation loss = 0.17479729652404785
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.17608776688575745
Validation loss = 0.16918089985847473
Validation loss = 0.16849184036254883
Validation loss = 0.1726628988981247
Validation loss = 0.17606592178344727
Validation loss = 0.17395079135894775
Validation loss = 0.17541247606277466
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 124      |
| Iteration     | 3        |
| MaximumReturn | 134      |
| MinimumReturn | 105      |
| TotalSamples  | 20000    |
----------------------------
itr #4 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.17431437969207764
Validation loss = 0.17236049473285675
Validation loss = 0.1705206334590912
Validation loss = 0.1721406877040863
Validation loss = 0.16964149475097656
Validation loss = 0.17090561985969543
Validation loss = 0.1728399395942688
Validation loss = 0.17657515406608582
Validation loss = 0.17846006155014038
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.17420583963394165
Validation loss = 0.1739351898431778
Validation loss = 0.17713692784309387
Validation loss = 0.17436489462852478
Validation loss = 0.1767408549785614
Validation loss = 0.1774667203426361
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.1727750599384308
Validation loss = 0.17279909551143646
Validation loss = 0.17455130815505981
Validation loss = 0.17279362678527832
Validation loss = 0.17205430567264557
Validation loss = 0.17901387810707092
Validation loss = 0.17545823752880096
Validation loss = 0.1795317381620407
Validation loss = 0.17921575903892517
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.1717178225517273
Validation loss = 0.17523270845413208
Validation loss = 0.1704479604959488
Validation loss = 0.17769834399223328
Validation loss = 0.17651063203811646
Validation loss = 0.17456907033920288
Validation loss = 0.17932334542274475
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.17716941237449646
Validation loss = 0.17083796858787537
Validation loss = 0.1716892421245575
Validation loss = 0.175678551197052
Validation loss = 0.1819610744714737
Validation loss = 0.17557820677757263
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 115      |
| Iteration     | 4        |
| MaximumReturn | 127      |
| MinimumReturn | 100      |
| TotalSamples  | 24000    |
----------------------------
itr #5 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.18266351521015167
Validation loss = 0.17802560329437256
Validation loss = 0.18183499574661255
Validation loss = 0.1812589317560196
Validation loss = 0.18296056985855103
Validation loss = 0.18244610726833344
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.17315149307250977
Validation loss = 0.17555685341358185
Validation loss = 0.17937611043453217
Validation loss = 0.17969374358654022
Validation loss = 0.18011993169784546
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.1841840147972107
Validation loss = 0.180392786860466
Validation loss = 0.1805819422006607
Validation loss = 0.18534569442272186
Validation loss = 0.18282003700733185
Validation loss = 0.18701517581939697
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.17456376552581787
Validation loss = 0.17594091594219208
Validation loss = 0.1771278828382492
Validation loss = 0.18046528100967407
Validation loss = 0.1811475157737732
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.1735166758298874
Validation loss = 0.1785368174314499
Validation loss = 0.17831645905971527
Validation loss = 0.1792430877685547
Validation loss = 0.1814911961555481
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 103      |
| Iteration     | 5        |
| MaximumReturn | 107      |
| MinimumReturn | 96       |
| TotalSamples  | 28000    |
----------------------------
itr #6 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.17756445705890656
Validation loss = 0.18132463097572327
Validation loss = 0.18004374206066132
Validation loss = 0.182634636759758
Validation loss = 0.18292032182216644
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.17912717163562775
Validation loss = 0.17977866530418396
Validation loss = 0.18505942821502686
Validation loss = 0.1801748126745224
Validation loss = 0.18276748061180115
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.1875067502260208
Validation loss = 0.18356481194496155
Validation loss = 0.18211781978607178
Validation loss = 0.1888839304447174
Validation loss = 0.18656162917613983
Validation loss = 0.18657302856445312
Validation loss = 0.18835891783237457
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.17778070271015167
Validation loss = 0.17707492411136627
Validation loss = 0.17849576473236084
Validation loss = 0.18093164265155792
Validation loss = 0.1781608760356903
Validation loss = 0.1813817322254181
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.18055379390716553
Validation loss = 0.18335609138011932
Validation loss = 0.17907290160655975
Validation loss = 0.18269579112529755
Validation loss = 0.18368975818157196
Validation loss = 0.18145610392093658
Validation loss = 0.18395932018756866
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 95.4     |
| Iteration     | 6        |
| MaximumReturn | 99       |
| MinimumReturn | 94.1     |
| TotalSamples  | 32000    |
----------------------------
itr #7 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.18204310536384583
Validation loss = 0.1813921481370926
Validation loss = 0.1818649172782898
Validation loss = 0.18179020285606384
Validation loss = 0.18366482853889465
Validation loss = 0.19047939777374268
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.18202495574951172
Validation loss = 0.18116435408592224
Validation loss = 0.18338140845298767
Validation loss = 0.18296131491661072
Validation loss = 0.18704238533973694
Validation loss = 0.18691864609718323
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.1849154233932495
Validation loss = 0.1853926181793213
Validation loss = 0.18659453094005585
Validation loss = 0.1935226321220398
Validation loss = 0.18931075930595398
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.18208658695220947
Validation loss = 0.18007531762123108
Validation loss = 0.18283572793006897
Validation loss = 0.18615317344665527
Validation loss = 0.18598224222660065
Validation loss = 0.18808354437351227
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.18311649560928345
Validation loss = 0.18375903367996216
Validation loss = 0.18462680280208588
Validation loss = 0.18305227160453796
Validation loss = 0.1860811561346054
Validation loss = 0.18877044320106506
Validation loss = 0.18902654945850372
Validation loss = 0.19290253520011902
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 110      |
| Iteration     | 7        |
| MaximumReturn | 116      |
| MinimumReturn | 106      |
| TotalSamples  | 36000    |
----------------------------
itr #8 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.1878145933151245
Validation loss = 0.1869744211435318
Validation loss = 0.18952606618404388
Validation loss = 0.1897565871477127
Validation loss = 0.19345472753047943
Validation loss = 0.1945914477109909
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.18805573880672455
Validation loss = 0.18725578486919403
Validation loss = 0.18960045278072357
Validation loss = 0.1889890879392624
Validation loss = 0.19172337651252747
Validation loss = 0.1926257312297821
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.1898464411497116
Validation loss = 0.19117018580436707
Validation loss = 0.18994377553462982
Validation loss = 0.1923655867576599
Validation loss = 0.19474679231643677
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.1871151626110077
Validation loss = 0.18946853280067444
Validation loss = 0.1870867908000946
Validation loss = 0.19010037183761597
Validation loss = 0.18963971734046936
Validation loss = 0.19139836728572845
Validation loss = 0.19227242469787598
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.19191616773605347
Validation loss = 0.19411493837833405
Validation loss = 0.19112521409988403
Validation loss = 0.19290262460708618
Validation loss = 0.19305729866027832
Validation loss = 0.19776111841201782
Validation loss = 0.19910463690757751
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 118      |
| Iteration     | 8        |
| MaximumReturn | 124      |
| MinimumReturn | 112      |
| TotalSamples  | 40000    |
----------------------------
itr #9 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.19446611404418945
Validation loss = 0.19027528166770935
Validation loss = 0.19396910071372986
Validation loss = 0.1973607987165451
Validation loss = 0.20066773891448975
Validation loss = 0.2011193335056305
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.19197611510753632
Validation loss = 0.19438175857067108
Validation loss = 0.1965460479259491
Validation loss = 0.19795462489128113
Validation loss = 0.19915194809436798
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.1916564702987671
Validation loss = 0.1943926215171814
Validation loss = 0.19638171792030334
Validation loss = 0.1971074342727661
Validation loss = 0.197551429271698
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.19383420050144196
Validation loss = 0.190342515707016
Validation loss = 0.1941731870174408
Validation loss = 0.19826935231685638
Validation loss = 0.1981695592403412
Validation loss = 0.2013721764087677
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.19387155771255493
Validation loss = 0.19667252898216248
Validation loss = 0.1975858509540558
Validation loss = 0.2007063329219818
Validation loss = 0.20266017317771912
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 123      |
| Iteration     | 9        |
| MaximumReturn | 125      |
| MinimumReturn | 121      |
| TotalSamples  | 44000    |
----------------------------
itr #10 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.19968105852603912
Validation loss = 0.19886428117752075
Validation loss = 0.19971241056919098
Validation loss = 0.2037764936685562
Validation loss = 0.20262651145458221
Validation loss = 0.20483988523483276
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.1953587532043457
Validation loss = 0.19338716566562653
Validation loss = 0.20130567252635956
Validation loss = 0.1967926025390625
Validation loss = 0.20175665616989136
Validation loss = 0.20396284759044647
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.19578509032726288
Validation loss = 0.19832827150821686
Validation loss = 0.20127294957637787
Validation loss = 0.20286956429481506
Validation loss = 0.20536860823631287
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.19520306587219238
Validation loss = 0.19499491155147552
Validation loss = 0.19937968254089355
Validation loss = 0.2017814964056015
Validation loss = 0.20392900705337524
Validation loss = 0.20661993324756622
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.20001143217086792
Validation loss = 0.2032509744167328
Validation loss = 0.2016492486000061
Validation loss = 0.204851433634758
Validation loss = 0.2047441452741623
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 94.3     |
| Iteration     | 10       |
| MaximumReturn | 102      |
| MinimumReturn | 86.6     |
| TotalSamples  | 48000    |
----------------------------
itr #11 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.20316725969314575
Validation loss = 0.20275670289993286
Validation loss = 0.20727930963039398
Validation loss = 0.2070874720811844
Validation loss = 0.20879928767681122
Validation loss = 0.2132856398820877
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.20274920761585236
Validation loss = 0.20351047813892365
Validation loss = 0.20477521419525146
Validation loss = 0.20456166565418243
Validation loss = 0.20809118449687958
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.20370686054229736
Validation loss = 0.20139630138874054
Validation loss = 0.2022857666015625
Validation loss = 0.20634375512599945
Validation loss = 0.2068004161119461
Validation loss = 0.2103743553161621
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.20170961320400238
Validation loss = 0.2042868733406067
Validation loss = 0.20600645244121552
Validation loss = 0.20839376747608185
Validation loss = 0.21162645518779755
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.2025739550590515
Validation loss = 0.2052706480026245
Validation loss = 0.2076416164636612
Validation loss = 0.209753155708313
Validation loss = 0.21045689284801483
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 103      |
| Iteration     | 11       |
| MaximumReturn | 110      |
| MinimumReturn | 92.4     |
| TotalSamples  | 52000    |
----------------------------
itr #12 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.20795577764511108
Validation loss = 0.21324966847896576
Validation loss = 0.2114674150943756
Validation loss = 0.21311694383621216
Validation loss = 0.21670788526535034
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.20478342473506927
Validation loss = 0.20805588364601135
Validation loss = 0.20873525738716125
Validation loss = 0.20955437421798706
Validation loss = 0.21302613615989685
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.20719169080257416
Validation loss = 0.2080734372138977
Validation loss = 0.21063090860843658
Validation loss = 0.21354688704013824
Validation loss = 0.21647262573242188
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.2058379352092743
Validation loss = 0.20657934248447418
Validation loss = 0.21109215915203094
Validation loss = 0.2102719396352768
Validation loss = 0.21447817981243134
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.2050502598285675
Validation loss = 0.21047796308994293
Validation loss = 0.21173222362995148
Validation loss = 0.2175820767879486
Validation loss = 0.21591416001319885
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 125      |
| Iteration     | 12       |
| MaximumReturn | 132      |
| MinimumReturn | 118      |
| TotalSamples  | 56000    |
----------------------------
itr #13 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.21174359321594238
Validation loss = 0.216457799077034
Validation loss = 0.21702933311462402
Validation loss = 0.21971122920513153
Validation loss = 0.22233523428440094
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.20929162204265594
Validation loss = 0.20998349785804749
Validation loss = 0.21564218401908875
Validation loss = 0.21549682319164276
Validation loss = 0.2175733745098114
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.2114182561635971
Validation loss = 0.21347211301326752
Validation loss = 0.2177550494670868
Validation loss = 0.21945904195308685
Validation loss = 0.2185419499874115
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.2163574993610382
Validation loss = 0.21366949379444122
Validation loss = 0.21523107588291168
Validation loss = 0.21662844717502594
Validation loss = 0.22178712487220764
Validation loss = 0.22224213182926178
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.21643875539302826
Validation loss = 0.21401311457157135
Validation loss = 0.2164788544178009
Validation loss = 0.2191532850265503
Validation loss = 0.22091054916381836
Validation loss = 0.22239328920841217
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 137      |
| Iteration     | 13       |
| MaximumReturn | 140      |
| MinimumReturn | 130      |
| TotalSamples  | 60000    |
----------------------------
itr #14 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.21529483795166016
Validation loss = 0.21891354024410248
Validation loss = 0.22110356390476227
Validation loss = 0.2249840348958969
Validation loss = 0.22581903636455536
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.21477511525154114
Validation loss = 0.2124808132648468
Validation loss = 0.21691139042377472
Validation loss = 0.22040390968322754
Validation loss = 0.2226022183895111
Validation loss = 0.22382313013076782
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.21434496343135834
Validation loss = 0.2151314616203308
Validation loss = 0.22056730091571808
Validation loss = 0.22409306466579437
Validation loss = 0.22047796845436096
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.2164136916399002
Validation loss = 0.21906781196594238
Validation loss = 0.22016844153404236
Validation loss = 0.2245911806821823
Validation loss = 0.2260548621416092
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.21593675017356873
Validation loss = 0.22050794959068298
Validation loss = 0.22121573984622955
Validation loss = 0.22338125109672546
Validation loss = 0.22539642453193665
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 128      |
| Iteration     | 14       |
| MaximumReturn | 136      |
| MinimumReturn | 122      |
| TotalSamples  | 64000    |
----------------------------
itr #15 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.22065775096416473
Validation loss = 0.2223503589630127
Validation loss = 0.2243260145187378
Validation loss = 0.22687137126922607
Validation loss = 0.22557169198989868
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.21940681338310242
Validation loss = 0.22519807517528534
Validation loss = 0.22159622609615326
Validation loss = 0.22656145691871643
Validation loss = 0.22735625505447388
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.21818161010742188
Validation loss = 0.22060906887054443
Validation loss = 0.22316959500312805
Validation loss = 0.22303363680839539
Validation loss = 0.2247334122657776
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.2195637971162796
Validation loss = 0.2237415462732315
Validation loss = 0.22460459172725677
Validation loss = 0.22643572092056274
Validation loss = 0.22914035618305206
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.22034785151481628
Validation loss = 0.22175751626491547
Validation loss = 0.2245175838470459
Validation loss = 0.22573548555374146
Validation loss = 0.22868525981903076
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 132      |
| Iteration     | 15       |
| MaximumReturn | 136      |
| MinimumReturn | 127      |
| TotalSamples  | 68000    |
----------------------------
itr #16 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.2204972356557846
Validation loss = 0.22521807253360748
Validation loss = 0.2258448451757431
Validation loss = 0.22770856320858002
Validation loss = 0.23039627075195312
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.22017882764339447
Validation loss = 0.2230178266763687
Validation loss = 0.22204293310642242
Validation loss = 0.22886332869529724
Validation loss = 0.22927847504615784
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.21948984265327454
Validation loss = 0.22168242931365967
Validation loss = 0.22574104368686676
Validation loss = 0.22582954168319702
Validation loss = 0.22847433388233185
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.22247518599033356
Validation loss = 0.2237672656774521
Validation loss = 0.2261582911014557
Validation loss = 0.22989697754383087
Validation loss = 0.2272120863199234
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.2210487574338913
Validation loss = 0.2247462272644043
Validation loss = 0.223775252699852
Validation loss = 0.22807669639587402
Validation loss = 0.22924503684043884
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 125      |
| Iteration     | 16       |
| MaximumReturn | 132      |
| MinimumReturn | 120      |
| TotalSamples  | 72000    |
----------------------------
itr #17 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.22039391100406647
Validation loss = 0.22074945271015167
Validation loss = 0.22463543713092804
Validation loss = 0.2257266640663147
Validation loss = 0.2262212634086609
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.2212531566619873
Validation loss = 0.2204979509115219
Validation loss = 0.22563447058200836
Validation loss = 0.22617687284946442
Validation loss = 0.2272939682006836
Validation loss = 0.22831392288208008
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.22152480483055115
Validation loss = 0.22218914330005646
Validation loss = 0.22193458676338196
Validation loss = 0.22486169636249542
Validation loss = 0.22745174169540405
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.22051168978214264
Validation loss = 0.22337010502815247
Validation loss = 0.22233064472675323
Validation loss = 0.2253713309764862
Validation loss = 0.2302100956439972
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.21870192885398865
Validation loss = 0.22306330502033234
Validation loss = 0.2251376360654831
Validation loss = 0.2267402559518814
Validation loss = 0.2271622270345688
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 123      |
| Iteration     | 17       |
| MaximumReturn | 128      |
| MinimumReturn | 115      |
| TotalSamples  | 76000    |
----------------------------
itr #18 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.22267360985279083
Validation loss = 0.22120751440525055
Validation loss = 0.2225637435913086
Validation loss = 0.2247394323348999
Validation loss = 0.22489522397518158
Validation loss = 0.2292764037847519
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.22199499607086182
Validation loss = 0.22326961159706116
Validation loss = 0.2248377799987793
Validation loss = 0.22562561929225922
Validation loss = 0.22707296907901764
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.21907733380794525
Validation loss = 0.22092227637767792
Validation loss = 0.2228812426328659
Validation loss = 0.2246321588754654
Validation loss = 0.22634129226207733
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.22272895276546478
Validation loss = 0.22296850383281708
Validation loss = 0.22499997913837433
Validation loss = 0.22534535825252533
Validation loss = 0.2275533378124237
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.2210519164800644
Validation loss = 0.22082898020744324
Validation loss = 0.2244684249162674
Validation loss = 0.22642505168914795
Validation loss = 0.22514192759990692
Validation loss = 0.2264363318681717
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 125      |
| Iteration     | 18       |
| MaximumReturn | 130      |
| MinimumReturn | 115      |
| TotalSamples  | 80000    |
----------------------------
itr #19 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.22459575533866882
Validation loss = 0.2228422909975052
Validation loss = 0.22544312477111816
Validation loss = 0.22539444267749786
Validation loss = 0.22773206233978271
Validation loss = 0.22866661846637726
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.22180119156837463
Validation loss = 0.2233930379152298
Validation loss = 0.2277056872844696
Validation loss = 0.22643227875232697
Validation loss = 0.22928857803344727
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.22052033245563507
Validation loss = 0.22184494137763977
Validation loss = 0.2236500233411789
Validation loss = 0.22488293051719666
Validation loss = 0.226877361536026
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.22499164938926697
Validation loss = 0.2230534553527832
Validation loss = 0.22632288932800293
Validation loss = 0.22601938247680664
Validation loss = 0.2283724993467331
Validation loss = 0.22769734263420105
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.22379633784294128
Validation loss = 0.22420945763587952
Validation loss = 0.2258949726819992
Validation loss = 0.22682905197143555
Validation loss = 0.22747007012367249
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 123      |
| Iteration     | 19       |
| MaximumReturn | 126      |
| MinimumReturn | 119      |
| TotalSamples  | 84000    |
----------------------------
itr #20 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.22252216935157776
Validation loss = 0.22405792772769928
Validation loss = 0.22472767531871796
Validation loss = 0.2260633260011673
Validation loss = 0.2276371568441391
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.2219071090221405
Validation loss = 0.2226024568080902
Validation loss = 0.22631286084651947
Validation loss = 0.22586782276630402
Validation loss = 0.22684577107429504
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.22201678156852722
Validation loss = 0.22282713651657104
Validation loss = 0.2229706197977066
Validation loss = 0.2253875732421875
Validation loss = 0.22598232328891754
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.2225046455860138
Validation loss = 0.22270023822784424
Validation loss = 0.22408780455589294
Validation loss = 0.22587354481220245
Validation loss = 0.22599121928215027
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.2232668101787567
Validation loss = 0.22316239774227142
Validation loss = 0.22407940030097961
Validation loss = 0.22529684007167816
Validation loss = 0.22809462249279022
Validation loss = 0.2276511937379837
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 116      |
| Iteration     | 20       |
| MaximumReturn | 124      |
| MinimumReturn | 106      |
| TotalSamples  | 88000    |
----------------------------
itr #21 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.22429804503917694
Validation loss = 0.22388434410095215
Validation loss = 0.2242681384086609
Validation loss = 0.2279571145772934
Validation loss = 0.22829346358776093
Validation loss = 0.22953635454177856
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.22229161858558655
Validation loss = 0.2236987054347992
Validation loss = 0.2246025651693344
Validation loss = 0.22535432875156403
Validation loss = 0.22759661078453064
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.22262020409107208
Validation loss = 0.22244690358638763
Validation loss = 0.22732548415660858
Validation loss = 0.22662463784217834
Validation loss = 0.22840595245361328
Validation loss = 0.22837942838668823
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.22284460067749023
Validation loss = 0.22457066178321838
Validation loss = 0.2251409888267517
Validation loss = 0.22588394582271576
Validation loss = 0.22768530249595642
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.22395282983779907
Validation loss = 0.22426553070545197
Validation loss = 0.22612394392490387
Validation loss = 0.2240598052740097
Validation loss = 0.22836436331272125
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 119      |
| Iteration     | 21       |
| MaximumReturn | 127      |
| MinimumReturn | 113      |
| TotalSamples  | 92000    |
----------------------------
itr #22 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.22333648800849915
Validation loss = 0.22523581981658936
Validation loss = 0.226740300655365
Validation loss = 0.22827377915382385
Validation loss = 0.2303520143032074
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.22422796487808228
Validation loss = 0.22386398911476135
Validation loss = 0.22731076180934906
Validation loss = 0.22740507125854492
Validation loss = 0.2285016030073166
Validation loss = 0.2295546680688858
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.22480079531669617
Validation loss = 0.22539746761322021
Validation loss = 0.22781908512115479
Validation loss = 0.22742870450019836
Validation loss = 0.22930586338043213
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.22284206748008728
Validation loss = 0.22378116846084595
Validation loss = 0.22585591673851013
Validation loss = 0.22648397088050842
Validation loss = 0.22743426263332367
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.22420638799667358
Validation loss = 0.22578391432762146
Validation loss = 0.22832219302654266
Validation loss = 0.22821380198001862
Validation loss = 0.2275071144104004
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 125      |
| Iteration     | 22       |
| MaximumReturn | 131      |
| MinimumReturn | 120      |
| TotalSamples  | 96000    |
----------------------------
itr #23 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.22749416530132294
Validation loss = 0.22582630813121796
Validation loss = 0.22782842814922333
Validation loss = 0.2284950166940689
Validation loss = 0.22887630760669708
Validation loss = 0.2299410104751587
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.2252119779586792
Validation loss = 0.22462047636508942
Validation loss = 0.22762124240398407
Validation loss = 0.22773337364196777
Validation loss = 0.2279050350189209
Validation loss = 0.2305542677640915
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.22397136688232422
Validation loss = 0.22556941211223602
Validation loss = 0.22834108769893646
Validation loss = 0.22809211909770966
Validation loss = 0.22752587497234344
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.22426556050777435
Validation loss = 0.2244192361831665
Validation loss = 0.2254692167043686
Validation loss = 0.22610406577587128
Validation loss = 0.2293977290391922
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.22469846904277802
Validation loss = 0.22451388835906982
Validation loss = 0.2267986685037613
Validation loss = 0.22620107233524323
Validation loss = 0.22868633270263672
Validation loss = 0.22842907905578613
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 132      |
| Iteration     | 23       |
| MaximumReturn | 134      |
| MinimumReturn | 128      |
| TotalSamples  | 100000   |
----------------------------
itr #24 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.2236098349094391
Validation loss = 0.2248116433620453
Validation loss = 0.22700853645801544
Validation loss = 0.22657510638237
Validation loss = 0.22747348248958588
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.2226860076189041
Validation loss = 0.22429050505161285
Validation loss = 0.22676187753677368
Validation loss = 0.22629351913928986
Validation loss = 0.22763191163539886
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.22437025606632233
Validation loss = 0.2243874967098236
Validation loss = 0.2249392718076706
Validation loss = 0.2272721827030182
Validation loss = 0.22561559081077576
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.22093811631202698
Validation loss = 0.2221345156431198
Validation loss = 0.22382380068302155
Validation loss = 0.22472359240055084
Validation loss = 0.22832058370113373
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.2240375131368637
Validation loss = 0.2233777493238449
Validation loss = 0.22444236278533936
Validation loss = 0.22471600770950317
Validation loss = 0.2266397625207901
Validation loss = 0.22867538034915924
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 125      |
| Iteration     | 24       |
| MaximumReturn | 129      |
| MinimumReturn | 122      |
| TotalSamples  | 104000   |
----------------------------
itr #25 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.22407802939414978
Validation loss = 0.22343124449253082
Validation loss = 0.22440025210380554
Validation loss = 0.2243584841489792
Validation loss = 0.225726917386055
Validation loss = 0.22697210311889648
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.22282005846500397
Validation loss = 0.22238685190677643
Validation loss = 0.22373920679092407
Validation loss = 0.2260020673274994
Validation loss = 0.22700278460979462
Validation loss = 0.22586096823215485
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.22107869386672974
Validation loss = 0.2214210033416748
Validation loss = 0.22405098378658295
Validation loss = 0.22546683251857758
Validation loss = 0.22509993612766266
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.2219022810459137
Validation loss = 0.2244805246591568
Validation loss = 0.22321783006191254
Validation loss = 0.22725875675678253
Validation loss = 0.22687172889709473
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.22228384017944336
Validation loss = 0.22578296065330505
Validation loss = 0.22502118349075317
Validation loss = 0.2250608503818512
Validation loss = 0.22636684775352478
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 125      |
| Iteration     | 25       |
| MaximumReturn | 132      |
| MinimumReturn | 118      |
| TotalSamples  | 108000   |
----------------------------
itr #26 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.2227645069360733
Validation loss = 0.22250154614448547
Validation loss = 0.2254294753074646
Validation loss = 0.22515243291854858
Validation loss = 0.2265356332063675
Validation loss = 0.22761735320091248
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.22295130789279938
Validation loss = 0.22258040308952332
Validation loss = 0.22565214335918427
Validation loss = 0.2257377654314041
Validation loss = 0.22743207216262817
Validation loss = 0.22701436281204224
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.22311387956142426
Validation loss = 0.22110731899738312
Validation loss = 0.22357778251171112
Validation loss = 0.22455328702926636
Validation loss = 0.22581587731838226
Validation loss = 0.22533294558525085
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.22056972980499268
Validation loss = 0.22278176248073578
Validation loss = 0.22278867661952972
Validation loss = 0.22522224485874176
Validation loss = 0.2260272353887558
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.22229737043380737
Validation loss = 0.2224786877632141
Validation loss = 0.22542919218540192
Validation loss = 0.22632162272930145
Validation loss = 0.22601182758808136
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 132      |
| Iteration     | 26       |
| MaximumReturn | 134      |
| MinimumReturn | 131      |
| TotalSamples  | 112000   |
----------------------------
itr #27 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.22401900589466095
Validation loss = 0.22357134521007538
Validation loss = 0.22498784959316254
Validation loss = 0.22587482631206512
Validation loss = 0.22537243366241455
Validation loss = 0.22773964703083038
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.22410549223423004
Validation loss = 0.22346441447734833
Validation loss = 0.22418823838233948
Validation loss = 0.22622573375701904
Validation loss = 0.22852174937725067
Validation loss = 0.2272753268480301
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.22224852442741394
Validation loss = 0.2215319722890854
Validation loss = 0.2237604409456253
Validation loss = 0.22501005232334137
Validation loss = 0.22577254474163055
Validation loss = 0.22578828036785126
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.2223869115114212
Validation loss = 0.2232327163219452
Validation loss = 0.22398731112480164
Validation loss = 0.22340790927410126
Validation loss = 0.2267255038022995
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.22286851704120636
Validation loss = 0.22219683229923248
Validation loss = 0.22383789718151093
Validation loss = 0.224831223487854
Validation loss = 0.22562967240810394
Validation loss = 0.22533096373081207
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 132      |
| Iteration     | 27       |
| MaximumReturn | 136      |
| MinimumReturn | 129      |
| TotalSamples  | 116000   |
----------------------------
itr #28 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.2242439240217209
Validation loss = 0.22451189160346985
Validation loss = 0.2250250279903412
Validation loss = 0.2264453023672104
Validation loss = 0.22712978720664978
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.22494234144687653
Validation loss = 0.22340823709964752
Validation loss = 0.22411096096038818
Validation loss = 0.2250489741563797
Validation loss = 0.2262582778930664
Validation loss = 0.2272542268037796
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.22302471101284027
Validation loss = 0.2232256829738617
Validation loss = 0.22439396381378174
Validation loss = 0.2244967520236969
Validation loss = 0.22469954192638397
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.22162431478500366
Validation loss = 0.22265325486660004
Validation loss = 0.22372977435588837
Validation loss = 0.2251715213060379
Validation loss = 0.2257673293352127
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.2213928997516632
Validation loss = 0.2239494025707245
Validation loss = 0.22361265122890472
Validation loss = 0.22449927031993866
Validation loss = 0.22559480369091034
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 127      |
| Iteration     | 28       |
| MaximumReturn | 128      |
| MinimumReturn | 125      |
| TotalSamples  | 120000   |
----------------------------
itr #29 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.2246040552854538
Validation loss = 0.22487373650074005
Validation loss = 0.22712111473083496
Validation loss = 0.22737573087215424
Validation loss = 0.22948044538497925
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.2252212017774582
Validation loss = 0.22477978467941284
Validation loss = 0.22523097693920135
Validation loss = 0.22784855961799622
Validation loss = 0.22673295438289642
Validation loss = 0.22854235768318176
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.2224425971508026
Validation loss = 0.22496657073497772
Validation loss = 0.22531434893608093
Validation loss = 0.22508786618709564
Validation loss = 0.22612883150577545
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.222106471657753
Validation loss = 0.22344772517681122
Validation loss = 0.22469595074653625
Validation loss = 0.2248672991991043
Validation loss = 0.22855155169963837
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.22448071837425232
Validation loss = 0.22446848452091217
Validation loss = 0.22582849860191345
Validation loss = 0.2262388914823532
Validation loss = 0.22642773389816284
Validation loss = 0.22641147673130035
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 127      |
| Iteration     | 29       |
| MaximumReturn | 133      |
| MinimumReturn | 122      |
| TotalSamples  | 124000   |
----------------------------
itr #30 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.22429607808589935
Validation loss = 0.22585144639015198
Validation loss = 0.22543583810329437
Validation loss = 0.22759710252285004
Validation loss = 0.22844502329826355
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.2256099283695221
Validation loss = 0.22395208477973938
Validation loss = 0.2254934012889862
Validation loss = 0.22690044343471527
Validation loss = 0.22788558900356293
Validation loss = 0.22859661281108856
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.22314472496509552
Validation loss = 0.22374138236045837
Validation loss = 0.2273380309343338
Validation loss = 0.22720946371555328
Validation loss = 0.22697345912456512
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.22521604597568512
Validation loss = 0.2233170121908188
Validation loss = 0.22477734088897705
Validation loss = 0.22691413760185242
Validation loss = 0.22641363739967346
Validation loss = 0.22655196487903595
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.2250375747680664
Validation loss = 0.22419042885303497
Validation loss = 0.22666539251804352
Validation loss = 0.2257481962442398
Validation loss = 0.22703851759433746
Validation loss = 0.22793209552764893
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 127      |
| Iteration     | 30       |
| MaximumReturn | 133      |
| MinimumReturn | 125      |
| TotalSamples  | 128000   |
----------------------------
itr #31 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.22424745559692383
Validation loss = 0.22368863224983215
Validation loss = 0.22506360709667206
Validation loss = 0.22545741498470306
Validation loss = 0.22665712237358093
Validation loss = 0.22799524664878845
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.22461974620819092
Validation loss = 0.22380375862121582
Validation loss = 0.22594410181045532
Validation loss = 0.22783569991588593
Validation loss = 0.2265906035900116
Validation loss = 0.22744134068489075
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.22296226024627686
Validation loss = 0.22314128279685974
Validation loss = 0.22431136667728424
Validation loss = 0.22660665214061737
Validation loss = 0.22523175179958344
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.22340914607048035
Validation loss = 0.22226350009441376
Validation loss = 0.22530972957611084
Validation loss = 0.22658969461917877
Validation loss = 0.22666719555854797
Validation loss = 0.22705033421516418
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.2238079011440277
Validation loss = 0.22514915466308594
Validation loss = 0.22494269907474518
Validation loss = 0.2263723462820053
Validation loss = 0.2267681360244751
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 125      |
| Iteration     | 31       |
| MaximumReturn | 129      |
| MinimumReturn | 121      |
| TotalSamples  | 132000   |
----------------------------
itr #32 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.22395247220993042
Validation loss = 0.2260301262140274
Validation loss = 0.22643621265888214
Validation loss = 0.22747288644313812
Validation loss = 0.22760988771915436
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.22422030568122864
Validation loss = 0.2249196171760559
Validation loss = 0.2270221710205078
Validation loss = 0.22782106697559357
Validation loss = 0.22682742774486542
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.22385895252227783
Validation loss = 0.22381795942783356
Validation loss = 0.22743916511535645
Validation loss = 0.22651441395282745
Validation loss = 0.22514386475086212
Validation loss = 0.2273155003786087
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.2245534509420395
Validation loss = 0.22525256872177124
Validation loss = 0.22469854354858398
Validation loss = 0.2261848747730255
Validation loss = 0.22732850909233093
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.2246548980474472
Validation loss = 0.2238263040781021
Validation loss = 0.22640177607536316
Validation loss = 0.2269250750541687
Validation loss = 0.226026713848114
Validation loss = 0.22682230174541473
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 130      |
| Iteration     | 32       |
| MaximumReturn | 132      |
| MinimumReturn | 126      |
| TotalSamples  | 136000   |
----------------------------
