Logging to experiments/gym_cheetahO01/gym_cheetahO01/Fri-28-Oct-2022-08-59-10-PM-CDT_gym_cheetahO01_trpo_iteration_20_seed4321
Print configuration .....
{'env_name': 'gym_cheetahO01', 'random_seeds': [4321, 2314, 2341, 3421], 'save_variables': False, 'model_save_dir': '/tmp/gym_cheetahO01_models/', 'restore_variables': False, 'start_onpol_iter': 0, 'onpol_iters': 33, 'num_path_random': 6, 'num_path_onpol': 6, 'env_horizon': 1000, 'max_train_data': 200000, 'max_val_data': 100000, 'discard_ratio': 0.0, 'dynamics': {'pre_training': {'mode': 'intrinsic_reward', 'itr': 0, 'policy_itr': 20}, 'model': 'nn', 'ensemble': True, 'ensemble_model_count': 5, 'enable_particle_ensemble': True, 'particles': 5, 'intrinsic_reward_only': False, 'external_reward_evaluation_interval': 5, 'obs_var': 1.0, 'intrinsic_reward_coeff': 1.0, 'ita': 1.0, 'mode': 'random', 'val': True, 'n_layers': 4, 'hidden_size': 1000, 'activation': 'relu', 'batch_size': 1000, 'learning_rate': 0.001, 'epochs': 200, 'kfac_params': {'learning_rate': 0.1, 'damping': 0.001, 'momentum': 0.9, 'kl_clip': 0.0001, 'cov_ema_decay': 0.99}}, 'policy': {'network_shape': [32, 32], 'init_logstd': 0.0, 'activation': 'tanh', 'reinitialize_every_itr': False}, 'trpo': {'horizon': 1000, 'gamma': 0.99, 'step_size': 0.01, 'iterations': 20, 'batch_size': 50000, 'gae': 0.95}, 'trpo_ext_reward': {'horizon': 1000, 'gamma': 0.99, 'step_size': 0.01, 'iterations': 20, 'batch_size': 50000, 'gae': 0.95}, 'algo': 'trpo'}
Generating random rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating random rollouts.
Creating normalization for training data.
Done creating normalization for training data.
Particle ensemble enabled? True
An ensemble of 5 dynamics model <class 'model.dynamics.NNDynamicsModel'> initialized
Train dynamics model with intrinsic reward only? False
Pre-training enabled. Using only intrinsic reward.
Pre-training dynamics model for 0 iterations...
Done pre-training dynamics model.
Using external reward only.
itr #0 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.4413784146308899
Validation loss = 0.2152242660522461
Validation loss = 0.16880226135253906
Validation loss = 0.15794974565505981
Validation loss = 0.15675142407417297
Validation loss = 0.15565523505210876
Validation loss = 0.15318958461284637
Validation loss = 0.15766075253486633
Validation loss = 0.1529536098241806
Validation loss = 0.16178278625011444
Validation loss = 0.1613570749759674
Validation loss = 0.16181311011314392
Validation loss = 0.15646663308143616
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.560768723487854
Validation loss = 0.21686036884784698
Validation loss = 0.1709703952074051
Validation loss = 0.16014277935028076
Validation loss = 0.15552955865859985
Validation loss = 0.155117928981781
Validation loss = 0.15605959296226501
Validation loss = 0.16524581611156464
Validation loss = 0.17703521251678467
Validation loss = 0.16454847157001495
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.47667354345321655
Validation loss = 0.21140176057815552
Validation loss = 0.17218880355358124
Validation loss = 0.1593073159456253
Validation loss = 0.15512797236442566
Validation loss = 0.1533680111169815
Validation loss = 0.1560681015253067
Validation loss = 0.1607588231563568
Validation loss = 0.15827998518943787
Validation loss = 0.1619594246149063
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.6431095600128174
Validation loss = 0.21424314379692078
Validation loss = 0.172575443983078
Validation loss = 0.1614099144935608
Validation loss = 0.1570950448513031
Validation loss = 0.1623431146144867
Validation loss = 0.1559309959411621
Validation loss = 0.1552639603614807
Validation loss = 0.1742093563079834
Validation loss = 0.1613551825284958
Validation loss = 0.15856918692588806
Validation loss = 0.16112327575683594
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.4745147228240967
Validation loss = 0.21152743697166443
Validation loss = 0.17230573296546936
Validation loss = 0.1585327684879303
Validation loss = 0.15566816926002502
Validation loss = 0.15853047370910645
Validation loss = 0.15693262219429016
Validation loss = 0.15628373622894287
Validation loss = 0.15462908148765564
Validation loss = 0.15361624956130981
Validation loss = 0.15593385696411133
Validation loss = 0.1661733090877533
Validation loss = 0.1584598422050476
Validation loss = 0.15885412693023682
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -282     |
| Iteration     | 0        |
| MaximumReturn | -182     |
| MinimumReturn | -365     |
| TotalSamples  | 8000     |
----------------------------
itr #1 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.20323702692985535
Validation loss = 0.17784450948238373
Validation loss = 0.17643554508686066
Validation loss = 0.17932866513729095
Validation loss = 0.1843670755624771
Validation loss = 0.18184883892536163
Validation loss = 0.20243799686431885
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.1977590173482895
Validation loss = 0.1761327087879181
Validation loss = 0.18132075667381287
Validation loss = 0.19769828021526337
Validation loss = 0.2650756537914276
Validation loss = 0.2033717930316925
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.19905322790145874
Validation loss = 0.17691601812839508
Validation loss = 0.1762576401233673
Validation loss = 0.17461000382900238
Validation loss = 0.17770567536354065
Validation loss = 0.17689990997314453
Validation loss = 0.17793744802474976
Validation loss = 0.19973836839199066
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.19996969401836395
Validation loss = 0.175997793674469
Validation loss = 0.17539311945438385
Validation loss = 0.18189993500709534
Validation loss = 0.1734108030796051
Validation loss = 0.1847091019153595
Validation loss = 0.17531917989253998
Validation loss = 0.1869385540485382
Validation loss = 0.18457508087158203
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.2016269564628601
Validation loss = 0.17572364211082458
Validation loss = 0.25038933753967285
Validation loss = 0.18908247351646423
Validation loss = 0.17796765267848969
Validation loss = 0.2364572435617447
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -34.5    |
| Iteration     | 1        |
| MaximumReturn | 46.4     |
| MinimumReturn | -107     |
| TotalSamples  | 12000    |
----------------------------
itr #2 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.17249564826488495
Validation loss = 0.17419278621673584
Validation loss = 0.17304496467113495
Validation loss = 0.1710277646780014
Validation loss = 0.17057180404663086
Validation loss = 0.18883931636810303
Validation loss = 0.19021427631378174
Validation loss = 0.1758461445569992
Validation loss = 0.18163056671619415
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.17337240278720856
Validation loss = 0.16971875727176666
Validation loss = 0.17741625010967255
Validation loss = 0.17131949961185455
Validation loss = 0.16447661817073822
Validation loss = 0.17442212998867035
Validation loss = 0.17600999772548676
Validation loss = 0.17383973300457
Validation loss = 0.1754852533340454
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.16891562938690186
Validation loss = 0.16958732903003693
Validation loss = 0.16742415726184845
Validation loss = 0.17198039591312408
Validation loss = 0.17504918575286865
Validation loss = 0.17753590643405914
Validation loss = 0.17611415684223175
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.17011411488056183
Validation loss = 0.17100845277309418
Validation loss = 0.17272908985614777
Validation loss = 0.17499136924743652
Validation loss = 0.2093570977449417
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.17324693500995636
Validation loss = 0.16912607848644257
Validation loss = 0.16975705325603485
Validation loss = 0.17151685059070587
Validation loss = 0.1692322939634323
Validation loss = 0.16618286073207855
Validation loss = 0.18517784774303436
Validation loss = 0.1705118864774704
Validation loss = 0.17424069344997406
Validation loss = 0.17120565474033356
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -305     |
| Iteration     | 2        |
| MaximumReturn | 324      |
| MinimumReturn | -603     |
| TotalSamples  | 16000    |
----------------------------
itr #3 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.4185701012611389
Validation loss = 0.45593488216400146
Validation loss = 0.4614722728729248
Validation loss = 0.47339844703674316
Validation loss = 0.49398985505104065
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.45853158831596375
Validation loss = 0.4805550277233124
Validation loss = 0.512920081615448
Validation loss = 0.49822235107421875
Validation loss = 0.47332024574279785
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.5043284296989441
Validation loss = 0.5167707204818726
Validation loss = 0.5164293050765991
Validation loss = 0.5352722406387329
Validation loss = 0.5667214393615723
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3966779112815857
Validation loss = 0.4237533211708069
Validation loss = 0.4378091096878052
Validation loss = 0.4363376498222351
Validation loss = 0.411594957113266
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.43766725063323975
Validation loss = 0.5333759784698486
Validation loss = 0.49082115292549133
Validation loss = 0.49790114164352417
Validation loss = 0.4978163242340088
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -217     |
| Iteration     | 3        |
| MaximumReturn | 177      |
| MinimumReturn | -469     |
| TotalSamples  | 20000    |
----------------------------
itr #4 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.192378431558609
Validation loss = 0.18404068052768707
Validation loss = 0.1774832308292389
Validation loss = 0.18290071189403534
Validation loss = 0.17798639833927155
Validation loss = 0.1793302446603775
Validation loss = 0.18173958361148834
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.19058308005332947
Validation loss = 0.18771111965179443
Validation loss = 0.1778240203857422
Validation loss = 0.1887994110584259
Validation loss = 0.1785334050655365
Validation loss = 0.18748226761817932
Validation loss = 0.18114981055259705
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.191008523106575
Validation loss = 0.17683523893356323
Validation loss = 0.18841448426246643
Validation loss = 0.17498831450939178
Validation loss = 0.17950153350830078
Validation loss = 0.19813258945941925
Validation loss = 0.18661272525787354
Validation loss = 0.1866699606180191
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.18602421879768372
Validation loss = 0.1749667376279831
Validation loss = 0.1734556257724762
Validation loss = 0.18003608286380768
Validation loss = 0.17663617432117462
Validation loss = 0.20463474094867706
Validation loss = 0.17609061300754547
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.1904371678829193
Validation loss = 0.176168754696846
Validation loss = 0.18289723992347717
Validation loss = 0.17744359374046326
Validation loss = 0.18045002222061157
Validation loss = 0.1803213357925415
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 6.08     |
| Iteration     | 4        |
| MaximumReturn | 256      |
| MinimumReturn | -733     |
| TotalSamples  | 24000    |
----------------------------
itr #5 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.17781370878219604
Validation loss = 0.1844366043806076
Validation loss = 0.18084703385829926
Validation loss = 0.1777193546295166
Validation loss = 0.17857052385807037
Validation loss = 0.17764432728290558
Validation loss = 0.18162114918231964
Validation loss = 0.18074630200862885
Validation loss = 0.18091696500778198
Validation loss = 0.18001586198806763
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.17625324428081512
Validation loss = 0.17373813688755035
Validation loss = 0.17944644391536713
Validation loss = 0.19029317796230316
Validation loss = 0.23042626678943634
Validation loss = 0.17919595539569855
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.17761938273906708
Validation loss = 0.182529017329216
Validation loss = 0.1765117645263672
Validation loss = 0.17578738927841187
Validation loss = 0.18011029064655304
Validation loss = 0.19348283112049103
Validation loss = 0.17991316318511963
Validation loss = 0.18250709772109985
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.1771034151315689
Validation loss = 0.1783093810081482
Validation loss = 0.17652052640914917
Validation loss = 0.17994610965251923
Validation loss = 0.17771349847316742
Validation loss = 0.18650059401988983
Validation loss = 0.1782611608505249
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.17571385204792023
Validation loss = 0.18114395439624786
Validation loss = 0.17442817986011505
Validation loss = 0.17449207603931427
Validation loss = 0.176058828830719
Validation loss = 0.17968229949474335
Validation loss = 0.18032167851924896
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -333     |
| Iteration     | 5        |
| MaximumReturn | 160      |
| MinimumReturn | -887     |
| TotalSamples  | 28000    |
----------------------------
itr #6 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.17757487297058105
Validation loss = 0.17667019367218018
Validation loss = 0.1759415566921234
Validation loss = 0.18269293010234833
Validation loss = 0.18370197713375092
Validation loss = 0.17971499264240265
Validation loss = 0.18012063205242157
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.17630909383296967
Validation loss = 0.18881694972515106
Validation loss = 0.17614050209522247
Validation loss = 0.17734427750110626
Validation loss = 0.17778292298316956
Validation loss = 0.17582157254219055
Validation loss = 0.181890606880188
Validation loss = 0.17884325981140137
Validation loss = 0.176066592335701
Validation loss = 0.1793462187051773
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.176641583442688
Validation loss = 0.1761275678873062
Validation loss = 0.17784945666790009
Validation loss = 0.18222269415855408
Validation loss = 0.1774158775806427
Validation loss = 0.1804148405790329
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.17431628704071045
Validation loss = 0.17362631857395172
Validation loss = 0.18411895632743835
Validation loss = 0.1881236881017685
Validation loss = 0.1798659712076187
Validation loss = 0.19474783539772034
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.18052251636981964
Validation loss = 0.17433211207389832
Validation loss = 0.17180784046649933
Validation loss = 0.18182621896266937
Validation loss = 0.1829608976840973
Validation loss = 0.17822493612766266
Validation loss = 0.1816001981496811
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -242     |
| Iteration     | 6        |
| MaximumReturn | 122      |
| MinimumReturn | -793     |
| TotalSamples  | 32000    |
----------------------------
itr #7 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.18227817118167877
Validation loss = 0.1797669529914856
Validation loss = 0.18442651629447937
Validation loss = 0.177690327167511
Validation loss = 0.18074534833431244
Validation loss = 0.1816435605287552
Validation loss = 0.1810828298330307
Validation loss = 0.1810149848461151
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.17498961091041565
Validation loss = 0.17789345979690552
Validation loss = 0.17795273661613464
Validation loss = 0.17672498524188995
Validation loss = 0.18605875968933105
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.17696109414100647
Validation loss = 0.1785922348499298
Validation loss = 0.17906548082828522
Validation loss = 0.17818190157413483
Validation loss = 0.17865075170993805
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.17686040699481964
Validation loss = 0.17782816290855408
Validation loss = 0.1820528209209442
Validation loss = 0.1818685680627823
Validation loss = 0.1850275993347168
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.17372669279575348
Validation loss = 0.17448784410953522
Validation loss = 0.17632222175598145
Validation loss = 0.1752711683511734
Validation loss = 0.1808416247367859
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 196      |
| Iteration     | 7        |
| MaximumReturn | 608      |
| MinimumReturn | -635     |
| TotalSamples  | 36000    |
----------------------------
itr #8 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.1783575415611267
Validation loss = 0.18069766461849213
Validation loss = 0.1807083785533905
Validation loss = 0.18100863695144653
Validation loss = 0.17921698093414307
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.17495255172252655
Validation loss = 0.17702847719192505
Validation loss = 0.17903874814510345
Validation loss = 0.18090471625328064
Validation loss = 0.18300339579582214
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.17855964601039886
Validation loss = 0.1768799126148224
Validation loss = 0.18753041326999664
Validation loss = 0.17785009741783142
Validation loss = 0.17891165614128113
Validation loss = 0.17799323797225952
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.17825739085674286
Validation loss = 0.1766584813594818
Validation loss = 0.17620520293712616
Validation loss = 0.18614116311073303
Validation loss = 0.1805000752210617
Validation loss = 0.18310798704624176
Validation loss = 0.17888575792312622
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.17475774884223938
Validation loss = 0.17655211687088013
Validation loss = 0.1809971034526825
Validation loss = 0.17740744352340698
Validation loss = 0.18152296543121338
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -195     |
| Iteration     | 8        |
| MaximumReturn | -49      |
| MinimumReturn | -434     |
| TotalSamples  | 40000    |
----------------------------
itr #9 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.18113143742084503
Validation loss = 0.17887386679649353
Validation loss = 0.18066681921482086
Validation loss = 0.17816868424415588
Validation loss = 0.18332138657569885
Validation loss = 0.18454653024673462
Validation loss = 0.1819416731595993
Validation loss = 0.18228109180927277
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.17988808453083038
Validation loss = 0.1755352020263672
Validation loss = 0.18070213496685028
Validation loss = 0.17903412878513336
Validation loss = 0.18007488548755646
Validation loss = 0.1811015009880066
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.17694401741027832
Validation loss = 0.1804230660200119
Validation loss = 0.18284958600997925
Validation loss = 0.1822669804096222
Validation loss = 0.1813264787197113
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.17855006456375122
Validation loss = 0.1789960414171219
Validation loss = 0.18356376886367798
Validation loss = 0.18028095364570618
Validation loss = 0.17913231253623962
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.17691245675086975
Validation loss = 0.1774209439754486
Validation loss = 0.1793447583913803
Validation loss = 0.1807814985513687
Validation loss = 0.17902466654777527
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 5.11     |
| Iteration     | 9        |
| MaximumReturn | 596      |
| MinimumReturn | -297     |
| TotalSamples  | 44000    |
----------------------------
itr #10 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.18186043202877045
Validation loss = 0.18186281621456146
Validation loss = 0.1820603311061859
Validation loss = 0.1876213699579239
Validation loss = 0.18021763861179352
Validation loss = 0.18629080057144165
Validation loss = 0.1818610578775406
Validation loss = 0.18082799017429352
Validation loss = 0.18259134888648987
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.18019118905067444
Validation loss = 0.17821626365184784
Validation loss = 0.17944298684597015
Validation loss = 0.17958807945251465
Validation loss = 0.1792244017124176
Validation loss = 0.18223491311073303
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.1766015738248825
Validation loss = 0.17783574759960175
Validation loss = 0.1787133663892746
Validation loss = 0.17950107157230377
Validation loss = 0.17896980047225952
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.17849253118038177
Validation loss = 0.17991262674331665
Validation loss = 0.1806623935699463
Validation loss = 0.17869098484516144
Validation loss = 0.18152786791324615
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.17810457944869995
Validation loss = 0.17584024369716644
Validation loss = 0.18250204622745514
Validation loss = 0.1837102323770523
Validation loss = 0.19042561948299408
Validation loss = 0.181941419839859
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 180      |
| Iteration     | 10       |
| MaximumReturn | 524      |
| MinimumReturn | -137     |
| TotalSamples  | 48000    |
----------------------------
itr #11 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.18522799015045166
Validation loss = 0.18120038509368896
Validation loss = 0.180853009223938
Validation loss = 0.1824154108762741
Validation loss = 0.18269400298595428
Validation loss = 0.18047793209552765
Validation loss = 0.1841173619031906
Validation loss = 0.18303529918193817
Validation loss = 0.18085689842700958
Validation loss = 0.18153925240039825
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.17951864004135132
Validation loss = 0.17969490587711334
Validation loss = 0.17937004566192627
Validation loss = 0.1791897565126419
Validation loss = 0.18160957098007202
Validation loss = 0.1831352561712265
Validation loss = 0.18604803085327148
Validation loss = 0.18188081681728363
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.17911772429943085
Validation loss = 0.1781633347272873
Validation loss = 0.1803596466779709
Validation loss = 0.1813211292028427
Validation loss = 0.18079209327697754
Validation loss = 0.18187832832336426
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.17907476425170898
Validation loss = 0.17842034995555878
Validation loss = 0.18277128040790558
Validation loss = 0.17883507907390594
Validation loss = 0.18088524043560028
Validation loss = 0.1805247813463211
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.18058602511882782
Validation loss = 0.17904464900493622
Validation loss = 0.18034054338932037
Validation loss = 0.185267373919487
Validation loss = 0.18331186473369598
Validation loss = 0.18420292437076569
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -356     |
| Iteration     | 11       |
| MaximumReturn | -102     |
| MinimumReturn | -484     |
| TotalSamples  | 52000    |
----------------------------
itr #12 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.1837533414363861
Validation loss = 0.17957016825675964
Validation loss = 0.18111148476600647
Validation loss = 0.18233631551265717
Validation loss = 0.18073683977127075
Validation loss = 0.18207040429115295
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.18340285122394562
Validation loss = 0.1783616691827774
Validation loss = 0.1810607612133026
Validation loss = 0.18368442356586456
Validation loss = 0.18037065863609314
Validation loss = 0.18334811925888062
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.18111848831176758
Validation loss = 0.17743445932865143
Validation loss = 0.18164052069187164
Validation loss = 0.18649467825889587
Validation loss = 0.18200431764125824
Validation loss = 0.18041542172431946
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.18268603086471558
Validation loss = 0.18258172273635864
Validation loss = 0.18072988092899323
Validation loss = 0.1847941130399704
Validation loss = 0.180181086063385
Validation loss = 0.18288518488407135
Validation loss = 0.1807313710451126
Validation loss = 0.18032263219356537
Validation loss = 0.18009532988071442
Validation loss = 0.1810757964849472
Validation loss = 0.18209876120090485
Validation loss = 0.18029369413852692
Validation loss = 0.18230241537094116
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.18084467947483063
Validation loss = 0.1785968840122223
Validation loss = 0.1798197329044342
Validation loss = 0.181284099817276
Validation loss = 0.1798238754272461
Validation loss = 0.1811947524547577
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -17.4    |
| Iteration     | 12       |
| MaximumReturn | 564      |
| MinimumReturn | -271     |
| TotalSamples  | 56000    |
----------------------------
itr #13 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.18173936009407043
Validation loss = 0.18008151650428772
Validation loss = 0.17951717972755432
Validation loss = 0.18053437769412994
Validation loss = 0.18374252319335938
Validation loss = 0.1811499148607254
Validation loss = 0.1802201271057129
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.18024004995822906
Validation loss = 0.1792057901620865
Validation loss = 0.1803896576166153
Validation loss = 0.183648481965065
Validation loss = 0.1802309900522232
Validation loss = 0.18091122806072235
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.18205687403678894
Validation loss = 0.18343274295330048
Validation loss = 0.1791856437921524
Validation loss = 0.18018631637096405
Validation loss = 0.1799910068511963
Validation loss = 0.1828087568283081
Validation loss = 0.18077106773853302
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.18125374615192413
Validation loss = 0.17856727540493011
Validation loss = 0.1812959909439087
Validation loss = 0.17963971197605133
Validation loss = 0.1808219850063324
Validation loss = 0.18140625953674316
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.18064139783382416
Validation loss = 0.1789156049489975
Validation loss = 0.18379731476306915
Validation loss = 0.18035706877708435
Validation loss = 0.18232153356075287
Validation loss = 0.17956817150115967
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -178     |
| Iteration     | 13       |
| MaximumReturn | 77.2     |
| MinimumReturn | -330     |
| TotalSamples  | 60000    |
----------------------------
itr #14 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.1795087605714798
Validation loss = 0.17674000561237335
Validation loss = 0.1787124127149582
Validation loss = 0.17878654599189758
Validation loss = 0.1815214455127716
Validation loss = 0.18084019422531128
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.1841730922460556
Validation loss = 0.17805153131484985
Validation loss = 0.17860080301761627
Validation loss = 0.17803923785686493
Validation loss = 0.17903003096580505
Validation loss = 0.17908482253551483
Validation loss = 0.17832820117473602
Validation loss = 0.1779099404811859
Validation loss = 0.18227648735046387
Validation loss = 0.18221688270568848
Validation loss = 0.17807826399803162
Validation loss = 0.18104828894138336
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.1806916892528534
Validation loss = 0.17909683287143707
Validation loss = 0.18172764778137207
Validation loss = 0.18011392652988434
Validation loss = 0.18079045414924622
Validation loss = 0.17851966619491577
Validation loss = 0.1815401166677475
Validation loss = 0.18308964371681213
Validation loss = 0.1803111881017685
Validation loss = 0.18222051858901978
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.1815803498029709
Validation loss = 0.17655058205127716
Validation loss = 0.17942914366722107
Validation loss = 0.1780218631029129
Validation loss = 0.18070940673351288
Validation loss = 0.1800978034734726
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.18028345704078674
Validation loss = 0.1782216876745224
Validation loss = 0.1791801154613495
Validation loss = 0.17992013692855835
Validation loss = 0.18049634993076324
Validation loss = 0.18040922284126282
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -335     |
| Iteration     | 14       |
| MaximumReturn | -91.4    |
| MinimumReturn | -514     |
| TotalSamples  | 64000    |
----------------------------
itr #15 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.17980501055717468
Validation loss = 0.17769210040569305
Validation loss = 0.1773410439491272
Validation loss = 0.17815501987934113
Validation loss = 0.17791739106178284
Validation loss = 0.17847692966461182
Validation loss = 0.1791214495897293
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.182148277759552
Validation loss = 0.1779031753540039
Validation loss = 0.17658081650733948
Validation loss = 0.17772793769836426
Validation loss = 0.17725922167301178
Validation loss = 0.17820967733860016
Validation loss = 0.17785191535949707
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.18023329973220825
Validation loss = 0.17887946963310242
Validation loss = 0.17836041748523712
Validation loss = 0.18055644631385803
Validation loss = 0.1798323094844818
Validation loss = 0.17810335755348206
Validation loss = 0.1781911700963974
Validation loss = 0.17920836806297302
Validation loss = 0.17801925539970398
Validation loss = 0.17749565839767456
Validation loss = 0.18013983964920044
Validation loss = 0.17804083228111267
Validation loss = 0.17790868878364563
Validation loss = 0.17637330293655396
Validation loss = 0.1782771348953247
Validation loss = 0.17675244808197021
Validation loss = 0.17994028329849243
Validation loss = 0.17695516347885132
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.18068534135818481
Validation loss = 0.17654046416282654
Validation loss = 0.18069210648536682
Validation loss = 0.17698153853416443
Validation loss = 0.18041712045669556
Validation loss = 0.178211510181427
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.17902781069278717
Validation loss = 0.17891669273376465
Validation loss = 0.17917993664741516
Validation loss = 0.18075305223464966
Validation loss = 0.17937257885932922
Validation loss = 0.1799072027206421
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 460      |
| Iteration     | 15       |
| MaximumReturn | 1.05e+03 |
| MinimumReturn | -526     |
| TotalSamples  | 68000    |
----------------------------
itr #16 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.17928561568260193
Validation loss = 0.17694289982318878
Validation loss = 0.17626459896564484
Validation loss = 0.17613577842712402
Validation loss = 0.1786249876022339
Validation loss = 0.17739635705947876
Validation loss = 0.17828480899333954
Validation loss = 0.17791971564292908
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.17967955768108368
Validation loss = 0.17545725405216217
Validation loss = 0.1760251224040985
Validation loss = 0.18029457330703735
Validation loss = 0.17687702178955078
Validation loss = 0.17791545391082764
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.17944450676441193
Validation loss = 0.17579558491706848
Validation loss = 0.17606297135353088
Validation loss = 0.17586827278137207
Validation loss = 0.1789843738079071
Validation loss = 0.17634893953800201
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.17888440191745758
Validation loss = 0.17541086673736572
Validation loss = 0.17708271741867065
Validation loss = 0.17858226597309113
Validation loss = 0.18127188086509705
Validation loss = 0.17681953310966492
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.18048956990242004
Validation loss = 0.17879915237426758
Validation loss = 0.17801211774349213
Validation loss = 0.17910322546958923
Validation loss = 0.1785222291946411
Validation loss = 0.1788134127855301
Validation loss = 0.17878621816635132
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -260     |
| Iteration     | 16       |
| MaximumReturn | -26.6    |
| MinimumReturn | -489     |
| TotalSamples  | 72000    |
----------------------------
itr #17 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.17803272604942322
Validation loss = 0.17485836148262024
Validation loss = 0.17622818052768707
Validation loss = 0.1768915057182312
Validation loss = 0.1758553385734558
Validation loss = 0.17799167335033417
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.1784047931432724
Validation loss = 0.1754494458436966
Validation loss = 0.17527051270008087
Validation loss = 0.1759854555130005
Validation loss = 0.17775161564350128
Validation loss = 0.17477236688137054
Validation loss = 0.17644792795181274
Validation loss = 0.17731225490570068
Validation loss = 0.17426349222660065
Validation loss = 0.176728755235672
Validation loss = 0.1748514175415039
Validation loss = 0.17772100865840912
Validation loss = 0.17441633343696594
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.17806074023246765
Validation loss = 0.17668768763542175
Validation loss = 0.17701667547225952
Validation loss = 0.17491774260997772
Validation loss = 0.17476624250411987
Validation loss = 0.17516852915287018
Validation loss = 0.17750291526317596
Validation loss = 0.17840735614299774
Validation loss = 0.17565776407718658
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.17608220875263214
Validation loss = 0.17597787082195282
Validation loss = 0.17681194841861725
Validation loss = 0.17583248019218445
Validation loss = 0.17618268728256226
Validation loss = 0.17472699284553528
Validation loss = 0.17542091012001038
Validation loss = 0.17694224417209625
Validation loss = 0.1752493977546692
Validation loss = 0.17573294043540955
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.18064580857753754
Validation loss = 0.17614798247814178
Validation loss = 0.17674559354782104
Validation loss = 0.17686143517494202
Validation loss = 0.17917953431606293
Validation loss = 0.17573107779026031
Validation loss = 0.17852270603179932
Validation loss = 0.1781378984451294
Validation loss = 0.17585676908493042
Validation loss = 0.17709803581237793
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -55.8    |
| Iteration     | 17       |
| MaximumReturn | 294      |
| MinimumReturn | -366     |
| TotalSamples  | 76000    |
----------------------------
itr #18 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.1812227964401245
Validation loss = 0.1728726178407669
Validation loss = 0.17452514171600342
Validation loss = 0.17762742936611176
Validation loss = 0.17417573928833008
Validation loss = 0.17365656793117523
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.17787687480449677
Validation loss = 0.17214353382587433
Validation loss = 0.17441920936107635
Validation loss = 0.17383211851119995
Validation loss = 0.1748620867729187
Validation loss = 0.17619645595550537
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.17769458889961243
Validation loss = 0.1752205342054367
Validation loss = 0.1728225201368332
Validation loss = 0.174269899725914
Validation loss = 0.17335014045238495
Validation loss = 0.173715740442276
Validation loss = 0.17321960628032684
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.17719021439552307
Validation loss = 0.17368820309638977
Validation loss = 0.17316758632659912
Validation loss = 0.17342345416545868
Validation loss = 0.1746705323457718
Validation loss = 0.17443907260894775
Validation loss = 0.17505918443202972
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.1783091127872467
Validation loss = 0.17424994707107544
Validation loss = 0.17551228404045105
Validation loss = 0.1743558794260025
Validation loss = 0.17454814910888672
Validation loss = 0.17461247742176056
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 21.7     |
| Iteration     | 18       |
| MaximumReturn | 668      |
| MinimumReturn | -395     |
| TotalSamples  | 80000    |
----------------------------
itr #19 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.1755186766386032
Validation loss = 0.17364700138568878
Validation loss = 0.1731901466846466
Validation loss = 0.17228999733924866
Validation loss = 0.1748092919588089
Validation loss = 0.17347343266010284
Validation loss = 0.17258647084236145
Validation loss = 0.17424628138542175
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.17590048909187317
Validation loss = 0.17176659405231476
Validation loss = 0.17348134517669678
Validation loss = 0.17249658703804016
Validation loss = 0.17194412648677826
Validation loss = 0.172832652926445
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.17820163071155548
Validation loss = 0.17371883988380432
Validation loss = 0.17320950329303741
Validation loss = 0.17479553818702698
Validation loss = 0.1728994995355606
Validation loss = 0.17349417507648468
Validation loss = 0.17207810282707214
Validation loss = 0.17302148044109344
Validation loss = 0.17202280461788177
Validation loss = 0.17245618999004364
Validation loss = 0.17267802357673645
Validation loss = 0.1713414490222931
Validation loss = 0.17151708900928497
Validation loss = 0.1728677898645401
Validation loss = 0.17270427942276
Validation loss = 0.17252571880817413
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.17388221621513367
Validation loss = 0.17178699374198914
Validation loss = 0.1740165799856186
Validation loss = 0.17312562465667725
Validation loss = 0.1730469912290573
Validation loss = 0.17377297580242157
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.17825500667095184
Validation loss = 0.1727152317762375
Validation loss = 0.1741560697555542
Validation loss = 0.1745717078447342
Validation loss = 0.17461180686950684
Validation loss = 0.17402413487434387
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 41.6     |
| Iteration     | 19       |
| MaximumReturn | 459      |
| MinimumReturn | -280     |
| TotalSamples  | 84000    |
----------------------------
itr #20 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.17454306781291962
Validation loss = 0.1712648570537567
Validation loss = 0.17263095080852509
Validation loss = 0.17194640636444092
Validation loss = 0.17111952602863312
Validation loss = 0.1728934943675995
Validation loss = 0.17158129811286926
Validation loss = 0.17226728796958923
Validation loss = 0.17141717672348022
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.17416436970233917
Validation loss = 0.17058499157428741
Validation loss = 0.17084376513957977
Validation loss = 0.17114567756652832
Validation loss = 0.1721896380186081
Validation loss = 0.17264020442962646
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.17250528931617737
Validation loss = 0.1709812432527542
Validation loss = 0.17052127420902252
Validation loss = 0.17236390709877014
Validation loss = 0.17098312079906464
Validation loss = 0.1717236042022705
Validation loss = 0.17056307196617126
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.17538902163505554
Validation loss = 0.17164988815784454
Validation loss = 0.1728057861328125
Validation loss = 0.17201881110668182
Validation loss = 0.17125113308429718
Validation loss = 0.17236843705177307
Validation loss = 0.17216703295707703
Validation loss = 0.17255879938602448
Validation loss = 0.17232906818389893
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.17680683732032776
Validation loss = 0.17213046550750732
Validation loss = 0.17311044037342072
Validation loss = 0.17280566692352295
Validation loss = 0.17271515727043152
Validation loss = 0.17285265028476715
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 100      |
| Iteration     | 20       |
| MaximumReturn | 635      |
| MinimumReturn | -306     |
| TotalSamples  | 88000    |
----------------------------
itr #21 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.17479881644248962
Validation loss = 0.16991104185581207
Validation loss = 0.1707843840122223
Validation loss = 0.17174400389194489
Validation loss = 0.17142759263515472
Validation loss = 0.17079663276672363
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.17300423979759216
Validation loss = 0.16904090344905853
Validation loss = 0.17005902528762817
Validation loss = 0.1718863993883133
Validation loss = 0.16998223960399628
Validation loss = 0.17002640664577484
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.17205367982387543
Validation loss = 0.16889439523220062
Validation loss = 0.16901034116744995
Validation loss = 0.17124591767787933
Validation loss = 0.16969352960586548
Validation loss = 0.17133110761642456
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.17430001497268677
Validation loss = 0.17155176401138306
Validation loss = 0.17126309871673584
Validation loss = 0.17309652268886566
Validation loss = 0.1721595674753189
Validation loss = 0.17345759272575378
Validation loss = 0.17147476971149445
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.17477494478225708
Validation loss = 0.1719248741865158
Validation loss = 0.17214371263980865
Validation loss = 0.1728505939245224
Validation loss = 0.17236702144145966
Validation loss = 0.1723138689994812
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -273     |
| Iteration     | 21       |
| MaximumReturn | 53.4     |
| MinimumReturn | -444     |
| TotalSamples  | 92000    |
----------------------------
itr #22 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.17270436882972717
Validation loss = 0.16982625424861908
Validation loss = 0.1717381626367569
Validation loss = 0.17073054611682892
Validation loss = 0.17043843865394592
Validation loss = 0.17086827754974365
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.1725197285413742
Validation loss = 0.16913838684558868
Validation loss = 0.16865533590316772
Validation loss = 0.16993208229541779
Validation loss = 0.1693408340215683
Validation loss = 0.16989664733409882
Validation loss = 0.16939225792884827
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.17177537083625793
Validation loss = 0.16830943524837494
Validation loss = 0.16952960193157196
Validation loss = 0.17142944037914276
Validation loss = 0.16851180791854858
Validation loss = 0.1694575548171997
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.17430351674556732
Validation loss = 0.17032507061958313
Validation loss = 0.17142094671726227
Validation loss = 0.1707114726305008
Validation loss = 0.1712397336959839
Validation loss = 0.17092084884643555
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.17430804669857025
Validation loss = 0.17120404541492462
Validation loss = 0.17275236546993256
Validation loss = 0.17058540880680084
Validation loss = 0.17289407551288605
Validation loss = 0.1710958480834961
Validation loss = 0.1714482456445694
Validation loss = 0.17176924645900726
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -304     |
| Iteration     | 22       |
| MaximumReturn | 114      |
| MinimumReturn | -547     |
| TotalSamples  | 96000    |
----------------------------
itr #23 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.171625018119812
Validation loss = 0.16903024911880493
Validation loss = 0.16937176883220673
Validation loss = 0.17108875513076782
Validation loss = 0.17024336755275726
Validation loss = 0.16958940029144287
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.17163199186325073
Validation loss = 0.1690949946641922
Validation loss = 0.17026309669017792
Validation loss = 0.16792668402194977
Validation loss = 0.168664813041687
Validation loss = 0.16934877634048462
Validation loss = 0.16832071542739868
Validation loss = 0.16905970871448517
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.1719266176223755
Validation loss = 0.16810660064220428
Validation loss = 0.16906766593456268
Validation loss = 0.17042003571987152
Validation loss = 0.16896818578243256
Validation loss = 0.168477401137352
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.17071624100208282
Validation loss = 0.17004673182964325
Validation loss = 0.16979950666427612
Validation loss = 0.1709435135126114
Validation loss = 0.17113667726516724
Validation loss = 0.17131955921649933
Validation loss = 0.1700807809829712
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.17638200521469116
Validation loss = 0.16971488296985626
Validation loss = 0.17131024599075317
Validation loss = 0.17123694717884064
Validation loss = 0.17220042645931244
Validation loss = 0.17205245792865753
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -210     |
| Iteration     | 23       |
| MaximumReturn | 424      |
| MinimumReturn | -472     |
| TotalSamples  | 100000   |
----------------------------
itr #24 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.17060667276382446
Validation loss = 0.16844682395458221
Validation loss = 0.16916555166244507
Validation loss = 0.16921977698802948
Validation loss = 0.16984528303146362
Validation loss = 0.1693006455898285
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.16945916414260864
Validation loss = 0.16760367155075073
Validation loss = 0.16755437850952148
Validation loss = 0.16648876667022705
Validation loss = 0.16787990927696228
Validation loss = 0.16770388185977936
Validation loss = 0.1675059199333191
Validation loss = 0.16796046495437622
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.1707547903060913
Validation loss = 0.16789193451404572
Validation loss = 0.1693010777235031
Validation loss = 0.16822433471679688
Validation loss = 0.16909223794937134
Validation loss = 0.16881737112998962
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.1736675500869751
Validation loss = 0.16877694427967072
Validation loss = 0.1710977703332901
Validation loss = 0.1698138564825058
Validation loss = 0.16905255615711212
Validation loss = 0.16920819878578186
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.17281986773014069
Validation loss = 0.17003794014453888
Validation loss = 0.17009219527244568
Validation loss = 0.17046499252319336
Validation loss = 0.17029082775115967
Validation loss = 0.17130577564239502
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -432     |
| Iteration     | 24       |
| MaximumReturn | -10      |
| MinimumReturn | -579     |
| TotalSamples  | 104000   |
----------------------------
itr #25 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.17143385112285614
Validation loss = 0.16844652593135834
Validation loss = 0.17007851600646973
Validation loss = 0.16865165531635284
Validation loss = 0.16920100152492523
Validation loss = 0.1690087467432022
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.16862604022026062
Validation loss = 0.16693204641342163
Validation loss = 0.16629385948181152
Validation loss = 0.16745296120643616
Validation loss = 0.16873976588249207
Validation loss = 0.16754139959812164
Validation loss = 0.16708344221115112
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.16983360052108765
Validation loss = 0.16726154088974
Validation loss = 0.1692519187927246
Validation loss = 0.16708454489707947
Validation loss = 0.16783472895622253
Validation loss = 0.16770049929618835
Validation loss = 0.16685113310813904
Validation loss = 0.16798655688762665
Validation loss = 0.1686570942401886
Validation loss = 0.16705836355686188
Validation loss = 0.1683761477470398
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.17280179262161255
Validation loss = 0.16831587255001068
Validation loss = 0.16898681223392487
Validation loss = 0.17016921937465668
Validation loss = 0.16844044625759125
Validation loss = 0.16993196308612823
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.17382976412773132
Validation loss = 0.1690906435251236
Validation loss = 0.16922639310359955
Validation loss = 0.17059046030044556
Validation loss = 0.1692608743906021
Validation loss = 0.16983549296855927
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -297     |
| Iteration     | 25       |
| MaximumReturn | 46.8     |
| MinimumReturn | -508     |
| TotalSamples  | 108000   |
----------------------------
itr #26 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.16943208873271942
Validation loss = 0.16729898750782013
Validation loss = 0.1687299907207489
Validation loss = 0.16881822049617767
Validation loss = 0.1687307506799698
Validation loss = 0.16833488643169403
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.1683562994003296
Validation loss = 0.1662416309118271
Validation loss = 0.16761691868305206
Validation loss = 0.16683045029640198
Validation loss = 0.16688013076782227
Validation loss = 0.167913019657135
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.1685655415058136
Validation loss = 0.16602084040641785
Validation loss = 0.1671413630247116
Validation loss = 0.1681971251964569
Validation loss = 0.16749636828899384
Validation loss = 0.1671707034111023
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.17015627026557922
Validation loss = 0.1681324690580368
Validation loss = 0.16876237094402313
Validation loss = 0.16903704404830933
Validation loss = 0.16924002766609192
Validation loss = 0.16857366263866425
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.17324961721897125
Validation loss = 0.1684264987707138
Validation loss = 0.1693369597196579
Validation loss = 0.17012378573417664
Validation loss = 0.17009712755680084
Validation loss = 0.17073029279708862
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 391      |
| Iteration     | 26       |
| MaximumReturn | 922      |
| MinimumReturn | -432     |
| TotalSamples  | 112000   |
----------------------------
itr #27 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.1709080934524536
Validation loss = 0.167673259973526
Validation loss = 0.16775356233119965
Validation loss = 0.16783370077610016
Validation loss = 0.16870297491550446
Validation loss = 0.1676618605852127
Validation loss = 0.1691398173570633
Validation loss = 0.1675526201725006
Validation loss = 0.16709133982658386
Validation loss = 0.1674700528383255
Validation loss = 0.16776660084724426
Validation loss = 0.16729113459587097
Validation loss = 0.16813281178474426
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.16905543208122253
Validation loss = 0.16595053672790527
Validation loss = 0.16702283918857574
Validation loss = 0.1671372354030609
Validation loss = 0.16953130066394806
Validation loss = 0.16667795181274414
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.16924233734607697
Validation loss = 0.1668795496225357
Validation loss = 0.16676568984985352
Validation loss = 0.16753850877285004
Validation loss = 0.16726423799991608
Validation loss = 0.16797246038913727
Validation loss = 0.16594718396663666
Validation loss = 0.16762201488018036
Validation loss = 0.16642150282859802
Validation loss = 0.16658441722393036
Validation loss = 0.16644585132598877
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.1704169511795044
Validation loss = 0.16813591122627258
Validation loss = 0.1679261028766632
Validation loss = 0.16854505240917206
Validation loss = 0.16838571429252625
Validation loss = 0.16781504452228546
Validation loss = 0.16801902651786804
Validation loss = 0.16803035140037537
Validation loss = 0.16700956225395203
Validation loss = 0.16823649406433105
Validation loss = 0.16750453412532806
Validation loss = 0.1682358980178833
Validation loss = 0.16760630905628204
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.1702723354101181
Validation loss = 0.16799834370613098
Validation loss = 0.16835597157478333
Validation loss = 0.1682058721780777
Validation loss = 0.1688172072172165
Validation loss = 0.16915641725063324
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 596      |
| Iteration     | 27       |
| MaximumReturn | 1.21e+03 |
| MinimumReturn | -519     |
| TotalSamples  | 116000   |
----------------------------
itr #28 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.17003561556339264
Validation loss = 0.16603220999240875
Validation loss = 0.1665293574333191
Validation loss = 0.1677802950143814
Validation loss = 0.1675211787223816
Validation loss = 0.1678941696882248
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.16923905909061432
Validation loss = 0.1651826947927475
Validation loss = 0.16597244143486023
Validation loss = 0.16651253402233124
Validation loss = 0.16591420769691467
Validation loss = 0.1679389327764511
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.16860798001289368
Validation loss = 0.1660364419221878
Validation loss = 0.16651591658592224
Validation loss = 0.16661721467971802
Validation loss = 0.16589640080928802
Validation loss = 0.16684769093990326
Validation loss = 0.16717295348644257
Validation loss = 0.1668989211320877
Validation loss = 0.1670297235250473
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.16963638365268707
Validation loss = 0.16477584838867188
Validation loss = 0.16700318455696106
Validation loss = 0.16751061379909515
Validation loss = 0.1665215790271759
Validation loss = 0.16687831282615662
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.17149777710437775
Validation loss = 0.16669690608978271
Validation loss = 0.16855357587337494
Validation loss = 0.16822823882102966
Validation loss = 0.16937583684921265
Validation loss = 0.16900183260440826
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -113     |
| Iteration     | 28       |
| MaximumReturn | 222      |
| MinimumReturn | -456     |
| TotalSamples  | 120000   |
----------------------------
itr #29 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.1684001386165619
Validation loss = 0.16683770716190338
Validation loss = 0.16594596207141876
Validation loss = 0.16757424175739288
Validation loss = 0.16650310158729553
Validation loss = 0.1679953932762146
Validation loss = 0.16751956939697266
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.16838288307189941
Validation loss = 0.16617193818092346
Validation loss = 0.16761310398578644
Validation loss = 0.1656470149755478
Validation loss = 0.1667873114347458
Validation loss = 0.16708824038505554
Validation loss = 0.1668141633272171
Validation loss = 0.1665046513080597
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.16974996030330658
Validation loss = 0.16590937972068787
Validation loss = 0.16622045636177063
Validation loss = 0.16686801612377167
Validation loss = 0.16664128005504608
Validation loss = 0.1674797236919403
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.16910073161125183
Validation loss = 0.16564637422561646
Validation loss = 0.16828684508800507
Validation loss = 0.16705311834812164
Validation loss = 0.167131245136261
Validation loss = 0.16738560795783997
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.1692010462284088
Validation loss = 0.16680021584033966
Validation loss = 0.16836120188236237
Validation loss = 0.1674882471561432
Validation loss = 0.16849656403064728
Validation loss = 0.1681959629058838
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -224     |
| Iteration     | 29       |
| MaximumReturn | 181      |
| MinimumReturn | -460     |
| TotalSamples  | 124000   |
----------------------------
itr #30 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.16910286247730255
Validation loss = 0.16567493975162506
Validation loss = 0.16659726202487946
Validation loss = 0.1681768298149109
Validation loss = 0.16629040241241455
Validation loss = 0.16610953211784363
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.16751590371131897
Validation loss = 0.16404704749584198
Validation loss = 0.16621801257133484
Validation loss = 0.16636185348033905
Validation loss = 0.16518478095531464
Validation loss = 0.16703428328037262
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.1687784641981125
Validation loss = 0.16609002649784088
Validation loss = 0.16566866636276245
Validation loss = 0.165052592754364
Validation loss = 0.1653904914855957
Validation loss = 0.16542240977287292
Validation loss = 0.1664460003376007
Validation loss = 0.16625374555587769
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.16900227963924408
Validation loss = 0.16547991335391998
Validation loss = 0.1665296107530594
Validation loss = 0.16727764904499054
Validation loss = 0.16733016073703766
Validation loss = 0.16629087924957275
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.16905559599399567
Validation loss = 0.16561344265937805
Validation loss = 0.16679389774799347
Validation loss = 0.16667422652244568
Validation loss = 0.1679679900407791
Validation loss = 0.16751794517040253
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 429      |
| Iteration     | 30       |
| MaximumReturn | 1.17e+03 |
| MinimumReturn | -114     |
| TotalSamples  | 128000   |
----------------------------
itr #31 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.168612539768219
Validation loss = 0.16522850096225739
Validation loss = 0.16754916310310364
Validation loss = 0.16739967465400696
Validation loss = 0.16681742668151855
Validation loss = 0.16558969020843506
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.16639026999473572
Validation loss = 0.16509106755256653
Validation loss = 0.16563302278518677
Validation loss = 0.16601523756980896
Validation loss = 0.16667374968528748
Validation loss = 0.16540342569351196
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.16751717031002045
Validation loss = 0.16412989795207977
Validation loss = 0.16572988033294678
Validation loss = 0.16638699173927307
Validation loss = 0.16512933373451233
Validation loss = 0.16494685411453247
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.16934359073638916
Validation loss = 0.1649058312177658
Validation loss = 0.16758927702903748
Validation loss = 0.16585330665111542
Validation loss = 0.1666869819164276
Validation loss = 0.1657596081495285
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.1690434217453003
Validation loss = 0.1650213599205017
Validation loss = 0.1662958711385727
Validation loss = 0.16676825284957886
Validation loss = 0.16705651581287384
Validation loss = 0.1667424440383911
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -91.4    |
| Iteration     | 31       |
| MaximumReturn | 347      |
| MinimumReturn | -513     |
| TotalSamples  | 132000   |
----------------------------
itr #32 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.16709735989570618
Validation loss = 0.16565006971359253
Validation loss = 0.1661984771490097
Validation loss = 0.16669689118862152
Validation loss = 0.16669964790344238
Validation loss = 0.16577160358428955
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.16763314604759216
Validation loss = 0.1652781069278717
Validation loss = 0.16542091965675354
Validation loss = 0.16539302468299866
Validation loss = 0.16509807109832764
Validation loss = 0.1655152142047882
Validation loss = 0.16515851020812988
Validation loss = 0.16502074897289276
Validation loss = 0.16562925279140472
Validation loss = 0.16532140970230103
Validation loss = 0.165615975856781
Validation loss = 0.16468869149684906
Validation loss = 0.1644476056098938
Validation loss = 0.1642901748418808
Validation loss = 0.16425031423568726
Validation loss = 0.16526710987091064
Validation loss = 0.16458532214164734
Validation loss = 0.1654846966266632
Validation loss = 0.16485825181007385
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.16675056517124176
Validation loss = 0.16389848291873932
Validation loss = 0.165206640958786
Validation loss = 0.16456085443496704
Validation loss = 0.1647520214319229
Validation loss = 0.16487915813922882
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.16852426528930664
Validation loss = 0.16493922472000122
Validation loss = 0.16769592463970184
Validation loss = 0.165921151638031
Validation loss = 0.16566652059555054
Validation loss = 0.1656707227230072
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.16773958504199982
Validation loss = 0.16560541093349457
Validation loss = 0.1662236899137497
Validation loss = 0.16719986498355865
Validation loss = 0.1663181483745575
Validation loss = 0.16705022752285004
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -55.3    |
| Iteration     | 32       |
| MaximumReturn | 609      |
| MinimumReturn | -503     |
| TotalSamples  | 136000   |
----------------------------
