Logging to experiments/hopper/hopperO01/Tue-01-Nov-2022-09-35-15-AM-CDT_hopper_trpo_iteration_20_seed2431
Print configuration .....
{'env_name': 'hopper', 'random_seeds': [1234, 2431, 2531, 2231], 'save_variables': False, 'model_save_dir': '/tmp/hopper_models/', 'restore_variables': False, 'start_onpol_iter': 0, 'onpol_iters': 33, 'num_path_random': 6, 'num_path_onpol': 6, 'env_horizon': 1000, 'max_train_data': 200000, 'max_val_data': 100000, 'discard_ratio': 0.0, 'dynamics': {'pre_training': {'mode': 'intrinsic_reward', 'itr': 0, 'policy_itr': 20}, 'model': 'nn', 'ensemble': True, 'ensemble_model_count': 5, 'enable_particle_ensemble': True, 'particles': 5, 'obs_var': 1.0, 'intrinsic_reward_coeff': 1.0, 'ita': 1.0, 'mode': 'random', 'val': True, 'n_layers': 4, 'hidden_size': 1000, 'activation': 'relu', 'batch_size': 1000, 'learning_rate': 0.001, 'reg_coeff': 0.0, 'epochs': 200, 'kfac_params': {'learning_rate': 0.1, 'damping': 0.001, 'momentum': 0.9, 'kl_clip': 0.0001, 'cov_ema_decay': 0.99}}, 'policy': {'network_shape': [64, 64], 'init_logstd': 0.0, 'activation': 'tanh', 'reinitialize_every_itr': False}, 'trpo': {'horizon': 1000, 'gamma': 0.99, 'step_size': 0.01, 'iterations': 20, 'batch_size': 50000, 'gae': 0.95, 'visualization': False, 'visualize_iterations': [0]}, 'algo': 'trpo'}
Generating random rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating random rollouts.
Creating normalization for training data.
Done creating normalization for training data.
Particle ensemble enabled? True
An ensemble of 5 dynamics model <class 'model.dynamics.NNDynamicsModel'> initialized
Train dynamics model with intrinsic reward only? False
Pre-training enabled. Using only intrinsic reward.
Pre-training dynamics model for 0 iterations...
Done pre-training dynamics model.
Using external reward only.
itr #0 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.8836572170257568
Validation loss = 0.6817021369934082
Validation loss = 0.6630652546882629
Validation loss = 0.6928933262825012
Validation loss = 0.6932342052459717
Validation loss = 0.7218468189239502
Validation loss = 0.719406247138977
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.9079097509384155
Validation loss = 0.6809521913528442
Validation loss = 0.6623558402061462
Validation loss = 0.6734427213668823
Validation loss = 0.7053118944168091
Validation loss = 0.7130551338195801
Validation loss = 0.7418016195297241
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7586491107940674
Validation loss = 0.6758354306221008
Validation loss = 0.6659390926361084
Validation loss = 0.6822822093963623
Validation loss = 0.7034627795219421
Validation loss = 0.7329761981964111
Validation loss = 0.7571285963058472
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.8613938093185425
Validation loss = 0.6874008178710938
Validation loss = 0.6702551245689392
Validation loss = 0.6839975118637085
Validation loss = 0.7079488635063171
Validation loss = 0.7213515639305115
Validation loss = 0.7350751161575317
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.8606401085853577
Validation loss = 0.6818424463272095
Validation loss = 0.6636792421340942
Validation loss = 0.6717399954795837
Validation loss = 0.6955710649490356
Validation loss = 0.7347455620765686
Validation loss = 0.7413586378097534
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -2.09e+03 |
| Iteration     | 0         |
| MaximumReturn | -723      |
| MinimumReturn | -2.98e+03 |
| TotalSamples  | 8000      |
-----------------------------
itr #1 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6869140863418579
Validation loss = 0.6952054500579834
Validation loss = 0.7112222909927368
Validation loss = 0.7198944091796875
Validation loss = 0.7443913221359253
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.683715283870697
Validation loss = 0.6791084408760071
Validation loss = 0.6978483200073242
Validation loss = 0.7145708203315735
Validation loss = 0.7576017379760742
Validation loss = 0.7738388776779175
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.6911221742630005
Validation loss = 0.6837894916534424
Validation loss = 0.7182586789131165
Validation loss = 0.7319785356521606
Validation loss = 0.755579948425293
Validation loss = 0.7948522567749023
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.6840845942497253
Validation loss = 0.6771762371063232
Validation loss = 0.6926544904708862
Validation loss = 0.704283595085144
Validation loss = 0.7497100830078125
Validation loss = 0.7618309855461121
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6932579874992371
Validation loss = 0.6758966445922852
Validation loss = 0.7005788683891296
Validation loss = 0.7221671342849731
Validation loss = 0.7342665791511536
Validation loss = 0.7663290500640869
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -2.18e+03 |
| Iteration     | 1         |
| MaximumReturn | -1.7e+03  |
| MinimumReturn | -2.73e+03 |
| TotalSamples  | 12000     |
-----------------------------
itr #2 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6970512866973877
Validation loss = 0.7146158218383789
Validation loss = 0.7448734641075134
Validation loss = 0.7730348706245422
Validation loss = 0.7858921885490417
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7033219933509827
Validation loss = 0.7312593460083008
Validation loss = 0.7595749497413635
Validation loss = 0.7951211333274841
Validation loss = 0.7966728806495667
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.707766056060791
Validation loss = 0.7500920295715332
Validation loss = 0.7587714195251465
Validation loss = 0.8106595873832703
Validation loss = 0.810642659664154
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7062592506408691
Validation loss = 0.7299652099609375
Validation loss = 0.7898778915405273
Validation loss = 0.7876651287078857
Validation loss = 0.8247433304786682
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.703610897064209
Validation loss = 0.7455920577049255
Validation loss = 0.7805728316307068
Validation loss = 0.7995738387107849
Validation loss = 0.8188300132751465
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -2.53e+03 |
| Iteration     | 2         |
| MaximumReturn | -1.88e+03 |
| MinimumReturn | -3.03e+03 |
| TotalSamples  | 16000     |
-----------------------------
itr #3 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7585826516151428
Validation loss = 0.7817642688751221
Validation loss = 0.801287055015564
Validation loss = 0.8165144324302673
Validation loss = 0.8343716263771057
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7565723061561584
Validation loss = 0.7874014377593994
Validation loss = 0.8013631105422974
Validation loss = 0.8241102695465088
Validation loss = 0.8320967555046082
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7519512176513672
Validation loss = 0.7980388402938843
Validation loss = 0.8081634044647217
Validation loss = 0.8271602988243103
Validation loss = 0.8557249307632446
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7701970934867859
Validation loss = 0.7822408080101013
Validation loss = 0.7981947064399719
Validation loss = 0.8177287578582764
Validation loss = 0.8460870981216431
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7607205510139465
Validation loss = 0.8123931288719177
Validation loss = 0.7927201986312866
Validation loss = 0.8358546495437622
Validation loss = 0.8414292335510254
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -3.06e+03 |
| Iteration     | 3         |
| MaximumReturn | -2.99e+03 |
| MinimumReturn | -3.12e+03 |
| TotalSamples  | 20000     |
-----------------------------
itr #4 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7427843809127808
Validation loss = 0.793855607509613
Validation loss = 0.8023451566696167
Validation loss = 0.8255870938301086
Validation loss = 0.8434944152832031
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7300289869308472
Validation loss = 0.7890685796737671
Validation loss = 0.8090906143188477
Validation loss = 0.8195698857307434
Validation loss = 0.8370224237442017
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7219346761703491
Validation loss = 0.8044863939285278
Validation loss = 0.8160398602485657
Validation loss = 0.8329303860664368
Validation loss = 0.8297408819198608
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7344030737876892
Validation loss = 0.7878691554069519
Validation loss = 0.8067458271980286
Validation loss = 0.8188574910163879
Validation loss = 0.8180215954780579
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7482901811599731
Validation loss = 0.8006499409675598
Validation loss = 0.8058080673217773
Validation loss = 0.8227859735488892
Validation loss = 0.8313989639282227
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -2.59e+03 |
| Iteration     | 4         |
| MaximumReturn | -1.31e+03 |
| MinimumReturn | -3.06e+03 |
| TotalSamples  | 24000     |
-----------------------------
itr #5 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7521542906761169
Validation loss = 0.7813341617584229
Validation loss = 0.7891770005226135
Validation loss = 0.8001000285148621
Validation loss = 0.8078816533088684
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7506606578826904
Validation loss = 0.7793523669242859
Validation loss = 0.7807977795600891
Validation loss = 0.7937412858009338
Validation loss = 0.7996869683265686
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7516810894012451
Validation loss = 0.7858392596244812
Validation loss = 0.7903618216514587
Validation loss = 0.7993481159210205
Validation loss = 0.803581953048706
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7386929988861084
Validation loss = 0.7671058177947998
Validation loss = 0.7786087989807129
Validation loss = 0.7860276699066162
Validation loss = 0.7915482521057129
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.763650119304657
Validation loss = 0.7755851149559021
Validation loss = 0.7928091883659363
Validation loss = 0.7992570400238037
Validation loss = 0.8117443919181824
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -2.39e+03 |
| Iteration     | 5         |
| MaximumReturn | -2.13e+03 |
| MinimumReturn | -2.91e+03 |
| TotalSamples  | 28000     |
-----------------------------
itr #6 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7593625783920288
Validation loss = 0.7608293890953064
Validation loss = 0.7682644724845886
Validation loss = 0.7730454206466675
Validation loss = 0.7886806130409241
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7497779726982117
Validation loss = 0.7638159394264221
Validation loss = 0.7658806443214417
Validation loss = 0.7818877100944519
Validation loss = 0.7825738191604614
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7641304135322571
Validation loss = 0.7618446350097656
Validation loss = 0.7688603401184082
Validation loss = 0.7824219465255737
Validation loss = 0.7948461174964905
Validation loss = 0.7827399373054504
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7492944002151489
Validation loss = 0.7626823782920837
Validation loss = 0.7588030099868774
Validation loss = 0.783112108707428
Validation loss = 0.7723283171653748
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.752679705619812
Validation loss = 0.7702497243881226
Validation loss = 0.768075168132782
Validation loss = 0.7750009298324585
Validation loss = 0.7799170613288879
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -2.25e+03 |
| Iteration     | 6         |
| MaximumReturn | -1.34e+03 |
| MinimumReturn | -2.98e+03 |
| TotalSamples  | 32000     |
-----------------------------
itr #7 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7061522006988525
Validation loss = 0.7101114988327026
Validation loss = 0.7240172028541565
Validation loss = 0.7278017997741699
Validation loss = 0.7426536679267883
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6969850659370422
Validation loss = 0.7213170528411865
Validation loss = 0.7140095233917236
Validation loss = 0.7281148433685303
Validation loss = 0.7454727292060852
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7010799646377563
Validation loss = 0.7226935625076294
Validation loss = 0.7125890254974365
Validation loss = 0.7347350120544434
Validation loss = 0.7424073815345764
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.687904417514801
Validation loss = 0.7102857828140259
Validation loss = 0.7104750871658325
Validation loss = 0.7261281609535217
Validation loss = 0.7252252101898193
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6981150507926941
Validation loss = 0.7075474262237549
Validation loss = 0.7261151075363159
Validation loss = 0.7264614105224609
Validation loss = 0.7418422102928162
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -2.58e+03 |
| Iteration     | 7         |
| MaximumReturn | -1.64e+03 |
| MinimumReturn | -2.93e+03 |
| TotalSamples  | 36000     |
-----------------------------
itr #8 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.9735110402107239
Validation loss = 1.1014171838760376
Validation loss = 1.1447405815124512
Validation loss = 1.1236001253128052
Validation loss = 1.2243982553482056
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.9528977274894714
Validation loss = 1.0090062618255615
Validation loss = 1.032078742980957
Validation loss = 1.0568994283676147
Validation loss = 1.09633207321167
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.9935760498046875
Validation loss = 1.0735409259796143
Validation loss = 1.1093120574951172
Validation loss = 1.1203218698501587
Validation loss = 1.1595783233642578
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.9562793374061584
Validation loss = 1.003969669342041
Validation loss = 1.0445353984832764
Validation loss = 1.0676523447036743
Validation loss = 1.1032700538635254
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.9902464747428894
Validation loss = 1.0397849082946777
Validation loss = 1.0727639198303223
Validation loss = 1.0935267210006714
Validation loss = 1.136788249015808
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -2.08e+03 |
| Iteration     | 8         |
| MaximumReturn | -739      |
| MinimumReturn | -2.43e+03 |
| TotalSamples  | 40000     |
-----------------------------
itr #9 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7082511782646179
Validation loss = 0.7467634677886963
Validation loss = 0.7585643529891968
Validation loss = 0.7775542736053467
Validation loss = 0.8020251393318176
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6959758400917053
Validation loss = 0.7330397367477417
Validation loss = 0.7539186477661133
Validation loss = 0.7553590536117554
Validation loss = 0.7684633135795593
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7052620649337769
Validation loss = 0.7403963208198547
Validation loss = 0.7686538696289062
Validation loss = 0.7751052379608154
Validation loss = 0.784591794013977
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7033625841140747
Validation loss = 0.7378708720207214
Validation loss = 0.7627792358398438
Validation loss = 0.7798272967338562
Validation loss = 0.7855392694473267
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7023252248764038
Validation loss = 0.7334483861923218
Validation loss = 0.7523713111877441
Validation loss = 0.7702232599258423
Validation loss = 0.7775850296020508
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -2.33e+03 |
| Iteration     | 9         |
| MaximumReturn | -2.14e+03 |
| MinimumReturn | -2.52e+03 |
| TotalSamples  | 44000     |
-----------------------------
itr #10 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7502312064170837
Validation loss = 0.7684190273284912
Validation loss = 0.7892115116119385
Validation loss = 0.806032121181488
Validation loss = 0.7926256060600281
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7319097518920898
Validation loss = 0.7518298625946045
Validation loss = 0.7659124135971069
Validation loss = 0.7772737741470337
Validation loss = 0.7818353772163391
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7519323229789734
Validation loss = 0.7793219089508057
Validation loss = 0.7895664572715759
Validation loss = 0.8069570064544678
Validation loss = 0.8147003054618835
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7473636865615845
Validation loss = 0.7728244662284851
Validation loss = 0.7883762121200562
Validation loss = 0.7903723120689392
Validation loss = 0.7965236902236938
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7367393970489502
Validation loss = 0.7674249410629272
Validation loss = 0.772991418838501
Validation loss = 0.7866694927215576
Validation loss = 0.7985069155693054
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -2.03e+03 |
| Iteration     | 10        |
| MaximumReturn | -1.3e+03  |
| MinimumReturn | -2.47e+03 |
| TotalSamples  | 48000     |
-----------------------------
itr #11 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7828006744384766
Validation loss = 0.7983075976371765
Validation loss = 0.8163847923278809
Validation loss = 0.8175883889198303
Validation loss = 0.8247029185295105
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7690026164054871
Validation loss = 0.7842782139778137
Validation loss = 0.7847158908843994
Validation loss = 0.7977540493011475
Validation loss = 0.8080589175224304
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7864626049995422
Validation loss = 0.7893552780151367
Validation loss = 0.8046396374702454
Validation loss = 0.8140941262245178
Validation loss = 0.8172733187675476
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.76982182264328
Validation loss = 0.7944979667663574
Validation loss = 0.794708251953125
Validation loss = 0.8140832781791687
Validation loss = 0.8193824887275696
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7675137519836426
Validation loss = 0.7883744239807129
Validation loss = 0.8030581474304199
Validation loss = 0.8120391964912415
Validation loss = 0.8084747195243835
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -2.43e+03 |
| Iteration     | 11        |
| MaximumReturn | -1.99e+03 |
| MinimumReturn | -2.64e+03 |
| TotalSamples  | 52000     |
-----------------------------
itr #12 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.807587742805481
Validation loss = 0.7938191890716553
Validation loss = 0.8132259845733643
Validation loss = 0.8255764842033386
Validation loss = 0.8229361772537231
Validation loss = 0.834509015083313
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7903315424919128
Validation loss = 0.7786601185798645
Validation loss = 0.791448175907135
Validation loss = 0.8004075288772583
Validation loss = 0.7969850301742554
Validation loss = 0.8129240870475769
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.8044418692588806
Validation loss = 0.800446629524231
Validation loss = 0.8043003678321838
Validation loss = 0.8272238969802856
Validation loss = 0.8148318529129028
Validation loss = 0.8293526768684387
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7882339954376221
Validation loss = 0.8012142777442932
Validation loss = 0.8003963232040405
Validation loss = 0.8027026057243347
Validation loss = 0.8244372606277466
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7919496297836304
Validation loss = 0.7893310189247131
Validation loss = 0.7998218536376953
Validation loss = 0.801239013671875
Validation loss = 0.8111728429794312
Validation loss = 0.8119288682937622
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.71e+03 |
| Iteration     | 12        |
| MaximumReturn | -1.47e+03 |
| MinimumReturn | -2.34e+03 |
| TotalSamples  | 56000     |
-----------------------------
itr #13 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.782914936542511
Validation loss = 0.7826175689697266
Validation loss = 0.7811905741691589
Validation loss = 0.7882389426231384
Validation loss = 0.7938594818115234
Validation loss = 0.7869216203689575
Validation loss = 0.795781135559082
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7623305320739746
Validation loss = 0.7634760737419128
Validation loss = 0.7672147750854492
Validation loss = 0.7796404957771301
Validation loss = 0.7840590476989746
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7657910585403442
Validation loss = 0.7747550010681152
Validation loss = 0.7807310819625854
Validation loss = 0.7794588208198547
Validation loss = 0.7891029119491577
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7612472772598267
Validation loss = 0.7650600671768188
Validation loss = 0.7684939503669739
Validation loss = 0.7824587821960449
Validation loss = 0.7857867479324341
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7686063647270203
Validation loss = 0.7693161964416504
Validation loss = 0.7760566473007202
Validation loss = 0.7804818749427795
Validation loss = 0.7844918966293335
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.53e+03 |
| Iteration     | 13        |
| MaximumReturn | -1.26e+03 |
| MinimumReturn | -1.74e+03 |
| TotalSamples  | 60000     |
-----------------------------
itr #14 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7784465551376343
Validation loss = 0.7795961499214172
Validation loss = 0.7859782576560974
Validation loss = 0.7928925156593323
Validation loss = 0.7949455976486206
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7633118033409119
Validation loss = 0.7844924926757812
Validation loss = 0.774912416934967
Validation loss = 0.7793652415275574
Validation loss = 0.7871099710464478
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7647877335548401
Validation loss = 0.7706019282341003
Validation loss = 0.7819913029670715
Validation loss = 0.7849571704864502
Validation loss = 0.7953553795814514
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7592923641204834
Validation loss = 0.7694094777107239
Validation loss = 0.7785805463790894
Validation loss = 0.7859888076782227
Validation loss = 0.7923814654350281
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7740002870559692
Validation loss = 0.7768176794052124
Validation loss = 0.7844511866569519
Validation loss = 0.7960203289985657
Validation loss = 0.7913256883621216
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.21e+03 |
| Iteration     | 14        |
| MaximumReturn | -856      |
| MinimumReturn | -1.43e+03 |
| TotalSamples  | 64000     |
-----------------------------
itr #15 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7604942321777344
Validation loss = 0.7722011208534241
Validation loss = 0.7686869502067566
Validation loss = 0.773773193359375
Validation loss = 0.7792309522628784
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7507396936416626
Validation loss = 0.755947470664978
Validation loss = 0.765113115310669
Validation loss = 0.7759488224983215
Validation loss = 0.7726553082466125
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7742211818695068
Validation loss = 0.7633402347564697
Validation loss = 0.7715154886245728
Validation loss = 0.774111270904541
Validation loss = 0.7803711891174316
Validation loss = 0.7794902920722961
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7612098455429077
Validation loss = 0.7658579349517822
Validation loss = 0.7661559581756592
Validation loss = 0.7800641059875488
Validation loss = 0.7764546871185303
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7620821595191956
Validation loss = 0.7608027458190918
Validation loss = 0.773545503616333
Validation loss = 0.7760782241821289
Validation loss = 0.7806114554405212
Validation loss = 0.7811211943626404
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -705      |
| Iteration     | 15        |
| MaximumReturn | 92.2      |
| MinimumReturn | -1.45e+03 |
| TotalSamples  | 68000     |
-----------------------------
itr #16 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7433388233184814
Validation loss = 0.7480350136756897
Validation loss = 0.7593905925750732
Validation loss = 0.7603341341018677
Validation loss = 0.7638432383537292
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7497766613960266
Validation loss = 0.7563337683677673
Validation loss = 0.7556340098381042
Validation loss = 0.7666735649108887
Validation loss = 0.760578989982605
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.752112627029419
Validation loss = 0.7546325922012329
Validation loss = 0.7649027705192566
Validation loss = 0.7651205062866211
Validation loss = 0.7665665745735168
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7458183169364929
Validation loss = 0.7577166557312012
Validation loss = 0.7617470622062683
Validation loss = 0.7631945013999939
Validation loss = 0.7642403244972229
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7542480826377869
Validation loss = 0.7581108212471008
Validation loss = 0.757498562335968
Validation loss = 0.7712509632110596
Validation loss = 0.7683110237121582
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.19e+03 |
| Iteration     | 16        |
| MaximumReturn | -364      |
| MinimumReturn | -2.31e+03 |
| TotalSamples  | 72000     |
-----------------------------
itr #17 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.740139901638031
Validation loss = 0.7497683167457581
Validation loss = 0.7494856715202332
Validation loss = 0.7577398419380188
Validation loss = 0.7587709426879883
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7496674656867981
Validation loss = 0.747356116771698
Validation loss = 0.7508575320243835
Validation loss = 0.7574403285980225
Validation loss = 0.7592613697052002
Validation loss = 0.7602269649505615
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7647412419319153
Validation loss = 0.7509253025054932
Validation loss = 0.755041778087616
Validation loss = 0.7551509737968445
Validation loss = 0.7664430141448975
Validation loss = 0.7628066539764404
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7584676146507263
Validation loss = 0.7498757243156433
Validation loss = 0.7534445524215698
Validation loss = 0.7602143883705139
Validation loss = 0.7559536099433899
Validation loss = 0.7534311413764954
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7623188495635986
Validation loss = 0.7576764225959778
Validation loss = 0.7591057419776917
Validation loss = 0.762501060962677
Validation loss = 0.7625075578689575
Validation loss = 0.7645058631896973
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.32e+03 |
| Iteration     | 17        |
| MaximumReturn | -827      |
| MinimumReturn | -1.71e+03 |
| TotalSamples  | 76000     |
-----------------------------
itr #18 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7402885556221008
Validation loss = 0.7432958483695984
Validation loss = 0.7532013654708862
Validation loss = 0.7527822852134705
Validation loss = 0.7574281096458435
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7392602562904358
Validation loss = 0.7353385090827942
Validation loss = 0.7516418695449829
Validation loss = 0.7490783333778381
Validation loss = 0.7590442895889282
Validation loss = 0.7525321841239929
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7330721020698547
Validation loss = 0.7456004619598389
Validation loss = 0.752164363861084
Validation loss = 0.7541014552116394
Validation loss = 0.7495428919792175
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7425937652587891
Validation loss = 0.7407310605049133
Validation loss = 0.7495813965797424
Validation loss = 0.7496770620346069
Validation loss = 0.7525689601898193
Validation loss = 0.7517290115356445
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7511429786682129
Validation loss = 0.7504182457923889
Validation loss = 0.7535123229026794
Validation loss = 0.756691038608551
Validation loss = 0.7556408047676086
Validation loss = 0.7556414008140564
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.05e+03 |
| Iteration     | 18        |
| MaximumReturn | -618      |
| MinimumReturn | -1.74e+03 |
| TotalSamples  | 80000     |
-----------------------------
itr #19 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7451267242431641
Validation loss = 0.7424925565719604
Validation loss = 0.7472864985466003
Validation loss = 0.7498217225074768
Validation loss = 0.7510940432548523
Validation loss = 0.753596305847168
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.735954225063324
Validation loss = 0.743066132068634
Validation loss = 0.7494211196899414
Validation loss = 0.7463289499282837
Validation loss = 0.7505382299423218
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7418266534805298
Validation loss = 0.7451453804969788
Validation loss = 0.7512403726577759
Validation loss = 0.749554455280304
Validation loss = 0.7483096718788147
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7417590022087097
Validation loss = 0.7440679669380188
Validation loss = 0.7453426122665405
Validation loss = 0.7486652135848999
Validation loss = 0.7441068887710571
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7466198801994324
Validation loss = 0.7495172619819641
Validation loss = 0.751079797744751
Validation loss = 0.7519748210906982
Validation loss = 0.7598228454589844
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.15e+03 |
| Iteration     | 19        |
| MaximumReturn | -800      |
| MinimumReturn | -1.98e+03 |
| TotalSamples  | 84000     |
-----------------------------
itr #20 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7359500527381897
Validation loss = 0.7445090413093567
Validation loss = 0.7466604113578796
Validation loss = 0.745185375213623
Validation loss = 0.7494722604751587
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7444741129875183
Validation loss = 0.7431589961051941
Validation loss = 0.7454245686531067
Validation loss = 0.7480881810188293
Validation loss = 0.7507414817810059
Validation loss = 0.7442197203636169
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7466307282447815
Validation loss = 0.746078610420227
Validation loss = 0.7500430345535278
Validation loss = 0.7512987852096558
Validation loss = 0.75616055727005
Validation loss = 0.757143497467041
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7447643876075745
Validation loss = 0.7357350587844849
Validation loss = 0.7475401759147644
Validation loss = 0.7467179298400879
Validation loss = 0.7480128407478333
Validation loss = 0.7488629221916199
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7418974041938782
Validation loss = 0.7412030100822449
Validation loss = 0.748130202293396
Validation loss = 0.7503248453140259
Validation loss = 0.746374785900116
Validation loss = 0.7513532042503357
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.53e+03 |
| Iteration     | 20        |
| MaximumReturn | -897      |
| MinimumReturn | -2.29e+03 |
| TotalSamples  | 88000     |
-----------------------------
itr #21 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7308850884437561
Validation loss = 0.7405369281768799
Validation loss = 0.7411637902259827
Validation loss = 0.7450270056724548
Validation loss = 0.7457563877105713
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7321447730064392
Validation loss = 0.735995352268219
Validation loss = 0.7399618625640869
Validation loss = 0.7434893250465393
Validation loss = 0.7451668381690979
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7486875653266907
Validation loss = 0.7390422224998474
Validation loss = 0.7422912120819092
Validation loss = 0.7440906763076782
Validation loss = 0.7495318055152893
Validation loss = 0.7501926422119141
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7386608719825745
Validation loss = 0.7314798831939697
Validation loss = 0.7437102794647217
Validation loss = 0.7404889464378357
Validation loss = 0.7331159114837646
Validation loss = 0.7381269931793213
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7466039657592773
Validation loss = 0.736080527305603
Validation loss = 0.7386752963066101
Validation loss = 0.7380481362342834
Validation loss = 0.741520345211029
Validation loss = 0.7423458695411682
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.25e+03 |
| Iteration     | 21        |
| MaximumReturn | -727      |
| MinimumReturn | -2.33e+03 |
| TotalSamples  | 92000     |
-----------------------------
itr #22 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7384654879570007
Validation loss = 0.7355566620826721
Validation loss = 0.7381671667098999
Validation loss = 0.7395920157432556
Validation loss = 0.7388736009597778
Validation loss = 0.7441665530204773
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7303600907325745
Validation loss = 0.7297816872596741
Validation loss = 0.7368308901786804
Validation loss = 0.7394733428955078
Validation loss = 0.7362513542175293
Validation loss = 0.7371501326560974
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7282952070236206
Validation loss = 0.7295970320701599
Validation loss = 0.7403702735900879
Validation loss = 0.7412815093994141
Validation loss = 0.7383025884628296
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7443179488182068
Validation loss = 0.7271722555160522
Validation loss = 0.7308328747749329
Validation loss = 0.7354558110237122
Validation loss = 0.7325962781906128
Validation loss = 0.733066737651825
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7338693141937256
Validation loss = 0.7342169284820557
Validation loss = 0.7364193201065063
Validation loss = 0.7395937442779541
Validation loss = 0.7413582801818848
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -638     |
| Iteration     | 22       |
| MaximumReturn | -454     |
| MinimumReturn | -900     |
| TotalSamples  | 96000    |
----------------------------
itr #23 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7291285395622253
Validation loss = 0.7302737236022949
Validation loss = 0.7330935597419739
Validation loss = 0.7324704527854919
Validation loss = 0.7359514832496643
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7395356297492981
Validation loss = 0.7239726185798645
Validation loss = 0.731044590473175
Validation loss = 0.7359724044799805
Validation loss = 0.7302090525627136
Validation loss = 0.7325196862220764
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7339689135551453
Validation loss = 0.732964038848877
Validation loss = 0.7370781898498535
Validation loss = 0.7370942234992981
Validation loss = 0.7368132472038269
Validation loss = 0.7362655997276306
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7213608622550964
Validation loss = 0.7300317287445068
Validation loss = 0.7333695888519287
Validation loss = 0.7345817685127258
Validation loss = 0.7388680577278137
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7319801449775696
Validation loss = 0.7323467135429382
Validation loss = 0.7368901371955872
Validation loss = 0.7354481220245361
Validation loss = 0.7337409853935242
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -627     |
| Iteration     | 23       |
| MaximumReturn | -352     |
| MinimumReturn | -826     |
| TotalSamples  | 100000   |
----------------------------
itr #24 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7329832315444946
Validation loss = 0.7277992367744446
Validation loss = 0.7318594455718994
Validation loss = 0.7361359596252441
Validation loss = 0.7344297766685486
Validation loss = 0.7337465882301331
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7281202673912048
Validation loss = 0.7321839928627014
Validation loss = 0.7296909093856812
Validation loss = 0.7346423864364624
Validation loss = 0.7354786396026611
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7259094715118408
Validation loss = 0.7316332459449768
Validation loss = 0.7335805296897888
Validation loss = 0.7397444844245911
Validation loss = 0.7410529851913452
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7306566834449768
Validation loss = 0.7233532071113586
Validation loss = 0.7262930274009705
Validation loss = 0.7312569618225098
Validation loss = 0.7317869663238525
Validation loss = 0.7319114208221436
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7272450923919678
Validation loss = 0.727349042892456
Validation loss = 0.7375767230987549
Validation loss = 0.7364022135734558
Validation loss = 0.7367359399795532
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.13e+03 |
| Iteration     | 24        |
| MaximumReturn | 38.2      |
| MinimumReturn | -2.54e+03 |
| TotalSamples  | 104000    |
-----------------------------
itr #25 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7304815053939819
Validation loss = 0.7245980501174927
Validation loss = 0.730880856513977
Validation loss = 0.727473795413971
Validation loss = 0.7314094305038452
Validation loss = 0.7314630150794983
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7313639521598816
Validation loss = 0.7257252335548401
Validation loss = 0.7333130240440369
Validation loss = 0.7311198115348816
Validation loss = 0.734096884727478
Validation loss = 0.7308003306388855
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7254931926727295
Validation loss = 0.7229043245315552
Validation loss = 0.7332344055175781
Validation loss = 0.7380162477493286
Validation loss = 0.7360039949417114
Validation loss = 0.7365599274635315
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7250657677650452
Validation loss = 0.7206570506095886
Validation loss = 0.7278558015823364
Validation loss = 0.7304279804229736
Validation loss = 0.7278769612312317
Validation loss = 0.725563108921051
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7264548540115356
Validation loss = 0.7268492579460144
Validation loss = 0.7297207713127136
Validation loss = 0.7317887544631958
Validation loss = 0.7313966155052185
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.95e+03 |
| Iteration     | 25        |
| MaximumReturn | -466      |
| MinimumReturn | -2.66e+03 |
| TotalSamples  | 108000    |
-----------------------------
itr #26 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7298805713653564
Validation loss = 0.7252450585365295
Validation loss = 0.7273910045623779
Validation loss = 0.7317742705345154
Validation loss = 0.731821596622467
Validation loss = 0.7315787076950073
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7280524969100952
Validation loss = 0.7305610179901123
Validation loss = 0.7347903847694397
Validation loss = 0.7341828942298889
Validation loss = 0.7378990650177002
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7311483025550842
Validation loss = 0.7275501489639282
Validation loss = 0.7323293089866638
Validation loss = 0.7320086359977722
Validation loss = 0.7345820069313049
Validation loss = 0.7331985235214233
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7342513799667358
Validation loss = 0.7262654304504395
Validation loss = 0.7341107130050659
Validation loss = 0.7291607856750488
Validation loss = 0.7336409091949463
Validation loss = 0.7316541075706482
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7254869937896729
Validation loss = 0.7298964262008667
Validation loss = 0.7341552972793579
Validation loss = 0.7350910902023315
Validation loss = 0.7333550453186035
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.08e+03 |
| Iteration     | 26        |
| MaximumReturn | -391      |
| MinimumReturn | -2.32e+03 |
| TotalSamples  | 112000    |
-----------------------------
itr #27 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7279693484306335
Validation loss = 0.7251600623130798
Validation loss = 0.7269096374511719
Validation loss = 0.7289946675300598
Validation loss = 0.730829656124115
Validation loss = 0.7293633222579956
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7259855270385742
Validation loss = 0.7249569892883301
Validation loss = 0.7318461537361145
Validation loss = 0.7329928278923035
Validation loss = 0.7346888780593872
Validation loss = 0.736742377281189
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7276214361190796
Validation loss = 0.7279897332191467
Validation loss = 0.7300191521644592
Validation loss = 0.7335224747657776
Validation loss = 0.7330769300460815
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7251859903335571
Validation loss = 0.7253008484840393
Validation loss = 0.7272063493728638
Validation loss = 0.727658212184906
Validation loss = 0.7267629504203796
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7309868931770325
Validation loss = 0.7312389612197876
Validation loss = 0.7273334860801697
Validation loss = 0.732076108455658
Validation loss = 0.7302356362342834
Validation loss = 0.7352977395057678
Validation loss = 0.7328954339027405
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.43e+03 |
| Iteration     | 27        |
| MaximumReturn | -537      |
| MinimumReturn | -2.76e+03 |
| TotalSamples  | 116000    |
-----------------------------
itr #28 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.732428789138794
Validation loss = 0.7205208539962769
Validation loss = 0.7256976366043091
Validation loss = 0.7290706038475037
Validation loss = 0.7276180982589722
Validation loss = 0.7289210557937622
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7325290441513062
Validation loss = 0.7270131707191467
Validation loss = 0.7305423617362976
Validation loss = 0.7331162691116333
Validation loss = 0.7341463565826416
Validation loss = 0.7303556799888611
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7283560633659363
Validation loss = 0.7249876260757446
Validation loss = 0.7309852242469788
Validation loss = 0.7353527545928955
Validation loss = 0.7321708798408508
Validation loss = 0.7346187829971313
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7280880212783813
Validation loss = 0.7235775589942932
Validation loss = 0.7283148765563965
Validation loss = 0.7289032340049744
Validation loss = 0.7236440777778625
Validation loss = 0.7279414534568787
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7277361154556274
Validation loss = 0.7226658463478088
Validation loss = 0.728257417678833
Validation loss = 0.7291450500488281
Validation loss = 0.7284048199653625
Validation loss = 0.729898989200592
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.69e+03 |
| Iteration     | 28        |
| MaximumReturn | -575      |
| MinimumReturn | -2.74e+03 |
| TotalSamples  | 120000    |
-----------------------------
itr #29 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.727401852607727
Validation loss = 0.7276594042778015
Validation loss = 0.7303177118301392
Validation loss = 0.7296627759933472
Validation loss = 0.72987300157547
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7311296463012695
Validation loss = 0.7338294982910156
Validation loss = 0.7292220592498779
Validation loss = 0.7389025092124939
Validation loss = 0.7351827025413513
Validation loss = 0.7367404699325562
Validation loss = 0.7338723540306091
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7314183712005615
Validation loss = 0.7276495099067688
Validation loss = 0.7336048483848572
Validation loss = 0.7359216809272766
Validation loss = 0.7350788116455078
Validation loss = 0.735931932926178
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7340792417526245
Validation loss = 0.7264255881309509
Validation loss = 0.7317975163459778
Validation loss = 0.7302616238594055
Validation loss = 0.7290942668914795
Validation loss = 0.7320582270622253
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7257810235023499
Validation loss = 0.7259880304336548
Validation loss = 0.7301260232925415
Validation loss = 0.7305173277854919
Validation loss = 0.7310218811035156
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.39e+03 |
| Iteration     | 29        |
| MaximumReturn | -890      |
| MinimumReturn | -2.27e+03 |
| TotalSamples  | 124000    |
-----------------------------
itr #30 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7323816418647766
Validation loss = 0.7267701029777527
Validation loss = 0.7312480807304382
Validation loss = 0.7288787364959717
Validation loss = 0.7288523316383362
Validation loss = 0.7342009544372559
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7276366949081421
Validation loss = 0.7263148427009583
Validation loss = 0.7332447171211243
Validation loss = 0.7372004985809326
Validation loss = 0.7334390878677368
Validation loss = 0.736994206905365
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7307829856872559
Validation loss = 0.7303897738456726
Validation loss = 0.7359063029289246
Validation loss = 0.7378382682800293
Validation loss = 0.7363058924674988
Validation loss = 0.7347753643989563
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7308350205421448
Validation loss = 0.7238971590995789
Validation loss = 0.7303386926651001
Validation loss = 0.7360775470733643
Validation loss = 0.7317109704017639
Validation loss = 0.7309945225715637
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7312641739845276
Validation loss = 0.7261939644813538
Validation loss = 0.7327789664268494
Validation loss = 0.7353932857513428
Validation loss = 0.7325193881988525
Validation loss = 0.7327563166618347
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.66e+03 |
| Iteration     | 30        |
| MaximumReturn | -1.2e+03  |
| MinimumReturn | -2.83e+03 |
| TotalSamples  | 128000    |
-----------------------------
itr #31 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7250330448150635
Validation loss = 0.7279487252235413
Validation loss = 0.7301868796348572
Validation loss = 0.7282577753067017
Validation loss = 0.732252836227417
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7245407104492188
Validation loss = 0.7285814881324768
Validation loss = 0.7303350567817688
Validation loss = 0.7302596569061279
Validation loss = 0.7313766479492188
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7283209562301636
Validation loss = 0.7271501421928406
Validation loss = 0.7327984571456909
Validation loss = 0.7332111597061157
Validation loss = 0.7369140982627869
Validation loss = 0.7370312809944153
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7265163660049438
Validation loss = 0.7264684438705444
Validation loss = 0.7326890230178833
Validation loss = 0.7315723896026611
Validation loss = 0.7332226037979126
Validation loss = 0.732742965221405
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7271578311920166
Validation loss = 0.7253453135490417
Validation loss = 0.7291452884674072
Validation loss = 0.7300771474838257
Validation loss = 0.7332499623298645
Validation loss = 0.7322375774383545
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.2e+03  |
| Iteration     | 31        |
| MaximumReturn | -967      |
| MinimumReturn | -1.43e+03 |
| TotalSamples  | 132000    |
-----------------------------
itr #32 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.72572922706604
Validation loss = 0.7312589883804321
Validation loss = 0.7320260405540466
Validation loss = 0.7341772317886353
Validation loss = 0.7363454699516296
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.731262743473053
Validation loss = 0.7284175753593445
Validation loss = 0.7346890568733215
Validation loss = 0.7351967096328735
Validation loss = 0.7328613996505737
Validation loss = 0.733245313167572
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7314490079879761
Validation loss = 0.7341209650039673
Validation loss = 0.7365292310714722
Validation loss = 0.7356775403022766
Validation loss = 0.7328312993049622
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7266585230827332
Validation loss = 0.7252493500709534
Validation loss = 0.7323657274246216
Validation loss = 0.734078586101532
Validation loss = 0.7336490750312805
Validation loss = 0.7358161211013794
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7273958325386047
Validation loss = 0.7284061908721924
Validation loss = 0.7301518321037292
Validation loss = 0.7327292561531067
Validation loss = 0.7316097617149353
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.7e+03  |
| Iteration     | 32        |
| MaximumReturn | -1.02e+03 |
| MinimumReturn | -2.61e+03 |
| TotalSamples  | 136000    |
-----------------------------
