Logging to experiments/hopper/hopperO01/Tue-01-Nov-2022-09-35-15-AM-CDT_hopper_trpo_iteration_20_seed1234
Print configuration .....
{'env_name': 'hopper', 'random_seeds': [1234, 2431, 2531, 2231], 'save_variables': False, 'model_save_dir': '/tmp/hopper_models/', 'restore_variables': False, 'start_onpol_iter': 0, 'onpol_iters': 33, 'num_path_random': 6, 'num_path_onpol': 6, 'env_horizon': 1000, 'max_train_data': 200000, 'max_val_data': 100000, 'discard_ratio': 0.0, 'dynamics': {'pre_training': {'mode': 'intrinsic_reward', 'itr': 0, 'policy_itr': 20}, 'model': 'nn', 'ensemble': True, 'ensemble_model_count': 5, 'enable_particle_ensemble': True, 'particles': 5, 'obs_var': 1.0, 'intrinsic_reward_coeff': 1.0, 'ita': 1.0, 'mode': 'random', 'val': True, 'n_layers': 4, 'hidden_size': 1000, 'activation': 'relu', 'batch_size': 1000, 'learning_rate': 0.001, 'reg_coeff': 0.0, 'epochs': 200, 'kfac_params': {'learning_rate': 0.1, 'damping': 0.001, 'momentum': 0.9, 'kl_clip': 0.0001, 'cov_ema_decay': 0.99}}, 'policy': {'network_shape': [64, 64], 'init_logstd': 0.0, 'activation': 'tanh', 'reinitialize_every_itr': False}, 'trpo': {'horizon': 1000, 'gamma': 0.99, 'step_size': 0.01, 'iterations': 20, 'batch_size': 50000, 'gae': 0.95, 'visualization': False, 'visualize_iterations': [0]}, 'algo': 'trpo'}
Generating random rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating random rollouts.
Creating normalization for training data.
Done creating normalization for training data.
Particle ensemble enabled? True
An ensemble of 5 dynamics model <class 'model.dynamics.NNDynamicsModel'> initialized
Train dynamics model with intrinsic reward only? False
Pre-training enabled. Using only intrinsic reward.
Pre-training dynamics model for 0 iterations...
Done pre-training dynamics model.
Using external reward only.
itr #0 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.9094932079315186
Validation loss = 0.6674730777740479
Validation loss = 0.6495957374572754
Validation loss = 0.6513131856918335
Validation loss = 0.6556215286254883
Validation loss = 0.6716224551200867
Validation loss = 0.7022049427032471
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.8990565538406372
Validation loss = 0.6797657608985901
Validation loss = 0.6485223770141602
Validation loss = 0.6454019546508789
Validation loss = 0.6478968858718872
Validation loss = 0.6836206912994385
Validation loss = 0.727555513381958
Validation loss = 0.815673291683197
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.890730619430542
Validation loss = 0.6740823984146118
Validation loss = 0.6486344337463379
Validation loss = 0.6408135890960693
Validation loss = 0.6469109058380127
Validation loss = 0.6653914451599121
Validation loss = 0.6901541948318481
Validation loss = 0.72934889793396
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.8090508580207825
Validation loss = 0.6704025268554688
Validation loss = 0.6547094583511353
Validation loss = 0.6457428932189941
Validation loss = 0.6557037830352783
Validation loss = 0.6993918418884277
Validation loss = 0.7300575375556946
Validation loss = 0.7795250415802002
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.8723270893096924
Validation loss = 0.6752969026565552
Validation loss = 0.6503336429595947
Validation loss = 0.642264723777771
Validation loss = 0.6584886312484741
Validation loss = 0.6829190254211426
Validation loss = 0.716884434223175
Validation loss = 0.7912636995315552
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -2.52e+03 |
| Iteration     | 0         |
| MaximumReturn | -2.49e+03 |
| MinimumReturn | -2.54e+03 |
| TotalSamples  | 8000      |
-----------------------------
itr #1 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7023639678955078
Validation loss = 0.624850869178772
Validation loss = 0.6250114440917969
Validation loss = 0.6268319487571716
Validation loss = 0.6407904028892517
Validation loss = 0.6772274971008301
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7116650342941284
Validation loss = 0.6374626755714417
Validation loss = 0.625095784664154
Validation loss = 0.6220189332962036
Validation loss = 0.6382524371147156
Validation loss = 0.652560830116272
Validation loss = 0.6836779117584229
Validation loss = 0.6994320154190063
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7131824493408203
Validation loss = 0.6342742443084717
Validation loss = 0.6216223835945129
Validation loss = 0.6403794884681702
Validation loss = 0.6395854353904724
Validation loss = 0.6710065007209778
Validation loss = 0.6932089328765869
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7212167978286743
Validation loss = 0.6445858478546143
Validation loss = 0.6282789707183838
Validation loss = 0.6515761017799377
Validation loss = 0.6551090478897095
Validation loss = 0.67903733253479
Validation loss = 0.7024421095848083
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.717952311038971
Validation loss = 0.6354355812072754
Validation loss = 0.6309643983840942
Validation loss = 0.6258741617202759
Validation loss = 0.6467781066894531
Validation loss = 0.6789868474006653
Validation loss = 0.6953022480010986
Validation loss = 0.7219525575637817
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -2.46e+03 |
| Iteration     | 1         |
| MaximumReturn | -2.45e+03 |
| MinimumReturn | -2.49e+03 |
| TotalSamples  | 12000     |
-----------------------------
itr #2 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.616890549659729
Validation loss = 0.6099381446838379
Validation loss = 0.6207484602928162
Validation loss = 0.6360591650009155
Validation loss = 0.6347949504852295
Validation loss = 0.6704239249229431
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6285216808319092
Validation loss = 0.628265380859375
Validation loss = 0.6468493938446045
Validation loss = 0.669771671295166
Validation loss = 0.6740962862968445
Validation loss = 0.6902757287025452
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.62982577085495
Validation loss = 0.6406345963478088
Validation loss = 0.6430412530899048
Validation loss = 0.6529425978660583
Validation loss = 0.6622738838195801
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.6293938159942627
Validation loss = 0.6309149861335754
Validation loss = 0.6538026332855225
Validation loss = 0.6523621678352356
Validation loss = 0.6725156903266907
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6300660371780396
Validation loss = 0.6501514315605164
Validation loss = 0.6545454859733582
Validation loss = 0.6647368669509888
Validation loss = 0.6848586201667786
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -2.39e+03 |
| Iteration     | 2         |
| MaximumReturn | -2.17e+03 |
| MinimumReturn | -2.49e+03 |
| TotalSamples  | 16000     |
-----------------------------
itr #3 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6113051176071167
Validation loss = 0.620669960975647
Validation loss = 0.6298010349273682
Validation loss = 0.650225043296814
Validation loss = 0.6675907373428345
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6106207370758057
Validation loss = 0.6305558681488037
Validation loss = 0.653793454170227
Validation loss = 0.6719028949737549
Validation loss = 0.6847137212753296
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.6123075485229492
Validation loss = 0.6183481216430664
Validation loss = 0.6456883549690247
Validation loss = 0.6667617559432983
Validation loss = 0.6675736904144287
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.6214402914047241
Validation loss = 0.6313046216964722
Validation loss = 0.645677924156189
Validation loss = 0.6532220840454102
Validation loss = 0.6756701469421387
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6154191493988037
Validation loss = 0.6274489164352417
Validation loss = 0.6421012282371521
Validation loss = 0.6685028076171875
Validation loss = 0.6820037364959717
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.16e+03 |
| Iteration     | 3         |
| MaximumReturn | -596      |
| MinimumReturn | -1.87e+03 |
| TotalSamples  | 20000     |
-----------------------------
itr #4 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6144607067108154
Validation loss = 0.6480032205581665
Validation loss = 0.6715859770774841
Validation loss = 0.6978504061698914
Validation loss = 0.7130438089370728
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6315039992332458
Validation loss = 0.6576545834541321
Validation loss = 0.6835815906524658
Validation loss = 0.7062554359436035
Validation loss = 0.7231126427650452
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.6190492510795593
Validation loss = 0.6522666215896606
Validation loss = 0.673866868019104
Validation loss = 0.6871955394744873
Validation loss = 0.704479455947876
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.6152766346931458
Validation loss = 0.6517300009727478
Validation loss = 0.6737613677978516
Validation loss = 0.6867523789405823
Validation loss = 0.7164671421051025
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6149604916572571
Validation loss = 0.6457161903381348
Validation loss = 0.6666079163551331
Validation loss = 0.6854226589202881
Validation loss = 0.7224128842353821
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.79e+03 |
| Iteration     | 4         |
| MaximumReturn | -388      |
| MinimumReturn | -3.23e+03 |
| TotalSamples  | 24000     |
-----------------------------
itr #5 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6618237495422363
Validation loss = 0.7520923614501953
Validation loss = 0.7742049694061279
Validation loss = 0.7985028624534607
Validation loss = 0.8107128739356995
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6891794204711914
Validation loss = 0.7521681189537048
Validation loss = 0.7696345448493958
Validation loss = 0.8021113872528076
Validation loss = 0.7994387745857239
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.6640350818634033
Validation loss = 0.7256714701652527
Validation loss = 0.7598772048950195
Validation loss = 0.7895717024803162
Validation loss = 0.7815156579017639
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.6728787422180176
Validation loss = 0.7457197308540344
Validation loss = 0.7647996544837952
Validation loss = 0.7919483780860901
Validation loss = 0.8054792881011963
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6876111030578613
Validation loss = 0.7344563007354736
Validation loss = 0.7495131492614746
Validation loss = 0.778403103351593
Validation loss = 0.7904157638549805
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -3.47e+03 |
| Iteration     | 5         |
| MaximumReturn | -3.31e+03 |
| MinimumReturn | -3.66e+03 |
| TotalSamples  | 28000     |
-----------------------------
itr #6 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.696388304233551
Validation loss = 0.7613068222999573
Validation loss = 0.761725902557373
Validation loss = 0.7736359238624573
Validation loss = 0.7884188890457153
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6853801608085632
Validation loss = 0.7539197206497192
Validation loss = 0.7663195729255676
Validation loss = 0.7791970372200012
Validation loss = 0.7890034317970276
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.6898494362831116
Validation loss = 0.7351252436637878
Validation loss = 0.7675107717514038
Validation loss = 0.7703757286071777
Validation loss = 0.7764437794685364
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7060961723327637
Validation loss = 0.7518537640571594
Validation loss = 0.7642145752906799
Validation loss = 0.781152069568634
Validation loss = 0.8000990748405457
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7049304246902466
Validation loss = 0.7434979677200317
Validation loss = 0.7576574683189392
Validation loss = 0.7691248655319214
Validation loss = 0.7832413911819458
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -2.53e+03 |
| Iteration     | 6         |
| MaximumReturn | -1.65e+03 |
| MinimumReturn | -3.34e+03 |
| TotalSamples  | 32000     |
-----------------------------
itr #7 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6845743656158447
Validation loss = 0.7134294509887695
Validation loss = 0.7287650108337402
Validation loss = 0.7265522480010986
Validation loss = 0.7348095178604126
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6684495806694031
Validation loss = 0.7147405743598938
Validation loss = 0.7216355204582214
Validation loss = 0.7339468002319336
Validation loss = 0.7343993186950684
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.6636945009231567
Validation loss = 0.69157874584198
Validation loss = 0.720276415348053
Validation loss = 0.735482394695282
Validation loss = 0.7401689291000366
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.6633992195129395
Validation loss = 0.7079770565032959
Validation loss = 0.7148548364639282
Validation loss = 0.7274746894836426
Validation loss = 0.7336095571517944
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6659673452377319
Validation loss = 0.6962176561355591
Validation loss = 0.7070866227149963
Validation loss = 0.7180851697921753
Validation loss = 0.723560094833374
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -2.05e+03 |
| Iteration     | 7         |
| MaximumReturn | -1.71e+03 |
| MinimumReturn | -3.02e+03 |
| TotalSamples  | 36000     |
-----------------------------
itr #8 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6966339945793152
Validation loss = 0.718652606010437
Validation loss = 0.7295973300933838
Validation loss = 0.7359386086463928
Validation loss = 0.7462146878242493
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6858559250831604
Validation loss = 0.713291347026825
Validation loss = 0.7267596125602722
Validation loss = 0.7324549555778503
Validation loss = 0.7458176612854004
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.695863664150238
Validation loss = 0.7238843441009521
Validation loss = 0.7308855652809143
Validation loss = 0.7465219497680664
Validation loss = 0.7510439157485962
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.683042585849762
Validation loss = 0.7130451798439026
Validation loss = 0.7214096784591675
Validation loss = 0.7335107922554016
Validation loss = 0.7368589043617249
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6857523918151855
Validation loss = 0.7064453363418579
Validation loss = 0.7185907959938049
Validation loss = 0.7280438542366028
Validation loss = 0.729158341884613
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.8e+03  |
| Iteration     | 8         |
| MaximumReturn | -1.35e+03 |
| MinimumReturn | -2.52e+03 |
| TotalSamples  | 40000     |
-----------------------------
itr #9 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7121859192848206
Validation loss = 0.7114524841308594
Validation loss = 0.725995659828186
Validation loss = 0.7397230863571167
Validation loss = 0.7445856332778931
Validation loss = 0.7437666058540344
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.699840247631073
Validation loss = 0.7124287486076355
Validation loss = 0.7255417704582214
Validation loss = 0.7330725193023682
Validation loss = 0.7413550019264221
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7024492621421814
Validation loss = 0.7218180298805237
Validation loss = 0.7329529523849487
Validation loss = 0.732494056224823
Validation loss = 0.7441233992576599
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.6986867785453796
Validation loss = 0.7031711935997009
Validation loss = 0.7211693525314331
Validation loss = 0.7272418737411499
Validation loss = 0.7364844679832458
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7002300024032593
Validation loss = 0.7057280540466309
Validation loss = 0.7131360173225403
Validation loss = 0.7215660810470581
Validation loss = 0.733562171459198
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.15e+03 |
| Iteration     | 9         |
| MaximumReturn | -892      |
| MinimumReturn | -1.41e+03 |
| TotalSamples  | 44000     |
-----------------------------
itr #10 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7257423996925354
Validation loss = 0.7223941683769226
Validation loss = 0.7314901351928711
Validation loss = 0.7444005608558655
Validation loss = 0.7489389181137085
Validation loss = 0.7444825768470764
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7111022472381592
Validation loss = 0.7211647033691406
Validation loss = 0.7230404019355774
Validation loss = 0.7336673736572266
Validation loss = 0.7370738387107849
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7188684940338135
Validation loss = 0.72405606508255
Validation loss = 0.7340565323829651
Validation loss = 0.738115668296814
Validation loss = 0.7464050054550171
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7096558809280396
Validation loss = 0.7140153050422668
Validation loss = 0.7215496301651001
Validation loss = 0.7343168258666992
Validation loss = 0.7339732646942139
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6967747211456299
Validation loss = 0.7089455127716064
Validation loss = 0.7190415859222412
Validation loss = 0.7281495928764343
Validation loss = 0.7300258874893188
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -2.49e+03 |
| Iteration     | 10        |
| MaximumReturn | -1.8e+03  |
| MinimumReturn | -2.79e+03 |
| TotalSamples  | 48000     |
-----------------------------
itr #11 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7204675674438477
Validation loss = 0.729360818862915
Validation loss = 0.7275226712226868
Validation loss = 0.7382451891899109
Validation loss = 0.7462404370307922
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7156334519386292
Validation loss = 0.7212209701538086
Validation loss = 0.728502094745636
Validation loss = 0.7381053566932678
Validation loss = 0.736929178237915
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7197751998901367
Validation loss = 0.7174484729766846
Validation loss = 0.7358024716377258
Validation loss = 0.7366137504577637
Validation loss = 0.7366484999656677
Validation loss = 0.7435958385467529
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7231356501579285
Validation loss = 0.7164783477783203
Validation loss = 0.7260242104530334
Validation loss = 0.7291571497917175
Validation loss = 0.7377716898918152
Validation loss = 0.7404009699821472
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7002552151679993
Validation loss = 0.7145505547523499
Validation loss = 0.7199263572692871
Validation loss = 0.7207719683647156
Validation loss = 0.731215238571167
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -2.2e+03  |
| Iteration     | 11        |
| MaximumReturn | -1.59e+03 |
| MinimumReturn | -2.81e+03 |
| TotalSamples  | 52000     |
-----------------------------
itr #12 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7311222553253174
Validation loss = 0.7291571497917175
Validation loss = 0.7307808995246887
Validation loss = 0.741924524307251
Validation loss = 0.7407350540161133
Validation loss = 0.7462562918663025
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7214199900627136
Validation loss = 0.725678026676178
Validation loss = 0.7378604412078857
Validation loss = 0.7399251461029053
Validation loss = 0.7376833558082581
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7235695123672485
Validation loss = 0.720140814781189
Validation loss = 0.7319570779800415
Validation loss = 0.7350436449050903
Validation loss = 0.7429482936859131
Validation loss = 0.7432606220245361
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7252871990203857
Validation loss = 0.7197430729866028
Validation loss = 0.731673002243042
Validation loss = 0.7351466417312622
Validation loss = 0.7403075098991394
Validation loss = 0.7432412505149841
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7204224467277527
Validation loss = 0.713409423828125
Validation loss = 0.7175760865211487
Validation loss = 0.7337691783905029
Validation loss = 0.7347427606582642
Validation loss = 0.7434413433074951
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -2.41e+03 |
| Iteration     | 12        |
| MaximumReturn | -2.35e+03 |
| MinimumReturn | -2.53e+03 |
| TotalSamples  | 56000     |
-----------------------------
itr #13 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7168307900428772
Validation loss = 0.7278717756271362
Validation loss = 0.7382305264472961
Validation loss = 0.7420989871025085
Validation loss = 0.7471280097961426
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7230880856513977
Validation loss = 0.7273085713386536
Validation loss = 0.729651927947998
Validation loss = 0.7366840243339539
Validation loss = 0.7441496253013611
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7222040891647339
Validation loss = 0.7286868095397949
Validation loss = 0.7333981394767761
Validation loss = 0.7432500123977661
Validation loss = 0.7425097823143005
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7214747667312622
Validation loss = 0.7262808084487915
Validation loss = 0.7330343127250671
Validation loss = 0.7427350282669067
Validation loss = 0.7441965937614441
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7174715399742126
Validation loss = 0.7330085039138794
Validation loss = 0.7324695587158203
Validation loss = 0.736836850643158
Validation loss = 0.7433847784996033
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -2.16e+03 |
| Iteration     | 13        |
| MaximumReturn | -2.03e+03 |
| MinimumReturn | -2.26e+03 |
| TotalSamples  | 60000     |
-----------------------------
itr #14 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7224683165550232
Validation loss = 0.7257148027420044
Validation loss = 0.7313920259475708
Validation loss = 0.737921416759491
Validation loss = 0.742973268032074
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7259739637374878
Validation loss = 0.7297138571739197
Validation loss = 0.7338701486587524
Validation loss = 0.7425681352615356
Validation loss = 0.7427396774291992
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7254288196563721
Validation loss = 0.7235000729560852
Validation loss = 0.7351019978523254
Validation loss = 0.7348383069038391
Validation loss = 0.7402291893959045
Validation loss = 0.7471228241920471
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7302954792976379
Validation loss = 0.7312505841255188
Validation loss = 0.7334463596343994
Validation loss = 0.7411046624183655
Validation loss = 0.7448611855506897
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7408124208450317
Validation loss = 0.7302483320236206
Validation loss = 0.7393314838409424
Validation loss = 0.7358368635177612
Validation loss = 0.7458905577659607
Validation loss = 0.7470600008964539
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.88e+03 |
| Iteration     | 14        |
| MaximumReturn | -1.47e+03 |
| MinimumReturn | -2.88e+03 |
| TotalSamples  | 64000     |
-----------------------------
itr #15 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7260653376579285
Validation loss = 0.7275079488754272
Validation loss = 0.7362210750579834
Validation loss = 0.7417281866073608
Validation loss = 0.7417647838592529
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7293077707290649
Validation loss = 0.7266274094581604
Validation loss = 0.7414723634719849
Validation loss = 0.7405816316604614
Validation loss = 0.7440551519393921
Validation loss = 0.7498540282249451
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7371062636375427
Validation loss = 0.7287841439247131
Validation loss = 0.7359637022018433
Validation loss = 0.7398260831832886
Validation loss = 0.7440369725227356
Validation loss = 0.7430520057678223
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7326040267944336
Validation loss = 0.7357794046401978
Validation loss = 0.7421269416809082
Validation loss = 0.7406877875328064
Validation loss = 0.7458341717720032
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7279301881790161
Validation loss = 0.7333155870437622
Validation loss = 0.738518238067627
Validation loss = 0.746338963508606
Validation loss = 0.747084379196167
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -2.1e+03  |
| Iteration     | 15        |
| MaximumReturn | -1.88e+03 |
| MinimumReturn | -2.54e+03 |
| TotalSamples  | 68000     |
-----------------------------
itr #16 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7345096468925476
Validation loss = 0.7297732830047607
Validation loss = 0.7369468808174133
Validation loss = 0.7418603301048279
Validation loss = 0.7346736192703247
Validation loss = 0.7373670339584351
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7274024486541748
Validation loss = 0.7329912781715393
Validation loss = 0.7339983582496643
Validation loss = 0.737160325050354
Validation loss = 0.7443191409111023
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7330573797225952
Validation loss = 0.7289033532142639
Validation loss = 0.733716607093811
Validation loss = 0.7401278018951416
Validation loss = 0.7378217577934265
Validation loss = 0.740612804889679
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7335469722747803
Validation loss = 0.7282851338386536
Validation loss = 0.7430020570755005
Validation loss = 0.7468880414962769
Validation loss = 0.7458811402320862
Validation loss = 0.7507857084274292
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7285270690917969
Validation loss = 0.7318150997161865
Validation loss = 0.7364919185638428
Validation loss = 0.737551212310791
Validation loss = 0.7457886934280396
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -2e+03    |
| Iteration     | 16        |
| MaximumReturn | -1.81e+03 |
| MinimumReturn | -2.56e+03 |
| TotalSamples  | 72000     |
-----------------------------
itr #17 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7292129993438721
Validation loss = 0.7259237766265869
Validation loss = 0.7345085144042969
Validation loss = 0.7311941981315613
Validation loss = 0.7348909378051758
Validation loss = 0.7353124022483826
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7272742986679077
Validation loss = 0.7241162061691284
Validation loss = 0.7285409569740295
Validation loss = 0.7338719964027405
Validation loss = 0.7366735935211182
Validation loss = 0.7369630932807922
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7254233360290527
Validation loss = 0.7260608077049255
Validation loss = 0.7311245799064636
Validation loss = 0.7280277013778687
Validation loss = 0.7349035143852234
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7341586947441101
Validation loss = 0.7290985584259033
Validation loss = 0.739106297492981
Validation loss = 0.7368840575218201
Validation loss = 0.7395696043968201
Validation loss = 0.7414922714233398
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7283028364181519
Validation loss = 0.7313886880874634
Validation loss = 0.7271606922149658
Validation loss = 0.7342939972877502
Validation loss = 0.7389931678771973
Validation loss = 0.7448475360870361
Validation loss = 0.7379525303840637
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.87e+03 |
| Iteration     | 17        |
| MaximumReturn | -1.48e+03 |
| MinimumReturn | -2.11e+03 |
| TotalSamples  | 76000     |
-----------------------------
itr #18 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7187730669975281
Validation loss = 0.7249611616134644
Validation loss = 0.729401707649231
Validation loss = 0.7340714335441589
Validation loss = 0.7321498990058899
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7273640036582947
Validation loss = 0.7259879112243652
Validation loss = 0.7300452589988708
Validation loss = 0.7315510511398315
Validation loss = 0.7323623895645142
Validation loss = 0.7329793572425842
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7195027470588684
Validation loss = 0.7236275672912598
Validation loss = 0.7267314195632935
Validation loss = 0.7327885031700134
Validation loss = 0.7303885817527771
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.728449285030365
Validation loss = 0.7240120768547058
Validation loss = 0.7310199737548828
Validation loss = 0.7388325929641724
Validation loss = 0.7431132793426514
Validation loss = 0.7401810884475708
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7264451384544373
Validation loss = 0.7298427224159241
Validation loss = 0.733017086982727
Validation loss = 0.7323336601257324
Validation loss = 0.7353777885437012
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -2.07e+03 |
| Iteration     | 18        |
| MaximumReturn | -1.82e+03 |
| MinimumReturn | -2.77e+03 |
| TotalSamples  | 80000     |
-----------------------------
itr #19 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7300553321838379
Validation loss = 0.7347398996353149
Validation loss = 0.7354422211647034
Validation loss = 0.7348387837409973
Validation loss = 0.7414363622665405
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7296401858329773
Validation loss = 0.7272976636886597
Validation loss = 0.7325610518455505
Validation loss = 0.7347163558006287
Validation loss = 0.7441881895065308
Validation loss = 0.7392218708992004
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7238410711288452
Validation loss = 0.731212317943573
Validation loss = 0.7310433387756348
Validation loss = 0.7336710691452026
Validation loss = 0.7343800663948059
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7346669435501099
Validation loss = 0.7355321645736694
Validation loss = 0.740511417388916
Validation loss = 0.7432698607444763
Validation loss = 0.7465818524360657
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.726354718208313
Validation loss = 0.7309485673904419
Validation loss = 0.7388201355934143
Validation loss = 0.7419111728668213
Validation loss = 0.7392438650131226
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.89e+03 |
| Iteration     | 19        |
| MaximumReturn | -1.76e+03 |
| MinimumReturn | -2.17e+03 |
| TotalSamples  | 84000     |
-----------------------------
itr #20 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7202964425086975
Validation loss = 0.7166142463684082
Validation loss = 0.726433515548706
Validation loss = 0.7266913652420044
Validation loss = 0.7287569046020508
Validation loss = 0.7298744320869446
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7174555063247681
Validation loss = 0.7188729047775269
Validation loss = 0.7268575429916382
Validation loss = 0.7290406227111816
Validation loss = 0.7308356165885925
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7224326133728027
Validation loss = 0.7204959392547607
Validation loss = 0.7264466881752014
Validation loss = 0.7299839854240417
Validation loss = 0.7288657426834106
Validation loss = 0.7274508476257324
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7222458720207214
Validation loss = 0.7271069884300232
Validation loss = 0.7327403426170349
Validation loss = 0.7311296463012695
Validation loss = 0.7361937165260315
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7199650406837463
Validation loss = 0.7262092232704163
Validation loss = 0.7299879193305969
Validation loss = 0.7324188947677612
Validation loss = 0.7302969098091125
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.76e+03 |
| Iteration     | 20        |
| MaximumReturn | -1.23e+03 |
| MinimumReturn | -2.99e+03 |
| TotalSamples  | 88000     |
-----------------------------
itr #21 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7262687087059021
Validation loss = 0.7210790514945984
Validation loss = 0.7295436859130859
Validation loss = 0.7311426401138306
Validation loss = 0.7352890372276306
Validation loss = 0.7312738299369812
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.715648889541626
Validation loss = 0.7211886048316956
Validation loss = 0.7299264669418335
Validation loss = 0.7381996512413025
Validation loss = 0.7327911257743835
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7228589653968811
Validation loss = 0.7209306359291077
Validation loss = 0.7262553572654724
Validation loss = 0.7300710678100586
Validation loss = 0.7289839386940002
Validation loss = 0.7286996841430664
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7174953818321228
Validation loss = 0.7310505509376526
Validation loss = 0.731110155582428
Validation loss = 0.7381724715232849
Validation loss = 0.7388710379600525
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7281153202056885
Validation loss = 0.7318561673164368
Validation loss = 0.7344003319740295
Validation loss = 0.7370744347572327
Validation loss = 0.7387443780899048
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.53e+03 |
| Iteration     | 21        |
| MaximumReturn | -1.46e+03 |
| MinimumReturn | -1.66e+03 |
| TotalSamples  | 92000     |
-----------------------------
itr #22 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7230609655380249
Validation loss = 0.7244530320167542
Validation loss = 0.7271715998649597
Validation loss = 0.7312511801719666
Validation loss = 0.7293061017990112
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7223095893859863
Validation loss = 0.7210325002670288
Validation loss = 0.7239526510238647
Validation loss = 0.7318863272666931
Validation loss = 0.7320297956466675
Validation loss = 0.7327575087547302
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7220064997673035
Validation loss = 0.7213007211685181
Validation loss = 0.7221013903617859
Validation loss = 0.7285934686660767
Validation loss = 0.7244374752044678
Validation loss = 0.72704017162323
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7261220812797546
Validation loss = 0.7250938415527344
Validation loss = 0.734153151512146
Validation loss = 0.7335861325263977
Validation loss = 0.7358524203300476
Validation loss = 0.7354649305343628
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7298412322998047
Validation loss = 0.7224801778793335
Validation loss = 0.7323896884918213
Validation loss = 0.7370052933692932
Validation loss = 0.7363929748535156
Validation loss = 0.7389694452285767
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.57e+03 |
| Iteration     | 22        |
| MaximumReturn | -1.49e+03 |
| MinimumReturn | -1.81e+03 |
| TotalSamples  | 96000     |
-----------------------------
itr #23 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7256547808647156
Validation loss = 0.7251358032226562
Validation loss = 0.733738899230957
Validation loss = 0.7322132587432861
Validation loss = 0.7325709462165833
Validation loss = 0.7349188327789307
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7274215221405029
Validation loss = 0.7346766591072083
Validation loss = 0.7348267436027527
Validation loss = 0.7330144047737122
Validation loss = 0.7389476895332336
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.730776846408844
Validation loss = 0.7269940376281738
Validation loss = 0.7282187342643738
Validation loss = 0.7306006550788879
Validation loss = 0.7313124537467957
Validation loss = 0.7312381267547607
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7271650433540344
Validation loss = 0.7306819558143616
Validation loss = 0.7331086993217468
Validation loss = 0.7407421469688416
Validation loss = 0.7377254962921143
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7311010360717773
Validation loss = 0.7299866080284119
Validation loss = 0.7375940680503845
Validation loss = 0.7445384860038757
Validation loss = 0.741053581237793
Validation loss = 0.745185375213623
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.72e+03 |
| Iteration     | 23        |
| MaximumReturn | -1.42e+03 |
| MinimumReturn | -2.85e+03 |
| TotalSamples  | 100000    |
-----------------------------
itr #24 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7294489145278931
Validation loss = 0.7249858379364014
Validation loss = 0.7332473993301392
Validation loss = 0.7323411703109741
Validation loss = 0.7315488457679749
Validation loss = 0.7313719391822815
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7230880260467529
Validation loss = 0.721233606338501
Validation loss = 0.7319185137748718
Validation loss = 0.7355998158454895
Validation loss = 0.7359519004821777
Validation loss = 0.7373130917549133
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7216914296150208
Validation loss = 0.7196238040924072
Validation loss = 0.7311294674873352
Validation loss = 0.7303941249847412
Validation loss = 0.7271240949630737
Validation loss = 0.7288665771484375
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.731303870677948
Validation loss = 0.7314122915267944
Validation loss = 0.7354117035865784
Validation loss = 0.7364349365234375
Validation loss = 0.733723521232605
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7323921918869019
Validation loss = 0.729794442653656
Validation loss = 0.7363820672035217
Validation loss = 0.7363788485527039
Validation loss = 0.74135422706604
Validation loss = 0.7350097894668579
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.34e+03 |
| Iteration     | 24        |
| MaximumReturn | -974      |
| MinimumReturn | -1.6e+03  |
| TotalSamples  | 104000    |
-----------------------------
itr #25 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7207722067832947
Validation loss = 0.7224212288856506
Validation loss = 0.7280516028404236
Validation loss = 0.7297028303146362
Validation loss = 0.731011152267456
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7249612808227539
Validation loss = 0.7272418141365051
Validation loss = 0.7332581877708435
Validation loss = 0.7356429696083069
Validation loss = 0.733440101146698
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.719140350818634
Validation loss = 0.7196455597877502
Validation loss = 0.7258220314979553
Validation loss = 0.7322801351547241
Validation loss = 0.727380633354187
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7278319001197815
Validation loss = 0.7292028069496155
Validation loss = 0.7302405834197998
Validation loss = 0.7382618188858032
Validation loss = 0.7360193729400635
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7272071838378906
Validation loss = 0.7289224863052368
Validation loss = 0.7338608503341675
Validation loss = 0.7354142665863037
Validation loss = 0.7361882925033569
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.58e+03 |
| Iteration     | 25        |
| MaximumReturn | -1e+03    |
| MinimumReturn | -2.77e+03 |
| TotalSamples  | 108000    |
-----------------------------
itr #26 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7179003953933716
Validation loss = 0.7216757535934448
Validation loss = 0.7261673808097839
Validation loss = 0.7295059561729431
Validation loss = 0.7243794202804565
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7265266180038452
Validation loss = 0.7219131588935852
Validation loss = 0.728905975818634
Validation loss = 0.7323542833328247
Validation loss = 0.7313832640647888
Validation loss = 0.7296722531318665
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7142311334609985
Validation loss = 0.7186129689216614
Validation loss = 0.7213357090950012
Validation loss = 0.7260243892669678
Validation loss = 0.7224221229553223
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7277408838272095
Validation loss = 0.7303100228309631
Validation loss = 0.7325826287269592
Validation loss = 0.7349222898483276
Validation loss = 0.733208954334259
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7217957377433777
Validation loss = 0.7280734181404114
Validation loss = 0.7336962819099426
Validation loss = 0.7329638600349426
Validation loss = 0.7327222228050232
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.54e+03 |
| Iteration     | 26        |
| MaximumReturn | -1.47e+03 |
| MinimumReturn | -1.67e+03 |
| TotalSamples  | 112000    |
-----------------------------
itr #27 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7212068438529968
Validation loss = 0.7258886694908142
Validation loss = 0.7256227135658264
Validation loss = 0.7247964143753052
Validation loss = 0.730595588684082
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7192813158035278
Validation loss = 0.7232099175453186
Validation loss = 0.7287067174911499
Validation loss = 0.7282498478889465
Validation loss = 0.7299184799194336
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7184848785400391
Validation loss = 0.7229346632957458
Validation loss = 0.7253805994987488
Validation loss = 0.7266077399253845
Validation loss = 0.726722240447998
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7269824147224426
Validation loss = 0.7294949293136597
Validation loss = 0.7296441197395325
Validation loss = 0.7331708073616028
Validation loss = 0.7350103259086609
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7262126207351685
Validation loss = 0.7272772789001465
Validation loss = 0.7282962799072266
Validation loss = 0.7303584218025208
Validation loss = 0.732329785823822
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.37e+03 |
| Iteration     | 27        |
| MaximumReturn | -1.27e+03 |
| MinimumReturn | -1.43e+03 |
| TotalSamples  | 116000    |
-----------------------------
itr #28 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7214581966400146
Validation loss = 0.7214329242706299
Validation loss = 0.7217068076133728
Validation loss = 0.7253258228302002
Validation loss = 0.7243764996528625
Validation loss = 0.7264825105667114
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7223800420761108
Validation loss = 0.7219852209091187
Validation loss = 0.7285898327827454
Validation loss = 0.7261900305747986
Validation loss = 0.7320818305015564
Validation loss = 0.7259948253631592
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7171584963798523
Validation loss = 0.7204382419586182
Validation loss = 0.7221060991287231
Validation loss = 0.728164792060852
Validation loss = 0.7295987606048584
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7249680161476135
Validation loss = 0.7274860143661499
Validation loss = 0.7307686805725098
Validation loss = 0.7349712252616882
Validation loss = 0.7295205593109131
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7201529145240784
Validation loss = 0.7258937954902649
Validation loss = 0.7225658893585205
Validation loss = 0.7331594228744507
Validation loss = 0.7341916561126709
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.23e+03 |
| Iteration     | 28        |
| MaximumReturn | -1.09e+03 |
| MinimumReturn | -1.4e+03  |
| TotalSamples  | 120000    |
-----------------------------
itr #29 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7183066606521606
Validation loss = 0.7166450023651123
Validation loss = 0.7255827784538269
Validation loss = 0.7268336415290833
Validation loss = 0.7290812134742737
Validation loss = 0.725170910358429
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.719074547290802
Validation loss = 0.7217017412185669
Validation loss = 0.7264114022254944
Validation loss = 0.7279418110847473
Validation loss = 0.72904372215271
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7187944650650024
Validation loss = 0.7212395071983337
Validation loss = 0.7262579798698425
Validation loss = 0.7250658869743347
Validation loss = 0.7303094267845154
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7191088795661926
Validation loss = 0.7294113636016846
Validation loss = 0.7304283380508423
Validation loss = 0.7358938455581665
Validation loss = 0.7354772686958313
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7224041819572449
Validation loss = 0.724628210067749
Validation loss = 0.7325597405433655
Validation loss = 0.7308093309402466
Validation loss = 0.7309957146644592
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.03e+03 |
| Iteration     | 29        |
| MaximumReturn | -558      |
| MinimumReturn | -1.31e+03 |
| TotalSamples  | 124000    |
-----------------------------
itr #30 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7159562706947327
Validation loss = 0.7205417156219482
Validation loss = 0.7239995002746582
Validation loss = 0.7264657616615295
Validation loss = 0.7238927483558655
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7172717452049255
Validation loss = 0.723400354385376
Validation loss = 0.7282036542892456
Validation loss = 0.7250390648841858
Validation loss = 0.7240025997161865
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7161421179771423
Validation loss = 0.7239729166030884
Validation loss = 0.7247925400733948
Validation loss = 0.726199746131897
Validation loss = 0.7276461720466614
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7227616310119629
Validation loss = 0.726184070110321
Validation loss = 0.7332499027252197
Validation loss = 0.7302024960517883
Validation loss = 0.7326374650001526
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7217722535133362
Validation loss = 0.7244074940681458
Validation loss = 0.728818953037262
Validation loss = 0.7307018041610718
Validation loss = 0.7353761196136475
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.5e+03  |
| Iteration     | 30        |
| MaximumReturn | -1.34e+03 |
| MinimumReturn | -1.62e+03 |
| TotalSamples  | 128000    |
-----------------------------
itr #31 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7179822325706482
Validation loss = 0.7217100262641907
Validation loss = 0.7284775972366333
Validation loss = 0.7303632497787476
Validation loss = 0.7275744080543518
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7215934991836548
Validation loss = 0.7225401997566223
Validation loss = 0.7292417287826538
Validation loss = 0.7261741757392883
Validation loss = 0.7280136942863464
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7136756181716919
Validation loss = 0.7233198881149292
Validation loss = 0.7249031066894531
Validation loss = 0.7307611703872681
Validation loss = 0.7257460355758667
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7196367383003235
Validation loss = 0.7276700735092163
Validation loss = 0.7295736074447632
Validation loss = 0.7312734127044678
Validation loss = 0.7326798439025879
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7211031913757324
Validation loss = 0.7286535501480103
Validation loss = 0.7324119210243225
Validation loss = 0.7334491014480591
Validation loss = 0.7307589054107666
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.57e+03 |
| Iteration     | 31        |
| MaximumReturn | -1.46e+03 |
| MinimumReturn | -1.66e+03 |
| TotalSamples  | 132000    |
-----------------------------
itr #32 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7203630805015564
Validation loss = 0.7218403816223145
Validation loss = 0.7222921252250671
Validation loss = 0.7285112738609314
Validation loss = 0.72735196352005
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.718003511428833
Validation loss = 0.7203760147094727
Validation loss = 0.7253631353378296
Validation loss = 0.7327048778533936
Validation loss = 0.7315429449081421
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7182788848876953
Validation loss = 0.722245991230011
Validation loss = 0.7259156703948975
Validation loss = 0.7272918820381165
Validation loss = 0.7279773354530334
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.726025402545929
Validation loss = 0.7287443280220032
Validation loss = 0.732815682888031
Validation loss = 0.730089545249939
Validation loss = 0.7332944869995117
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7229199409484863
Validation loss = 0.7271186113357544
Validation loss = 0.7361863851547241
Validation loss = 0.7306125164031982
Validation loss = 0.7350199818611145
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.29e+03 |
| Iteration     | 32        |
| MaximumReturn | -958      |
| MinimumReturn | -1.61e+03 |
| TotalSamples  | 136000    |
-----------------------------
