Logging to experiments/gym_fswimmer/SO01/Wed-02-Nov-2022-04-25-22-PM-CDT_gym_fswimmer_trpo_iteration_20_seed2631
Print configuration .....
{'env_name': 'gym_fswimmer', 'random_seeds': [2312, 1231, 2631, 5543], 'save_variables': False, 'model_save_dir': '/tmp/gym_fswimmer_models/', 'restore_variables': False, 'start_onpol_iter': 0, 'onpol_iters': 33, 'num_path_random': 6, 'num_path_onpol': 6, 'env_horizon': 1000, 'max_train_data': 200000, 'max_val_data': 100000, 'discard_ratio': 0.0, 'dynamics': {'pre_training': {'mode': 'intrinsic_reward', 'itr': 0, 'policy_itr': 20}, 'model': 'nn', 'ensemble': True, 'ensemble_model_count': 5, 'enable_particle_ensemble': True, 'particles': 5, 'intrinsic_reward_only': False, 'external_reward_evaluation_interval': 5, 'obs_var': 1.0, 'intrinsic_reward_coeff': 1.0, 'ita': 1.0, 'mode': 'random', 'val': True, 'n_layers': 4, 'hidden_size': 1000, 'activation': 'relu', 'batch_size': 1000, 'learning_rate': 0.001, 'epochs': 200, 'kfac_params': {'learning_rate': 0.1, 'damping': 0.001, 'momentum': 0.9, 'kl_clip': 0.0001, 'cov_ema_decay': 0.99}}, 'policy': {'network_shape': [32, 32], 'init_logstd': 0.0, 'activation': 'tanh', 'reinitialize_every_itr': False}, 'trpo': {'horizon': 200, 'gamma': 0.99, 'step_size': 0.01, 'iterations': 20, 'batch_size': 50000, 'gae': 0.95}, 'trpo_ext_reward': {'horizon': 1000, 'gamma': 0.99, 'step_size': 0.01, 'iterations': 20, 'batch_size': 50000, 'gae': 0.95}, 'algo': 'trpo'}
Generating random rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating random rollouts.
Creating normalization for training data.
Done creating normalization for training data.
Particle ensemble enabled? True
An ensemble of 5 dynamics model <class 'model.dynamics.NNDynamicsModel'> initialized
Train dynamics model with intrinsic reward only? False
Pre-training enabled. Using only intrinsic reward.
Pre-training dynamics model for 0 iterations...
Done pre-training dynamics model.
Using external reward only.
itr #0 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7963378429412842
Validation loss = 0.4122346043586731
Validation loss = 0.3554103374481201
Validation loss = 0.33730417490005493
Validation loss = 0.33211344480514526
Validation loss = 0.32953065633773804
Validation loss = 0.32484176754951477
Validation loss = 0.32614296674728394
Validation loss = 0.33761322498321533
Validation loss = 0.3374238610267639
Validation loss = 0.3436833620071411
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6740789413452148
Validation loss = 0.40969598293304443
Validation loss = 0.35091543197631836
Validation loss = 0.34091633558273315
Validation loss = 0.3265092670917511
Validation loss = 0.3274047076702118
Validation loss = 0.33402830362319946
Validation loss = 0.32896822690963745
Validation loss = 0.3347166180610657
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.6457765102386475
Validation loss = 0.40526527166366577
Validation loss = 0.3512849807739258
Validation loss = 0.3342720866203308
Validation loss = 0.32876601815223694
Validation loss = 0.3274717330932617
Validation loss = 0.33759433031082153
Validation loss = 0.32675158977508545
Validation loss = 0.3403165936470032
Validation loss = 0.3430975079536438
Validation loss = 0.3374549150466919
Validation loss = 0.35651683807373047
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.750724732875824
Validation loss = 0.4202612042427063
Validation loss = 0.35630249977111816
Validation loss = 0.34156346321105957
Validation loss = 0.327551007270813
Validation loss = 0.3272858262062073
Validation loss = 0.3288739323616028
Validation loss = 0.334175705909729
Validation loss = 0.3283134698867798
Validation loss = 0.34441396594047546
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.687274694442749
Validation loss = 0.41053566336631775
Validation loss = 0.35145220160484314
Validation loss = 0.33459436893463135
Validation loss = 0.33156511187553406
Validation loss = 0.32800352573394775
Validation loss = 0.32983630895614624
Validation loss = 0.34519967436790466
Validation loss = 0.346401572227478
Validation loss = 0.34163665771484375
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -64      |
| Iteration     | 0        |
| MaximumReturn | -60.6    |
| MinimumReturn | -66.8    |
| TotalSamples  | 8000     |
----------------------------
itr #1 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.29452529549598694
Validation loss = 0.2373015433549881
Validation loss = 0.2315608561038971
Validation loss = 0.23334932327270508
Validation loss = 0.23745308816432953
Validation loss = 0.2318378984928131
Validation loss = 0.23708440363407135
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.2967437207698822
Validation loss = 0.23464149236679077
Validation loss = 0.2313031554222107
Validation loss = 0.23182062804698944
Validation loss = 0.2263399362564087
Validation loss = 0.23468643426895142
Validation loss = 0.231187641620636
Validation loss = 0.23394057154655457
Validation loss = 0.23409023880958557
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.30387675762176514
Validation loss = 0.2382361739873886
Validation loss = 0.2383325845003128
Validation loss = 0.23171217739582062
Validation loss = 0.2333662509918213
Validation loss = 0.23797345161437988
Validation loss = 0.23289164900779724
Validation loss = 0.24038681387901306
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3138350248336792
Validation loss = 0.24036924540996552
Validation loss = 0.23129600286483765
Validation loss = 0.23179064691066742
Validation loss = 0.2290758192539215
Validation loss = 0.2296299785375595
Validation loss = 0.23174244165420532
Validation loss = 0.22955955564975739
Validation loss = 0.23513086140155792
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.29995715618133545
Validation loss = 0.23610740900039673
Validation loss = 0.22759538888931274
Validation loss = 0.23166146874427795
Validation loss = 0.22910144925117493
Validation loss = 0.22959211468696594
Validation loss = 0.2309565544128418
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 8.14     |
| Iteration     | 1        |
| MaximumReturn | 20.9     |
| MinimumReturn | -8.17    |
| TotalSamples  | 12000    |
----------------------------
itr #2 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.24763862788677216
Validation loss = 0.23879851400852203
Validation loss = 0.24195091426372528
Validation loss = 0.248046413064003
Validation loss = 0.24443574249744415
Validation loss = 0.24655795097351074
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.2532396614551544
Validation loss = 0.2403538078069687
Validation loss = 0.23957706987857819
Validation loss = 0.2429426908493042
Validation loss = 0.24272428452968597
Validation loss = 0.25277915596961975
Validation loss = 0.24419696629047394
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.25020378828048706
Validation loss = 0.2487642765045166
Validation loss = 0.24820949137210846
Validation loss = 0.24906104803085327
Validation loss = 0.2621370255947113
Validation loss = 0.24935883283615112
Validation loss = 0.2631068527698517
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.2524760663509369
Validation loss = 0.24895079433918
Validation loss = 0.24099957942962646
Validation loss = 0.2447466105222702
Validation loss = 0.25424253940582275
Validation loss = 0.2584184408187866
Validation loss = 0.25181806087493896
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.2497193068265915
Validation loss = 0.24477450549602509
Validation loss = 0.23687727749347687
Validation loss = 0.24408073723316193
Validation loss = 0.24394209682941437
Validation loss = 0.2445259541273117
Validation loss = 0.2509356141090393
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -4.31    |
| Iteration     | 2        |
| MaximumReturn | 21       |
| MinimumReturn | -16.9    |
| TotalSamples  | 16000    |
----------------------------
itr #3 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.2728523015975952
Validation loss = 0.27673614025115967
Validation loss = 0.27975544333457947
Validation loss = 0.27374204993247986
Validation loss = 0.28338494896888733
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.27211087942123413
Validation loss = 0.2733441889286041
Validation loss = 0.2798894941806793
Validation loss = 0.28628408908843994
Validation loss = 0.29243209958076477
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.2786102294921875
Validation loss = 0.28813058137893677
Validation loss = 0.2845938205718994
Validation loss = 0.29414820671081543
Validation loss = 0.29065370559692383
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.2709587514400482
Validation loss = 0.2677173316478729
Validation loss = 0.2767219543457031
Validation loss = 0.2811795771121979
Validation loss = 0.2911328971385956
Validation loss = 0.2920348048210144
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.2720118463039398
Validation loss = 0.28034788370132446
Validation loss = 0.2703419625759125
Validation loss = 0.2758590579032898
Validation loss = 0.2849442958831787
Validation loss = 0.2881978154182434
Validation loss = 0.2951492369174957
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 9.92     |
| Iteration     | 3        |
| MaximumReturn | 28.8     |
| MinimumReturn | -12.6    |
| TotalSamples  | 20000    |
----------------------------
itr #4 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.2740451395511627
Validation loss = 0.27940115332603455
Validation loss = 0.28396672010421753
Validation loss = 0.288849413394928
Validation loss = 0.2860650420188904
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.27248328924179077
Validation loss = 0.27961212396621704
Validation loss = 0.2791939377784729
Validation loss = 0.2852047383785248
Validation loss = 0.28300702571868896
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.29043570160865784
Validation loss = 0.28038060665130615
Validation loss = 0.28447282314300537
Validation loss = 0.2915337085723877
Validation loss = 0.29972851276397705
Validation loss = 0.2929995059967041
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.2766132652759552
Validation loss = 0.28361377120018005
Validation loss = 0.28387224674224854
Validation loss = 0.2939978241920471
Validation loss = 0.2891901135444641
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.28165021538734436
Validation loss = 0.2776375710964203
Validation loss = 0.2833513617515564
Validation loss = 0.2889503538608551
Validation loss = 0.29825150966644287
Validation loss = 0.2890973687171936
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 3.58     |
| Iteration     | 4        |
| MaximumReturn | 8.82     |
| MinimumReturn | -3.98    |
| TotalSamples  | 24000    |
----------------------------
itr #5 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.2851978540420532
Validation loss = 0.29196977615356445
Validation loss = 0.28939804434776306
Validation loss = 0.2978263199329376
Validation loss = 0.29519590735435486
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.2836703062057495
Validation loss = 0.2867638170719147
Validation loss = 0.2887350022792816
Validation loss = 0.29761746525764465
Validation loss = 0.2949143946170807
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.28850796818733215
Validation loss = 0.2858271598815918
Validation loss = 0.2920167148113251
Validation loss = 0.3007323443889618
Validation loss = 0.29567644000053406
Validation loss = 0.3010075092315674
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.2890298366546631
Validation loss = 0.28826484084129333
Validation loss = 0.2933882772922516
Validation loss = 0.30303773283958435
Validation loss = 0.2948608696460724
Validation loss = 0.3001956045627594
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.28641578555107117
Validation loss = 0.2966438829898834
Validation loss = 0.293317973613739
Validation loss = 0.2986867129802704
Validation loss = 0.2981492578983307
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 16.6     |
| Iteration     | 5        |
| MaximumReturn | 28.4     |
| MinimumReturn | 1.15     |
| TotalSamples  | 28000    |
----------------------------
itr #6 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.29647496342658997
Validation loss = 0.30904683470726013
Validation loss = 0.29791519045829773
Validation loss = 0.30499759316444397
Validation loss = 0.30249762535095215
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.29079702496528625
Validation loss = 0.2967044413089752
Validation loss = 0.2956976294517517
Validation loss = 0.29961279034614563
Validation loss = 0.2966204285621643
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.29999491572380066
Validation loss = 0.29669898748397827
Validation loss = 0.3031623661518097
Validation loss = 0.30414220690727234
Validation loss = 0.3071133494377136
Validation loss = 0.31823235750198364
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.2955654263496399
Validation loss = 0.2954845726490021
Validation loss = 0.29822173714637756
Validation loss = 0.30346089601516724
Validation loss = 0.30857232213020325
Validation loss = 0.3035069406032562
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.2991180419921875
Validation loss = 0.30302953720092773
Validation loss = 0.3027988374233246
Validation loss = 0.3060855269432068
Validation loss = 0.30383560061454773
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 18.4     |
| Iteration     | 6        |
| MaximumReturn | 26.8     |
| MinimumReturn | 1.12     |
| TotalSamples  | 32000    |
----------------------------
itr #7 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.30056196451187134
Validation loss = 0.30324751138687134
Validation loss = 0.2984481751918793
Validation loss = 0.3067746162414551
Validation loss = 0.30628132820129395
Validation loss = 0.3051837682723999
Validation loss = 0.3122265338897705
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.2975693643093109
Validation loss = 0.29346925020217896
Validation loss = 0.29890552163124084
Validation loss = 0.3012601137161255
Validation loss = 0.30711209774017334
Validation loss = 0.30861717462539673
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3040582239627838
Validation loss = 0.30253157019615173
Validation loss = 0.307552307844162
Validation loss = 0.30725347995758057
Validation loss = 0.31432992219924927
Validation loss = 0.30935347080230713
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3052985668182373
Validation loss = 0.29992833733558655
Validation loss = 0.30350473523139954
Validation loss = 0.30808335542678833
Validation loss = 0.31365370750427246
Validation loss = 0.31747448444366455
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.30534446239471436
Validation loss = 0.3002382218837738
Validation loss = 0.3052464723587036
Validation loss = 0.3100377023220062
Validation loss = 0.3132089376449585
Validation loss = 0.31590601801872253
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 10.1     |
| Iteration     | 7        |
| MaximumReturn | 15       |
| MinimumReturn | 3.33     |
| TotalSamples  | 36000    |
----------------------------
itr #8 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.30921170115470886
Validation loss = 0.3069509267807007
Validation loss = 0.3115263283252716
Validation loss = 0.313960999250412
Validation loss = 0.3142293691635132
Validation loss = 0.3137780725955963
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3026089668273926
Validation loss = 0.3019121289253235
Validation loss = 0.3061803877353668
Validation loss = 0.30494698882102966
Validation loss = 0.3089447617530823
Validation loss = 0.3129009008407593
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.30643773078918457
Validation loss = 0.31205207109451294
Validation loss = 0.31157127022743225
Validation loss = 0.3139699101448059
Validation loss = 0.3150373697280884
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.30448269844055176
Validation loss = 0.3112429678440094
Validation loss = 0.3094332218170166
Validation loss = 0.31207019090652466
Validation loss = 0.31441935896873474
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.30635005235671997
Validation loss = 0.3092459738254547
Validation loss = 0.3143368661403656
Validation loss = 0.3071800172328949
Validation loss = 0.31330764293670654
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 9.29     |
| Iteration     | 8        |
| MaximumReturn | 16.1     |
| MinimumReturn | -0.00624 |
| TotalSamples  | 40000    |
----------------------------
itr #9 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3131493926048279
Validation loss = 0.3107360005378723
Validation loss = 0.31270813941955566
Validation loss = 0.3124375343322754
Validation loss = 0.3197086453437805
Validation loss = 0.3199825584888458
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3071843087673187
Validation loss = 0.3105190694332123
Validation loss = 0.3122187554836273
Validation loss = 0.31037405133247375
Validation loss = 0.3170773386955261
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.30940574407577515
Validation loss = 0.3138652443885803
Validation loss = 0.3131921887397766
Validation loss = 0.3151579797267914
Validation loss = 0.31782034039497375
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.31206122040748596
Validation loss = 0.31099632382392883
Validation loss = 0.31417667865753174
Validation loss = 0.3155568838119507
Validation loss = 0.32056570053100586
Validation loss = 0.3242444694042206
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.30784517526626587
Validation loss = 0.3083884119987488
Validation loss = 0.3140774965286255
Validation loss = 0.3153635561466217
Validation loss = 0.31612980365753174
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 10.3     |
| Iteration     | 9        |
| MaximumReturn | 16.5     |
| MinimumReturn | 0.539    |
| TotalSamples  | 44000    |
----------------------------
itr #10 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.31759822368621826
Validation loss = 0.3202582597732544
Validation loss = 0.3192467987537384
Validation loss = 0.3185310661792755
Validation loss = 0.32236289978027344
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.31093981862068176
Validation loss = 0.31014484167099
Validation loss = 0.3161301910877228
Validation loss = 0.3158232569694519
Validation loss = 0.3195875883102417
Validation loss = 0.3164248764514923
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3135899007320404
Validation loss = 0.3143382668495178
Validation loss = 0.3172549307346344
Validation loss = 0.316794216632843
Validation loss = 0.31859657168388367
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3143961727619171
Validation loss = 0.31967851519584656
Validation loss = 0.31879615783691406
Validation loss = 0.3203105330467224
Validation loss = 0.3289516866207123
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.31328895688056946
Validation loss = 0.3159319758415222
Validation loss = 0.3182244300842285
Validation loss = 0.31796911358833313
Validation loss = 0.3216734230518341
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 27       |
| Iteration     | 10       |
| MaximumReturn | 33.3     |
| MinimumReturn | 19.3     |
| TotalSamples  | 48000    |
----------------------------
itr #11 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.31566479802131653
Validation loss = 0.3190340995788574
Validation loss = 0.32061755657196045
Validation loss = 0.3259344696998596
Validation loss = 0.3272632956504822
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3140285015106201
Validation loss = 0.3122542202472687
Validation loss = 0.3144187927246094
Validation loss = 0.32177600264549255
Validation loss = 0.32438576221466064
Validation loss = 0.3227216899394989
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.31657537817955017
Validation loss = 0.31395551562309265
Validation loss = 0.3187962472438812
Validation loss = 0.319561630487442
Validation loss = 0.322456955909729
Validation loss = 0.32799842953681946
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.31559234857559204
Validation loss = 0.3173713684082031
Validation loss = 0.31690654158592224
Validation loss = 0.3199269771575928
Validation loss = 0.32552823424339294
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3193035125732422
Validation loss = 0.31301864981651306
Validation loss = 0.3141520917415619
Validation loss = 0.31821194291114807
Validation loss = 0.32406505942344666
Validation loss = 0.3230398893356323
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 36       |
| Iteration     | 11       |
| MaximumReturn | 41.9     |
| MinimumReturn | 31.1     |
| TotalSamples  | 52000    |
----------------------------
itr #12 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3147856891155243
Validation loss = 0.31616300344467163
Validation loss = 0.31859689950942993
Validation loss = 0.324677050113678
Validation loss = 0.3268193006515503
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3170273005962372
Validation loss = 0.3148609399795532
Validation loss = 0.3221014738082886
Validation loss = 0.3236437141895294
Validation loss = 0.32538750767707825
Validation loss = 0.3277219235897064
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3166990578174591
Validation loss = 0.31541940569877625
Validation loss = 0.31985214352607727
Validation loss = 0.3239469528198242
Validation loss = 0.3276100158691406
Validation loss = 0.32622072100639343
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3180742561817169
Validation loss = 0.3180365264415741
Validation loss = 0.32435479760169983
Validation loss = 0.3270410895347595
Validation loss = 0.3323390781879425
Validation loss = 0.3318389058113098
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3150958716869354
Validation loss = 0.31763818860054016
Validation loss = 0.3191938102245331
Validation loss = 0.32023414969444275
Validation loss = 0.3236926794052124
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 19.5     |
| Iteration     | 12       |
| MaximumReturn | 29.1     |
| MinimumReturn | 3.49     |
| TotalSamples  | 56000    |
----------------------------
itr #13 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3212045133113861
Validation loss = 0.3266896903514862
Validation loss = 0.32962319254875183
Validation loss = 0.33073002099990845
Validation loss = 0.3358043134212494
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3260252773761749
Validation loss = 0.3242199420928955
Validation loss = 0.33435681462287903
Validation loss = 0.33959126472473145
Validation loss = 0.3355798125267029
Validation loss = 0.34212830662727356
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.32888534665107727
Validation loss = 0.3263189196586609
Validation loss = 0.3323504328727722
Validation loss = 0.3361647427082062
Validation loss = 0.34312185645103455
Validation loss = 0.34542086720466614
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.32624688744544983
Validation loss = 0.326887309551239
Validation loss = 0.333542138338089
Validation loss = 0.33983349800109863
Validation loss = 0.3358791768550873
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3191893994808197
Validation loss = 0.3257383406162262
Validation loss = 0.3274647891521454
Validation loss = 0.3281913697719574
Validation loss = 0.33558669686317444
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 33.8     |
| Iteration     | 13       |
| MaximumReturn | 45.2     |
| MinimumReturn | 22.2     |
| TotalSamples  | 60000    |
----------------------------
itr #14 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.32744112610816956
Validation loss = 0.33173853158950806
Validation loss = 0.3334774374961853
Validation loss = 0.3355596363544464
Validation loss = 0.34576550126075745
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3317452371120453
Validation loss = 0.3397723436355591
Validation loss = 0.33924955129623413
Validation loss = 0.34245505928993225
Validation loss = 0.3499068319797516
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.33510345220565796
Validation loss = 0.34388267993927
Validation loss = 0.3420332670211792
Validation loss = 0.34971901774406433
Validation loss = 0.3475643992424011
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3277412950992584
Validation loss = 0.3355836272239685
Validation loss = 0.3366183638572693
Validation loss = 0.33774977922439575
Validation loss = 0.34382864832878113
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3248685896396637
Validation loss = 0.3277546465396881
Validation loss = 0.3389008641242981
Validation loss = 0.34220677614212036
Validation loss = 0.34389379620552063
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 41.7     |
| Iteration     | 14       |
| MaximumReturn | 51.9     |
| MinimumReturn | 35.2     |
| TotalSamples  | 64000    |
----------------------------
itr #15 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3340071439743042
Validation loss = 0.33594799041748047
Validation loss = 0.34218478202819824
Validation loss = 0.34510087966918945
Validation loss = 0.34911710023880005
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3361564874649048
Validation loss = 0.33868682384490967
Validation loss = 0.3433570861816406
Validation loss = 0.3465563654899597
Validation loss = 0.35063087940216064
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3377293348312378
Validation loss = 0.3464951813220978
Validation loss = 0.3442569673061371
Validation loss = 0.34693658351898193
Validation loss = 0.35177502036094666
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.33859413862228394
Validation loss = 0.33828917145729065
Validation loss = 0.34133774042129517
Validation loss = 0.3479275107383728
Validation loss = 0.34691253304481506
Validation loss = 0.35332000255584717
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3316752016544342
Validation loss = 0.3409464955329895
Validation loss = 0.34080740809440613
Validation loss = 0.3449121117591858
Validation loss = 0.3487498164176941
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 32.1     |
| Iteration     | 15       |
| MaximumReturn | 41.7     |
| MinimumReturn | 26.8     |
| TotalSamples  | 68000    |
----------------------------
itr #16 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3420843482017517
Validation loss = 0.3452194035053253
Validation loss = 0.3465605676174164
Validation loss = 0.3539890944957733
Validation loss = 0.35637176036834717
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3455485701560974
Validation loss = 0.34979188442230225
Validation loss = 0.3545618951320648
Validation loss = 0.3540199100971222
Validation loss = 0.36012065410614014
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3422154188156128
Validation loss = 0.34754282236099243
Validation loss = 0.351028174161911
Validation loss = 0.35607820749282837
Validation loss = 0.3588133454322815
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.34261372685432434
Validation loss = 0.34809422492980957
Validation loss = 0.35328131914138794
Validation loss = 0.3565340042114258
Validation loss = 0.35909226536750793
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3413674831390381
Validation loss = 0.34430766105651855
Validation loss = 0.34895089268684387
Validation loss = 0.35499370098114014
Validation loss = 0.3589169979095459
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 24.6     |
| Iteration     | 16       |
| MaximumReturn | 41.7     |
| MinimumReturn | 14.9     |
| TotalSamples  | 72000    |
----------------------------
itr #17 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3499301075935364
Validation loss = 0.35172417759895325
Validation loss = 0.35559484362602234
Validation loss = 0.35746681690216064
Validation loss = 0.36463049054145813
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3525717854499817
Validation loss = 0.35397371649742126
Validation loss = 0.3604981601238251
Validation loss = 0.3669077754020691
Validation loss = 0.37218010425567627
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.35482949018478394
Validation loss = 0.35818639397621155
Validation loss = 0.35917332768440247
Validation loss = 0.36407604813575745
Validation loss = 0.36373865604400635
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.35282090306282043
Validation loss = 0.35521233081817627
Validation loss = 0.35808664560317993
Validation loss = 0.36486950516700745
Validation loss = 0.3652988076210022
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.35287386178970337
Validation loss = 0.35330596566200256
Validation loss = 0.3559795320034027
Validation loss = 0.3646928071975708
Validation loss = 0.36313536763191223
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 21.7     |
| Iteration     | 17       |
| MaximumReturn | 35       |
| MinimumReturn | 8.91     |
| TotalSamples  | 76000    |
----------------------------
itr #18 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.36011767387390137
Validation loss = 0.36031022667884827
Validation loss = 0.3668838441371918
Validation loss = 0.37216928601264954
Validation loss = 0.3711543679237366
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3585011661052704
Validation loss = 0.3592498302459717
Validation loss = 0.3646294176578522
Validation loss = 0.36937570571899414
Validation loss = 0.37310901284217834
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3601493835449219
Validation loss = 0.3610249161720276
Validation loss = 0.3654990792274475
Validation loss = 0.36919698119163513
Validation loss = 0.37089699506759644
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.36248740553855896
Validation loss = 0.3644079566001892
Validation loss = 0.36738571524620056
Validation loss = 0.36926451325416565
Validation loss = 0.37264713644981384
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3580302894115448
Validation loss = 0.3619995415210724
Validation loss = 0.36685702204704285
Validation loss = 0.3674588203430176
Validation loss = 0.3717198669910431
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 24       |
| Iteration     | 18       |
| MaximumReturn | 35.5     |
| MinimumReturn | 6.87     |
| TotalSamples  | 80000    |
----------------------------
itr #19 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.36540645360946655
Validation loss = 0.36764344573020935
Validation loss = 0.3704003691673279
Validation loss = 0.3746096193790436
Validation loss = 0.37696999311447144
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3692658841609955
Validation loss = 0.36657652258872986
Validation loss = 0.37258028984069824
Validation loss = 0.3748261332511902
Validation loss = 0.37638741731643677
Validation loss = 0.37710103392601013
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3719233572483063
Validation loss = 0.36923083662986755
Validation loss = 0.37641000747680664
Validation loss = 0.37697720527648926
Validation loss = 0.3783131539821625
Validation loss = 0.37982746958732605
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.36746707558631897
Validation loss = 0.37058526277542114
Validation loss = 0.37583601474761963
Validation loss = 0.37452903389930725
Validation loss = 0.3791534900665283
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3674045205116272
Validation loss = 0.3711117208003998
Validation loss = 0.3738449811935425
Validation loss = 0.37421488761901855
Validation loss = 0.3770802915096283
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 22       |
| Iteration     | 19       |
| MaximumReturn | 32.4     |
| MinimumReturn | 3.75     |
| TotalSamples  | 84000    |
----------------------------
itr #20 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.37126293778419495
Validation loss = 0.37659746408462524
Validation loss = 0.3773341178894043
Validation loss = 0.3809641897678375
Validation loss = 0.38106417655944824
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3724552094936371
Validation loss = 0.3716651201248169
Validation loss = 0.38109925389289856
Validation loss = 0.3813942074775696
Validation loss = 0.38464346528053284
Validation loss = 0.3851030468940735
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3739316463470459
Validation loss = 0.3752133548259735
Validation loss = 0.3820936977863312
Validation loss = 0.38294896483421326
Validation loss = 0.38464075326919556
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.37649229168891907
Validation loss = 0.3754085898399353
Validation loss = 0.3809339106082916
Validation loss = 0.3847442865371704
Validation loss = 0.3876170516014099
Validation loss = 0.3853684365749359
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.37351664900779724
Validation loss = 0.37654346227645874
Validation loss = 0.3782654404640198
Validation loss = 0.3817121386528015
Validation loss = 0.3852640390396118
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 14.6     |
| Iteration     | 20       |
| MaximumReturn | 32.2     |
| MinimumReturn | -9.28    |
| TotalSamples  | 88000    |
----------------------------
itr #21 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3794384002685547
Validation loss = 0.379412442445755
Validation loss = 0.38398391008377075
Validation loss = 0.3842892646789551
Validation loss = 0.38649922609329224
Validation loss = 0.3894536793231964
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3774273991584778
Validation loss = 0.38021907210350037
Validation loss = 0.38080736994743347
Validation loss = 0.38703596591949463
Validation loss = 0.3907693028450012
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3790324032306671
Validation loss = 0.3836226463317871
Validation loss = 0.3850599527359009
Validation loss = 0.38967007398605347
Validation loss = 0.3879484236240387
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3783490061759949
Validation loss = 0.38279998302459717
Validation loss = 0.3889123797416687
Validation loss = 0.3886682391166687
Validation loss = 0.388296902179718
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.382131963968277
Validation loss = 0.38120919466018677
Validation loss = 0.38324907422065735
Validation loss = 0.38528287410736084
Validation loss = 0.38795098662376404
Validation loss = 0.38817188143730164
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 6.13     |
| Iteration     | 21       |
| MaximumReturn | 26.1     |
| MinimumReturn | -6.95    |
| TotalSamples  | 92000    |
----------------------------
itr #22 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.38363370299339294
Validation loss = 0.3836071491241455
Validation loss = 0.3878120481967926
Validation loss = 0.3888818919658661
Validation loss = 0.39173656702041626
Validation loss = 0.39300060272216797
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.38360491394996643
Validation loss = 0.38669735193252563
Validation loss = 0.3862055838108063
Validation loss = 0.3901880979537964
Validation loss = 0.3883605897426605
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3779159486293793
Validation loss = 0.3840647041797638
Validation loss = 0.3868825435638428
Validation loss = 0.3875013291835785
Validation loss = 0.39005860686302185
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3791208267211914
Validation loss = 0.38514381647109985
Validation loss = 0.3858910799026489
Validation loss = 0.39104291796684265
Validation loss = 0.38688868284225464
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.38157930970191956
Validation loss = 0.3823724687099457
Validation loss = 0.3866305947303772
Validation loss = 0.3907073140144348
Validation loss = 0.3917034864425659
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 3.67     |
| Iteration     | 22       |
| MaximumReturn | 14.4     |
| MinimumReturn | -2.76    |
| TotalSamples  | 96000    |
----------------------------
itr #23 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3872304856777191
Validation loss = 0.3875654637813568
Validation loss = 0.3913630545139313
Validation loss = 0.39391326904296875
Validation loss = 0.39359068870544434
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3867826461791992
Validation loss = 0.38787126541137695
Validation loss = 0.3919009268283844
Validation loss = 0.3928121328353882
Validation loss = 0.39558693766593933
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3870859146118164
Validation loss = 0.3879367411136627
Validation loss = 0.3900664150714874
Validation loss = 0.39228758215904236
Validation loss = 0.39214444160461426
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.38624534010887146
Validation loss = 0.3902052938938141
Validation loss = 0.392097145318985
Validation loss = 0.39479002356529236
Validation loss = 0.39768004417419434
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3857412338256836
Validation loss = 0.38866379857063293
Validation loss = 0.3944748342037201
Validation loss = 0.3924272060394287
Validation loss = 0.3952638804912567
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 8.54     |
| Iteration     | 23       |
| MaximumReturn | 16.7     |
| MinimumReturn | -5.69    |
| TotalSamples  | 100000   |
----------------------------
itr #24 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3944271504878998
Validation loss = 0.3926927149295807
Validation loss = 0.39553210139274597
Validation loss = 0.396036297082901
Validation loss = 0.3989448845386505
Validation loss = 0.3996826410293579
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3922722339630127
Validation loss = 0.3944074213504791
Validation loss = 0.39512526988983154
Validation loss = 0.3941279649734497
Validation loss = 0.39888209104537964
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.39275386929512024
Validation loss = 0.3927353620529175
Validation loss = 0.39646872878074646
Validation loss = 0.39926016330718994
Validation loss = 0.3982057273387909
Validation loss = 0.39984631538391113
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3890552818775177
Validation loss = 0.3915365934371948
Validation loss = 0.39338573813438416
Validation loss = 0.3966653048992157
Validation loss = 0.3999629616737366
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.39368969202041626
Validation loss = 0.3925626277923584
Validation loss = 0.39876115322113037
Validation loss = 0.3983020484447479
Validation loss = 0.3998197317123413
Validation loss = 0.4006942808628082
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 3.6      |
| Iteration     | 24       |
| MaximumReturn | 25.5     |
| MinimumReturn | -8.39    |
| TotalSamples  | 104000   |
----------------------------
itr #25 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.39050695300102234
Validation loss = 0.3945040702819824
Validation loss = 0.3952387273311615
Validation loss = 0.4025220572948456
Validation loss = 0.39958620071411133
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.39096757769584656
Validation loss = 0.3916109502315521
Validation loss = 0.39559924602508545
Validation loss = 0.3971576392650604
Validation loss = 0.40094268321990967
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3904819190502167
Validation loss = 0.39376217126846313
Validation loss = 0.3983318507671356
Validation loss = 0.3995722234249115
Validation loss = 0.4004526436328888
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3929758369922638
Validation loss = 0.3922995328903198
Validation loss = 0.3968086540699005
Validation loss = 0.3976691663265228
Validation loss = 0.3980085551738739
Validation loss = 0.4006405174732208
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.39218756556510925
Validation loss = 0.3936130702495575
Validation loss = 0.3986448049545288
Validation loss = 0.3977970480918884
Validation loss = 0.39830201864242554
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -0.792   |
| Iteration     | 25       |
| MaximumReturn | 13.7     |
| MinimumReturn | -14.1    |
| TotalSamples  | 108000   |
----------------------------
itr #26 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.393367201089859
Validation loss = 0.39579901099205017
Validation loss = 0.39730098843574524
Validation loss = 0.39995333552360535
Validation loss = 0.39963021874427795
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.394125759601593
Validation loss = 0.3940666913986206
Validation loss = 0.39845725893974304
Validation loss = 0.3993426561355591
Validation loss = 0.3998924195766449
Validation loss = 0.4009273946285248
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3934100866317749
Validation loss = 0.3961429297924042
Validation loss = 0.39631423354148865
Validation loss = 0.3994366228580475
Validation loss = 0.4004049301147461
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.39376187324523926
Validation loss = 0.3971841633319855
Validation loss = 0.39861106872558594
Validation loss = 0.4008578062057495
Validation loss = 0.3987255096435547
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3944057822227478
Validation loss = 0.396723210811615
Validation loss = 0.39721858501434326
Validation loss = 0.399826318025589
Validation loss = 0.40043872594833374
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 8.83     |
| Iteration     | 26       |
| MaximumReturn | 31       |
| MinimumReturn | -12.5    |
| TotalSamples  | 112000   |
----------------------------
itr #27 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3936310410499573
Validation loss = 0.3984798789024353
Validation loss = 0.4017334580421448
Validation loss = 0.401474267244339
Validation loss = 0.40132611989974976
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.39405494928359985
Validation loss = 0.39866235852241516
Validation loss = 0.4004092812538147
Validation loss = 0.40196624398231506
Validation loss = 0.4035337269306183
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.39573660492897034
Validation loss = 0.39963310956954956
Validation loss = 0.3997158110141754
Validation loss = 0.4026091396808624
Validation loss = 0.40545541048049927
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.39734214544296265
Validation loss = 0.39820989966392517
Validation loss = 0.3997178077697754
Validation loss = 0.40155261754989624
Validation loss = 0.40473809838294983
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3924809992313385
Validation loss = 0.3993202745914459
Validation loss = 0.4026477038860321
Validation loss = 0.4023987650871277
Validation loss = 0.40610024333000183
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 7.11     |
| Iteration     | 27       |
| MaximumReturn | 24.6     |
| MinimumReturn | -12.3    |
| TotalSamples  | 116000   |
----------------------------
itr #28 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.39994722604751587
Validation loss = 0.40144917368888855
Validation loss = 0.4021322429180145
Validation loss = 0.40552592277526855
Validation loss = 0.40601223707199097
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.39894899725914
Validation loss = 0.40071901679039
Validation loss = 0.40262776613235474
Validation loss = 0.4041304886341095
Validation loss = 0.405912846326828
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3985351622104645
Validation loss = 0.4015067517757416
Validation loss = 0.4034782350063324
Validation loss = 0.4059087038040161
Validation loss = 0.40566951036453247
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.39878711104393005
Validation loss = 0.39892399311065674
Validation loss = 0.4027394652366638
Validation loss = 0.40282270312309265
Validation loss = 0.4071468412876129
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.40229547023773193
Validation loss = 0.3995874524116516
Validation loss = 0.4053564667701721
Validation loss = 0.405562162399292
Validation loss = 0.40518784523010254
Validation loss = 0.4064580798149109
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 13.2     |
| Iteration     | 28       |
| MaximumReturn | 29.2     |
| MinimumReturn | -7.62    |
| TotalSamples  | 120000   |
----------------------------
itr #29 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.402904212474823
Validation loss = 0.4043939411640167
Validation loss = 0.40733084082603455
Validation loss = 0.4085836410522461
Validation loss = 0.40854206681251526
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.4009280800819397
Validation loss = 0.4040813148021698
Validation loss = 0.40689507126808167
Validation loss = 0.4082633852958679
Validation loss = 0.4094606637954712
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.4054722189903259
Validation loss = 0.4068884253501892
Validation loss = 0.40631914138793945
Validation loss = 0.41085827350616455
Validation loss = 0.4084993302822113
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.4002440571784973
Validation loss = 0.4030020534992218
Validation loss = 0.4061185419559479
Validation loss = 0.4080638289451599
Validation loss = 0.40916314721107483
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.4009180963039398
Validation loss = 0.40417057275772095
Validation loss = 0.40835803747177124
Validation loss = 0.40815943479537964
Validation loss = 0.4067733883857727
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 0.79     |
| Iteration     | 29       |
| MaximumReturn | 25.9     |
| MinimumReturn | -10.4    |
| TotalSamples  | 124000   |
----------------------------
itr #30 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.40304023027420044
Validation loss = 0.40434208512306213
Validation loss = 0.4078642725944519
Validation loss = 0.40945112705230713
Validation loss = 0.41185736656188965
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.4052717089653015
Validation loss = 0.4046964645385742
Validation loss = 0.409158319234848
Validation loss = 0.4081929624080658
Validation loss = 0.41401559114456177
Validation loss = 0.41076046228408813
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.4032924771308899
Validation loss = 0.40517961978912354
Validation loss = 0.40923619270324707
Validation loss = 0.40968433022499084
Validation loss = 0.40990665555000305
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.40300890803337097
Validation loss = 0.4051266610622406
Validation loss = 0.4094792604446411
Validation loss = 0.41070157289505005
Validation loss = 0.4102877378463745
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.4045793414115906
Validation loss = 0.40537863969802856
Validation loss = 0.4068111777305603
Validation loss = 0.41045984625816345
Validation loss = 0.4101550579071045
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 9.11     |
| Iteration     | 30       |
| MaximumReturn | 27.6     |
| MinimumReturn | -10.9    |
| TotalSamples  | 128000   |
----------------------------
itr #31 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.4067856967449188
Validation loss = 0.4097900688648224
Validation loss = 0.4101022481918335
Validation loss = 0.413108766078949
Validation loss = 0.41356873512268066
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.40924033522605896
Validation loss = 0.4113914966583252
Validation loss = 0.4121054708957672
Validation loss = 0.4141213297843933
Validation loss = 0.4155486822128296
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.4072608947753906
Validation loss = 0.40750837326049805
Validation loss = 0.41202402114868164
Validation loss = 0.41242921352386475
Validation loss = 0.4126952588558197
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.4051973223686218
Validation loss = 0.4084739089012146
Validation loss = 0.4125455617904663
Validation loss = 0.4108624756336212
Validation loss = 0.41249343752861023
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.40653979778289795
Validation loss = 0.40952980518341064
Validation loss = 0.4097784459590912
Validation loss = 0.41156280040740967
Validation loss = 0.41117730736732483
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.6      |
| Iteration     | 31       |
| MaximumReturn | 19.8     |
| MinimumReturn | -14.1    |
| TotalSamples  | 132000   |
----------------------------
itr #32 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.40659499168395996
Validation loss = 0.41177433729171753
Validation loss = 0.4138698875904083
Validation loss = 0.41434574127197266
Validation loss = 0.4151477813720703
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.41225486993789673
Validation loss = 0.41159212589263916
Validation loss = 0.41414493322372437
Validation loss = 0.41482216119766235
Validation loss = 0.41729453206062317
Validation loss = 0.4174809455871582
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.4084852635860443
Validation loss = 0.411687970161438
Validation loss = 0.41277119517326355
Validation loss = 0.4154338836669922
Validation loss = 0.417082279920578
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.4071871340274811
Validation loss = 0.41072455048561096
Validation loss = 0.41586029529571533
Validation loss = 0.412473201751709
Validation loss = 0.4144379794597626
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.40941017866134644
Validation loss = 0.41017580032348633
Validation loss = 0.4127639830112457
Validation loss = 0.41363608837127686
Validation loss = 0.41456127166748047
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -0.0813  |
| Iteration     | 32       |
| MaximumReturn | 22.8     |
| MinimumReturn | -14.5    |
| TotalSamples  | 136000   |
----------------------------
