Logging to experiments/gym_fswimmer/SO01/Wed-02-Nov-2022-04-25-22-PM-CDT_gym_fswimmer_trpo_iteration_20_seed5543
Print configuration .....
{'env_name': 'gym_fswimmer', 'random_seeds': [2312, 1231, 2631, 5543], 'save_variables': False, 'model_save_dir': '/tmp/gym_fswimmer_models/', 'restore_variables': False, 'start_onpol_iter': 0, 'onpol_iters': 33, 'num_path_random': 6, 'num_path_onpol': 6, 'env_horizon': 1000, 'max_train_data': 200000, 'max_val_data': 100000, 'discard_ratio': 0.0, 'dynamics': {'pre_training': {'mode': 'intrinsic_reward', 'itr': 0, 'policy_itr': 20}, 'model': 'nn', 'ensemble': True, 'ensemble_model_count': 5, 'enable_particle_ensemble': True, 'particles': 5, 'intrinsic_reward_only': False, 'external_reward_evaluation_interval': 5, 'obs_var': 1.0, 'intrinsic_reward_coeff': 1.0, 'ita': 1.0, 'mode': 'random', 'val': True, 'n_layers': 4, 'hidden_size': 1000, 'activation': 'relu', 'batch_size': 1000, 'learning_rate': 0.001, 'epochs': 200, 'kfac_params': {'learning_rate': 0.1, 'damping': 0.001, 'momentum': 0.9, 'kl_clip': 0.0001, 'cov_ema_decay': 0.99}}, 'policy': {'network_shape': [32, 32], 'init_logstd': 0.0, 'activation': 'tanh', 'reinitialize_every_itr': False}, 'trpo': {'horizon': 200, 'gamma': 0.99, 'step_size': 0.01, 'iterations': 20, 'batch_size': 50000, 'gae': 0.95}, 'trpo_ext_reward': {'horizon': 1000, 'gamma': 0.99, 'step_size': 0.01, 'iterations': 20, 'batch_size': 50000, 'gae': 0.95}, 'algo': 'trpo'}
Generating random rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating random rollouts.
Creating normalization for training data.
Done creating normalization for training data.
Particle ensemble enabled? True
An ensemble of 5 dynamics model <class 'model.dynamics.NNDynamicsModel'> initialized
Train dynamics model with intrinsic reward only? False
Pre-training enabled. Using only intrinsic reward.
Pre-training dynamics model for 0 iterations...
Done pre-training dynamics model.
Using external reward only.
itr #0 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6772875785827637
Validation loss = 0.42794090509414673
Validation loss = 0.3566340208053589
Validation loss = 0.3291546404361725
Validation loss = 0.31734275817871094
Validation loss = 0.31499946117401123
Validation loss = 0.31743860244750977
Validation loss = 0.3167399764060974
Validation loss = 0.32026374340057373
Validation loss = 0.31778085231781006
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6408864259719849
Validation loss = 0.38569533824920654
Validation loss = 0.3454388976097107
Validation loss = 0.3232596516609192
Validation loss = 0.325788676738739
Validation loss = 0.3205452263355255
Validation loss = 0.3146336078643799
Validation loss = 0.3126363456249237
Validation loss = 0.31339603662490845
Validation loss = 0.3170968294143677
Validation loss = 0.3234745264053345
Validation loss = 0.3265131711959839
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7393914461135864
Validation loss = 0.43511420488357544
Validation loss = 0.3562338352203369
Validation loss = 0.334981769323349
Validation loss = 0.3206425607204437
Validation loss = 0.3163570165634155
Validation loss = 0.31894123554229736
Validation loss = 0.31358760595321655
Validation loss = 0.3118455111980438
Validation loss = 0.3216465413570404
Validation loss = 0.32236146926879883
Validation loss = 0.320884108543396
Validation loss = 0.3266393542289734
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7297703623771667
Validation loss = 0.4265601634979248
Validation loss = 0.36124616861343384
Validation loss = 0.32878297567367554
Validation loss = 0.31985604763031006
Validation loss = 0.316017746925354
Validation loss = 0.3129303455352783
Validation loss = 0.3146361708641052
Validation loss = 0.316228449344635
Validation loss = 0.3322490453720093
Validation loss = 0.3190968334674835
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.9250509738922119
Validation loss = 0.42935818433761597
Validation loss = 0.35749509930610657
Validation loss = 0.3341349959373474
Validation loss = 0.326424777507782
Validation loss = 0.3172941207885742
Validation loss = 0.3146493434906006
Validation loss = 0.31291794776916504
Validation loss = 0.32053831219673157
Validation loss = 0.316658079624176
Validation loss = 0.3174327611923218
Validation loss = 0.33891114592552185
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -7.19    |
| Iteration     | 0        |
| MaximumReturn | 3.44     |
| MinimumReturn | -16.4    |
| TotalSamples  | 8000     |
----------------------------
itr #1 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3734182119369507
Validation loss = 0.30810022354125977
Validation loss = 0.3126280605792999
Validation loss = 0.30549976229667664
Validation loss = 0.31128233671188354
Validation loss = 0.3063049018383026
Validation loss = 0.3117779493331909
Validation loss = 0.316184401512146
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.36589375138282776
Validation loss = 0.31033605337142944
Validation loss = 0.31406170129776
Validation loss = 0.30711326003074646
Validation loss = 0.3088877201080322
Validation loss = 0.30907177925109863
Validation loss = 0.32025304436683655
Validation loss = 0.3324943482875824
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.35835281014442444
Validation loss = 0.30603039264678955
Validation loss = 0.31071582436561584
Validation loss = 0.3059898316860199
Validation loss = 0.31942713260650635
Validation loss = 0.3160715401172638
Validation loss = 0.31848955154418945
Validation loss = 0.31749770045280457
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3579389750957489
Validation loss = 0.31330516934394836
Validation loss = 0.31101715564727783
Validation loss = 0.308452844619751
Validation loss = 0.3036681115627289
Validation loss = 0.3116282820701599
Validation loss = 0.32777735590934753
Validation loss = 0.3120889663696289
Validation loss = 0.3181813061237335
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3607129156589508
Validation loss = 0.31822624802589417
Validation loss = 0.30918535590171814
Validation loss = 0.3079475164413452
Validation loss = 0.30989110469818115
Validation loss = 0.31149235367774963
Validation loss = 0.3165735900402069
Validation loss = 0.3212544023990631
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -3.02    |
| Iteration     | 1        |
| MaximumReturn | 15.9     |
| MinimumReturn | -17.2    |
| TotalSamples  | 12000    |
----------------------------
itr #2 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3087291419506073
Validation loss = 0.3062326908111572
Validation loss = 0.30399230122566223
Validation loss = 0.32086852192878723
Validation loss = 0.3133988678455353
Validation loss = 0.3193831443786621
Validation loss = 0.31488141417503357
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.318813294172287
Validation loss = 0.31090524792671204
Validation loss = 0.31624355912208557
Validation loss = 0.32302507758140564
Validation loss = 0.3197442293167114
Validation loss = 0.32327544689178467
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.320547491312027
Validation loss = 0.30251815915107727
Validation loss = 0.3062427043914795
Validation loss = 0.3179280757904053
Validation loss = 0.31731048226356506
Validation loss = 0.32673802971839905
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3171674311161041
Validation loss = 0.30773571133613586
Validation loss = 0.31539955735206604
Validation loss = 0.3168749511241913
Validation loss = 0.31079742312431335
Validation loss = 0.3212984800338745
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.31160226464271545
Validation loss = 0.3052866756916046
Validation loss = 0.3145153224468231
Validation loss = 0.31476953625679016
Validation loss = 0.31677305698394775
Validation loss = 0.32372716069221497
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -1.09    |
| Iteration     | 2        |
| MaximumReturn | 17.7     |
| MinimumReturn | -24.3    |
| TotalSamples  | 16000    |
----------------------------
itr #3 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3078519105911255
Validation loss = 0.3051184415817261
Validation loss = 0.3136230707168579
Validation loss = 0.3153381645679474
Validation loss = 0.3281254172325134
Validation loss = 0.33168312907218933
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3108331859111786
Validation loss = 0.3124822974205017
Validation loss = 0.3162482976913452
Validation loss = 0.3131909966468811
Validation loss = 0.3286173939704895
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3097374439239502
Validation loss = 0.31264984607696533
Validation loss = 0.31546294689178467
Validation loss = 0.3234618306159973
Validation loss = 0.3190239369869232
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.311935156583786
Validation loss = 0.3106882572174072
Validation loss = 0.3167697787284851
Validation loss = 0.315128892660141
Validation loss = 0.32314684987068176
Validation loss = 0.33089035749435425
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3017919063568115
Validation loss = 0.3109533190727234
Validation loss = 0.31173625588417053
Validation loss = 0.3202195167541504
Validation loss = 0.32746583223342896
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.78     |
| Iteration     | 3        |
| MaximumReturn | 17.5     |
| MinimumReturn | -27.4    |
| TotalSamples  | 20000    |
----------------------------
itr #4 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.31711632013320923
Validation loss = 0.3234211802482605
Validation loss = 0.3235018849372864
Validation loss = 0.3347165286540985
Validation loss = 0.33025094866752625
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3181714415550232
Validation loss = 0.31689712405204773
Validation loss = 0.3189016580581665
Validation loss = 0.31941676139831543
Validation loss = 0.328228622674942
Validation loss = 0.334187775850296
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3088867664337158
Validation loss = 0.3129492402076721
Validation loss = 0.31633007526397705
Validation loss = 0.3222200870513916
Validation loss = 0.322998583316803
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.31379714608192444
Validation loss = 0.3218974471092224
Validation loss = 0.3227551281452179
Validation loss = 0.3234362006187439
Validation loss = 0.322212815284729
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3171862065792084
Validation loss = 0.31030720472335815
Validation loss = 0.32450783252716064
Validation loss = 0.32152196764945984
Validation loss = 0.32160133123397827
Validation loss = 0.32868871092796326
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -5.55    |
| Iteration     | 4        |
| MaximumReturn | 15.7     |
| MinimumReturn | -24.1    |
| TotalSamples  | 24000    |
----------------------------
itr #5 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.32182231545448303
Validation loss = 0.3254256546497345
Validation loss = 0.3255535662174225
Validation loss = 0.32838550209999084
Validation loss = 0.3362334072589874
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3199969232082367
Validation loss = 0.32756513357162476
Validation loss = 0.3245716094970703
Validation loss = 0.3283744156360626
Validation loss = 0.344877690076828
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3234829902648926
Validation loss = 0.31908121705055237
Validation loss = 0.32195281982421875
Validation loss = 0.32363954186439514
Validation loss = 0.3332926630973816
Validation loss = 0.33119311928749084
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.31852009892463684
Validation loss = 0.32238712906837463
Validation loss = 0.32664185762405396
Validation loss = 0.3321802318096161
Validation loss = 0.3287643492221832
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.318555623292923
Validation loss = 0.3178984224796295
Validation loss = 0.33460476994514465
Validation loss = 0.32934603095054626
Validation loss = 0.3334111273288727
Validation loss = 0.3326347768306732
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.43     |
| Iteration     | 5        |
| MaximumReturn | 15.8     |
| MinimumReturn | -16.9    |
| TotalSamples  | 28000    |
----------------------------
itr #6 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.33685019612312317
Validation loss = 0.33378905057907104
Validation loss = 0.3373441696166992
Validation loss = 0.3461744785308838
Validation loss = 0.35240045189857483
Validation loss = 0.3520537316799164
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.33674049377441406
Validation loss = 0.33199402689933777
Validation loss = 0.34652239084243774
Validation loss = 0.34474068880081177
Validation loss = 0.34718963503837585
Validation loss = 0.3546341061592102
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3325157165527344
Validation loss = 0.3351432681083679
Validation loss = 0.34331056475639343
Validation loss = 0.3510463833808899
Validation loss = 0.3508024215698242
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.33626362681388855
Validation loss = 0.33692029118537903
Validation loss = 0.3387942314147949
Validation loss = 0.34159961342811584
Validation loss = 0.3488242030143738
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.33911246061325073
Validation loss = 0.33899638056755066
Validation loss = 0.33968380093574524
Validation loss = 0.3516283333301544
Validation loss = 0.3549072742462158
Validation loss = 0.3520622253417969
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -8.98    |
| Iteration     | 6        |
| MaximumReturn | 12.1     |
| MinimumReturn | -25.1    |
| TotalSamples  | 32000    |
----------------------------
itr #7 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3466936945915222
Validation loss = 0.3487508296966553
Validation loss = 0.35068047046661377
Validation loss = 0.35328179597854614
Validation loss = 0.36491748690605164
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3499826192855835
Validation loss = 0.3488132655620575
Validation loss = 0.3564274311065674
Validation loss = 0.3556945323944092
Validation loss = 0.3654935657978058
Validation loss = 0.3652159571647644
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.34816882014274597
Validation loss = 0.3501054346561432
Validation loss = 0.3473421037197113
Validation loss = 0.3591279983520508
Validation loss = 0.35858118534088135
Validation loss = 0.3652474284172058
Validation loss = 0.370872437953949
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3398085832595825
Validation loss = 0.3423723578453064
Validation loss = 0.3440573811531067
Validation loss = 0.35014086961746216
Validation loss = 0.3561910390853882
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3470161259174347
Validation loss = 0.352483868598938
Validation loss = 0.35934484004974365
Validation loss = 0.3584445118904114
Validation loss = 0.3683900237083435
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 3.2      |
| Iteration     | 7        |
| MaximumReturn | 14.2     |
| MinimumReturn | -20.7    |
| TotalSamples  | 36000    |
----------------------------
itr #8 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.35442718863487244
Validation loss = 0.3627694249153137
Validation loss = 0.3605414628982544
Validation loss = 0.36883270740509033
Validation loss = 0.3703053295612335
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3623453676700592
Validation loss = 0.36099278926849365
Validation loss = 0.3680775463581085
Validation loss = 0.36849355697631836
Validation loss = 0.3740392327308655
Validation loss = 0.37927913665771484
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3644317388534546
Validation loss = 0.3647601902484894
Validation loss = 0.365897536277771
Validation loss = 0.3810061514377594
Validation loss = 0.37725672125816345
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.35375067591667175
Validation loss = 0.35208022594451904
Validation loss = 0.3565455973148346
Validation loss = 0.3611847162246704
Validation loss = 0.36758071184158325
Validation loss = 0.3811790347099304
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3540293276309967
Validation loss = 0.3605004847049713
Validation loss = 0.3690670430660248
Validation loss = 0.3745003044605255
Validation loss = 0.37205788493156433
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -10.7    |
| Iteration     | 8        |
| MaximumReturn | 16.7     |
| MinimumReturn | -18.6    |
| TotalSamples  | 40000    |
----------------------------
itr #9 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.37015095353126526
Validation loss = 0.3691673278808594
Validation loss = 0.37602680921554565
Validation loss = 0.38862138986587524
Validation loss = 0.3798418641090393
Validation loss = 0.3950602114200592
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.37883058190345764
Validation loss = 0.3797551393508911
Validation loss = 0.38689976930618286
Validation loss = 0.38839736580848694
Validation loss = 0.3957657814025879
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3773292303085327
Validation loss = 0.37582331895828247
Validation loss = 0.38403528928756714
Validation loss = 0.39051467180252075
Validation loss = 0.3933154046535492
Validation loss = 0.3959234654903412
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3614315390586853
Validation loss = 0.37025177478790283
Validation loss = 0.3741108775138855
Validation loss = 0.37713927030563354
Validation loss = 0.3870003819465637
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.37286999821662903
Validation loss = 0.37276309728622437
Validation loss = 0.38182657957077026
Validation loss = 0.38124844431877136
Validation loss = 0.3858093023300171
Validation loss = 0.38702818751335144
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -0.644   |
| Iteration     | 9        |
| MaximumReturn | 18.9     |
| MinimumReturn | -16.3    |
| TotalSamples  | 44000    |
----------------------------
itr #10 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.38906237483024597
Validation loss = 0.38663479685783386
Validation loss = 0.39034223556518555
Validation loss = 0.3983630836009979
Validation loss = 0.4033513069152832
Validation loss = 0.4053333103656769
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3825468122959137
Validation loss = 0.39168497920036316
Validation loss = 0.3978954255580902
Validation loss = 0.40065285563468933
Validation loss = 0.4041603207588196
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3863702118396759
Validation loss = 0.3950679898262024
Validation loss = 0.4065439999103546
Validation loss = 0.39666613936424255
Validation loss = 0.40490835905075073
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3752025365829468
Validation loss = 0.3826199769973755
Validation loss = 0.38759392499923706
Validation loss = 0.3946913778781891
Validation loss = 0.39196765422821045
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3843149244785309
Validation loss = 0.38550546765327454
Validation loss = 0.39055487513542175
Validation loss = 0.3966923952102661
Validation loss = 0.4072374105453491
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -7.07    |
| Iteration     | 10       |
| MaximumReturn | 15.7     |
| MinimumReturn | -21.4    |
| TotalSamples  | 48000    |
----------------------------
itr #11 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.40400493144989014
Validation loss = 0.4005555212497711
Validation loss = 0.40512028336524963
Validation loss = 0.4136364459991455
Validation loss = 0.41987964510917664
Validation loss = 0.4194738566875458
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3996526002883911
Validation loss = 0.4022054374217987
Validation loss = 0.4116272032260895
Validation loss = 0.41348525881767273
Validation loss = 0.4172334671020508
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.40250346064567566
Validation loss = 0.40080687403678894
Validation loss = 0.4102543294429779
Validation loss = 0.41151881217956543
Validation loss = 0.416422039270401
Validation loss = 0.4186953604221344
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3981914222240448
Validation loss = 0.39532479643821716
Validation loss = 0.4011203944683075
Validation loss = 0.40260472893714905
Validation loss = 0.40934285521507263
Validation loss = 0.4124929904937744
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.39504411816596985
Validation loss = 0.39797794818878174
Validation loss = 0.4053424298763275
Validation loss = 0.40804633498191833
Validation loss = 0.4118046760559082
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -10      |
| Iteration     | 11       |
| MaximumReturn | 16       |
| MinimumReturn | -23.5    |
| TotalSamples  | 52000    |
----------------------------
itr #12 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.41727691888809204
Validation loss = 0.41500991582870483
Validation loss = 0.4196776747703552
Validation loss = 0.4202454388141632
Validation loss = 0.42765501141548157
Validation loss = 0.43358612060546875
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.41017457842826843
Validation loss = 0.41339918971061707
Validation loss = 0.4105919301509857
Validation loss = 0.42187821865081787
Validation loss = 0.4215424358844757
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.4118117690086365
Validation loss = 0.41324475407600403
Validation loss = 0.41939693689346313
Validation loss = 0.4253592789173126
Validation loss = 0.4283367991447449
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.40990206599235535
Validation loss = 0.40849006175994873
Validation loss = 0.4172768294811249
Validation loss = 0.42081528902053833
Validation loss = 0.42245468497276306
Validation loss = 0.43044739961624146
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.39623865485191345
Validation loss = 0.4078831970691681
Validation loss = 0.4113275110721588
Validation loss = 0.4161698520183563
Validation loss = 0.41965290904045105
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -7.93    |
| Iteration     | 12       |
| MaximumReturn | 18.1     |
| MinimumReturn | -22      |
| TotalSamples  | 56000    |
----------------------------
itr #13 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.42617321014404297
Validation loss = 0.43110594153404236
Validation loss = 0.4348897635936737
Validation loss = 0.4356260299682617
Validation loss = 0.43908244371414185
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.42050594091415405
Validation loss = 0.4275330603122711
Validation loss = 0.42436912655830383
Validation loss = 0.43186745047569275
Validation loss = 0.4338023364543915
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.4161706566810608
Validation loss = 0.4229128658771515
Validation loss = 0.4287976622581482
Validation loss = 0.43210241198539734
Validation loss = 0.4412086009979248
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.4289107322692871
Validation loss = 0.4210534691810608
Validation loss = 0.42678990960121155
Validation loss = 0.43296271562576294
Validation loss = 0.4362489581108093
Validation loss = 0.4393279552459717
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.4130919575691223
Validation loss = 0.4179742932319641
Validation loss = 0.43044358491897583
Validation loss = 0.4285799562931061
Validation loss = 0.4355326294898987
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -2.07    |
| Iteration     | 13       |
| MaximumReturn | 17.4     |
| MinimumReturn | -24.1    |
| TotalSamples  | 60000    |
----------------------------
itr #14 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.4293459951877594
Validation loss = 0.43113672733306885
Validation loss = 0.43906429409980774
Validation loss = 0.44146454334259033
Validation loss = 0.4444320797920227
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.42915043234825134
Validation loss = 0.4292512834072113
Validation loss = 0.4367130696773529
Validation loss = 0.43887463212013245
Validation loss = 0.4425927996635437
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.4290210008621216
Validation loss = 0.43357688188552856
Validation loss = 0.43800461292266846
Validation loss = 0.44147172570228577
Validation loss = 0.44440731406211853
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.4300476610660553
Validation loss = 0.4343375861644745
Validation loss = 0.44312673807144165
Validation loss = 0.44643279910087585
Validation loss = 0.44308236241340637
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.42057177424430847
Validation loss = 0.4288334548473358
Validation loss = 0.42998006939888
Validation loss = 0.43629294633865356
Validation loss = 0.4343700110912323
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -16.6    |
| Iteration     | 14       |
| MaximumReturn | 2.85     |
| MinimumReturn | -24.5    |
| TotalSamples  | 64000    |
----------------------------
itr #15 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.43219953775405884
Validation loss = 0.44163647294044495
Validation loss = 0.4482758641242981
Validation loss = 0.45195984840393066
Validation loss = 0.45684802532196045
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.4349137842655182
Validation loss = 0.44383475184440613
Validation loss = 0.4401332437992096
Validation loss = 0.4487871527671814
Validation loss = 0.4523658752441406
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.43215689063072205
Validation loss = 0.4429172873497009
Validation loss = 0.4455817639827728
Validation loss = 0.4457772374153137
Validation loss = 0.4543224573135376
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.44326239824295044
Validation loss = 0.4420377016067505
Validation loss = 0.4476153254508972
Validation loss = 0.45148104429244995
Validation loss = 0.45433545112609863
Validation loss = 0.45881739258766174
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.4345390796661377
Validation loss = 0.4399740993976593
Validation loss = 0.4400079846382141
Validation loss = 0.4472876489162445
Validation loss = 0.4522109627723694
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -0.159   |
| Iteration     | 15       |
| MaximumReturn | 12.8     |
| MinimumReturn | -23.2    |
| TotalSamples  | 68000    |
----------------------------
itr #16 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.4445725679397583
Validation loss = 0.4473802447319031
Validation loss = 0.4516276717185974
Validation loss = 0.45644330978393555
Validation loss = 0.4589274525642395
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.44029560685157776
Validation loss = 0.4475381374359131
Validation loss = 0.4492150545120239
Validation loss = 0.44975796341896057
Validation loss = 0.4528714418411255
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.4430682957172394
Validation loss = 0.44939061999320984
Validation loss = 0.4528862237930298
Validation loss = 0.45582902431488037
Validation loss = 0.4567233622074127
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.44966405630111694
Validation loss = 0.4495909512042999
Validation loss = 0.4535320997238159
Validation loss = 0.4641043245792389
Validation loss = 0.46209409832954407
Validation loss = 0.4654316008090973
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.4386521577835083
Validation loss = 0.4467662274837494
Validation loss = 0.44953468441963196
Validation loss = 0.45321735739707947
Validation loss = 0.45445653796195984
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -5.53    |
| Iteration     | 16       |
| MaximumReturn | 11       |
| MinimumReturn | -16.2    |
| TotalSamples  | 72000    |
----------------------------
itr #17 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.44893115758895874
Validation loss = 0.45110204815864563
Validation loss = 0.45810550451278687
Validation loss = 0.456906795501709
Validation loss = 0.46499115228652954
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.44683128595352173
Validation loss = 0.45321041345596313
Validation loss = 0.45737484097480774
Validation loss = 0.45573604106903076
Validation loss = 0.4648207426071167
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.4453646242618561
Validation loss = 0.45151886343955994
Validation loss = 0.45533397793769836
Validation loss = 0.4573996365070343
Validation loss = 0.46460431814193726
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.44986486434936523
Validation loss = 0.4538730978965759
Validation loss = 0.45554250478744507
Validation loss = 0.4617077112197876
Validation loss = 0.4671344757080078
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.4461154341697693
Validation loss = 0.4448390007019043
Validation loss = 0.45458513498306274
Validation loss = 0.45733579993247986
Validation loss = 0.4588075876235962
Validation loss = 0.4645581841468811
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -9.76    |
| Iteration     | 17       |
| MaximumReturn | 15.1     |
| MinimumReturn | -26.5    |
| TotalSamples  | 76000    |
----------------------------
itr #18 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.448792964220047
Validation loss = 0.45140019059181213
Validation loss = 0.45888158679008484
Validation loss = 0.46215322613716125
Validation loss = 0.461925208568573
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.4470030665397644
Validation loss = 0.45375773310661316
Validation loss = 0.4551069736480713
Validation loss = 0.4564136564731598
Validation loss = 0.46069028973579407
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.44623002409935
Validation loss = 0.45241686701774597
Validation loss = 0.4581730365753174
Validation loss = 0.458143949508667
Validation loss = 0.4612216353416443
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.4518476724624634
Validation loss = 0.4567135274410248
Validation loss = 0.4597398340702057
Validation loss = 0.4599689245223999
Validation loss = 0.46565327048301697
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.4503403604030609
Validation loss = 0.4527674615383148
Validation loss = 0.4619787037372589
Validation loss = 0.45911508798599243
Validation loss = 0.46069350838661194
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -1.07    |
| Iteration     | 18       |
| MaximumReturn | 13.8     |
| MinimumReturn | -22.2    |
| TotalSamples  | 80000    |
----------------------------
itr #19 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.4526631832122803
Validation loss = 0.45686349272727966
Validation loss = 0.4629729390144348
Validation loss = 0.46820956468582153
Validation loss = 0.46047544479370117
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.4550187587738037
Validation loss = 0.4542485177516937
Validation loss = 0.45928317308425903
Validation loss = 0.46224063634872437
Validation loss = 0.4650491178035736
Validation loss = 0.463374525308609
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.45294761657714844
Validation loss = 0.4552876353263855
Validation loss = 0.4584757685661316
Validation loss = 0.4622238278388977
Validation loss = 0.4641185700893402
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.4559154510498047
Validation loss = 0.4633353352546692
Validation loss = 0.4637886583805084
Validation loss = 0.46935930848121643
Validation loss = 0.4664344787597656
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.45357847213745117
Validation loss = 0.454868882894516
Validation loss = 0.4560125470161438
Validation loss = 0.4633707106113434
Validation loss = 0.4608813226222992
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 0.776    |
| Iteration     | 19       |
| MaximumReturn | 17.1     |
| MinimumReturn | -15.1    |
| TotalSamples  | 84000    |
----------------------------
itr #20 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.4577707052230835
Validation loss = 0.4557397961616516
Validation loss = 0.4625754952430725
Validation loss = 0.4679538607597351
Validation loss = 0.47058069705963135
Validation loss = 0.4699324667453766
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.46519824862480164
Validation loss = 0.46214866638183594
Validation loss = 0.4626893103122711
Validation loss = 0.4630504846572876
Validation loss = 0.47016283869743347
Validation loss = 0.4671004116535187
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.4589933156967163
Validation loss = 0.46298322081565857
Validation loss = 0.4627424478530884
Validation loss = 0.4642898142337799
Validation loss = 0.47210392355918884
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.45753106474876404
Validation loss = 0.4618212878704071
Validation loss = 0.4614662826061249
Validation loss = 0.4670746326446533
Validation loss = 0.47193849086761475
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.4528944492340088
Validation loss = 0.4608127474784851
Validation loss = 0.46190816164016724
Validation loss = 0.4651205539703369
Validation loss = 0.4686611592769623
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 3.07     |
| Iteration     | 20       |
| MaximumReturn | 19.5     |
| MinimumReturn | -20      |
| TotalSamples  | 88000    |
----------------------------
itr #21 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.4577701687812805
Validation loss = 0.46142756938934326
Validation loss = 0.46339675784111023
Validation loss = 0.46988049149513245
Validation loss = 0.47017860412597656
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.4600622057914734
Validation loss = 0.4617665112018585
Validation loss = 0.46238958835601807
Validation loss = 0.46551772952079773
Validation loss = 0.4663122296333313
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.45584431290626526
Validation loss = 0.4637516140937805
Validation loss = 0.4660608172416687
Validation loss = 0.46870413422584534
Validation loss = 0.4690536558628082
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.46073225140571594
Validation loss = 0.46227481961250305
Validation loss = 0.4656181335449219
Validation loss = 0.46821966767311096
Validation loss = 0.4740569293498993
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.45995646715164185
Validation loss = 0.45969727635383606
Validation loss = 0.4626395106315613
Validation loss = 0.46482670307159424
Validation loss = 0.4666726291179657
Validation loss = 0.4683743715286255
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -0.506   |
| Iteration     | 21       |
| MaximumReturn | 20.3     |
| MinimumReturn | -21.4    |
| TotalSamples  | 92000    |
----------------------------
itr #22 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.4597283601760864
Validation loss = 0.4647122621536255
Validation loss = 0.4694893956184387
Validation loss = 0.46908038854599
Validation loss = 0.4699545204639435
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.4574023485183716
Validation loss = 0.46476638317108154
Validation loss = 0.46771135926246643
Validation loss = 0.4686601459980011
Validation loss = 0.4692026972770691
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.46390414237976074
Validation loss = 0.47070813179016113
Validation loss = 0.46427738666534424
Validation loss = 0.47109857201576233
Validation loss = 0.47203516960144043
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.4635070860385895
Validation loss = 0.4630783200263977
Validation loss = 0.4660719037055969
Validation loss = 0.47040417790412903
Validation loss = 0.47235995531082153
Validation loss = 0.47376495599746704
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.45823901891708374
Validation loss = 0.4615212678909302
Validation loss = 0.4692249596118927
Validation loss = 0.47245216369628906
Validation loss = 0.4702011048793793
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -2       |
| Iteration     | 22       |
| MaximumReturn | 21.1     |
| MinimumReturn | -24.5    |
| TotalSamples  | 96000    |
----------------------------
itr #23 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.46134820580482483
Validation loss = 0.4655916690826416
Validation loss = 0.4668467938899994
Validation loss = 0.47246718406677246
Validation loss = 0.46970319747924805
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.46247151494026184
Validation loss = 0.4640289545059204
Validation loss = 0.4669133722782135
Validation loss = 0.4689856767654419
Validation loss = 0.4746222496032715
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.46306684613227844
Validation loss = 0.46511104702949524
Validation loss = 0.4689064025878906
Validation loss = 0.471317857503891
Validation loss = 0.47072887420654297
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.46596816182136536
Validation loss = 0.46571794152259827
Validation loss = 0.4700559377670288
Validation loss = 0.4709603786468506
Validation loss = 0.47184133529663086
Validation loss = 0.4754733741283417
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.46074333786964417
Validation loss = 0.4645911753177643
Validation loss = 0.4674764573574066
Validation loss = 0.4727555215358734
Validation loss = 0.4724481999874115
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -7.33    |
| Iteration     | 23       |
| MaximumReturn | 18.6     |
| MinimumReturn | -23.5    |
| TotalSamples  | 100000   |
----------------------------
itr #24 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.4652649164199829
Validation loss = 0.46417874097824097
Validation loss = 0.4678313136100769
Validation loss = 0.472663015127182
Validation loss = 0.46842652559280396
Validation loss = 0.4715462028980255
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.4639405906200409
Validation loss = 0.46815404295921326
Validation loss = 0.46785110235214233
Validation loss = 0.46917960047721863
Validation loss = 0.47105181217193604
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.4645616412162781
Validation loss = 0.4656473994255066
Validation loss = 0.4704662561416626
Validation loss = 0.4750487804412842
Validation loss = 0.4728471338748932
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.46516937017440796
Validation loss = 0.4675697684288025
Validation loss = 0.46973782777786255
Validation loss = 0.4719219207763672
Validation loss = 0.47416117787361145
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.46188271045684814
Validation loss = 0.4647904634475708
Validation loss = 0.4692731201648712
Validation loss = 0.46993958950042725
Validation loss = 0.47212740778923035
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -8.31    |
| Iteration     | 24       |
| MaximumReturn | 19.7     |
| MinimumReturn | -24.4    |
| TotalSamples  | 104000   |
----------------------------
itr #25 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.465204119682312
Validation loss = 0.4697529673576355
Validation loss = 0.4690132737159729
Validation loss = 0.4699857831001282
Validation loss = 0.47168394923210144
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.46294593811035156
Validation loss = 0.46800681948661804
Validation loss = 0.46941375732421875
Validation loss = 0.47155430912971497
Validation loss = 0.47337689995765686
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.4626447260379791
Validation loss = 0.47117042541503906
Validation loss = 0.4710249602794647
Validation loss = 0.4702775776386261
Validation loss = 0.476165235042572
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.46465444564819336
Validation loss = 0.46676164865493774
Validation loss = 0.47442370653152466
Validation loss = 0.4744519591331482
Validation loss = 0.47455692291259766
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.4659411311149597
Validation loss = 0.46621230244636536
Validation loss = 0.47195324301719666
Validation loss = 0.4715934097766876
Validation loss = 0.4740350544452667
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 3.34     |
| Iteration     | 25       |
| MaximumReturn | 16.9     |
| MinimumReturn | -23      |
| TotalSamples  | 108000   |
----------------------------
itr #26 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.4673101603984833
Validation loss = 0.4652060270309448
Validation loss = 0.4706267714500427
Validation loss = 0.47565916180610657
Validation loss = 0.47553160786628723
Validation loss = 0.47645050287246704
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.46816813945770264
Validation loss = 0.4663577377796173
Validation loss = 0.47409382462501526
Validation loss = 0.4752133786678314
Validation loss = 0.47226282954216003
Validation loss = 0.47919440269470215
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.4695585072040558
Validation loss = 0.4700755476951599
Validation loss = 0.47382086515426636
Validation loss = 0.4754623770713806
Validation loss = 0.47464773058891296
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.46991002559661865
Validation loss = 0.4711376428604126
Validation loss = 0.47230982780456543
Validation loss = 0.47406458854675293
Validation loss = 0.4748595654964447
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.464399129152298
Validation loss = 0.4722428619861603
Validation loss = 0.4713962972164154
Validation loss = 0.47555312514305115
Validation loss = 0.47644445300102234
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -9.17    |
| Iteration     | 26       |
| MaximumReturn | 8.73     |
| MinimumReturn | -23.1    |
| TotalSamples  | 112000   |
----------------------------
itr #27 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.4698706269264221
Validation loss = 0.4716046154499054
Validation loss = 0.47243595123291016
Validation loss = 0.47624102234840393
Validation loss = 0.47509145736694336
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.4658355414867401
Validation loss = 0.4715346395969391
Validation loss = 0.47242340445518494
Validation loss = 0.473731130361557
Validation loss = 0.47608837485313416
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.4700796902179718
Validation loss = 0.47158485651016235
Validation loss = 0.47545263171195984
Validation loss = 0.47567227482795715
Validation loss = 0.4791223704814911
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.4664972722530365
Validation loss = 0.4690193831920624
Validation loss = 0.4753232002258301
Validation loss = 0.47973352670669556
Validation loss = 0.4766312539577484
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.46870777010917664
Validation loss = 0.4709872305393219
Validation loss = 0.4750922620296478
Validation loss = 0.4757988750934601
Validation loss = 0.47756579518318176
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -1.9     |
| Iteration     | 27       |
| MaximumReturn | 16.7     |
| MinimumReturn | -18.4    |
| TotalSamples  | 116000   |
----------------------------
itr #28 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.4721958339214325
Validation loss = 0.47256630659103394
Validation loss = 0.4755493700504303
Validation loss = 0.47651803493499756
Validation loss = 0.4754103720188141
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.46964704990386963
Validation loss = 0.4687248468399048
Validation loss = 0.4739196300506592
Validation loss = 0.4753658175468445
Validation loss = 0.4753531217575073
Validation loss = 0.4768575429916382
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.46801307797431946
Validation loss = 0.4727197289466858
Validation loss = 0.4736414849758148
Validation loss = 0.47965919971466064
Validation loss = 0.48005273938179016
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.4690376818180084
Validation loss = 0.472029447555542
Validation loss = 0.47787052392959595
Validation loss = 0.47529637813568115
Validation loss = 0.47881853580474854
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.4719877243041992
Validation loss = 0.47277364134788513
Validation loss = 0.47380000352859497
Validation loss = 0.47871503233909607
Validation loss = 0.4806225299835205
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -8.16    |
| Iteration     | 28       |
| MaximumReturn | 21.1     |
| MinimumReturn | -24.1    |
| TotalSamples  | 120000   |
----------------------------
itr #29 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.4734533727169037
Validation loss = 0.4742073714733124
Validation loss = 0.47556304931640625
Validation loss = 0.4788725972175598
Validation loss = 0.4792889952659607
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.47444671392440796
Validation loss = 0.4741533100605011
Validation loss = 0.47737348079681396
Validation loss = 0.47789469361305237
Validation loss = 0.4789992868900299
Validation loss = 0.4791439175605774
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.47159507870674133
Validation loss = 0.4763764441013336
Validation loss = 0.4774150252342224
Validation loss = 0.47998395562171936
Validation loss = 0.48246535658836365
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.4726203382015228
Validation loss = 0.47579944133758545
Validation loss = 0.4759356379508972
Validation loss = 0.48164117336273193
Validation loss = 0.4794490933418274
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.4753846824169159
Validation loss = 0.4762624204158783
Validation loss = 0.4787035286426544
Validation loss = 0.4799570143222809
Validation loss = 0.48110231757164
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 6.07     |
| Iteration     | 29       |
| MaximumReturn | 18.5     |
| MinimumReturn | -12.2    |
| TotalSamples  | 124000   |
----------------------------
itr #30 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.47222673892974854
Validation loss = 0.4772305190563202
Validation loss = 0.48160961270332336
Validation loss = 0.479235976934433
Validation loss = 0.4812805652618408
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.4752677381038666
Validation loss = 0.4758499264717102
Validation loss = 0.4791082441806793
Validation loss = 0.4793175756931305
Validation loss = 0.480326384305954
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.4760759770870209
Validation loss = 0.47809284925460815
Validation loss = 0.48108136653900146
Validation loss = 0.4799499809741974
Validation loss = 0.48164305090904236
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.472228467464447
Validation loss = 0.47582918405532837
Validation loss = 0.4798165261745453
Validation loss = 0.4818561375141144
Validation loss = 0.48142632842063904
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.4759809970855713
Validation loss = 0.47834670543670654
Validation loss = 0.47992074489593506
Validation loss = 0.48185572028160095
Validation loss = 0.48594942688941956
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -4.81    |
| Iteration     | 30       |
| MaximumReturn | 19.6     |
| MinimumReturn | -20.5    |
| TotalSamples  | 128000   |
----------------------------
itr #31 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.4778304100036621
Validation loss = 0.47843432426452637
Validation loss = 0.48075807094573975
Validation loss = 0.48115211725234985
Validation loss = 0.48208481073379517
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.47825807332992554
Validation loss = 0.4765683114528656
Validation loss = 0.47828778624534607
Validation loss = 0.48375171422958374
Validation loss = 0.4796532392501831
Validation loss = 0.4839106500148773
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.4790537655353546
Validation loss = 0.4778773784637451
Validation loss = 0.4826105237007141
Validation loss = 0.4829018712043762
Validation loss = 0.4850752353668213
Validation loss = 0.4861239492893219
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.4744637608528137
Validation loss = 0.48006922006607056
Validation loss = 0.48020756244659424
Validation loss = 0.4810812175273895
Validation loss = 0.483012318611145
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.47667068243026733
Validation loss = 0.48020488023757935
Validation loss = 0.482101708650589
Validation loss = 0.4853644073009491
Validation loss = 0.48559215664863586
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 10.1     |
| Iteration     | 31       |
| MaximumReturn | 20       |
| MinimumReturn | -19.4    |
| TotalSamples  | 132000   |
----------------------------
itr #32 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.4788769483566284
Validation loss = 0.4790155291557312
Validation loss = 0.4810006618499756
Validation loss = 0.4818214178085327
Validation loss = 0.48296552896499634
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.47400185465812683
Validation loss = 0.48000568151474
Validation loss = 0.48066601157188416
Validation loss = 0.4839673638343811
Validation loss = 0.4831506907939911
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.47883331775665283
Validation loss = 0.48276615142822266
Validation loss = 0.4851026237010956
Validation loss = 0.4854712188243866
Validation loss = 0.48763570189476013
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.47539380192756653
Validation loss = 0.4799986183643341
Validation loss = 0.4820955991744995
Validation loss = 0.48196303844451904
Validation loss = 0.4838699400424957
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.4779227674007416
Validation loss = 0.4812512993812561
Validation loss = 0.4836810827255249
Validation loss = 0.484817773103714
Validation loss = 0.4866732358932495
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -18      |
| Iteration     | 32       |
| MaximumReturn | -10.4    |
| MinimumReturn | -22.2    |
| TotalSamples  | 136000   |
----------------------------
