Logging to experiments/gym_fswimmer/SO01/Wed-02-Nov-2022-04-25-22-PM-CDT_gym_fswimmer_trpo_iteration_20_seed1231
Print configuration .....
{'env_name': 'gym_fswimmer', 'random_seeds': [2312, 1231, 2631, 5543], 'save_variables': False, 'model_save_dir': '/tmp/gym_fswimmer_models/', 'restore_variables': False, 'start_onpol_iter': 0, 'onpol_iters': 33, 'num_path_random': 6, 'num_path_onpol': 6, 'env_horizon': 1000, 'max_train_data': 200000, 'max_val_data': 100000, 'discard_ratio': 0.0, 'dynamics': {'pre_training': {'mode': 'intrinsic_reward', 'itr': 0, 'policy_itr': 20}, 'model': 'nn', 'ensemble': True, 'ensemble_model_count': 5, 'enable_particle_ensemble': True, 'particles': 5, 'intrinsic_reward_only': False, 'external_reward_evaluation_interval': 5, 'obs_var': 1.0, 'intrinsic_reward_coeff': 1.0, 'ita': 1.0, 'mode': 'random', 'val': True, 'n_layers': 4, 'hidden_size': 1000, 'activation': 'relu', 'batch_size': 1000, 'learning_rate': 0.001, 'epochs': 200, 'kfac_params': {'learning_rate': 0.1, 'damping': 0.001, 'momentum': 0.9, 'kl_clip': 0.0001, 'cov_ema_decay': 0.99}}, 'policy': {'network_shape': [32, 32], 'init_logstd': 0.0, 'activation': 'tanh', 'reinitialize_every_itr': False}, 'trpo': {'horizon': 200, 'gamma': 0.99, 'step_size': 0.01, 'iterations': 20, 'batch_size': 50000, 'gae': 0.95}, 'trpo_ext_reward': {'horizon': 1000, 'gamma': 0.99, 'step_size': 0.01, 'iterations': 20, 'batch_size': 50000, 'gae': 0.95}, 'algo': 'trpo'}
Generating random rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating random rollouts.
Creating normalization for training data.
Done creating normalization for training data.
Particle ensemble enabled? True
An ensemble of 5 dynamics model <class 'model.dynamics.NNDynamicsModel'> initialized
Train dynamics model with intrinsic reward only? False
Pre-training enabled. Using only intrinsic reward.
Pre-training dynamics model for 0 iterations...
Done pre-training dynamics model.
Using external reward only.
itr #0 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.5887596011161804
Validation loss = 0.403493732213974
Validation loss = 0.3532167971134186
Validation loss = 0.3404453992843628
Validation loss = 0.33795398473739624
Validation loss = 0.3443545997142792
Validation loss = 0.33789291977882385
Validation loss = 0.3430492579936981
Validation loss = 0.3432161211967468
Validation loss = 0.35086268186569214
Validation loss = 0.3631913661956787
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.9396911859512329
Validation loss = 0.43119674921035767
Validation loss = 0.370923787355423
Validation loss = 0.3435843586921692
Validation loss = 0.33709508180618286
Validation loss = 0.3369966149330139
Validation loss = 0.3418010175228119
Validation loss = 0.3339908719062805
Validation loss = 0.3402746319770813
Validation loss = 0.34172946214675903
Validation loss = 0.3459014296531677
Validation loss = 0.3456784784793854
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.8729269504547119
Validation loss = 0.4263031482696533
Validation loss = 0.36462539434432983
Validation loss = 0.34332916140556335
Validation loss = 0.3441840708255768
Validation loss = 0.3388245403766632
Validation loss = 0.33553239703178406
Validation loss = 0.3414592146873474
Validation loss = 0.34341490268707275
Validation loss = 0.3452115058898926
Validation loss = 0.35205376148223877
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.6595143675804138
Validation loss = 0.417088121175766
Validation loss = 0.3647352457046509
Validation loss = 0.34127628803253174
Validation loss = 0.33826446533203125
Validation loss = 0.33638590574264526
Validation loss = 0.33632197976112366
Validation loss = 0.33509761095046997
Validation loss = 0.3412436246871948
Validation loss = 0.3421925902366638
Validation loss = 0.3454378843307495
Validation loss = 0.3469700217247009
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6074050068855286
Validation loss = 0.4116279184818268
Validation loss = 0.35492271184921265
Validation loss = 0.3430268168449402
Validation loss = 0.341585636138916
Validation loss = 0.33738839626312256
Validation loss = 0.34987419843673706
Validation loss = 0.3475320041179657
Validation loss = 0.35260009765625
Validation loss = 0.35069364309310913
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -49.3    |
| Iteration     | 0        |
| MaximumReturn | -42.8    |
| MinimumReturn | -54.8    |
| TotalSamples  | 8000     |
----------------------------
itr #1 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3560047447681427
Validation loss = 0.3001718819141388
Validation loss = 0.2926309108734131
Validation loss = 0.2920674681663513
Validation loss = 0.28594863414764404
Validation loss = 0.2880210876464844
Validation loss = 0.29676926136016846
Validation loss = 0.28876793384552
Validation loss = 0.3012373447418213
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.35520684719085693
Validation loss = 0.28977829217910767
Validation loss = 0.2928429841995239
Validation loss = 0.2951570451259613
Validation loss = 0.29141679406166077
Validation loss = 0.28905197978019714
Validation loss = 0.2930249571800232
Validation loss = 0.2925231456756592
Validation loss = 0.3049175441265106
Validation loss = 0.30524006485939026
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3598332405090332
Validation loss = 0.29568764567375183
Validation loss = 0.292243629693985
Validation loss = 0.28835418820381165
Validation loss = 0.28768181800842285
Validation loss = 0.2889319062232971
Validation loss = 0.2940638065338135
Validation loss = 0.3024734854698181
Validation loss = 0.3021585941314697
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3522460162639618
Validation loss = 0.29502421617507935
Validation loss = 0.2916041910648346
Validation loss = 0.29176950454711914
Validation loss = 0.2916569411754608
Validation loss = 0.28947967290878296
Validation loss = 0.3078695237636566
Validation loss = 0.29877614974975586
Validation loss = 0.29810625314712524
Validation loss = 0.315594881772995
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3547571897506714
Validation loss = 0.30028098821640015
Validation loss = 0.30210769176483154
Validation loss = 0.28827860951423645
Validation loss = 0.2928954064846039
Validation loss = 0.2949606776237488
Validation loss = 0.2920854985713959
Validation loss = 0.2957398295402527
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 3.85     |
| Iteration     | 1        |
| MaximumReturn | 7.1      |
| MinimumReturn | 0.702    |
| TotalSamples  | 12000    |
----------------------------
itr #2 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.2911403179168701
Validation loss = 0.28322961926460266
Validation loss = 0.27736905217170715
Validation loss = 0.2913007438182831
Validation loss = 0.28120720386505127
Validation loss = 0.29883235692977905
Validation loss = 0.29807889461517334
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.2992415428161621
Validation loss = 0.28691062331199646
Validation loss = 0.2890923321247101
Validation loss = 0.29230716824531555
Validation loss = 0.3010641634464264
Validation loss = 0.298735111951828
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.2979409992694855
Validation loss = 0.2807719111442566
Validation loss = 0.2732261121273041
Validation loss = 0.28111472725868225
Validation loss = 0.2896076440811157
Validation loss = 0.28561073541641235
Validation loss = 0.30059850215911865
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.29973679780960083
Validation loss = 0.28863635659217834
Validation loss = 0.2893607020378113
Validation loss = 0.2910595238208771
Validation loss = 0.2944043278694153
Validation loss = 0.2979161739349365
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.29281461238861084
Validation loss = 0.28637179732322693
Validation loss = 0.2741446793079376
Validation loss = 0.28565940260887146
Validation loss = 0.2798275649547577
Validation loss = 0.2881602942943573
Validation loss = 0.29253822565078735
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -11.4    |
| Iteration     | 2        |
| MaximumReturn | -7.19    |
| MinimumReturn | -14.8    |
| TotalSamples  | 16000    |
----------------------------
itr #3 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.25653934478759766
Validation loss = 0.26730531454086304
Validation loss = 0.26384446024894714
Validation loss = 0.2721506953239441
Validation loss = 0.2697070837020874
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.2546042203903198
Validation loss = 0.26422184705734253
Validation loss = 0.2640707492828369
Validation loss = 0.26643872261047363
Validation loss = 0.2635469436645508
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.2563170790672302
Validation loss = 0.26027411222457886
Validation loss = 0.266396164894104
Validation loss = 0.2664143443107605
Validation loss = 0.279883474111557
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.2652270197868347
Validation loss = 0.2631678283214569
Validation loss = 0.27196475863456726
Validation loss = 0.27110639214515686
Validation loss = 0.26736506819725037
Validation loss = 0.2747139632701874
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.259380042552948
Validation loss = 0.2592039108276367
Validation loss = 0.26074641942977905
Validation loss = 0.25836867094039917
Validation loss = 0.2642693519592285
Validation loss = 0.27078530192375183
Validation loss = 0.27149397134780884
Validation loss = 0.2749248147010803
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 4.19     |
| Iteration     | 3        |
| MaximumReturn | 15.7     |
| MinimumReturn | -8.9     |
| TotalSamples  | 20000    |
----------------------------
itr #4 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.24512183666229248
Validation loss = 0.24433331191539764
Validation loss = 0.2487235814332962
Validation loss = 0.2498229742050171
Validation loss = 0.253372460603714
Validation loss = 0.2604522407054901
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.2394426167011261
Validation loss = 0.23875832557678223
Validation loss = 0.2443006932735443
Validation loss = 0.2459188997745514
Validation loss = 0.24978478252887726
Validation loss = 0.2511135935783386
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.24064597487449646
Validation loss = 0.24055472016334534
Validation loss = 0.24498483538627625
Validation loss = 0.2509806752204895
Validation loss = 0.24817101657390594
Validation loss = 0.24737294018268585
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.24177834391593933
Validation loss = 0.25081416964530945
Validation loss = 0.24613694846630096
Validation loss = 0.2518298625946045
Validation loss = 0.2554275691509247
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.24267229437828064
Validation loss = 0.251209557056427
Validation loss = 0.25143396854400635
Validation loss = 0.25124335289001465
Validation loss = 0.2506829798221588
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -10.7    |
| Iteration     | 4        |
| MaximumReturn | -5.48    |
| MinimumReturn | -15.2    |
| TotalSamples  | 24000    |
----------------------------
itr #5 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.23371650278568268
Validation loss = 0.2431882619857788
Validation loss = 0.23740153014659882
Validation loss = 0.23925961554050446
Validation loss = 0.24434077739715576
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.23428499698638916
Validation loss = 0.23323112726211548
Validation loss = 0.2380334734916687
Validation loss = 0.23800045251846313
Validation loss = 0.2433023452758789
Validation loss = 0.2431376725435257
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.23535184562206268
Validation loss = 0.23488140106201172
Validation loss = 0.23604393005371094
Validation loss = 0.23768194019794464
Validation loss = 0.24136339128017426
Validation loss = 0.24413686990737915
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.2336971014738083
Validation loss = 0.23899465799331665
Validation loss = 0.23933984339237213
Validation loss = 0.24615506827831268
Validation loss = 0.2397976517677307
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.23324386775493622
Validation loss = 0.23314422369003296
Validation loss = 0.24119305610656738
Validation loss = 0.24284011125564575
Validation loss = 0.24618297815322876
Validation loss = 0.2426394373178482
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -8.1     |
| Iteration     | 5        |
| MaximumReturn | -6.41    |
| MinimumReturn | -9.19    |
| TotalSamples  | 28000    |
----------------------------
itr #6 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.21829618513584137
Validation loss = 0.21699003875255585
Validation loss = 0.2168819159269333
Validation loss = 0.22220619022846222
Validation loss = 0.21903598308563232
Validation loss = 0.22000528872013092
Validation loss = 0.225281223654747
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.21821574866771698
Validation loss = 0.21805576980113983
Validation loss = 0.22092732787132263
Validation loss = 0.21838368475437164
Validation loss = 0.22341570258140564
Validation loss = 0.22609959542751312
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.21380193531513214
Validation loss = 0.2208404242992401
Validation loss = 0.21992948651313782
Validation loss = 0.21526741981506348
Validation loss = 0.2201661318540573
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.21390439569950104
Validation loss = 0.21744027733802795
Validation loss = 0.21951577067375183
Validation loss = 0.22599248588085175
Validation loss = 0.21952487528324127
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.21914133429527283
Validation loss = 0.21984395384788513
Validation loss = 0.22170694172382355
Validation loss = 0.21980206668376923
Validation loss = 0.233409121632576
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -7.15    |
| Iteration     | 6        |
| MaximumReturn | -4.43    |
| MinimumReturn | -9.73    |
| TotalSamples  | 32000    |
----------------------------
itr #7 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.2070179581642151
Validation loss = 0.20388150215148926
Validation loss = 0.20591729879379272
Validation loss = 0.20563223958015442
Validation loss = 0.21131888031959534
Validation loss = 0.20930002629756927
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.20653891563415527
Validation loss = 0.2073572278022766
Validation loss = 0.2061818540096283
Validation loss = 0.20617777109146118
Validation loss = 0.20729228854179382
Validation loss = 0.21330994367599487
Validation loss = 0.21340347826480865
Validation loss = 0.2101454734802246
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.20033863186836243
Validation loss = 0.19957321882247925
Validation loss = 0.20414161682128906
Validation loss = 0.2054150551557541
Validation loss = 0.20218881964683533
Validation loss = 0.20715966820716858
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.20141705870628357
Validation loss = 0.19986848533153534
Validation loss = 0.2046554982662201
Validation loss = 0.2031872570514679
Validation loss = 0.20476555824279785
Validation loss = 0.2062579095363617
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.20541854202747345
Validation loss = 0.20206835865974426
Validation loss = 0.20599845051765442
Validation loss = 0.20496758818626404
Validation loss = 0.20639339089393616
Validation loss = 0.2062530517578125
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -7.14    |
| Iteration     | 7        |
| MaximumReturn | -4.49    |
| MinimumReturn | -10.7    |
| TotalSamples  | 36000    |
----------------------------
itr #8 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.19107042253017426
Validation loss = 0.19243425130844116
Validation loss = 0.1969272643327713
Validation loss = 0.1975860744714737
Validation loss = 0.20008999109268188
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.19722455739974976
Validation loss = 0.19716496765613556
Validation loss = 0.19642646610736847
Validation loss = 0.19721218943595886
Validation loss = 0.19786113500595093
Validation loss = 0.1994103044271469
Validation loss = 0.2035641372203827
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.19218388199806213
Validation loss = 0.1899670660495758
Validation loss = 0.19158805906772614
Validation loss = 0.19465075433254242
Validation loss = 0.19343248009681702
Validation loss = 0.19608056545257568
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.19276952743530273
Validation loss = 0.1928916722536087
Validation loss = 0.19124726951122284
Validation loss = 0.1942637711763382
Validation loss = 0.19546948373317719
Validation loss = 0.19910603761672974
Validation loss = 0.19698838889598846
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.19770246744155884
Validation loss = 0.19310124218463898
Validation loss = 0.1944870501756668
Validation loss = 0.1978122442960739
Validation loss = 0.19915486872196198
Validation loss = 0.19657187163829803
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -9.67    |
| Iteration     | 8        |
| MaximumReturn | -7.44    |
| MinimumReturn | -11.1    |
| TotalSamples  | 40000    |
----------------------------
itr #9 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.18683744966983795
Validation loss = 0.18631687760353088
Validation loss = 0.18770185112953186
Validation loss = 0.18857918679714203
Validation loss = 0.1886540949344635
Validation loss = 0.19197259843349457
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.19095580279827118
Validation loss = 0.19086584448814392
Validation loss = 0.19218392670154572
Validation loss = 0.1928977370262146
Validation loss = 0.19467803835868835
Validation loss = 0.19738535583019257
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.1872687041759491
Validation loss = 0.18658706545829773
Validation loss = 0.19106127321720123
Validation loss = 0.18874309957027435
Validation loss = 0.19105663895606995
Validation loss = 0.19085431098937988
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.18782930076122284
Validation loss = 0.1886250525712967
Validation loss = 0.1886557638645172
Validation loss = 0.18961593508720398
Validation loss = 0.19086624681949615
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.187794491648674
Validation loss = 0.18794173002243042
Validation loss = 0.19040317833423615
Validation loss = 0.18918968737125397
Validation loss = 0.1968625783920288
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -10.8    |
| Iteration     | 9        |
| MaximumReturn | -9.82    |
| MinimumReturn | -11.9    |
| TotalSamples  | 44000    |
----------------------------
itr #10 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.1830860674381256
Validation loss = 0.18177279829978943
Validation loss = 0.18496227264404297
Validation loss = 0.18268875777721405
Validation loss = 0.18468023836612701
Validation loss = 0.18919964134693146
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.18828855454921722
Validation loss = 0.18558014929294586
Validation loss = 0.1889665126800537
Validation loss = 0.1859971135854721
Validation loss = 0.19202204048633575
Validation loss = 0.1909816712141037
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.18313854932785034
Validation loss = 0.18321767449378967
Validation loss = 0.1840393990278244
Validation loss = 0.18606337904930115
Validation loss = 0.18436193466186523
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.17939843237400055
Validation loss = 0.18128839135169983
Validation loss = 0.18195365369319916
Validation loss = 0.18499521911144257
Validation loss = 0.183979332447052
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.18198157846927643
Validation loss = 0.18190664052963257
Validation loss = 0.18207621574401855
Validation loss = 0.1831062287092209
Validation loss = 0.1827874779701233
Validation loss = 0.18501955270767212
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -8.73    |
| Iteration     | 10       |
| MaximumReturn | -7.03    |
| MinimumReturn | -11.3    |
| TotalSamples  | 48000    |
----------------------------
itr #11 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.17749850451946259
Validation loss = 0.1807384043931961
Validation loss = 0.1812305450439453
Validation loss = 0.18013282120227814
Validation loss = 0.1838199943304062
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.1829937845468521
Validation loss = 0.183194100856781
Validation loss = 0.1837952733039856
Validation loss = 0.18846531212329865
Validation loss = 0.18646977841854095
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.17676186561584473
Validation loss = 0.17879177629947662
Validation loss = 0.1802731156349182
Validation loss = 0.18293391168117523
Validation loss = 0.18136616051197052
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.17797644436359406
Validation loss = 0.17924003303050995
Validation loss = 0.18011081218719482
Validation loss = 0.17811870574951172
Validation loss = 0.18128138780593872
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.18035905063152313
Validation loss = 0.1799824982881546
Validation loss = 0.18330097198486328
Validation loss = 0.18315504491329193
Validation loss = 0.1822681874036789
Validation loss = 0.1845455914735794
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -9.07    |
| Iteration     | 11       |
| MaximumReturn | -7.65    |
| MinimumReturn | -11.6    |
| TotalSamples  | 52000    |
----------------------------
itr #12 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.17557589709758759
Validation loss = 0.17707563936710358
Validation loss = 0.17744125425815582
Validation loss = 0.18021684885025024
Validation loss = 0.17981313169002533
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.17913618683815002
Validation loss = 0.1807321459054947
Validation loss = 0.18277686834335327
Validation loss = 0.18179605901241302
Validation loss = 0.1842227727174759
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.17497263848781586
Validation loss = 0.1777271330356598
Validation loss = 0.17748667299747467
Validation loss = 0.18246927857398987
Validation loss = 0.18201400339603424
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.1755075305700302
Validation loss = 0.1777123361825943
Validation loss = 0.17698881030082703
Validation loss = 0.1789456009864807
Validation loss = 0.17865203320980072
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.1768708974123001
Validation loss = 0.17745430767536163
Validation loss = 0.17819854617118835
Validation loss = 0.18035411834716797
Validation loss = 0.1810855269432068
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -7.25    |
| Iteration     | 12       |
| MaximumReturn | -5.82    |
| MinimumReturn | -8.74    |
| TotalSamples  | 56000    |
----------------------------
itr #13 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.17448939383029938
Validation loss = 0.17589601874351501
Validation loss = 0.17721426486968994
Validation loss = 0.1786704957485199
Validation loss = 0.18100611865520477
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.17941176891326904
Validation loss = 0.1779492199420929
Validation loss = 0.1791076809167862
Validation loss = 0.18174752593040466
Validation loss = 0.1823822557926178
Validation loss = 0.1852060854434967
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.17586709558963776
Validation loss = 0.1746874749660492
Validation loss = 0.17983902990818024
Validation loss = 0.17834408581256866
Validation loss = 0.17883193492889404
Validation loss = 0.18217633664608002
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.17522357404232025
Validation loss = 0.1777874082326889
Validation loss = 0.18017050623893738
Validation loss = 0.1799745112657547
Validation loss = 0.1765236109495163
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.17875388264656067
Validation loss = 0.17723293602466583
Validation loss = 0.18071484565734863
Validation loss = 0.1797778308391571
Validation loss = 0.18157005310058594
Validation loss = 0.1818203628063202
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -10.8    |
| Iteration     | 13       |
| MaximumReturn | -6.34    |
| MinimumReturn | -14.9    |
| TotalSamples  | 60000    |
----------------------------
itr #14 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.17784738540649414
Validation loss = 0.17746314406394958
Validation loss = 0.17930705845355988
Validation loss = 0.18229429423809052
Validation loss = 0.18281300365924835
Validation loss = 0.18488530814647675
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.18005819618701935
Validation loss = 0.18194609880447388
Validation loss = 0.1817634552717209
Validation loss = 0.18669405579566956
Validation loss = 0.18589535355567932
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.17673423886299133
Validation loss = 0.17698092758655548
Validation loss = 0.18011489510536194
Validation loss = 0.18462124466896057
Validation loss = 0.1850678026676178
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.17714527249336243
Validation loss = 0.17341412603855133
Validation loss = 0.17753306031227112
Validation loss = 0.17915934324264526
Validation loss = 0.17944025993347168
Validation loss = 0.1816764920949936
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.17743664979934692
Validation loss = 0.17967261373996735
Validation loss = 0.18071264028549194
Validation loss = 0.18254347145557404
Validation loss = 0.18627315759658813
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -4.35    |
| Iteration     | 14       |
| MaximumReturn | -0.65    |
| MinimumReturn | -5.97    |
| TotalSamples  | 64000    |
----------------------------
itr #15 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.18213269114494324
Validation loss = 0.18212780356407166
Validation loss = 0.18677093088626862
Validation loss = 0.18439464271068573
Validation loss = 0.1854076385498047
Validation loss = 0.18999671936035156
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.1819217950105667
Validation loss = 0.18410252034664154
Validation loss = 0.18487195670604706
Validation loss = 0.19016122817993164
Validation loss = 0.18929630517959595
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.18365496397018433
Validation loss = 0.18455857038497925
Validation loss = 0.18436940014362335
Validation loss = 0.1876334249973297
Validation loss = 0.1868363469839096
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.1808934360742569
Validation loss = 0.18091730773448944
Validation loss = 0.18450172245502472
Validation loss = 0.18486067652702332
Validation loss = 0.18564429879188538
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.18380720913410187
Validation loss = 0.18270504474639893
Validation loss = 0.18518662452697754
Validation loss = 0.1871274709701538
Validation loss = 0.1879737824201584
Validation loss = 0.19007734954357147
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -1.71    |
| Iteration     | 15       |
| MaximumReturn | 2.6      |
| MinimumReturn | -3.9     |
| TotalSamples  | 68000    |
----------------------------
itr #16 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.1906280368566513
Validation loss = 0.1924278438091278
Validation loss = 0.19431088864803314
Validation loss = 0.19691899418830872
Validation loss = 0.19733978807926178
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.19393151998519897
Validation loss = 0.19233550131320953
Validation loss = 0.19345854222774506
Validation loss = 0.1954951137304306
Validation loss = 0.19822460412979126
Validation loss = 0.1969977617263794
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.19012275338172913
Validation loss = 0.1906794309616089
Validation loss = 0.19287627935409546
Validation loss = 0.1930423080921173
Validation loss = 0.19615255296230316
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.18838554620742798
Validation loss = 0.18944135308265686
Validation loss = 0.18949638307094574
Validation loss = 0.19396862387657166
Validation loss = 0.1959252655506134
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.1914205104112625
Validation loss = 0.194996178150177
Validation loss = 0.19543766975402832
Validation loss = 0.19709548354148865
Validation loss = 0.1987992376089096
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -1.23    |
| Iteration     | 16       |
| MaximumReturn | 3.63     |
| MinimumReturn | -7.19    |
| TotalSamples  | 72000    |
----------------------------
itr #17 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.1973525881767273
Validation loss = 0.19967855513095856
Validation loss = 0.20207682251930237
Validation loss = 0.2029431313276291
Validation loss = 0.2041323035955429
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.20028983056545258
Validation loss = 0.20276300609111786
Validation loss = 0.20414231717586517
Validation loss = 0.20564110577106476
Validation loss = 0.2039889544248581
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.19825702905654907
Validation loss = 0.19890636205673218
Validation loss = 0.2051929533481598
Validation loss = 0.20257218182086945
Validation loss = 0.20286522805690765
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.19772924482822418
Validation loss = 0.19749771058559418
Validation loss = 0.2005760818719864
Validation loss = 0.20127394795417786
Validation loss = 0.20550279319286346
Validation loss = 0.20403234660625458
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.2006506472826004
Validation loss = 0.2013351172208786
Validation loss = 0.20292508602142334
Validation loss = 0.20444713532924652
Validation loss = 0.20473667979240417
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -3.75    |
| Iteration     | 17       |
| MaximumReturn | 1.03     |
| MinimumReturn | -8.31    |
| TotalSamples  | 76000    |
----------------------------
itr #18 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.20782658457756042
Validation loss = 0.20569272339344025
Validation loss = 0.208156555891037
Validation loss = 0.20994770526885986
Validation loss = 0.21206115186214447
Validation loss = 0.21315757930278778
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.20668821036815643
Validation loss = 0.20900140702724457
Validation loss = 0.20839305222034454
Validation loss = 0.21167629957199097
Validation loss = 0.21390144526958466
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.20693761110305786
Validation loss = 0.20505373179912567
Validation loss = 0.20683696866035461
Validation loss = 0.20990309119224548
Validation loss = 0.21043190360069275
Validation loss = 0.21270810067653656
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.2055884301662445
Validation loss = 0.20709334313869476
Validation loss = 0.20854048430919647
Validation loss = 0.21112944185733795
Validation loss = 0.2110396921634674
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.2062518149614334
Validation loss = 0.20958347618579865
Validation loss = 0.20875367522239685
Validation loss = 0.21151450276374817
Validation loss = 0.2122456431388855
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -6.54    |
| Iteration     | 18       |
| MaximumReturn | 4.49     |
| MinimumReturn | -11.8    |
| TotalSamples  | 80000    |
----------------------------
itr #19 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.21097323298454285
Validation loss = 0.21504005789756775
Validation loss = 0.21589498221874237
Validation loss = 0.21594412624835968
Validation loss = 0.21875214576721191
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.21319589018821716
Validation loss = 0.21530714631080627
Validation loss = 0.2166452407836914
Validation loss = 0.21685214340686798
Validation loss = 0.21703645586967468
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.211862713098526
Validation loss = 0.21465769410133362
Validation loss = 0.2155809849500656
Validation loss = 0.21831540763378143
Validation loss = 0.21709994971752167
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.21159930527210236
Validation loss = 0.21344685554504395
Validation loss = 0.21770605444908142
Validation loss = 0.2156001627445221
Validation loss = 0.218712717294693
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.21163952350616455
Validation loss = 0.2123805284500122
Validation loss = 0.2157859355211258
Validation loss = 0.21562442183494568
Validation loss = 0.2188277244567871
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.49     |
| Iteration     | 19       |
| MaximumReturn | 10.3     |
| MinimumReturn | -3.99    |
| TotalSamples  | 84000    |
----------------------------
itr #20 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.21931380033493042
Validation loss = 0.22102554142475128
Validation loss = 0.2236849069595337
Validation loss = 0.22449849545955658
Validation loss = 0.22556200623512268
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.22040864825248718
Validation loss = 0.22181475162506104
Validation loss = 0.2213403880596161
Validation loss = 0.22347865998744965
Validation loss = 0.2252328097820282
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.22106263041496277
Validation loss = 0.21995557844638824
Validation loss = 0.22319857776165009
Validation loss = 0.22589102387428284
Validation loss = 0.22563588619232178
Validation loss = 0.226466566324234
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.2180277705192566
Validation loss = 0.21973665058612823
Validation loss = 0.22125577926635742
Validation loss = 0.22306111454963684
Validation loss = 0.22455675899982452
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.21888510882854462
Validation loss = 0.22119174897670746
Validation loss = 0.22294767200946808
Validation loss = 0.2228340208530426
Validation loss = 0.225211501121521
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -5.22    |
| Iteration     | 20       |
| MaximumReturn | -2.95    |
| MinimumReturn | -9.09    |
| TotalSamples  | 88000    |
----------------------------
itr #21 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.22567720711231232
Validation loss = 0.226549431681633
Validation loss = 0.22778667509555817
Validation loss = 0.22811342775821686
Validation loss = 0.23220190405845642
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.22524327039718628
Validation loss = 0.2258516252040863
Validation loss = 0.2295723557472229
Validation loss = 0.23028777539730072
Validation loss = 0.23178277909755707
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.22829532623291016
Validation loss = 0.22869357466697693
Validation loss = 0.22873638570308685
Validation loss = 0.2310010939836502
Validation loss = 0.23281726241111755
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.22441396117210388
Validation loss = 0.22598686814308167
Validation loss = 0.2286270707845688
Validation loss = 0.2286389321088791
Validation loss = 0.23041820526123047
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.2278601974248886
Validation loss = 0.22719302773475647
Validation loss = 0.22939752042293549
Validation loss = 0.2297862470149994
Validation loss = 0.23087430000305176
Validation loss = 0.23190441727638245
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -9.71    |
| Iteration     | 21       |
| MaximumReturn | -3.95    |
| MinimumReturn | -15.4    |
| TotalSamples  | 92000    |
----------------------------
itr #22 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.23094384372234344
Validation loss = 0.23092471063137054
Validation loss = 0.23471422493457794
Validation loss = 0.2338821142911911
Validation loss = 0.23805943131446838
Validation loss = 0.23633554577827454
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.23011617362499237
Validation loss = 0.23210987448692322
Validation loss = 0.23469579219818115
Validation loss = 0.2369859367609024
Validation loss = 0.23617027699947357
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.2303478866815567
Validation loss = 0.23466002941131592
Validation loss = 0.23584668338298798
Validation loss = 0.23758195340633392
Validation loss = 0.23796628415584564
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.23062165081501007
Validation loss = 0.23340155184268951
Validation loss = 0.23374228179454803
Validation loss = 0.235015869140625
Validation loss = 0.2355916053056717
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.23240259289741516
Validation loss = 0.23358556628227234
Validation loss = 0.23424270749092102
Validation loss = 0.2369147539138794
Validation loss = 0.2374531775712967
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -13      |
| Iteration     | 22       |
| MaximumReturn | -3.67    |
| MinimumReturn | -20.1    |
| TotalSamples  | 96000    |
----------------------------
itr #23 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.23543556034564972
Validation loss = 0.2368098497390747
Validation loss = 0.24005226790905
Validation loss = 0.2391163557767868
Validation loss = 0.24075035750865936
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.23613084852695465
Validation loss = 0.23653411865234375
Validation loss = 0.2390255481004715
Validation loss = 0.24146908521652222
Validation loss = 0.2408076524734497
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.2373889684677124
Validation loss = 0.2387792021036148
Validation loss = 0.2390613555908203
Validation loss = 0.2423854023218155
Validation loss = 0.24193696677684784
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.23591621220111847
Validation loss = 0.23613469302654266
Validation loss = 0.2382577806711197
Validation loss = 0.23787598311901093
Validation loss = 0.23890750110149384
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.23643435537815094
Validation loss = 0.2372548133134842
Validation loss = 0.2390543669462204
Validation loss = 0.24090509116649628
Validation loss = 0.24218763411045074
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -11.3    |
| Iteration     | 23       |
| MaximumReturn | -3.05    |
| MinimumReturn | -17.4    |
| TotalSamples  | 100000   |
----------------------------
itr #24 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.24077525734901428
Validation loss = 0.2386976182460785
Validation loss = 0.24316802620887756
Validation loss = 0.24316047132015228
Validation loss = 0.24627237021923065
Validation loss = 0.24463024735450745
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.23750191926956177
Validation loss = 0.24040912091732025
Validation loss = 0.24232593178749084
Validation loss = 0.24379810690879822
Validation loss = 0.2452373057603836
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.24068884551525116
Validation loss = 0.2410232126712799
Validation loss = 0.24305923283100128
Validation loss = 0.24363569915294647
Validation loss = 0.2442851960659027
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.23788917064666748
Validation loss = 0.23859623074531555
Validation loss = 0.24205856025218964
Validation loss = 0.24192923307418823
Validation loss = 0.24408641457557678
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.23913201689720154
Validation loss = 0.24167324602603912
Validation loss = 0.24183368682861328
Validation loss = 0.24425944685935974
Validation loss = 0.24459660053253174
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -5.23    |
| Iteration     | 24       |
| MaximumReturn | 3.56     |
| MinimumReturn | -12.4    |
| TotalSamples  | 104000   |
----------------------------
itr #25 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.24647702276706696
Validation loss = 0.2445797324180603
Validation loss = 0.24560891091823578
Validation loss = 0.24772998690605164
Validation loss = 0.24851912260055542
Validation loss = 0.24753762781620026
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.24405287206172943
Validation loss = 0.24472282826900482
Validation loss = 0.24779875576496124
Validation loss = 0.24706053733825684
Validation loss = 0.24908433854579926
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.24451270699501038
Validation loss = 0.2434125691652298
Validation loss = 0.24801921844482422
Validation loss = 0.24842938780784607
Validation loss = 0.24720285832881927
Validation loss = 0.24885770678520203
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.24217557907104492
Validation loss = 0.24239887297153473
Validation loss = 0.24625369906425476
Validation loss = 0.2472764551639557
Validation loss = 0.24667085707187653
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.24607661366462708
Validation loss = 0.2443796843290329
Validation loss = 0.24698346853256226
Validation loss = 0.2476430982351303
Validation loss = 0.2480112910270691
Validation loss = 0.2493070513010025
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 0.178    |
| Iteration     | 25       |
| MaximumReturn | 3.85     |
| MinimumReturn | -5.37    |
| TotalSamples  | 108000   |
----------------------------
itr #26 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.24656756222248077
Validation loss = 0.24959009885787964
Validation loss = 0.2501744031906128
Validation loss = 0.2508261203765869
Validation loss = 0.25447148084640503
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.24609172344207764
Validation loss = 0.2475741058588028
Validation loss = 0.25057700276374817
Validation loss = 0.25060856342315674
Validation loss = 0.2507952153682709
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.249207004904747
Validation loss = 0.25149598717689514
Validation loss = 0.2485528588294983
Validation loss = 0.2522144913673401
Validation loss = 0.25166550278663635
Validation loss = 0.25300660729408264
Validation loss = 0.25326114892959595
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.24705688655376434
Validation loss = 0.24798153340816498
Validation loss = 0.24930965900421143
Validation loss = 0.2500121593475342
Validation loss = 0.25008293986320496
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.2474067658185959
Validation loss = 0.24821403622627258
Validation loss = 0.2500807046890259
Validation loss = 0.25141578912734985
Validation loss = 0.2516655921936035
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -3.8     |
| Iteration     | 26       |
| MaximumReturn | 4.79     |
| MinimumReturn | -11.3    |
| TotalSamples  | 112000   |
----------------------------
itr #27 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.2507622539997101
Validation loss = 0.25130102038383484
Validation loss = 0.2525251507759094
Validation loss = 0.2555338740348816
Validation loss = 0.25603458285331726
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.2517918646335602
Validation loss = 0.2508475184440613
Validation loss = 0.25300779938697815
Validation loss = 0.2541625499725342
Validation loss = 0.25498780608177185
Validation loss = 0.2552568018436432
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.25280702114105225
Validation loss = 0.25330570340156555
Validation loss = 0.2546873688697815
Validation loss = 0.25512638688087463
Validation loss = 0.25618600845336914
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.24906979501247406
Validation loss = 0.251238077878952
Validation loss = 0.2515801191329956
Validation loss = 0.2534240186214447
Validation loss = 0.25438159704208374
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.25016021728515625
Validation loss = 0.2520434558391571
Validation loss = 0.2522093653678894
Validation loss = 0.253761351108551
Validation loss = 0.2540567219257355
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -2.49    |
| Iteration     | 27       |
| MaximumReturn | 4.54     |
| MinimumReturn | -15.5    |
| TotalSamples  | 116000   |
----------------------------
itr #28 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.2549983263015747
Validation loss = 0.25513067841529846
Validation loss = 0.2553551197052002
Validation loss = 0.2571876049041748
Validation loss = 0.2571302354335785
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.2554119825363159
Validation loss = 0.2545807659626007
Validation loss = 0.25548699498176575
Validation loss = 0.2570044696331024
Validation loss = 0.25831037759780884
Validation loss = 0.25887373089790344
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.25430235266685486
Validation loss = 0.2551586925983429
Validation loss = 0.25718969106674194
Validation loss = 0.25744491815567017
Validation loss = 0.2584879398345947
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.2533036172389984
Validation loss = 0.25604361295700073
Validation loss = 0.2558233439922333
Validation loss = 0.25817009806632996
Validation loss = 0.2583533227443695
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.2547680735588074
Validation loss = 0.2548901438713074
Validation loss = 0.25744301080703735
Validation loss = 0.25734469294548035
Validation loss = 0.25829195976257324
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -4.37    |
| Iteration     | 28       |
| MaximumReturn | -0.153   |
| MinimumReturn | -13.4    |
| TotalSamples  | 120000   |
----------------------------
itr #29 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.2582084834575653
Validation loss = 0.2587480843067169
Validation loss = 0.25931644439697266
Validation loss = 0.2616690993309021
Validation loss = 0.2610991299152374
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.2589503526687622
Validation loss = 0.25873690843582153
Validation loss = 0.25974783301353455
Validation loss = 0.26206013560295105
Validation loss = 0.26169267296791077
Validation loss = 0.2616862952709198
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.25985586643218994
Validation loss = 0.2604031562805176
Validation loss = 0.2597212791442871
Validation loss = 0.26026469469070435
Validation loss = 0.2612851560115814
Validation loss = 0.2620692551136017
Validation loss = 0.26326116919517517
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.2561928629875183
Validation loss = 0.2591101825237274
Validation loss = 0.2592954635620117
Validation loss = 0.26010602712631226
Validation loss = 0.2616792321205139
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.257595956325531
Validation loss = 0.25868919491767883
Validation loss = 0.25848647952079773
Validation loss = 0.2604231536388397
Validation loss = 0.26090091466903687
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -6.9     |
| Iteration     | 29       |
| MaximumReturn | 8.61     |
| MinimumReturn | -15      |
| TotalSamples  | 124000   |
----------------------------
itr #30 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.26072075963020325
Validation loss = 0.26172658801078796
Validation loss = 0.2624328136444092
Validation loss = 0.26413366198539734
Validation loss = 0.2635153830051422
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.2627923786640167
Validation loss = 0.26284223794937134
Validation loss = 0.2630281150341034
Validation loss = 0.2659202814102173
Validation loss = 0.2646659016609192
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.2640637457370758
Validation loss = 0.2640029489994049
Validation loss = 0.26442909240722656
Validation loss = 0.26525428891181946
Validation loss = 0.26669421792030334
Validation loss = 0.2673245966434479
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.2599368691444397
Validation loss = 0.261586457490921
Validation loss = 0.26341623067855835
Validation loss = 0.2632976770401001
Validation loss = 0.263506680727005
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.25917837023735046
Validation loss = 0.25986334681510925
Validation loss = 0.2632003724575043
Validation loss = 0.2628324031829834
Validation loss = 0.26619723439216614
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -3.78    |
| Iteration     | 30       |
| MaximumReturn | 10.1     |
| MinimumReturn | -14.2    |
| TotalSamples  | 128000   |
----------------------------
itr #31 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.2643822431564331
Validation loss = 0.263979434967041
Validation loss = 0.26591917872428894
Validation loss = 0.26765453815460205
Validation loss = 0.2688807249069214
Validation loss = 0.26750385761260986
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.26303350925445557
Validation loss = 0.26464754343032837
Validation loss = 0.26808884739875793
Validation loss = 0.2662200927734375
Validation loss = 0.26774099469184875
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.2665674090385437
Validation loss = 0.26799553632736206
Validation loss = 0.26894497871398926
Validation loss = 0.26987314224243164
Validation loss = 0.2685137391090393
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.26277393102645874
Validation loss = 0.2629462480545044
Validation loss = 0.26516246795654297
Validation loss = 0.26677098870277405
Validation loss = 0.268499493598938
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.26419270038604736
Validation loss = 0.26497185230255127
Validation loss = 0.2649220824241638
Validation loss = 0.2671252489089966
Validation loss = 0.2683667242527008
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -6.05    |
| Iteration     | 31       |
| MaximumReturn | 11.2     |
| MinimumReturn | -16.3    |
| TotalSamples  | 132000   |
----------------------------
itr #32 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.26808732748031616
Validation loss = 0.2688274681568146
Validation loss = 0.27056610584259033
Validation loss = 0.2705118656158447
Validation loss = 0.27136051654815674
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.26817014813423157
Validation loss = 0.26895102858543396
Validation loss = 0.27165770530700684
Validation loss = 0.2709694802761078
Validation loss = 0.2730623483657837
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.27002763748168945
Validation loss = 0.26952844858169556
Validation loss = 0.27060461044311523
Validation loss = 0.2733929753303528
Validation loss = 0.27278026938438416
Validation loss = 0.27359539270401
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.2695412039756775
Validation loss = 0.26817816495895386
Validation loss = 0.268678218126297
Validation loss = 0.2704494595527649
Validation loss = 0.2727072238922119
Validation loss = 0.27230146527290344
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.2659945785999298
Validation loss = 0.2678597569465637
Validation loss = 0.2688247859477997
Validation loss = 0.27057862281799316
Validation loss = 0.2715780735015869
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -1.19    |
| Iteration     | 32       |
| MaximumReturn | 4.32     |
| MinimumReturn | -4.86    |
| TotalSamples  | 136000   |
----------------------------
