Logging to experiments/half_cheetah/control-affine/halfcheetah_seed3421
Print configuration .....
{'env_name': 'half_cheetah', 'random_seeds': [4321, 2314, 2341, 3421], 'save_variables': False, 'model_save_dir': '/tmp/half_cheetah_models/', 'restore_variables': False, 'start_onpol_iter': 0, 'onpol_iters': 33, 'num_path_random': 6, 'num_path_onpol': 6, 'env_horizon': 1000, 'max_train_data': 200000, 'max_val_data': 100000, 'discard_ratio': 0.0, 'dynamics': {'pre_training': {'mode': 'intrinsic_reward', 'itr': 0, 'policy_itr': 20}, 'model': 'nn', 'ensemble': True, 'ensemble_model_count': 5, 'enable_particle_ensemble': True, 'particles': 5, 'intrinsic_reward_only': False, 'external_reward_evaluation_interval': 5, 'obs_var': 1.0, 'intrinsic_reward_coeff': 1.0, 'ita': 1.0, 'mode': 'random', 'val': True, 'n_layers': 4, 'hidden_size': 1000, 'activation': 'relu', 'batch_size': 1000, 'learning_rate': 0.001, 'epochs': 200, 'kfac_params': {'learning_rate': 0.1, 'damping': 0.001, 'momentum': 0.9, 'kl_clip': 0.0001, 'cov_ema_decay': 0.99}}, 'policy': {'network_shape': [32, 32], 'init_logstd': 0.0, 'activation': 'tanh', 'reinitialize_every_itr': False}, 'trpo': {'horizon': 1000, 'gamma': 0.99, 'step_size': 0.01, 'iterations': 40, 'batch_size': 50000, 'gae': 0.95}, 'trpo_ext_reward': {'horizon': 1000, 'gamma': 0.99, 'step_size': 0.01, 'iterations': 20, 'batch_size': 50000, 'gae': 0.95}, 'algo': 'trpo'}
Generating random rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating random rollouts.
Creating normalization for training data.
Done creating normalization for training data.
Particle ensemble enabled? True
An ensemble of 5 dynamics model <class 'model.dynamics.NNDynamicsModel'> initialized
Train dynamics model with intrinsic reward only? False
Pre-training enabled. Using only intrinsic reward.
Pre-training dynamics model for 0 iterations...
Done pre-training dynamics model.
Using external reward only.
itr #0 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.5384238958358765
Validation loss = 0.12127114832401276
Validation loss = 0.08998733013868332
Validation loss = 0.07839559018611908
Validation loss = 0.07020488381385803
Validation loss = 0.06908833980560303
Validation loss = 0.06368362903594971
Validation loss = 0.0659913644194603
Validation loss = 0.05960965156555176
Validation loss = 0.06067124009132385
Validation loss = 0.057052504271268845
Validation loss = 0.056219786405563354
Validation loss = 0.057370223104953766
Validation loss = 0.056063588708639145
Validation loss = 0.05377804487943649
Validation loss = 0.059620946645736694
Validation loss = 0.05265740305185318
Validation loss = 0.057357512414455414
Validation loss = 0.05125664919614792
Validation loss = 0.050137296319007874
Validation loss = 0.05326562002301216
Validation loss = 0.05126228928565979
Validation loss = 0.05088181048631668
Validation loss = 0.05842822045087814
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.5244925022125244
Validation loss = 0.12327680736780167
Validation loss = 0.08973183482885361
Validation loss = 0.07752570509910583
Validation loss = 0.0730392336845398
Validation loss = 0.06712313741445541
Validation loss = 0.06458528339862823
Validation loss = 0.06541861593723297
Validation loss = 0.062421783804893494
Validation loss = 0.05873575061559677
Validation loss = 0.062052205204963684
Validation loss = 0.05907855182886124
Validation loss = 0.05615721642971039
Validation loss = 0.05820305645465851
Validation loss = 0.05494542419910431
Validation loss = 0.055831290781497955
Validation loss = 0.05514468625187874
Validation loss = 0.05519167706370354
Validation loss = 0.053286921232938766
Validation loss = 0.05425960570573807
Validation loss = 0.05342520773410797
Validation loss = 0.051195379346609116
Validation loss = 0.05298919975757599
Validation loss = 0.0514800101518631
Validation loss = 0.05030830204486847
Validation loss = 0.05045667663216591
Validation loss = 0.05008791387081146
Validation loss = 0.04956427961587906
Validation loss = 0.05080608278512955
Validation loss = 0.04866679385304451
Validation loss = 0.048006877303123474
Validation loss = 0.04790017008781433
Validation loss = 0.050103139132261276
Validation loss = 0.055964987725019455
Validation loss = 0.04737468063831329
Validation loss = 0.04982416331768036
Validation loss = 0.046546440571546555
Validation loss = 0.048200421035289764
Validation loss = 0.05024757236242294
Validation loss = 0.04678075388073921
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.5180220603942871
Validation loss = 0.12174953520298004
Validation loss = 0.08874967694282532
Validation loss = 0.07689949870109558
Validation loss = 0.0769546777009964
Validation loss = 0.06672133505344391
Validation loss = 0.0644323080778122
Validation loss = 0.06357239186763763
Validation loss = 0.06324837356805801
Validation loss = 0.06001925468444824
Validation loss = 0.058843500912189484
Validation loss = 0.05795741081237793
Validation loss = 0.056085050106048584
Validation loss = 0.05461960658431053
Validation loss = 0.05334824323654175
Validation loss = 0.060304656624794006
Validation loss = 0.052825383841991425
Validation loss = 0.0535535104572773
Validation loss = 0.052298177033662796
Validation loss = 0.05322255939245224
Validation loss = 0.0518377348780632
Validation loss = 0.05069389194250107
Validation loss = 0.05277201160788536
Validation loss = 0.049894437193870544
Validation loss = 0.04984453320503235
Validation loss = 0.051016971468925476
Validation loss = 0.05380234122276306
Validation loss = 0.050383202731609344
Validation loss = 0.04931534826755524
Validation loss = 0.04977741837501526
Validation loss = 0.04965996742248535
Validation loss = 0.050250813364982605
Validation loss = 0.0472896471619606
Validation loss = 0.047074973583221436
Validation loss = 0.051637765020132065
Validation loss = 0.04749758914113045
Validation loss = 0.04666610062122345
Validation loss = 0.04823162406682968
Validation loss = 0.046868808567523956
Validation loss = 0.04857093095779419
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.5240345597267151
Validation loss = 0.12402687966823578
Validation loss = 0.09045533835887909
Validation loss = 0.07783475518226624
Validation loss = 0.07161208242177963
Validation loss = 0.0707048624753952
Validation loss = 0.06793852150440216
Validation loss = 0.06283709406852722
Validation loss = 0.06390444934368134
Validation loss = 0.06149828061461449
Validation loss = 0.06035839021205902
Validation loss = 0.05711164325475693
Validation loss = 0.058805517852306366
Validation loss = 0.05568541958928108
Validation loss = 0.055748194456100464
Validation loss = 0.056603528559207916
Validation loss = 0.05377810820937157
Validation loss = 0.06113206595182419
Validation loss = 0.05458743870258331
Validation loss = 0.057035624980926514
Validation loss = 0.05352815240621567
Validation loss = 0.05194941908121109
Validation loss = 0.060358092188835144
Validation loss = 0.05200935900211334
Validation loss = 0.05227743461728096
Validation loss = 0.05381672456860542
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.5482892394065857
Validation loss = 0.12518814206123352
Validation loss = 0.09170421957969666
Validation loss = 0.07887206971645355
Validation loss = 0.07343585044145584
Validation loss = 0.06992265582084656
Validation loss = 0.0645020380616188
Validation loss = 0.06529152393341064
Validation loss = 0.06275580823421478
Validation loss = 0.06154666468501091
Validation loss = 0.058732688426971436
Validation loss = 0.05794491618871689
Validation loss = 0.05744094401597977
Validation loss = 0.05599137023091316
Validation loss = 0.056370049715042114
Validation loss = 0.05427389591932297
Validation loss = 0.05469002574682236
Validation loss = 0.05482650548219681
Validation loss = 0.053605955094099045
Validation loss = 0.052033714950084686
Validation loss = 0.05306239426136017
Validation loss = 0.05359157547354698
Validation loss = 0.05743686109781265
Validation loss = 0.052425045520067215
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Obtaining samples for iteration 20...
Obtaining samples for iteration 21...
Obtaining samples for iteration 22...
Obtaining samples for iteration 23...
Obtaining samples for iteration 24...
Obtaining samples for iteration 25...
Obtaining samples for iteration 26...
Obtaining samples for iteration 27...
Obtaining samples for iteration 28...
Obtaining samples for iteration 29...
Obtaining samples for iteration 30...
Obtaining samples for iteration 31...
Obtaining samples for iteration 32...
Obtaining samples for iteration 33...
Obtaining samples for iteration 34...
Obtaining samples for iteration 35...
Obtaining samples for iteration 36...
Obtaining samples for iteration 37...
Obtaining samples for iteration 38...
Obtaining samples for iteration 39...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -371     |
| Iteration     | 0        |
| MaximumReturn | -311     |
| MinimumReturn | -494     |
| TotalSamples  | 8000     |
----------------------------
itr #1 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.12932521104812622
Validation loss = 0.09299300611019135
Validation loss = 0.09457477927207947
Validation loss = 0.0978911966085434
Validation loss = 0.09301014244556427
Validation loss = 0.09028215706348419
Validation loss = 0.09268979728221893
Validation loss = 0.09213274717330933
Validation loss = 0.09106706082820892
Validation loss = 0.09509649872779846
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.16379810869693756
Validation loss = 0.09427347779273987
Validation loss = 0.08991517126560211
Validation loss = 0.08789917826652527
Validation loss = 0.0889868289232254
Validation loss = 0.08792318403720856
Validation loss = 0.08809082210063934
Validation loss = 0.08772150427103043
Validation loss = 0.0882950872182846
Validation loss = 0.08949622511863708
Validation loss = 0.08735182881355286
Validation loss = 0.08889642357826233
Validation loss = 0.09201142191886902
Validation loss = 0.08808676898479462
Validation loss = 0.0913914144039154
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.14703401923179626
Validation loss = 0.09298461675643921
Validation loss = 0.08872602880001068
Validation loss = 0.08776964992284775
Validation loss = 0.0897054672241211
Validation loss = 0.0884668231010437
Validation loss = 0.08793052285909653
Validation loss = 0.09366795420646667
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.12930545210838318
Validation loss = 0.09711955487728119
Validation loss = 0.09344363212585449
Validation loss = 0.0923347920179367
Validation loss = 0.09106393158435822
Validation loss = 0.08962506800889969
Validation loss = 0.09217578917741776
Validation loss = 0.09376713633537292
Validation loss = 0.09264461696147919
Validation loss = 0.09414037317037582
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.13671034574508667
Validation loss = 0.0960872545838356
Validation loss = 0.09128990769386292
Validation loss = 0.09222549200057983
Validation loss = 0.09142769873142242
Validation loss = 0.09193280339241028
Validation loss = 0.09092442691326141
Validation loss = 0.09451661258935928
Validation loss = 0.09452153742313385
Validation loss = 0.09931708872318268
Validation loss = 0.09566834568977356
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Obtaining samples for iteration 20...
Obtaining samples for iteration 21...
Obtaining samples for iteration 22...
Obtaining samples for iteration 23...
Obtaining samples for iteration 24...
Obtaining samples for iteration 25...
Obtaining samples for iteration 26...
Obtaining samples for iteration 27...
Obtaining samples for iteration 28...
Obtaining samples for iteration 29...
Obtaining samples for iteration 30...
Obtaining samples for iteration 31...
Obtaining samples for iteration 32...
Obtaining samples for iteration 33...
Obtaining samples for iteration 34...
Obtaining samples for iteration 35...
Obtaining samples for iteration 36...
Obtaining samples for iteration 37...
Obtaining samples for iteration 38...
Obtaining samples for iteration 39...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -378     |
| Iteration     | 1        |
| MaximumReturn | -300     |
| MinimumReturn | -493     |
| TotalSamples  | 12000    |
----------------------------
itr #2 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.1115594133734703
Validation loss = 0.09439117461442947
Validation loss = 0.09749936312437057
Validation loss = 0.09503070265054703
Validation loss = 0.09595455974340439
Validation loss = 0.09552749991416931
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.10691124945878983
Validation loss = 0.09436529874801636
Validation loss = 0.0951373353600502
Validation loss = 0.09545650333166122
Validation loss = 0.09413793683052063
Validation loss = 0.09876564890146255
Validation loss = 0.09989287704229355
Validation loss = 0.09673365950584412
Validation loss = 0.09520941227674484
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.10160472244024277
Validation loss = 0.09528089314699173
Validation loss = 0.09500246495008469
Validation loss = 0.09380567818880081
Validation loss = 0.09662642329931259
Validation loss = 0.09558125585317612
Validation loss = 0.09932214766740799
Validation loss = 0.10041534900665283
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.10786936432123184
Validation loss = 0.09602826833724976
Validation loss = 0.09517887979745865
Validation loss = 0.09594908356666565
Validation loss = 0.09834596514701843
Validation loss = 0.09900841861963272
Validation loss = 0.09760403633117676
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.10613087564706802
Validation loss = 0.09621777385473251
Validation loss = 0.09761128574609756
Validation loss = 0.0979653000831604
Validation loss = 0.09863891452550888
Validation loss = 0.09967618435621262
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Obtaining samples for iteration 20...
Obtaining samples for iteration 21...
Obtaining samples for iteration 22...
Obtaining samples for iteration 23...
Obtaining samples for iteration 24...
Obtaining samples for iteration 25...
Obtaining samples for iteration 26...
Obtaining samples for iteration 27...
Obtaining samples for iteration 28...
Obtaining samples for iteration 29...
Obtaining samples for iteration 30...
Obtaining samples for iteration 31...
Obtaining samples for iteration 32...
Obtaining samples for iteration 33...
Obtaining samples for iteration 34...
Obtaining samples for iteration 35...
Obtaining samples for iteration 36...
Obtaining samples for iteration 37...
Obtaining samples for iteration 38...
Obtaining samples for iteration 39...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 173      |
| Iteration     | 2        |
| MaximumReturn | 247      |
| MinimumReturn | 55.6     |
| TotalSamples  | 16000    |
----------------------------
itr #3 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.09629024565219879
Validation loss = 0.0900530219078064
Validation loss = 0.09103954583406448
Validation loss = 0.09298645704984665
Validation loss = 0.09046602249145508
Validation loss = 0.09423872828483582
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.10090699791908264
Validation loss = 0.09060543775558472
Validation loss = 0.08975161612033844
Validation loss = 0.08875218033790588
Validation loss = 0.09050597995519638
Validation loss = 0.09074284136295319
Validation loss = 0.08991257846355438
Validation loss = 0.09325207024812698
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.09903042018413544
Validation loss = 0.0889708623290062
Validation loss = 0.09140977263450623
Validation loss = 0.09113949537277222
Validation loss = 0.09117384254932404
Validation loss = 0.0897560715675354
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.09686534851789474
Validation loss = 0.0906900092959404
Validation loss = 0.08996505290269852
Validation loss = 0.09179982542991638
Validation loss = 0.09223217517137527
Validation loss = 0.09332971274852753
Validation loss = 0.0955277532339096
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.09784159064292908
Validation loss = 0.09212213009595871
Validation loss = 0.0901511162519455
Validation loss = 0.09136833250522614
Validation loss = 0.09049524366855621
Validation loss = 0.09022054076194763
Validation loss = 0.09152694791555405
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Obtaining samples for iteration 20...
Obtaining samples for iteration 21...
Obtaining samples for iteration 22...
Obtaining samples for iteration 23...
Obtaining samples for iteration 24...
Obtaining samples for iteration 25...
Obtaining samples for iteration 26...
Obtaining samples for iteration 27...
Obtaining samples for iteration 28...
Obtaining samples for iteration 29...
Obtaining samples for iteration 30...
Obtaining samples for iteration 31...
Obtaining samples for iteration 32...
Obtaining samples for iteration 33...
Obtaining samples for iteration 34...
Obtaining samples for iteration 35...
Obtaining samples for iteration 36...
Obtaining samples for iteration 37...
Obtaining samples for iteration 38...
Obtaining samples for iteration 39...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -157     |
| Iteration     | 3        |
| MaximumReturn | 18       |
| MinimumReturn | -345     |
| TotalSamples  | 20000    |
----------------------------
itr #4 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.09778053313493729
Validation loss = 0.08882094919681549
Validation loss = 0.08962376415729523
Validation loss = 0.09055207669734955
Validation loss = 0.09152310341596603
Validation loss = 0.09145572036504745
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.09837289154529572
Validation loss = 0.08845347166061401
Validation loss = 0.08875761926174164
Validation loss = 0.08735541999340057
Validation loss = 0.08761195838451385
Validation loss = 0.08944927155971527
Validation loss = 0.08973054587841034
Validation loss = 0.08909909427165985
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.09850963950157166
Validation loss = 0.0895952507853508
Validation loss = 0.08864898979663849
Validation loss = 0.08927293121814728
Validation loss = 0.0912063866853714
Validation loss = 0.09104765206575394
Validation loss = 0.09115941822528839
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.10018420219421387
Validation loss = 0.09022067487239838
Validation loss = 0.09145193547010422
Validation loss = 0.09352143108844757
Validation loss = 0.09002671390771866
Validation loss = 0.08951364457607269
Validation loss = 0.0940367802977562
Validation loss = 0.09043220430612564
Validation loss = 0.08915070444345474
Validation loss = 0.09265851229429245
Validation loss = 0.09338691830635071
Validation loss = 0.09100489318370819
Validation loss = 0.09293131530284882
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.10173879563808441
Validation loss = 0.09113918989896774
Validation loss = 0.09012575447559357
Validation loss = 0.08898651599884033
Validation loss = 0.09086000174283981
Validation loss = 0.09169648587703705
Validation loss = 0.08931012451648712
Validation loss = 0.09071038663387299
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Obtaining samples for iteration 20...
Obtaining samples for iteration 21...
Obtaining samples for iteration 22...
Obtaining samples for iteration 23...
Obtaining samples for iteration 24...
Obtaining samples for iteration 25...
Obtaining samples for iteration 26...
Obtaining samples for iteration 27...
Obtaining samples for iteration 28...
Obtaining samples for iteration 29...
Obtaining samples for iteration 30...
Obtaining samples for iteration 31...
Obtaining samples for iteration 32...
Obtaining samples for iteration 33...
Obtaining samples for iteration 34...
Obtaining samples for iteration 35...
Obtaining samples for iteration 36...
Obtaining samples for iteration 37...
Obtaining samples for iteration 38...
Obtaining samples for iteration 39...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -65.5    |
| Iteration     | 4        |
| MaximumReturn | 182      |
| MinimumReturn | -339     |
| TotalSamples  | 24000    |
----------------------------
itr #5 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.09275603294372559
Validation loss = 0.08415860682725906
Validation loss = 0.08698078244924545
Validation loss = 0.08511549979448318
Validation loss = 0.08634250611066818
Validation loss = 0.08432986587285995
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.0903310477733612
Validation loss = 0.08790261298418045
Validation loss = 0.0841461643576622
Validation loss = 0.08513942360877991
Validation loss = 0.08598067611455917
Validation loss = 0.0828934833407402
Validation loss = 0.08459962159395218
Validation loss = 0.0840526819229126
Validation loss = 0.08564379811286926
Validation loss = 0.08370789885520935
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.09440699219703674
Validation loss = 0.08567773550748825
Validation loss = 0.08402984589338303
Validation loss = 0.08794500678777695
Validation loss = 0.08384126424789429
Validation loss = 0.0850343331694603
Validation loss = 0.08984419703483582
Validation loss = 0.08589013665914536
Validation loss = 0.08542690426111221
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.09128233790397644
Validation loss = 0.08716877549886703
Validation loss = 0.08586255460977554
Validation loss = 0.08771761506795883
Validation loss = 0.08861692994832993
Validation loss = 0.08493220061063766
Validation loss = 0.08760741353034973
Validation loss = 0.08591341972351074
Validation loss = 0.08516623824834824
Validation loss = 0.0879615768790245
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.09296072274446487
Validation loss = 0.08665508031845093
Validation loss = 0.08603844046592712
Validation loss = 0.08454996347427368
Validation loss = 0.08378240466117859
Validation loss = 0.08481738716363907
Validation loss = 0.08641325682401657
Validation loss = 0.0882430449128151
Validation loss = 0.08514532446861267
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Obtaining samples for iteration 20...
Obtaining samples for iteration 21...
Obtaining samples for iteration 22...
Obtaining samples for iteration 23...
Obtaining samples for iteration 24...
Obtaining samples for iteration 25...
Obtaining samples for iteration 26...
Obtaining samples for iteration 27...
Obtaining samples for iteration 28...
Obtaining samples for iteration 29...
Obtaining samples for iteration 30...
Obtaining samples for iteration 31...
Obtaining samples for iteration 32...
Obtaining samples for iteration 33...
Obtaining samples for iteration 34...
Obtaining samples for iteration 35...
Obtaining samples for iteration 36...
Obtaining samples for iteration 37...
Obtaining samples for iteration 38...
Obtaining samples for iteration 39...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -89.3    |
| Iteration     | 5        |
| MaximumReturn | 105      |
| MinimumReturn | -244     |
| TotalSamples  | 28000    |
----------------------------
itr #6 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.09638380259275436
Validation loss = 0.08503538370132446
Validation loss = 0.08657244592905045
Validation loss = 0.08528464287519455
Validation loss = 0.08464410156011581
Validation loss = 0.08703500032424927
Validation loss = 0.08588260412216187
Validation loss = 0.08546411246061325
Validation loss = 0.08641739934682846
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.09394795447587967
Validation loss = 0.0856703668832779
Validation loss = 0.08410556614398956
Validation loss = 0.08688418567180634
Validation loss = 0.08372210711240768
Validation loss = 0.08600030094385147
Validation loss = 0.08840499073266983
Validation loss = 0.08543573319911957
Validation loss = 0.08473258465528488
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.09459511190652847
Validation loss = 0.08605407178401947
Validation loss = 0.08550982922315598
Validation loss = 0.0855218842625618
Validation loss = 0.08558301627635956
Validation loss = 0.08621228486299515
Validation loss = 0.08852804452180862
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.09662275016307831
Validation loss = 0.08566409349441528
Validation loss = 0.0855942815542221
Validation loss = 0.0850413367152214
Validation loss = 0.08730632811784744
Validation loss = 0.08442027866840363
Validation loss = 0.08721853792667389
Validation loss = 0.08536700159311295
Validation loss = 0.08713878691196442
Validation loss = 0.08513877540826797
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.09692370146512985
Validation loss = 0.08508239686489105
Validation loss = 0.08686430007219315
Validation loss = 0.08431972563266754
Validation loss = 0.08597632497549057
Validation loss = 0.08530281484127045
Validation loss = 0.08616011589765549
Validation loss = 0.0847175270318985
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Obtaining samples for iteration 20...
Obtaining samples for iteration 21...
Obtaining samples for iteration 22...
Obtaining samples for iteration 23...
Obtaining samples for iteration 24...
Obtaining samples for iteration 25...
Obtaining samples for iteration 26...
Obtaining samples for iteration 27...
Obtaining samples for iteration 28...
Obtaining samples for iteration 29...
Obtaining samples for iteration 30...
Obtaining samples for iteration 31...
Obtaining samples for iteration 32...
Obtaining samples for iteration 33...
Obtaining samples for iteration 34...
Obtaining samples for iteration 35...
Obtaining samples for iteration 36...
Obtaining samples for iteration 37...
Obtaining samples for iteration 38...
Obtaining samples for iteration 39...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -192     |
| Iteration     | 6        |
| MaximumReturn | -78.5    |
| MinimumReturn | -339     |
| TotalSamples  | 32000    |
----------------------------
itr #7 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.09765591472387314
Validation loss = 0.08678065240383148
Validation loss = 0.0877603068947792
Validation loss = 0.0877196341753006
Validation loss = 0.09059543907642365
Validation loss = 0.08719054609537125
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.09491002559661865
Validation loss = 0.08811094611883163
Validation loss = 0.08785437047481537
Validation loss = 0.08669528365135193
Validation loss = 0.08747399598360062
Validation loss = 0.08913733065128326
Validation loss = 0.08853951096534729
Validation loss = 0.08810418844223022
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.09450231492519379
Validation loss = 0.08691440522670746
Validation loss = 0.08713433891534805
Validation loss = 0.08810499310493469
Validation loss = 0.08857469260692596
Validation loss = 0.09047766029834747
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.09869788587093353
Validation loss = 0.08703025430440903
Validation loss = 0.08687761425971985
Validation loss = 0.08914162218570709
Validation loss = 0.08888886868953705
Validation loss = 0.08767952024936676
Validation loss = 0.08770094811916351
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.09640209376811981
Validation loss = 0.08888116478919983
Validation loss = 0.08932730555534363
Validation loss = 0.08788350969552994
Validation loss = 0.08828642964363098
Validation loss = 0.08950526267290115
Validation loss = 0.08942131698131561
Validation loss = 0.08987252414226532
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Obtaining samples for iteration 20...
Obtaining samples for iteration 21...
Obtaining samples for iteration 22...
Obtaining samples for iteration 23...
Obtaining samples for iteration 24...
Obtaining samples for iteration 25...
Obtaining samples for iteration 26...
Obtaining samples for iteration 27...
Obtaining samples for iteration 28...
Obtaining samples for iteration 29...
Obtaining samples for iteration 30...
Obtaining samples for iteration 31...
Obtaining samples for iteration 32...
Obtaining samples for iteration 33...
Obtaining samples for iteration 34...
Obtaining samples for iteration 35...
Obtaining samples for iteration 36...
Obtaining samples for iteration 37...
Obtaining samples for iteration 38...
Obtaining samples for iteration 39...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -193     |
| Iteration     | 7        |
| MaximumReturn | 95.7     |
| MinimumReturn | -325     |
| TotalSamples  | 36000    |
----------------------------
itr #8 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.11346452683210373
Validation loss = 0.09220018982887268
Validation loss = 0.09167216718196869
Validation loss = 0.09020520746707916
Validation loss = 0.09139558672904968
Validation loss = 0.0920683741569519
Validation loss = 0.0910637155175209
Validation loss = 0.09212111681699753
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.11595787107944489
Validation loss = 0.09292607754468918
Validation loss = 0.09268173575401306
Validation loss = 0.09324745833873749
Validation loss = 0.09552528709173203
Validation loss = 0.09196749329566956
Validation loss = 0.09163637459278107
Validation loss = 0.0945180207490921
Validation loss = 0.09130728244781494
Validation loss = 0.09185773879289627
Validation loss = 0.09291534125804901
Validation loss = 0.09246622025966644
Validation loss = 0.09262064844369888
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.1112256720662117
Validation loss = 0.09224504977464676
Validation loss = 0.09257476776838303
Validation loss = 0.0898425504565239
Validation loss = 0.09215312451124191
Validation loss = 0.09287241846323013
Validation loss = 0.09254565834999084
Validation loss = 0.09152872115373611
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.11602000147104263
Validation loss = 0.09272153675556183
Validation loss = 0.09193893522024155
Validation loss = 0.09066203236579895
Validation loss = 0.0912337601184845
Validation loss = 0.09098164737224579
Validation loss = 0.09079310297966003
Validation loss = 0.09096987545490265
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.1170184314250946
Validation loss = 0.09153273701667786
Validation loss = 0.09063854813575745
Validation loss = 0.0928475558757782
Validation loss = 0.08987287431955338
Validation loss = 0.09268707782030106
Validation loss = 0.092029869556427
Validation loss = 0.09132079035043716
Validation loss = 0.09287695586681366
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Obtaining samples for iteration 20...
Obtaining samples for iteration 21...
Obtaining samples for iteration 22...
Obtaining samples for iteration 23...
Obtaining samples for iteration 24...
Obtaining samples for iteration 25...
Obtaining samples for iteration 26...
Obtaining samples for iteration 27...
Obtaining samples for iteration 28...
Obtaining samples for iteration 29...
Obtaining samples for iteration 30...
Obtaining samples for iteration 31...
Obtaining samples for iteration 32...
Obtaining samples for iteration 33...
Obtaining samples for iteration 34...
Obtaining samples for iteration 35...
Obtaining samples for iteration 36...
Obtaining samples for iteration 37...
Obtaining samples for iteration 38...
Obtaining samples for iteration 39...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -14.3    |
| Iteration     | 8        |
| MaximumReturn | 288      |
| MinimumReturn | -284     |
| TotalSamples  | 40000    |
----------------------------
itr #9 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.0945422574877739
Validation loss = 0.09046919643878937
Validation loss = 0.08911798894405365
Validation loss = 0.09146133810281754
Validation loss = 0.09113501757383347
Validation loss = 0.09019318968057632
Validation loss = 0.09076310694217682
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.0977984070777893
Validation loss = 0.09007208794355392
Validation loss = 0.08945312350988388
Validation loss = 0.0898270383477211
Validation loss = 0.09010878950357437
Validation loss = 0.08978058397769928
Validation loss = 0.08963771164417267
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.09607090055942535
Validation loss = 0.0902639627456665
Validation loss = 0.09001185745000839
Validation loss = 0.09191293269395828
Validation loss = 0.08953247219324112
Validation loss = 0.09068576991558075
Validation loss = 0.09096291661262512
Validation loss = 0.09106127917766571
Validation loss = 0.09089888632297516
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.0965438187122345
Validation loss = 0.09050267934799194
Validation loss = 0.08873452246189117
Validation loss = 0.08912943303585052
Validation loss = 0.08918595314025879
Validation loss = 0.08952010422945023
Validation loss = 0.09115735441446304
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.09580495208501816
Validation loss = 0.08965553343296051
Validation loss = 0.0883326381444931
Validation loss = 0.08822032064199448
Validation loss = 0.08946369588375092
Validation loss = 0.08915530145168304
Validation loss = 0.09090206772089005
Validation loss = 0.08950352668762207
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Obtaining samples for iteration 20...
Obtaining samples for iteration 21...
Obtaining samples for iteration 22...
Obtaining samples for iteration 23...
Obtaining samples for iteration 24...
Obtaining samples for iteration 25...
Obtaining samples for iteration 26...
Obtaining samples for iteration 27...
Obtaining samples for iteration 28...
Obtaining samples for iteration 29...
Obtaining samples for iteration 30...
Obtaining samples for iteration 31...
Obtaining samples for iteration 32...
Obtaining samples for iteration 33...
Obtaining samples for iteration 34...
Obtaining samples for iteration 35...
Obtaining samples for iteration 36...
Obtaining samples for iteration 37...
Obtaining samples for iteration 38...
Obtaining samples for iteration 39...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 256      |
| Iteration     | 9        |
| MaximumReturn | 664      |
| MinimumReturn | -344     |
| TotalSamples  | 44000    |
----------------------------
itr #10 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.09169542789459229
Validation loss = 0.0855625718832016
Validation loss = 0.08515003323554993
Validation loss = 0.08654112368822098
Validation loss = 0.08563899248838425
Validation loss = 0.08650332689285278
Validation loss = 0.08606969565153122
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.09165056049823761
Validation loss = 0.08562301099300385
Validation loss = 0.0843057706952095
Validation loss = 0.08574387431144714
Validation loss = 0.0862799659371376
Validation loss = 0.08578138053417206
Validation loss = 0.08679063618183136
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.09316173195838928
Validation loss = 0.08655229955911636
Validation loss = 0.08619452267885208
Validation loss = 0.08514633029699326
Validation loss = 0.0873270332813263
Validation loss = 0.08701106160879135
Validation loss = 0.08571096509695053
Validation loss = 0.0853934958577156
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.0914691910147667
Validation loss = 0.08497295528650284
Validation loss = 0.08407390862703323
Validation loss = 0.08637363463640213
Validation loss = 0.0847039520740509
Validation loss = 0.08500134944915771
Validation loss = 0.08428525179624557
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.0915370061993599
Validation loss = 0.08492004126310349
Validation loss = 0.08495155721902847
Validation loss = 0.0839388519525528
Validation loss = 0.08637499809265137
Validation loss = 0.08560257405042648
Validation loss = 0.08391472697257996
Validation loss = 0.08559156209230423
Validation loss = 0.08515316247940063
Validation loss = 0.08492089062929153
Validation loss = 0.08576596528291702
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Obtaining samples for iteration 20...
Obtaining samples for iteration 21...
Obtaining samples for iteration 22...
Obtaining samples for iteration 23...
Obtaining samples for iteration 24...
Obtaining samples for iteration 25...
Obtaining samples for iteration 26...
Obtaining samples for iteration 27...
Obtaining samples for iteration 28...
Obtaining samples for iteration 29...
Obtaining samples for iteration 30...
Obtaining samples for iteration 31...
Obtaining samples for iteration 32...
Obtaining samples for iteration 33...
Obtaining samples for iteration 34...
Obtaining samples for iteration 35...
Obtaining samples for iteration 36...
Obtaining samples for iteration 37...
Obtaining samples for iteration 38...
Obtaining samples for iteration 39...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 534      |
| Iteration     | 10       |
| MaximumReturn | 1.45e+03 |
| MinimumReturn | -224     |
| TotalSamples  | 48000    |
----------------------------
itr #11 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.09067829698324203
Validation loss = 0.08201711624860764
Validation loss = 0.08369013667106628
Validation loss = 0.08212091028690338
Validation loss = 0.08448503166437149
Validation loss = 0.08475949615240097
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.08751630783081055
Validation loss = 0.08344143629074097
Validation loss = 0.08305100351572037
Validation loss = 0.0821448341012001
Validation loss = 0.08226912468671799
Validation loss = 0.08182353526353836
Validation loss = 0.08219785243272781
Validation loss = 0.08089008182287216
Validation loss = 0.08174597471952438
Validation loss = 0.0823342502117157
Validation loss = 0.08080177009105682
Validation loss = 0.08262323588132858
Validation loss = 0.08114371448755264
Validation loss = 0.08083868771791458
Validation loss = 0.08140429854393005
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.09061065316200256
Validation loss = 0.08197607845067978
Validation loss = 0.08398180454969406
Validation loss = 0.08237648010253906
Validation loss = 0.08206450194120407
Validation loss = 0.08379101753234863
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.09225672483444214
Validation loss = 0.08064108341932297
Validation loss = 0.08353370428085327
Validation loss = 0.08181603997945786
Validation loss = 0.08096721023321152
Validation loss = 0.0809304416179657
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.08863361924886703
Validation loss = 0.08226839452981949
Validation loss = 0.08098766952753067
Validation loss = 0.08206573873758316
Validation loss = 0.08124777674674988
Validation loss = 0.08170058578252792
Validation loss = 0.08163977414369583
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Obtaining samples for iteration 20...
Obtaining samples for iteration 21...
Obtaining samples for iteration 22...
Obtaining samples for iteration 23...
Obtaining samples for iteration 24...
Obtaining samples for iteration 25...
Obtaining samples for iteration 26...
Obtaining samples for iteration 27...
Obtaining samples for iteration 28...
Obtaining samples for iteration 29...
Obtaining samples for iteration 30...
Obtaining samples for iteration 31...
Obtaining samples for iteration 32...
Obtaining samples for iteration 33...
Obtaining samples for iteration 34...
Obtaining samples for iteration 35...
Obtaining samples for iteration 36...
Obtaining samples for iteration 37...
Obtaining samples for iteration 38...
Obtaining samples for iteration 39...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 351      |
| Iteration     | 11       |
| MaximumReturn | 994      |
| MinimumReturn | -325     |
| TotalSamples  | 52000    |
----------------------------
itr #12 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.08584196865558624
Validation loss = 0.07938750833272934
Validation loss = 0.07897646725177765
Validation loss = 0.08214842528104782
Validation loss = 0.07946862280368805
Validation loss = 0.0796445980668068
Validation loss = 0.07972689718008041
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.09081495553255081
Validation loss = 0.07732589542865753
Validation loss = 0.0781497061252594
Validation loss = 0.07872582226991653
Validation loss = 0.078711599111557
Validation loss = 0.07885567843914032
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.08935077488422394
Validation loss = 0.0801057294011116
Validation loss = 0.07891898602247238
Validation loss = 0.07970292866230011
Validation loss = 0.08054281026124954
Validation loss = 0.08069132268428802
Validation loss = 0.08012247085571289
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.08760663121938705
Validation loss = 0.07710270583629608
Validation loss = 0.07681074738502502
Validation loss = 0.07776129245758057
Validation loss = 0.07860532402992249
Validation loss = 0.07899557054042816
Validation loss = 0.07789845764636993
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.085782490670681
Validation loss = 0.0770844891667366
Validation loss = 0.07856784760951996
Validation loss = 0.07844198495149612
Validation loss = 0.07827500998973846
Validation loss = 0.0780041515827179
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Obtaining samples for iteration 20...
Obtaining samples for iteration 21...
Obtaining samples for iteration 22...
Obtaining samples for iteration 23...
Obtaining samples for iteration 24...
Obtaining samples for iteration 25...
Obtaining samples for iteration 26...
Obtaining samples for iteration 27...
Obtaining samples for iteration 28...
Obtaining samples for iteration 29...
Obtaining samples for iteration 30...
Obtaining samples for iteration 31...
Obtaining samples for iteration 32...
Obtaining samples for iteration 33...
Obtaining samples for iteration 34...
Obtaining samples for iteration 35...
Obtaining samples for iteration 36...
Obtaining samples for iteration 37...
Obtaining samples for iteration 38...
Obtaining samples for iteration 39...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 113      |
| Iteration     | 12       |
| MaximumReturn | 556      |
| MinimumReturn | -334     |
| TotalSamples  | 56000    |
----------------------------
itr #13 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.08420254290103912
Validation loss = 0.07668363302946091
Validation loss = 0.07648076117038727
Validation loss = 0.07770438492298126
Validation loss = 0.0771918073296547
Validation loss = 0.07718253880739212
Validation loss = 0.07641617953777313
Validation loss = 0.07549158483743668
Validation loss = 0.07608407735824585
Validation loss = 0.07550884038209915
Validation loss = 0.0771784633398056
Validation loss = 0.07710601389408112
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.08170043677091599
Validation loss = 0.0764794573187828
Validation loss = 0.07579205185174942
Validation loss = 0.07677103579044342
Validation loss = 0.07538646459579468
Validation loss = 0.07611901313066483
Validation loss = 0.07572907209396362
Validation loss = 0.07666196674108505
Validation loss = 0.07546739280223846
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.08609368652105331
Validation loss = 0.07662177830934525
Validation loss = 0.07679561525583267
Validation loss = 0.07748478651046753
Validation loss = 0.07716054469347
Validation loss = 0.07740290462970734
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.08405398577451706
Validation loss = 0.0751066729426384
Validation loss = 0.07404021173715591
Validation loss = 0.07408293336629868
Validation loss = 0.07529310137033463
Validation loss = 0.07413254678249359
Validation loss = 0.0759676918387413
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.08617136627435684
Validation loss = 0.07595552504062653
Validation loss = 0.0759926289319992
Validation loss = 0.07497038692235947
Validation loss = 0.07471160590648651
Validation loss = 0.0752464011311531
Validation loss = 0.07562745362520218
Validation loss = 0.07718807458877563
Validation loss = 0.07541052252054214
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Obtaining samples for iteration 20...
Obtaining samples for iteration 21...
Obtaining samples for iteration 22...
Obtaining samples for iteration 23...
Obtaining samples for iteration 24...
Obtaining samples for iteration 25...
Obtaining samples for iteration 26...
Obtaining samples for iteration 27...
Obtaining samples for iteration 28...
Obtaining samples for iteration 29...
Obtaining samples for iteration 30...
Obtaining samples for iteration 31...
Obtaining samples for iteration 32...
Obtaining samples for iteration 33...
Obtaining samples for iteration 34...
Obtaining samples for iteration 35...
Obtaining samples for iteration 36...
Obtaining samples for iteration 37...
Obtaining samples for iteration 38...
Obtaining samples for iteration 39...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 184      |
| Iteration     | 13       |
| MaximumReturn | 1.19e+03 |
| MinimumReturn | -405     |
| TotalSamples  | 60000    |
----------------------------
itr #14 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.08202143013477325
Validation loss = 0.07480175793170929
Validation loss = 0.07349655777215958
Validation loss = 0.07461646944284439
Validation loss = 0.0749502032995224
Validation loss = 0.07424597442150116
Validation loss = 0.07481684535741806
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.0798216313123703
Validation loss = 0.07435401529073715
Validation loss = 0.07527851313352585
Validation loss = 0.07398729026317596
Validation loss = 0.0754917711019516
Validation loss = 0.07473277300596237
Validation loss = 0.07407572120428085
Validation loss = 0.07437656819820404
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.08214575797319412
Validation loss = 0.07578963786363602
Validation loss = 0.07574760168790817
Validation loss = 0.07651243358850479
Validation loss = 0.07515929639339447
Validation loss = 0.07669217884540558
Validation loss = 0.07617314159870148
Validation loss = 0.07607425004243851
Validation loss = 0.07475362718105316
Validation loss = 0.07695899158716202
Validation loss = 0.07530904561281204
Validation loss = 0.07654359191656113
Validation loss = 0.07397253066301346
Validation loss = 0.07579370588064194
Validation loss = 0.07423210144042969
Validation loss = 0.07415693253278732
Validation loss = 0.074360691010952
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.08278588205575943
Validation loss = 0.07335244119167328
Validation loss = 0.07394473254680634
Validation loss = 0.07358639687299728
Validation loss = 0.07389534264802933
Validation loss = 0.0745026022195816
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.08011320978403091
Validation loss = 0.07498633861541748
Validation loss = 0.0743349939584732
Validation loss = 0.0736231729388237
Validation loss = 0.07540113478899002
Validation loss = 0.07553721219301224
Validation loss = 0.07395707815885544
Validation loss = 0.07606492936611176
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Obtaining samples for iteration 20...
Obtaining samples for iteration 21...
Obtaining samples for iteration 22...
Obtaining samples for iteration 23...
Obtaining samples for iteration 24...
Obtaining samples for iteration 25...
Obtaining samples for iteration 26...
Obtaining samples for iteration 27...
Obtaining samples for iteration 28...
Obtaining samples for iteration 29...
Obtaining samples for iteration 30...
Obtaining samples for iteration 31...
Obtaining samples for iteration 32...
Obtaining samples for iteration 33...
Obtaining samples for iteration 34...
Obtaining samples for iteration 35...
Obtaining samples for iteration 36...
Obtaining samples for iteration 37...
Obtaining samples for iteration 38...
Obtaining samples for iteration 39...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 837      |
| Iteration     | 14       |
| MaximumReturn | 1.63e+03 |
| MinimumReturn | 38.5     |
| TotalSamples  | 64000    |
----------------------------
itr #15 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.07881373912096024
Validation loss = 0.07406191527843475
Validation loss = 0.07462188601493835
Validation loss = 0.07265433669090271
Validation loss = 0.07344722002744675
Validation loss = 0.07210367918014526
Validation loss = 0.07373155653476715
Validation loss = 0.07375992089509964
Validation loss = 0.07428896427154541
Validation loss = 0.0724923312664032
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.07807125151157379
Validation loss = 0.07345301657915115
Validation loss = 0.07436077296733856
Validation loss = 0.07498227804899216
Validation loss = 0.07324893772602081
Validation loss = 0.07329340279102325
Validation loss = 0.07294706255197525
Validation loss = 0.07318953424692154
Validation loss = 0.07431148737668991
Validation loss = 0.07359203696250916
Validation loss = 0.07325921952724457
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.0807366669178009
Validation loss = 0.07317909598350525
Validation loss = 0.07354980707168579
Validation loss = 0.07370725274085999
Validation loss = 0.07367777079343796
Validation loss = 0.07492899894714355
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.07701431959867477
Validation loss = 0.07374414801597595
Validation loss = 0.07289756834506989
Validation loss = 0.07189056277275085
Validation loss = 0.07221804559230804
Validation loss = 0.07205520570278168
Validation loss = 0.07213569432497025
Validation loss = 0.0726967453956604
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.07822942733764648
Validation loss = 0.07301077991724014
Validation loss = 0.07346468418836594
Validation loss = 0.0737161934375763
Validation loss = 0.07366669923067093
Validation loss = 0.07348334789276123
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Obtaining samples for iteration 20...
Obtaining samples for iteration 21...
Obtaining samples for iteration 22...
Obtaining samples for iteration 23...
Obtaining samples for iteration 24...
Obtaining samples for iteration 25...
Obtaining samples for iteration 26...
Obtaining samples for iteration 27...
Obtaining samples for iteration 28...
Obtaining samples for iteration 29...
Obtaining samples for iteration 30...
Obtaining samples for iteration 31...
Obtaining samples for iteration 32...
Obtaining samples for iteration 33...
Obtaining samples for iteration 34...
Obtaining samples for iteration 35...
Obtaining samples for iteration 36...
Obtaining samples for iteration 37...
Obtaining samples for iteration 38...
Obtaining samples for iteration 39...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 42.3     |
| Iteration     | 15       |
| MaximumReturn | 787      |
| MinimumReturn | -333     |
| TotalSamples  | 68000    |
----------------------------
itr #16 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.07760463654994965
Validation loss = 0.07191259413957596
Validation loss = 0.07273174822330475
Validation loss = 0.0719716027379036
Validation loss = 0.07238838076591492
Validation loss = 0.07120560854673386
Validation loss = 0.07212253659963608
Validation loss = 0.07421809434890747
Validation loss = 0.07304824143648148
Validation loss = 0.07250526547431946
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.0794106051325798
Validation loss = 0.07299329340457916
Validation loss = 0.07230933010578156
Validation loss = 0.07269195467233658
Validation loss = 0.07293898612260818
Validation loss = 0.0732327476143837
Validation loss = 0.07292019575834274
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.07852381467819214
Validation loss = 0.07279792428016663
Validation loss = 0.07349998503923416
Validation loss = 0.0728941410779953
Validation loss = 0.07450857758522034
Validation loss = 0.07349733263254166
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.0779600590467453
Validation loss = 0.0714723989367485
Validation loss = 0.0725158154964447
Validation loss = 0.07217317819595337
Validation loss = 0.07358469814062119
Validation loss = 0.07242251932621002
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.07888299971818924
Validation loss = 0.07170175015926361
Validation loss = 0.07532387971878052
Validation loss = 0.07322188466787338
Validation loss = 0.07327187061309814
Validation loss = 0.07385876029729843
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Obtaining samples for iteration 20...
Obtaining samples for iteration 21...
Obtaining samples for iteration 22...
Obtaining samples for iteration 23...
Obtaining samples for iteration 24...
Obtaining samples for iteration 25...
Obtaining samples for iteration 26...
Obtaining samples for iteration 27...
Obtaining samples for iteration 28...
Obtaining samples for iteration 29...
Obtaining samples for iteration 30...
Obtaining samples for iteration 31...
Obtaining samples for iteration 32...
Obtaining samples for iteration 33...
Obtaining samples for iteration 34...
Obtaining samples for iteration 35...
Obtaining samples for iteration 36...
Obtaining samples for iteration 37...
Obtaining samples for iteration 38...
Obtaining samples for iteration 39...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 165      |
| Iteration     | 16       |
| MaximumReturn | 694      |
| MinimumReturn | -426     |
| TotalSamples  | 72000    |
----------------------------
itr #17 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.07665271311998367
Validation loss = 0.07108955830335617
Validation loss = 0.07273716479539871
Validation loss = 0.07228110730648041
Validation loss = 0.07341895252466202
Validation loss = 0.0731944888830185
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.07790958881378174
Validation loss = 0.07257585972547531
Validation loss = 0.07382941246032715
Validation loss = 0.07412949204444885
Validation loss = 0.07310304045677185
Validation loss = 0.07474643737077713
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.077033132314682
Validation loss = 0.07323554158210754
Validation loss = 0.07376289367675781
Validation loss = 0.07426823675632477
Validation loss = 0.07358153164386749
Validation loss = 0.07365373522043228
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.07873866707086563
Validation loss = 0.07222814112901688
Validation loss = 0.07263480871915817
Validation loss = 0.07185609638690948
Validation loss = 0.07236321270465851
Validation loss = 0.07239162921905518
Validation loss = 0.07236819714307785
Validation loss = 0.07272052764892578
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.07931914925575256
Validation loss = 0.07341088354587555
Validation loss = 0.07328727841377258
Validation loss = 0.07447774708271027
Validation loss = 0.07472953200340271
Validation loss = 0.0743226483464241
Validation loss = 0.07354094833135605
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Obtaining samples for iteration 20...
Obtaining samples for iteration 21...
Obtaining samples for iteration 22...
Obtaining samples for iteration 23...
Obtaining samples for iteration 24...
Obtaining samples for iteration 25...
Obtaining samples for iteration 26...
Obtaining samples for iteration 27...
Obtaining samples for iteration 28...
Obtaining samples for iteration 29...
Obtaining samples for iteration 30...
Obtaining samples for iteration 31...
Obtaining samples for iteration 32...
Obtaining samples for iteration 33...
Obtaining samples for iteration 34...
Obtaining samples for iteration 35...
Obtaining samples for iteration 36...
Obtaining samples for iteration 37...
Obtaining samples for iteration 38...
Obtaining samples for iteration 39...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 73.9     |
| Iteration     | 17       |
| MaximumReturn | 578      |
| MinimumReturn | -232     |
| TotalSamples  | 76000    |
----------------------------
itr #18 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.07806611806154251
Validation loss = 0.07254569232463837
Validation loss = 0.07217869907617569
Validation loss = 0.0728902518749237
Validation loss = 0.07405531406402588
Validation loss = 0.07426265627145767
Validation loss = 0.07310715317726135
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.08141312748193741
Validation loss = 0.07259784638881683
Validation loss = 0.07370829582214355
Validation loss = 0.07466325908899307
Validation loss = 0.07449092715978622
Validation loss = 0.07282797247171402
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.07759787887334824
Validation loss = 0.0731772854924202
Validation loss = 0.07277105003595352
Validation loss = 0.07495305687189102
Validation loss = 0.07275128364562988
Validation loss = 0.07332191616296768
Validation loss = 0.07150494307279587
Validation loss = 0.0734490379691124
Validation loss = 0.07645658403635025
Validation loss = 0.07385150343179703
Validation loss = 0.07327485084533691
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.0778939500451088
Validation loss = 0.07153226435184479
Validation loss = 0.07314122468233109
Validation loss = 0.0722903236746788
Validation loss = 0.0718492865562439
Validation loss = 0.07208123058080673
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.07786785811185837
Validation loss = 0.07361367344856262
Validation loss = 0.07357913255691528
Validation loss = 0.07402648776769638
Validation loss = 0.07317842543125153
Validation loss = 0.07410527765750885
Validation loss = 0.07364535331726074
Validation loss = 0.07275280356407166
Validation loss = 0.07295797020196915
Validation loss = 0.07288217544555664
Validation loss = 0.07430309057235718
Validation loss = 0.07377289980649948
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Obtaining samples for iteration 20...
Obtaining samples for iteration 21...
Obtaining samples for iteration 22...
Obtaining samples for iteration 23...
Obtaining samples for iteration 24...
Obtaining samples for iteration 25...
Obtaining samples for iteration 26...
Obtaining samples for iteration 27...
Obtaining samples for iteration 28...
Obtaining samples for iteration 29...
Obtaining samples for iteration 30...
Obtaining samples for iteration 31...
Obtaining samples for iteration 32...
Obtaining samples for iteration 33...
Obtaining samples for iteration 34...
Obtaining samples for iteration 35...
Obtaining samples for iteration 36...
Obtaining samples for iteration 37...
Obtaining samples for iteration 38...
Obtaining samples for iteration 39...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -156     |
| Iteration     | 18       |
| MaximumReturn | 69.1     |
| MinimumReturn | -330     |
| TotalSamples  | 80000    |
----------------------------
itr #19 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.07857118546962738
Validation loss = 0.07322412729263306
Validation loss = 0.07357548177242279
Validation loss = 0.0731423869729042
Validation loss = 0.07290208339691162
Validation loss = 0.07371976226568222
Validation loss = 0.07331329584121704
Validation loss = 0.07319710403680801
Validation loss = 0.07351437956094742
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.07847096771001816
Validation loss = 0.07273927330970764
Validation loss = 0.07298724353313446
Validation loss = 0.07419027388095856
Validation loss = 0.0736144557595253
Validation loss = 0.0750647485256195
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.07972876727581024
Validation loss = 0.07325064390897751
Validation loss = 0.07315413653850555
Validation loss = 0.07426450401544571
Validation loss = 0.07327505946159363
Validation loss = 0.07416381686925888
Validation loss = 0.07340589910745621
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.0796273797750473
Validation loss = 0.07276823371648788
Validation loss = 0.07283537834882736
Validation loss = 0.0727870911359787
Validation loss = 0.07289513200521469
Validation loss = 0.07554741948843002
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.07857788354158401
Validation loss = 0.07325190305709839
Validation loss = 0.07417192310094833
Validation loss = 0.07321430742740631
Validation loss = 0.07345116138458252
Validation loss = 0.07298070192337036
Validation loss = 0.07436288893222809
Validation loss = 0.0726989358663559
Validation loss = 0.07496997714042664
Validation loss = 0.07511777430772781
Validation loss = 0.07239512354135513
Validation loss = 0.07394369691610336
Validation loss = 0.07296781241893768
Validation loss = 0.0735698938369751
Validation loss = 0.07459133863449097
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Obtaining samples for iteration 20...
Obtaining samples for iteration 21...
Obtaining samples for iteration 22...
Obtaining samples for iteration 23...
Obtaining samples for iteration 24...
Obtaining samples for iteration 25...
Obtaining samples for iteration 26...
Obtaining samples for iteration 27...
Obtaining samples for iteration 28...
Obtaining samples for iteration 29...
Obtaining samples for iteration 30...
Obtaining samples for iteration 31...
Obtaining samples for iteration 32...
Obtaining samples for iteration 33...
Obtaining samples for iteration 34...
Obtaining samples for iteration 35...
Obtaining samples for iteration 36...
Obtaining samples for iteration 37...
Obtaining samples for iteration 38...
Obtaining samples for iteration 39...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -313     |
| Iteration     | 19       |
| MaximumReturn | -164     |
| MinimumReturn | -521     |
| TotalSamples  | 84000    |
----------------------------
itr #20 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.07711376249790192
Validation loss = 0.07276063412427902
Validation loss = 0.07356583327054977
Validation loss = 0.07391393929719925
Validation loss = 0.07421474158763885
Validation loss = 0.0743267759680748
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.0778617113828659
Validation loss = 0.07267244905233383
Validation loss = 0.07328728586435318
Validation loss = 0.07382708787918091
Validation loss = 0.07450089603662491
Validation loss = 0.07371160387992859
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.07682456076145172
Validation loss = 0.07376762479543686
Validation loss = 0.07399322092533112
Validation loss = 0.07369784265756607
Validation loss = 0.07323829084634781
Validation loss = 0.0746627002954483
Validation loss = 0.07354503870010376
Validation loss = 0.07342474907636642
Validation loss = 0.07571035623550415
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.07778853178024292
Validation loss = 0.07245049625635147
Validation loss = 0.07346029579639435
Validation loss = 0.07353894412517548
Validation loss = 0.07346788048744202
Validation loss = 0.07432946562767029
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.07772713154554367
Validation loss = 0.07335557043552399
Validation loss = 0.07322333753108978
Validation loss = 0.07421083003282547
Validation loss = 0.07521101832389832
Validation loss = 0.07368012517690659
Validation loss = 0.07392764836549759
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Obtaining samples for iteration 20...
Obtaining samples for iteration 21...
Obtaining samples for iteration 22...
Obtaining samples for iteration 23...
Obtaining samples for iteration 24...
Obtaining samples for iteration 25...
Obtaining samples for iteration 26...
Obtaining samples for iteration 27...
Obtaining samples for iteration 28...
Obtaining samples for iteration 29...
Obtaining samples for iteration 30...
Obtaining samples for iteration 31...
Obtaining samples for iteration 32...
Obtaining samples for iteration 33...
Obtaining samples for iteration 34...
Obtaining samples for iteration 35...
Obtaining samples for iteration 36...
Obtaining samples for iteration 37...
Obtaining samples for iteration 38...
Obtaining samples for iteration 39...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -281     |
| Iteration     | 20       |
| MaximumReturn | -11.1    |
| MinimumReturn | -474     |
| TotalSamples  | 88000    |
----------------------------
itr #21 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.07742311805486679
Validation loss = 0.07437204569578171
Validation loss = 0.07415464520454407
Validation loss = 0.0742209404706955
Validation loss = 0.07384047657251358
Validation loss = 0.0751434862613678
Validation loss = 0.07445827126502991
Validation loss = 0.07503685355186462
Validation loss = 0.07384984940290451
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.07948429882526398
Validation loss = 0.0734490305185318
Validation loss = 0.07372719049453735
Validation loss = 0.07549163699150085
Validation loss = 0.0741780698299408
Validation loss = 0.07444819062948227
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.07718754559755325
Validation loss = 0.07291296124458313
Validation loss = 0.0730690211057663
Validation loss = 0.0746692344546318
Validation loss = 0.07373085618019104
Validation loss = 0.07578472048044205
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.07528482377529144
Validation loss = 0.0729401707649231
Validation loss = 0.0727369487285614
Validation loss = 0.07361484318971634
Validation loss = 0.07358840107917786
Validation loss = 0.07323602586984634
Validation loss = 0.07412414997816086
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.07728556543588638
Validation loss = 0.0733303502202034
Validation loss = 0.0741230696439743
Validation loss = 0.07465872168540955
Validation loss = 0.07331952452659607
Validation loss = 0.07408562302589417
Validation loss = 0.07380000501871109
Validation loss = 0.07403240352869034
Validation loss = 0.07407915592193604
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Obtaining samples for iteration 20...
Obtaining samples for iteration 21...
Obtaining samples for iteration 22...
Obtaining samples for iteration 23...
Obtaining samples for iteration 24...
Obtaining samples for iteration 25...
Obtaining samples for iteration 26...
Obtaining samples for iteration 27...
Obtaining samples for iteration 28...
Obtaining samples for iteration 29...
Obtaining samples for iteration 30...
Obtaining samples for iteration 31...
Obtaining samples for iteration 32...
Obtaining samples for iteration 33...
Obtaining samples for iteration 34...
Obtaining samples for iteration 35...
Obtaining samples for iteration 36...
Obtaining samples for iteration 37...
Obtaining samples for iteration 38...
Obtaining samples for iteration 39...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -364     |
| Iteration     | 21       |
| MaximumReturn | -243     |
| MinimumReturn | -574     |
| TotalSamples  | 92000    |
----------------------------
itr #22 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.07774978131055832
Validation loss = 0.07391320914030075
Validation loss = 0.07554399222135544
Validation loss = 0.07651820033788681
Validation loss = 0.0752686858177185
Validation loss = 0.07699994742870331
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.07763917744159698
Validation loss = 0.07551787048578262
Validation loss = 0.07509852945804596
Validation loss = 0.07699651271104813
Validation loss = 0.07603135704994202
Validation loss = 0.07534544169902802
Validation loss = 0.07566212862730026
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.07772162556648254
Validation loss = 0.07567031681537628
Validation loss = 0.07433351874351501
Validation loss = 0.07522553205490112
Validation loss = 0.07533737272024155
Validation loss = 0.07537759095430374
Validation loss = 0.07544763386249542
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.07787404954433441
Validation loss = 0.07382330298423767
Validation loss = 0.07511057704687119
Validation loss = 0.07536521553993225
Validation loss = 0.07476697862148285
Validation loss = 0.0755116194486618
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.07998762279748917
Validation loss = 0.07508533447980881
Validation loss = 0.07486654072999954
Validation loss = 0.07538631558418274
Validation loss = 0.07514374703168869
Validation loss = 0.07560095191001892
Validation loss = 0.0747879147529602
Validation loss = 0.07501745969057083
Validation loss = 0.07439125329256058
Validation loss = 0.07645083963871002
Validation loss = 0.07537733018398285
Validation loss = 0.07494580000638962
Validation loss = 0.07525195926427841
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Obtaining samples for iteration 20...
Obtaining samples for iteration 21...
Obtaining samples for iteration 22...
Obtaining samples for iteration 23...
Obtaining samples for iteration 24...
Obtaining samples for iteration 25...
Obtaining samples for iteration 26...
Obtaining samples for iteration 27...
Obtaining samples for iteration 28...
Obtaining samples for iteration 29...
Obtaining samples for iteration 30...
Obtaining samples for iteration 31...
Obtaining samples for iteration 32...
Obtaining samples for iteration 33...
Obtaining samples for iteration 34...
Obtaining samples for iteration 35...
Obtaining samples for iteration 36...
Obtaining samples for iteration 37...
Obtaining samples for iteration 38...
Obtaining samples for iteration 39...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 335      |
| Iteration     | 22       |
| MaximumReturn | 2.35e+03 |
| MinimumReturn | -500     |
| TotalSamples  | 96000    |
----------------------------
itr #23 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.07824195176362991
Validation loss = 0.07568634301424026
Validation loss = 0.07577956467866898
Validation loss = 0.077631875872612
Validation loss = 0.07568260282278061
Validation loss = 0.07650095969438553
Validation loss = 0.076508529484272
Validation loss = 0.07794474810361862
Validation loss = 0.0769500657916069
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.07853088527917862
Validation loss = 0.07610281556844711
Validation loss = 0.07790931314229965
Validation loss = 0.07666108757257462
Validation loss = 0.0786198303103447
Validation loss = 0.07713458687067032
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.07894950360059738
Validation loss = 0.075799860060215
Validation loss = 0.07711592316627502
Validation loss = 0.07664410024881363
Validation loss = 0.07716742157936096
Validation loss = 0.0772557482123375
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.07862459123134613
Validation loss = 0.07586697489023209
Validation loss = 0.07563629001379013
Validation loss = 0.07465536147356033
Validation loss = 0.0759853720664978
Validation loss = 0.07563117891550064
Validation loss = 0.07659956812858582
Validation loss = 0.07564040273427963
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.07948243618011475
Validation loss = 0.07573618739843369
Validation loss = 0.07585934549570084
Validation loss = 0.07639763504266739
Validation loss = 0.07723203301429749
Validation loss = 0.07601744681596756
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Obtaining samples for iteration 20...
Obtaining samples for iteration 21...
Obtaining samples for iteration 22...
Obtaining samples for iteration 23...
Obtaining samples for iteration 24...
Obtaining samples for iteration 25...
Obtaining samples for iteration 26...
Obtaining samples for iteration 27...
Obtaining samples for iteration 28...
Obtaining samples for iteration 29...
Obtaining samples for iteration 30...
Obtaining samples for iteration 31...
Obtaining samples for iteration 32...
Obtaining samples for iteration 33...
Obtaining samples for iteration 34...
Obtaining samples for iteration 35...
Obtaining samples for iteration 36...
Obtaining samples for iteration 37...
Obtaining samples for iteration 38...
Obtaining samples for iteration 39...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -167     |
| Iteration     | 23       |
| MaximumReturn | 260      |
| MinimumReturn | -470     |
| TotalSamples  | 100000   |
----------------------------
itr #24 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.07995408028364182
Validation loss = 0.07574906945228577
Validation loss = 0.0763971358537674
Validation loss = 0.07645025104284286
Validation loss = 0.07693567126989365
Validation loss = 0.0771978348493576
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.07961644977331161
Validation loss = 0.07692930102348328
Validation loss = 0.07819870859384537
Validation loss = 0.07774100452661514
Validation loss = 0.07674764096736908
Validation loss = 0.07769562304019928
Validation loss = 0.0787917822599411
Validation loss = 0.07770256698131561
Validation loss = 0.07766109704971313
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.07937003672122955
Validation loss = 0.07618726044893265
Validation loss = 0.07753899693489075
Validation loss = 0.07747185975313187
Validation loss = 0.07656174898147583
Validation loss = 0.07744782418012619
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.07959679514169693
Validation loss = 0.07577574998140335
Validation loss = 0.07546192407608032
Validation loss = 0.07575686275959015
Validation loss = 0.07652857899665833
Validation loss = 0.07584628462791443
Validation loss = 0.07589055597782135
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.08015318959951401
Validation loss = 0.07443296164274216
Validation loss = 0.07579517364501953
Validation loss = 0.07624553889036179
Validation loss = 0.07828889042139053
Validation loss = 0.0766427293419838
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Obtaining samples for iteration 20...
Obtaining samples for iteration 21...
Obtaining samples for iteration 22...
Obtaining samples for iteration 23...
Obtaining samples for iteration 24...
Obtaining samples for iteration 25...
Obtaining samples for iteration 26...
Obtaining samples for iteration 27...
Obtaining samples for iteration 28...
Obtaining samples for iteration 29...
Obtaining samples for iteration 30...
Obtaining samples for iteration 31...
Obtaining samples for iteration 32...
Obtaining samples for iteration 33...
Obtaining samples for iteration 34...
Obtaining samples for iteration 35...
Obtaining samples for iteration 36...
Obtaining samples for iteration 37...
Obtaining samples for iteration 38...
Obtaining samples for iteration 39...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -97.2    |
| Iteration     | 24       |
| MaximumReturn | 769      |
| MinimumReturn | -484     |
| TotalSamples  | 104000   |
----------------------------
itr #25 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.19674107432365417
Validation loss = 0.2478126585483551
Validation loss = 0.26738837361335754
Validation loss = 0.3080602288246155
Validation loss = 0.2167455554008484
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.24257172644138336
Validation loss = 0.26826515793800354
Validation loss = 0.23246979713439941
Validation loss = 0.2517381012439728
Validation loss = 0.247597798705101
Validation loss = 0.289678156375885
Validation loss = 0.2363436073064804
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.17893263697624207
Validation loss = 0.19062568247318268
Validation loss = 0.20218269526958466
Validation loss = 0.18115878105163574
Validation loss = 0.16857601702213287
Validation loss = 0.17510178685188293
Validation loss = 0.19257093966007233
Validation loss = 0.2014131397008896
Validation loss = 0.19257010519504547
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.22257757186889648
Validation loss = 0.19906704127788544
Validation loss = 0.20283110439777374
Validation loss = 0.22094887495040894
Validation loss = 0.24839118123054504
Validation loss = 0.2252468764781952
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.2093116194009781
Validation loss = 0.206412211060524
Validation loss = 0.21617203950881958
Validation loss = 0.20888768136501312
Validation loss = 0.21842846274375916
Validation loss = 0.22383509576320648
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Obtaining samples for iteration 20...
Obtaining samples for iteration 21...
Obtaining samples for iteration 22...
Obtaining samples for iteration 23...
Obtaining samples for iteration 24...
Obtaining samples for iteration 25...
Obtaining samples for iteration 26...
Obtaining samples for iteration 27...
Obtaining samples for iteration 28...
Obtaining samples for iteration 29...
Obtaining samples for iteration 30...
Obtaining samples for iteration 31...
Obtaining samples for iteration 32...
Obtaining samples for iteration 33...
Obtaining samples for iteration 34...
Obtaining samples for iteration 35...
Obtaining samples for iteration 36...
Obtaining samples for iteration 37...
Obtaining samples for iteration 38...
Obtaining samples for iteration 39...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 282      |
| Iteration     | 25       |
| MaximumReturn | 1.53e+03 |
| MinimumReturn | -344     |
| TotalSamples  | 108000   |
----------------------------
itr #26 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.32717812061309814
Validation loss = 0.3582758605480194
Validation loss = 0.34168741106987
Validation loss = 0.43795180320739746
Validation loss = 0.3756912648677826
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.41847649216651917
Validation loss = 0.3895491063594818
Validation loss = 0.40983688831329346
Validation loss = 0.40613195300102234
Validation loss = 0.4202345013618469
Validation loss = 0.40676620602607727
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3036152422428131
Validation loss = 0.3147181272506714
Validation loss = 0.2930038273334503
Validation loss = 0.3249818682670593
Validation loss = 0.3146165907382965
Validation loss = 0.26237955689430237
Validation loss = 0.3116551637649536
Validation loss = 0.30664101243019104
Validation loss = 0.291588693857193
Validation loss = 0.30616843700408936
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3255630433559418
Validation loss = 0.35309043526649475
Validation loss = 0.35246700048446655
Validation loss = 0.35190120339393616
Validation loss = 0.3555331528186798
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3351409137248993
Validation loss = 0.3115232586860657
Validation loss = 0.3341454863548279
Validation loss = 0.3471645414829254
Validation loss = 0.2881641089916229
Validation loss = 0.33083412051200867
Validation loss = 0.293277382850647
Validation loss = 0.34689992666244507
Validation loss = 0.3093903362751007
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Obtaining samples for iteration 20...
Obtaining samples for iteration 21...
Obtaining samples for iteration 22...
Obtaining samples for iteration 23...
Obtaining samples for iteration 24...
Obtaining samples for iteration 25...
Obtaining samples for iteration 26...
Obtaining samples for iteration 27...
Obtaining samples for iteration 28...
Obtaining samples for iteration 29...
Obtaining samples for iteration 30...
Obtaining samples for iteration 31...
Obtaining samples for iteration 32...
Obtaining samples for iteration 33...
Obtaining samples for iteration 34...
Obtaining samples for iteration 35...
Obtaining samples for iteration 36...
Obtaining samples for iteration 37...
Obtaining samples for iteration 38...
Obtaining samples for iteration 39...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 33.3     |
| Iteration     | 26       |
| MaximumReturn | 536      |
| MinimumReturn | -313     |
| TotalSamples  | 112000   |
----------------------------
itr #27 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3802650272846222
Validation loss = 0.42864423990249634
Validation loss = 0.400397926568985
Validation loss = 0.49352970719337463
Validation loss = 0.41602692008018494
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.41476964950561523
Validation loss = 0.4090065658092499
Validation loss = 0.4426541328430176
Validation loss = 0.3956589698791504
Validation loss = 0.33937785029411316
Validation loss = 0.35310882329940796
Validation loss = 0.34471049904823303
Validation loss = 0.32036349177360535
Validation loss = 0.41166988015174866
Validation loss = 0.3945837616920471
Validation loss = 0.3830769658088684
Validation loss = 0.3373720943927765
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.31767454743385315
Validation loss = 0.2920823097229004
Validation loss = 0.30569028854370117
Validation loss = 0.3381461203098297
Validation loss = 0.2836509644985199
Validation loss = 0.3143014907836914
Validation loss = 0.32600364089012146
Validation loss = 0.3137211799621582
Validation loss = 0.3285233676433563
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3689115047454834
Validation loss = 0.3211580216884613
Validation loss = 0.36053794622421265
Validation loss = 0.360108345746994
Validation loss = 0.3217306137084961
Validation loss = 0.34166601300239563
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.34962958097457886
Validation loss = 0.2910691201686859
Validation loss = 0.31352710723876953
Validation loss = 0.2683770954608917
Validation loss = 0.29532313346862793
Validation loss = 0.2898339629173279
Validation loss = 0.2974722981452942
Validation loss = 0.31659767031669617
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Obtaining samples for iteration 20...
Obtaining samples for iteration 21...
Obtaining samples for iteration 22...
Obtaining samples for iteration 23...
Obtaining samples for iteration 24...
Obtaining samples for iteration 25...
Obtaining samples for iteration 26...
Obtaining samples for iteration 27...
Obtaining samples for iteration 28...
Obtaining samples for iteration 29...
Obtaining samples for iteration 30...
Obtaining samples for iteration 31...
Obtaining samples for iteration 32...
Obtaining samples for iteration 33...
Obtaining samples for iteration 34...
Obtaining samples for iteration 35...
Obtaining samples for iteration 36...
Obtaining samples for iteration 37...
Obtaining samples for iteration 38...
Obtaining samples for iteration 39...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 185      |
| Iteration     | 27       |
| MaximumReturn | 823      |
| MinimumReturn | -224     |
| TotalSamples  | 116000   |
----------------------------
itr #28 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.42455482482910156
Validation loss = 0.3347072899341583
Validation loss = 0.3578566312789917
Validation loss = 0.314863920211792
Validation loss = 0.34605973958969116
Validation loss = 0.41410356760025024
Validation loss = 0.3718857169151306
Validation loss = 0.3881744146347046
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.42330339550971985
Validation loss = 0.3592950105667114
Validation loss = 0.3460266590118408
Validation loss = 0.34951040148735046
Validation loss = 0.3364957273006439
Validation loss = 0.38888514041900635
Validation loss = 0.36185529828071594
Validation loss = 0.309611439704895
Validation loss = 0.37061741948127747
Validation loss = 0.3314554691314697
Validation loss = 0.31911057233810425
Validation loss = 0.2853195369243622
Validation loss = 0.3190572261810303
Validation loss = 0.3040977418422699
Validation loss = 0.34018293023109436
Validation loss = 0.3134048283100128
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.28719401359558105
Validation loss = 0.3628165125846863
Validation loss = 0.38674527406692505
Validation loss = 0.3783060312271118
Validation loss = 0.3517220616340637
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.33692994713783264
Validation loss = 0.34879887104034424
Validation loss = 0.37580451369285583
Validation loss = 0.3260672688484192
Validation loss = 0.3692634105682373
Validation loss = 0.3567427694797516
Validation loss = 0.33858582377433777
Validation loss = 0.3560820519924164
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.280320405960083
Validation loss = 0.34910550713539124
Validation loss = 0.3498040437698364
Validation loss = 0.30676382780075073
Validation loss = 0.27784937620162964
Validation loss = 0.30140548944473267
Validation loss = 0.2971208691596985
Validation loss = 0.2843593955039978
Validation loss = 0.3115977346897125
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Obtaining samples for iteration 20...
Obtaining samples for iteration 21...
Obtaining samples for iteration 22...
Obtaining samples for iteration 23...
Obtaining samples for iteration 24...
Obtaining samples for iteration 25...
Obtaining samples for iteration 26...
Obtaining samples for iteration 27...
Obtaining samples for iteration 28...
Obtaining samples for iteration 29...
Obtaining samples for iteration 30...
Obtaining samples for iteration 31...
Obtaining samples for iteration 32...
Obtaining samples for iteration 33...
Obtaining samples for iteration 34...
Obtaining samples for iteration 35...
Obtaining samples for iteration 36...
Obtaining samples for iteration 37...
Obtaining samples for iteration 38...
Obtaining samples for iteration 39...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -136     |
| Iteration     | 28       |
| MaximumReturn | 733      |
| MinimumReturn | -345     |
| TotalSamples  | 120000   |
----------------------------
itr #29 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 1.0187841653823853
Validation loss = 1.3569477796554565
Validation loss = 1.2683237791061401
Validation loss = 1.137540340423584
Validation loss = 1.164770483970642
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 1.1098780632019043
Validation loss = 1.0104172229766846
Validation loss = 1.1847453117370605
Validation loss = 0.9127537608146667
Validation loss = 0.9519796967506409
Validation loss = 0.9734078645706177
Validation loss = 0.9869815707206726
Validation loss = 1.0599008798599243
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.9358164668083191
Validation loss = 0.9345169067382812
Validation loss = 1.1468473672866821
Validation loss = 0.7960761785507202
Validation loss = 0.9885228276252747
Validation loss = 0.9350776076316833
Validation loss = 0.9459695219993591
Validation loss = 0.8476409912109375
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 1.273375391960144
Validation loss = 1.048500418663025
Validation loss = 1.1687318086624146
Validation loss = 0.9869139194488525
Validation loss = 1.0091369152069092
Validation loss = 1.1022930145263672
Validation loss = 0.9786837697029114
Validation loss = 0.8091761469841003
Validation loss = 0.9679979681968689
Validation loss = 1.0356792211532593
Validation loss = 0.9586054682731628
Validation loss = 0.89113849401474
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6192758083343506
Validation loss = 0.7387923002243042
Validation loss = 0.7916167974472046
Validation loss = 0.7754087448120117
Validation loss = 0.7892515063285828
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Obtaining samples for iteration 20...
Obtaining samples for iteration 21...
Obtaining samples for iteration 22...
Obtaining samples for iteration 23...
Obtaining samples for iteration 24...
Obtaining samples for iteration 25...
Obtaining samples for iteration 26...
Obtaining samples for iteration 27...
Obtaining samples for iteration 28...
Obtaining samples for iteration 29...
Obtaining samples for iteration 30...
Obtaining samples for iteration 31...
Obtaining samples for iteration 32...
Obtaining samples for iteration 33...
Obtaining samples for iteration 34...
Obtaining samples for iteration 35...
Obtaining samples for iteration 36...
Obtaining samples for iteration 37...
Obtaining samples for iteration 38...
Obtaining samples for iteration 39...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 422      |
| Iteration     | 29       |
| MaximumReturn | 1.79e+03 |
| MinimumReturn | -601     |
| TotalSamples  | 124000   |
----------------------------
itr #30 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 1.1961452960968018
Validation loss = 1.0817961692810059
Validation loss = 1.4321599006652832
Validation loss = 1.015294075012207
Validation loss = 1.0303022861480713
Validation loss = 0.970357358455658
Validation loss = 0.9845036864280701
Validation loss = 1.0902518033981323
Validation loss = 0.8240804076194763
Validation loss = 1.085761547088623
Validation loss = 0.9077001214027405
Validation loss = 1.0780701637268066
Validation loss = 1.1131181716918945
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.9392364621162415
Validation loss = 0.9462934136390686
Validation loss = 0.983606219291687
Validation loss = 1.0101717710494995
Validation loss = 0.8907625675201416
Validation loss = 0.9368287920951843
Validation loss = 0.967434823513031
Validation loss = 1.0445277690887451
Validation loss = 1.0388461351394653
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.8500168323516846
Validation loss = 0.8249132633209229
Validation loss = 0.846370279788971
Validation loss = 0.9329278469085693
Validation loss = 0.9026942849159241
Validation loss = 0.9887844920158386
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.9370409846305847
Validation loss = 0.9319495558738708
Validation loss = 1.085554838180542
Validation loss = 0.9843136072158813
Validation loss = 0.9554541110992432
Validation loss = 0.8954575061798096
Validation loss = 1.024042010307312
Validation loss = 0.8329963684082031
Validation loss = 0.9570057988166809
Validation loss = 0.791671872138977
Validation loss = 0.9033695459365845
Validation loss = 1.0120971202850342
Validation loss = 0.8613669276237488
Validation loss = 0.8345367312431335
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.778113603591919
Validation loss = 0.8241655826568604
Validation loss = 0.7093165516853333
Validation loss = 0.743570864200592
Validation loss = 0.7474300861358643
Validation loss = 0.7135193347930908
Validation loss = 0.811890721321106
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Obtaining samples for iteration 20...
Obtaining samples for iteration 21...
Obtaining samples for iteration 22...
Obtaining samples for iteration 23...
Obtaining samples for iteration 24...
Obtaining samples for iteration 25...
Obtaining samples for iteration 26...
Obtaining samples for iteration 27...
Obtaining samples for iteration 28...
Obtaining samples for iteration 29...
Obtaining samples for iteration 30...
Obtaining samples for iteration 31...
Obtaining samples for iteration 32...
Obtaining samples for iteration 33...
Obtaining samples for iteration 34...
Obtaining samples for iteration 35...
Obtaining samples for iteration 36...
Obtaining samples for iteration 37...
Obtaining samples for iteration 38...
Obtaining samples for iteration 39...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 169      |
| Iteration     | 30       |
| MaximumReturn | 515      |
| MinimumReturn | -131     |
| TotalSamples  | 128000   |
----------------------------
itr #31 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.9287685751914978
Validation loss = 1.1170868873596191
Validation loss = 1.12665855884552
Validation loss = 0.9741562008857727
Validation loss = 0.9716371297836304
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.9663434624671936
Validation loss = 1.0138429403305054
Validation loss = 0.9103513956069946
Validation loss = 0.92686527967453
Validation loss = 0.8382868766784668
Validation loss = 0.7595720887184143
Validation loss = 0.8271040916442871
Validation loss = 0.7752953767776489
Validation loss = 0.9156520962715149
Validation loss = 0.8508877158164978
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.8537744283676147
Validation loss = 0.9538377523422241
Validation loss = 0.8221355676651001
Validation loss = 0.7352519035339355
Validation loss = 0.7675169110298157
Validation loss = 0.7603943347930908
Validation loss = 0.8171564340591431
Validation loss = 0.7992549538612366
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.8761202096939087
Validation loss = 0.8156064748764038
Validation loss = 0.87582927942276
Validation loss = 0.8287808895111084
Validation loss = 0.8092207908630371
Validation loss = 0.7331485152244568
Validation loss = 0.7934489250183105
Validation loss = 0.7250522375106812
Validation loss = 0.9768993854522705
Validation loss = 0.6976116895675659
Validation loss = 0.7752859592437744
Validation loss = 0.8016228675842285
Validation loss = 0.8057280778884888
Validation loss = 0.6982851624488831
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6446163058280945
Validation loss = 0.6434407830238342
Validation loss = 0.757153332233429
Validation loss = 0.6925497651100159
Validation loss = 0.7954553365707397
Validation loss = 0.7438235282897949
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Obtaining samples for iteration 20...
Obtaining samples for iteration 21...
Obtaining samples for iteration 22...
Obtaining samples for iteration 23...
Obtaining samples for iteration 24...
Obtaining samples for iteration 25...
Obtaining samples for iteration 26...
Obtaining samples for iteration 27...
Obtaining samples for iteration 28...
Obtaining samples for iteration 29...
Obtaining samples for iteration 30...
Obtaining samples for iteration 31...
Obtaining samples for iteration 32...
Obtaining samples for iteration 33...
Obtaining samples for iteration 34...
Obtaining samples for iteration 35...
Obtaining samples for iteration 36...
Obtaining samples for iteration 37...
Obtaining samples for iteration 38...
Obtaining samples for iteration 39...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 149      |
| Iteration     | 31       |
| MaximumReturn | 1.49e+03 |
| MinimumReturn | -296     |
| TotalSamples  | 132000   |
----------------------------
itr #32 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7733984589576721
Validation loss = 1.0491445064544678
Validation loss = 1.0430668592453003
Validation loss = 0.8188300132751465
Validation loss = 0.8646441102027893
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.8156052231788635
Validation loss = 0.695037305355072
Validation loss = 0.8263463973999023
Validation loss = 0.744194507598877
Validation loss = 0.7741745114326477
Validation loss = 0.7472533583641052
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7560486793518066
Validation loss = 0.7369709610939026
Validation loss = 0.9983973503112793
Validation loss = 0.8934766054153442
Validation loss = 0.9556320309638977
Validation loss = 0.7038134932518005
Validation loss = 0.7571874260902405
Validation loss = 0.7179439663887024
Validation loss = 0.9358580112457275
Validation loss = 0.7708702683448792
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.5684965252876282
Validation loss = 0.7361681461334229
Validation loss = 0.823074221611023
Validation loss = 0.6745849847793579
Validation loss = 0.6773937940597534
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.600671112537384
Validation loss = 0.636212944984436
Validation loss = 0.6813568472862244
Validation loss = 0.7448086142539978
Validation loss = 0.7308581471443176
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Obtaining samples for iteration 20...
Obtaining samples for iteration 21...
Obtaining samples for iteration 22...
Obtaining samples for iteration 23...
Obtaining samples for iteration 24...
Obtaining samples for iteration 25...
Obtaining samples for iteration 26...
Obtaining samples for iteration 27...
Obtaining samples for iteration 28...
Obtaining samples for iteration 29...
Obtaining samples for iteration 30...
Obtaining samples for iteration 31...
Obtaining samples for iteration 32...
Obtaining samples for iteration 33...
Obtaining samples for iteration 34...
Obtaining samples for iteration 35...
Obtaining samples for iteration 36...
Obtaining samples for iteration 37...
Obtaining samples for iteration 38...
Obtaining samples for iteration 39...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 1000.
Path 2 | total_timesteps 2000.
Path 3 | total_timesteps 3000.
Path 4 | total_timesteps 4000.
Path 5 | total_timesteps 5000.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 212      |
| Iteration     | 32       |
| MaximumReturn | 1.32e+03 |
| MinimumReturn | -386     |
| TotalSamples  | 136000   |
----------------------------
