Logging to experiments/invertedPendulum/IPO01/Tue-01-Nov-2022-09-49-35-PM-CDT_invertedPendulum_trpo_iteration_20_seed2531
Print configuration .....
{'env_name': 'invertedPendulum', 'random_seeds': [3214, 2431, 2531, 2231], 'save_variables': False, 'model_save_dir': '/tmp/invertedPendulum_models/', 'restore_variables': False, 'start_onpol_iter': 0, 'onpol_iters': 80, 'num_path_random': 25, 'num_path_onpol': 25, 'env_horizon': 100, 'max_train_data': 200000, 'max_val_data': 100000, 'discard_ratio': 0.0, 'dynamics': {'pre_training': {'mode': 'intrinsic_reward', 'itr': 0, 'policy_itr': 20}, 'model': 'nn', 'ensemble': True, 'ensemble_model_count': 5, 'enable_particle_ensemble': True, 'particles': 5, 'obs_var': 1.0, 'intrinsic_reward_coeff': 1.0, 'ita': 1.0, 'mode': 'random', 'val': True, 'n_layers': 4, 'hidden_size': 1000, 'activation': 'relu', 'batch_size': 1000, 'learning_rate': 0.001, 'reg_coeff': 0.0, 'epochs': 200, 'kfac_params': {'learning_rate': 0.1, 'damping': 0.001, 'momentum': 0.9, 'kl_clip': 0.0001, 'cov_ema_decay': 0.99}}, 'policy': {'network_shape': [64, 64], 'init_logstd': 0.0, 'activation': 'tanh', 'reinitialize_every_itr': False}, 'trpo': {'horizon': 100, 'gamma': 0.99, 'step_size': 0.01, 'iterations': 20, 'batch_size': 50000, 'gae': 0.95, 'visualization': False, 'visualize_iterations': [0]}, 'algo': 'trpo'}
Generating random rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating random rollouts.
Creating normalization for training data.
Done creating normalization for training data.
Particle ensemble enabled? True
An ensemble of 5 dynamics model <class 'model.dynamics.NNDynamicsModel'> initialized
Train dynamics model with intrinsic reward only? False
Pre-training enabled. Using only intrinsic reward.
Pre-training dynamics model for 0 iterations...
Done pre-training dynamics model.
Using external reward only.
itr #0 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7815340757369995
Validation loss = 0.669297993183136
Validation loss = 0.6700093746185303
Validation loss = 0.6333994269371033
Validation loss = 0.6233624219894409
Validation loss = 0.6087600588798523
Validation loss = 0.582490086555481
Validation loss = 0.5795150995254517
Validation loss = 0.579343318939209
Validation loss = 0.5621255040168762
Validation loss = 0.5406492948532104
Validation loss = 0.5581356883049011
Validation loss = 0.5568869709968567
Validation loss = 0.5310177206993103
Validation loss = 0.5373954772949219
Validation loss = 0.5422204732894897
Validation loss = 0.5265740752220154
Validation loss = 0.5097392797470093
Validation loss = 0.5057187676429749
Validation loss = 0.5108961462974548
Validation loss = 0.4955434501171112
Validation loss = 0.49269217252731323
Validation loss = 0.4946732521057129
Validation loss = 0.4847513735294342
Validation loss = 0.4941599369049072
Validation loss = 0.48786064982414246
Validation loss = 0.493030309677124
Validation loss = 0.49390220642089844
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7926673889160156
Validation loss = 0.6706017255783081
Validation loss = 0.653425395488739
Validation loss = 0.642903745174408
Validation loss = 0.6289808750152588
Validation loss = 0.5956524014472961
Validation loss = 0.5835779309272766
Validation loss = 0.5711221098899841
Validation loss = 0.5523700714111328
Validation loss = 0.5516834259033203
Validation loss = 0.53696209192276
Validation loss = 0.5499619841575623
Validation loss = 0.5310871005058289
Validation loss = 0.5358145236968994
Validation loss = 0.5642130970954895
Validation loss = 0.5265483260154724
Validation loss = 0.5112556219100952
Validation loss = 0.5284712314605713
Validation loss = 0.5024002194404602
Validation loss = 0.5079091191291809
Validation loss = 0.49956682324409485
Validation loss = 0.5007246732711792
Validation loss = 0.5067882537841797
Validation loss = 0.49935469031333923
Validation loss = 0.4864952862262726
Validation loss = 0.4899927079677582
Validation loss = 0.4972699284553528
Validation loss = 0.4882802665233612
Validation loss = 0.48230138421058655
Validation loss = 0.478507936000824
Validation loss = 0.5167037844657898
Validation loss = 0.497918963432312
Validation loss = 0.4774326980113983
Validation loss = 0.47954699397087097
Validation loss = 0.482976496219635
Validation loss = 0.4838026762008667
Validation loss = 0.4863215982913971
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7857430577278137
Validation loss = 0.6874276995658875
Validation loss = 0.6565315127372742
Validation loss = 0.6391744017601013
Validation loss = 0.6114532351493835
Validation loss = 0.5955147743225098
Validation loss = 0.5773739814758301
Validation loss = 0.5642900466918945
Validation loss = 0.5534350275993347
Validation loss = 0.5514144897460938
Validation loss = 0.547514021396637
Validation loss = 0.5515011548995972
Validation loss = 0.5362029671669006
Validation loss = 0.5277618169784546
Validation loss = 0.5290741324424744
Validation loss = 0.5129565596580505
Validation loss = 0.5277835726737976
Validation loss = 0.5106580853462219
Validation loss = 0.506759524345398
Validation loss = 0.5194932222366333
Validation loss = 0.5231485962867737
Validation loss = 0.5016638040542603
Validation loss = 0.4778786301612854
Validation loss = 0.4880337715148926
Validation loss = 0.49771708250045776
Validation loss = 0.47924599051475525
Validation loss = 0.4901863634586334
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7916157245635986
Validation loss = 0.6810358166694641
Validation loss = 0.6629029512405396
Validation loss = 0.670437216758728
Validation loss = 0.6452195644378662
Validation loss = 0.6049264073371887
Validation loss = 0.5932837724685669
Validation loss = 0.5785854458808899
Validation loss = 0.5666413307189941
Validation loss = 0.566689670085907
Validation loss = 0.5447025895118713
Validation loss = 0.553727388381958
Validation loss = 0.5449034571647644
Validation loss = 0.5344929099082947
Validation loss = 0.568343997001648
Validation loss = 0.5239472389221191
Validation loss = 0.5343448519706726
Validation loss = 0.5141977071762085
Validation loss = 0.5289149880409241
Validation loss = 0.5170512199401855
Validation loss = 0.5166527032852173
Validation loss = 0.5108944177627563
Validation loss = 0.511958122253418
Validation loss = 0.5117257833480835
Validation loss = 0.5001881718635559
Validation loss = 0.5044747591018677
Validation loss = 0.5006065964698792
Validation loss = 0.5011460185050964
Validation loss = 0.49794766306877136
Validation loss = 0.5078670978546143
Validation loss = 0.4960666596889496
Validation loss = 0.4729134142398834
Validation loss = 0.4915994703769684
Validation loss = 0.48486629128456116
Validation loss = 0.479861319065094
Validation loss = 0.47578099370002747
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7964408993721008
Validation loss = 0.6893855929374695
Validation loss = 0.6650180816650391
Validation loss = 0.6371585726737976
Validation loss = 0.6220595836639404
Validation loss = 0.5923371315002441
Validation loss = 0.5758240222930908
Validation loss = 0.5551583170890808
Validation loss = 0.5404914021492004
Validation loss = 0.5395047664642334
Validation loss = 0.5537908673286438
Validation loss = 0.523439347743988
Validation loss = 0.5216366648674011
Validation loss = 0.527530312538147
Validation loss = 0.5121862292289734
Validation loss = 0.5189074873924255
Validation loss = 0.5143255591392517
Validation loss = 0.5089734196662903
Validation loss = 0.5065643787384033
Validation loss = 0.5148844718933105
Validation loss = 0.5094330310821533
Validation loss = 0.49905726313591003
Validation loss = 0.5093100070953369
Validation loss = 0.5077794194221497
Validation loss = 0.5026921629905701
Validation loss = 0.4997245967388153
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -84.8    |
| Iteration     | 0        |
| MaximumReturn | -34.1    |
| MinimumReturn | -111     |
| TotalSamples  | 3332     |
----------------------------
itr #1 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6076717972755432
Validation loss = 0.501759946346283
Validation loss = 0.4793491065502167
Validation loss = 0.474861741065979
Validation loss = 0.45817187428474426
Validation loss = 0.46122270822525024
Validation loss = 0.453140527009964
Validation loss = 0.4574255347251892
Validation loss = 0.453139066696167
Validation loss = 0.4542882442474365
Validation loss = 0.4494694173336029
Validation loss = 0.45839667320251465
Validation loss = 0.45748651027679443
Validation loss = 0.4484185576438904
Validation loss = 0.44379544258117676
Validation loss = 0.44185400009155273
Validation loss = 0.43703460693359375
Validation loss = 0.45201122760772705
Validation loss = 0.4477655589580536
Validation loss = 0.4461155831813812
Validation loss = 0.43891555070877075
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6783204078674316
Validation loss = 0.5251312851905823
Validation loss = 0.501554012298584
Validation loss = 0.47224828600883484
Validation loss = 0.46837499737739563
Validation loss = 0.45981326699256897
Validation loss = 0.45135563611984253
Validation loss = 0.4489199221134186
Validation loss = 0.4468705654144287
Validation loss = 0.4532071650028229
Validation loss = 0.45830410718917847
Validation loss = 0.44256365299224854
Validation loss = 0.44290339946746826
Validation loss = 0.44327837228775024
Validation loss = 0.4340880215167999
Validation loss = 0.4312273859977722
Validation loss = 0.44318506121635437
Validation loss = 0.4344213604927063
Validation loss = 0.44277846813201904
Validation loss = 0.4298180639743805
Validation loss = 0.43430644273757935
Validation loss = 0.42583906650543213
Validation loss = 0.42485880851745605
Validation loss = 0.4288839101791382
Validation loss = 0.4330829083919525
Validation loss = 0.4283619821071625
Validation loss = 0.4279178977012634
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.642192542552948
Validation loss = 0.5248562693595886
Validation loss = 0.4888920783996582
Validation loss = 0.46860283613204956
Validation loss = 0.4548133909702301
Validation loss = 0.4530676603317261
Validation loss = 0.4511455297470093
Validation loss = 0.44919538497924805
Validation loss = 0.4458485245704651
Validation loss = 0.4459421634674072
Validation loss = 0.447638601064682
Validation loss = 0.4357457458972931
Validation loss = 0.4341343343257904
Validation loss = 0.4469616115093231
Validation loss = 0.43436098098754883
Validation loss = 0.4381859004497528
Validation loss = 0.4244460463523865
Validation loss = 0.4299282431602478
Validation loss = 0.4221790134906769
Validation loss = 0.42558664083480835
Validation loss = 0.4306603968143463
Validation loss = 0.4291303753852844
Validation loss = 0.41830652952194214
Validation loss = 0.4243348240852356
Validation loss = 0.4469158351421356
Validation loss = 0.4090644419193268
Validation loss = 0.4095320403575897
Validation loss = 0.40351998805999756
Validation loss = 0.4189954400062561
Validation loss = 0.4314885437488556
Validation loss = 0.41834789514541626
Validation loss = 0.4249090254306793
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.6493261456489563
Validation loss = 0.5212435126304626
Validation loss = 0.4827207624912262
Validation loss = 0.46609488129615784
Validation loss = 0.4598134458065033
Validation loss = 0.4474914073944092
Validation loss = 0.4537654221057892
Validation loss = 0.4622759521007538
Validation loss = 0.45602476596832275
Validation loss = 0.4464755952358246
Validation loss = 0.4601190686225891
Validation loss = 0.4511784017086029
Validation loss = 0.4356198012828827
Validation loss = 0.46281376481056213
Validation loss = 0.45103058218955994
Validation loss = 0.4394577741622925
Validation loss = 0.4416702389717102
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6373680830001831
Validation loss = 0.5191048383712769
Validation loss = 0.48880866169929504
Validation loss = 0.46291863918304443
Validation loss = 0.4619208872318268
Validation loss = 0.44809308648109436
Validation loss = 0.44543787837028503
Validation loss = 0.44926366209983826
Validation loss = 0.4484163820743561
Validation loss = 0.44273871183395386
Validation loss = 0.4291363060474396
Validation loss = 0.43390771746635437
Validation loss = 0.4338134527206421
Validation loss = 0.4324224293231964
Validation loss = 0.4304017722606659
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -40.8    |
| Iteration     | 1        |
| MaximumReturn | -0.119   |
| MinimumReturn | -99.3    |
| TotalSamples  | 4998     |
----------------------------
itr #2 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6091864109039307
Validation loss = 0.45661455392837524
Validation loss = 0.42386651039123535
Validation loss = 0.40509071946144104
Validation loss = 0.3921370804309845
Validation loss = 0.38660740852355957
Validation loss = 0.3740707039833069
Validation loss = 0.3673360347747803
Validation loss = 0.3615410327911377
Validation loss = 0.36022263765335083
Validation loss = 0.36480170488357544
Validation loss = 0.36815330386161804
Validation loss = 0.34930360317230225
Validation loss = 0.35125744342803955
Validation loss = 0.35701650381088257
Validation loss = 0.3505352735519409
Validation loss = 0.3542993664741516
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6032956838607788
Validation loss = 0.46428313851356506
Validation loss = 0.4311649799346924
Validation loss = 0.4122992753982544
Validation loss = 0.40078288316726685
Validation loss = 0.38691842555999756
Validation loss = 0.3811721205711365
Validation loss = 0.37220442295074463
Validation loss = 0.36563485860824585
Validation loss = 0.36887142062187195
Validation loss = 0.36709141731262207
Validation loss = 0.34654292464256287
Validation loss = 0.3584441542625427
Validation loss = 0.34624457359313965
Validation loss = 0.3527679145336151
Validation loss = 0.3638836145401001
Validation loss = 0.3614066541194916
Validation loss = 0.3512302041053772
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.6580938100814819
Validation loss = 0.45784541964530945
Validation loss = 0.4202946424484253
Validation loss = 0.3966429531574249
Validation loss = 0.3834150433540344
Validation loss = 0.3820301592350006
Validation loss = 0.37244725227355957
Validation loss = 0.36542415618896484
Validation loss = 0.35873061418533325
Validation loss = 0.34998226165771484
Validation loss = 0.34936970472335815
Validation loss = 0.3671060800552368
Validation loss = 0.3539047837257385
Validation loss = 0.34474077820777893
Validation loss = 0.34826385974884033
Validation loss = 0.349040150642395
Validation loss = 0.34172531962394714
Validation loss = 0.33880186080932617
Validation loss = 0.35436391830444336
Validation loss = 0.34175026416778564
Validation loss = 0.3399254083633423
Validation loss = 0.3453301787376404
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.5941752791404724
Validation loss = 0.47311973571777344
Validation loss = 0.43737947940826416
Validation loss = 0.4153115153312683
Validation loss = 0.4003024697303772
Validation loss = 0.39364093542099
Validation loss = 0.4025138020515442
Validation loss = 0.38146233558654785
Validation loss = 0.3759457468986511
Validation loss = 0.3699952960014343
Validation loss = 0.36461296677589417
Validation loss = 0.36445045471191406
Validation loss = 0.3602977693080902
Validation loss = 0.3571891486644745
Validation loss = 0.3546803295612335
Validation loss = 0.3724309504032135
Validation loss = 0.3567395806312561
Validation loss = 0.358875572681427
Validation loss = 0.3495456278324127
Validation loss = 0.35445791482925415
Validation loss = 0.33881381154060364
Validation loss = 0.3392271399497986
Validation loss = 0.3410032391548157
Validation loss = 0.34369152784347534
Validation loss = 0.3427143096923828
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.5713770985603333
Validation loss = 0.44631630182266235
Validation loss = 0.4116765558719635
Validation loss = 0.3973926901817322
Validation loss = 0.3790872097015381
Validation loss = 0.3870321214199066
Validation loss = 0.36756858229637146
Validation loss = 0.36747658252716064
Validation loss = 0.37958449125289917
Validation loss = 0.3578111231327057
Validation loss = 0.37553220987319946
Validation loss = 0.3568553924560547
Validation loss = 0.3528926968574524
Validation loss = 0.34827184677124023
Validation loss = 0.3528164327144623
Validation loss = 0.34465640783309937
Validation loss = 0.3532183766365051
Validation loss = 0.35816895961761475
Validation loss = 0.33642682433128357
Validation loss = 0.3420584201812744
Validation loss = 0.3461493253707886
Validation loss = 0.3467922508716583
Validation loss = 0.34085720777511597
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -21.6    |
| Iteration     | 2        |
| MaximumReturn | -0.168   |
| MinimumReturn | -68.6    |
| TotalSamples  | 6664     |
----------------------------
itr #3 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.4437815248966217
Validation loss = 0.34877315163612366
Validation loss = 0.34260591864585876
Validation loss = 0.3319522440433502
Validation loss = 0.3402552604675293
Validation loss = 0.3335408866405487
Validation loss = 0.33567914366722107
Validation loss = 0.3247818648815155
Validation loss = 0.3299216032028198
Validation loss = 0.31871578097343445
Validation loss = 0.3370843231678009
Validation loss = 0.3297983407974243
Validation loss = 0.3257439434528351
Validation loss = 0.3275565207004547
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3929724395275116
Validation loss = 0.33816108107566833
Validation loss = 0.3260650634765625
Validation loss = 0.32001203298568726
Validation loss = 0.3261328935623169
Validation loss = 0.33759281039237976
Validation loss = 0.33292967081069946
Validation loss = 0.32591167092323303
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.39170360565185547
Validation loss = 0.3398877680301666
Validation loss = 0.3224444091320038
Validation loss = 0.33434638381004333
Validation loss = 0.33067336678504944
Validation loss = 0.32386091351509094
Validation loss = 0.3267362415790558
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.40503156185150146
Validation loss = 0.34331297874450684
Validation loss = 0.329569011926651
Validation loss = 0.3220101296901703
Validation loss = 0.32512447237968445
Validation loss = 0.32243913412094116
Validation loss = 0.3234971761703491
Validation loss = 0.32511916756629944
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.4022747576236725
Validation loss = 0.3466252088546753
Validation loss = 0.341219425201416
Validation loss = 0.3272289037704468
Validation loss = 0.3284217119216919
Validation loss = 0.3254915773868561
Validation loss = 0.3232465088367462
Validation loss = 0.32616302371025085
Validation loss = 0.3362340033054352
Validation loss = 0.3302280306816101
Validation loss = 0.3283192217350006
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -0.0688  |
| Iteration     | 3        |
| MaximumReturn | -0.0358  |
| MinimumReturn | -0.126   |
| TotalSamples  | 8330     |
----------------------------
itr #4 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.38686880469322205
Validation loss = 0.3391594886779785
Validation loss = 0.3350820541381836
Validation loss = 0.34123682975769043
Validation loss = 0.3333718478679657
Validation loss = 0.3405172526836395
Validation loss = 0.33685147762298584
Validation loss = 0.3405846953392029
Validation loss = 0.3344590663909912
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3777482211589813
Validation loss = 0.337632954120636
Validation loss = 0.3284352421760559
Validation loss = 0.32743749022483826
Validation loss = 0.3302247226238251
Validation loss = 0.32860493659973145
Validation loss = 0.3385406732559204
Validation loss = 0.3376912474632263
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3699529469013214
Validation loss = 0.3354191482067108
Validation loss = 0.3280009627342224
Validation loss = 0.3264892101287842
Validation loss = 0.3368048071861267
Validation loss = 0.3370998799800873
Validation loss = 0.3397449254989624
Validation loss = 0.3334711194038391
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3833387792110443
Validation loss = 0.3350397050380707
Validation loss = 0.33168020844459534
Validation loss = 0.3296758830547333
Validation loss = 0.3310917317867279
Validation loss = 0.3308255672454834
Validation loss = 0.3317379951477051
Validation loss = 0.334841251373291
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.36502063274383545
Validation loss = 0.3308684229850769
Validation loss = 0.33954501152038574
Validation loss = 0.33039402961730957
Validation loss = 0.33395570516586304
Validation loss = 0.33572718501091003
Validation loss = 0.32974371314048767
Validation loss = 0.33648818731307983
Validation loss = 0.3388890326023102
Validation loss = 0.33891239762306213
Validation loss = 0.33784154057502747
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -0.0785  |
| Iteration     | 4        |
| MaximumReturn | -0.0257  |
| MinimumReturn | -0.208   |
| TotalSamples  | 9996     |
----------------------------
itr #5 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.35896602272987366
Validation loss = 0.3344138562679291
Validation loss = 0.3344723582267761
Validation loss = 0.333983838558197
Validation loss = 0.3341667056083679
Validation loss = 0.3356909453868866
Validation loss = 0.3375043272972107
Validation loss = 0.3453132212162018
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.35793107748031616
Validation loss = 0.3290742039680481
Validation loss = 0.33315232396125793
Validation loss = 0.33184438943862915
Validation loss = 0.3358933925628662
Validation loss = 0.33791908621788025
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.35834723711013794
Validation loss = 0.3362967371940613
Validation loss = 0.33485373854637146
Validation loss = 0.33069056272506714
Validation loss = 0.3346622586250305
Validation loss = 0.3424733281135559
Validation loss = 0.3379881680011749
Validation loss = 0.34183865785598755
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.35193705558776855
Validation loss = 0.32923632860183716
Validation loss = 0.33203762769699097
Validation loss = 0.3329784870147705
Validation loss = 0.33003178238868713
Validation loss = 0.3320203423500061
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.34529489278793335
Validation loss = 0.3355289101600647
Validation loss = 0.33443188667297363
Validation loss = 0.3371373414993286
Validation loss = 0.33617082238197327
Validation loss = 0.3363652527332306
Validation loss = 0.3391786217689514
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -10.6    |
| Iteration     | 5        |
| MaximumReturn | -0.0308  |
| MinimumReturn | -56      |
| TotalSamples  | 11662    |
----------------------------
itr #6 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3611486256122589
Validation loss = 0.3517143726348877
Validation loss = 0.34219467639923096
Validation loss = 0.34013625979423523
Validation loss = 0.3521268963813782
Validation loss = 0.3438590168952942
Validation loss = 0.35607367753982544
Validation loss = 0.3501790761947632
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3551245927810669
Validation loss = 0.33840638399124146
Validation loss = 0.33816465735435486
Validation loss = 0.34320032596588135
Validation loss = 0.34713485836982727
Validation loss = 0.3585767447948456
Validation loss = 0.3419654071331024
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3690814673900604
Validation loss = 0.33764296770095825
Validation loss = 0.3465406000614166
Validation loss = 0.3426167964935303
Validation loss = 0.3422148823738098
Validation loss = 0.3474159836769104
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3476014733314514
Validation loss = 0.34191858768463135
Validation loss = 0.3446138799190521
Validation loss = 0.3436930775642395
Validation loss = 0.3407209515571594
Validation loss = 0.340701699256897
Validation loss = 0.3506135046482086
Validation loss = 0.34516599774360657
Validation loss = 0.3569455146789551
Validation loss = 0.3471895456314087
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3596240282058716
Validation loss = 0.34757381677627563
Validation loss = 0.3470165729522705
Validation loss = 0.34463265538215637
Validation loss = 0.3448426425457001
Validation loss = 0.3474477231502533
Validation loss = 0.3471537232398987
Validation loss = 0.3449457287788391
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -0.1     |
| Iteration     | 6        |
| MaximumReturn | -0.0362  |
| MinimumReturn | -0.298   |
| TotalSamples  | 13328    |
----------------------------
itr #7 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3462270200252533
Validation loss = 0.3430529832839966
Validation loss = 0.34749627113342285
Validation loss = 0.3472626507282257
Validation loss = 0.34048667550086975
Validation loss = 0.35016393661499023
Validation loss = 0.34251415729522705
Validation loss = 0.3435203731060028
Validation loss = 0.3490823209285736
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3364533483982086
Validation loss = 0.33831241726875305
Validation loss = 0.3336489200592041
Validation loss = 0.3374805748462677
Validation loss = 0.34241625666618347
Validation loss = 0.3415060043334961
Validation loss = 0.33974799513816833
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3601226508617401
Validation loss = 0.33635929226875305
Validation loss = 0.3519711196422577
Validation loss = 0.3390934467315674
Validation loss = 0.3441774547100067
Validation loss = 0.3412647545337677
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.35098835825920105
Validation loss = 0.3336961269378662
Validation loss = 0.3386337459087372
Validation loss = 0.338031142950058
Validation loss = 0.3514370024204254
Validation loss = 0.3371369540691376
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.341586709022522
Validation loss = 0.33793720602989197
Validation loss = 0.3440147638320923
Validation loss = 0.33574244379997253
Validation loss = 0.34739208221435547
Validation loss = 0.3425808846950531
Validation loss = 0.34352871775627136
Validation loss = 0.3451291024684906
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -15.1    |
| Iteration     | 7        |
| MaximumReturn | -0.0401  |
| MinimumReturn | -66.8    |
| TotalSamples  | 14994    |
----------------------------
itr #8 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.34563398361206055
Validation loss = 0.3353937268257141
Validation loss = 0.33831316232681274
Validation loss = 0.3423871397972107
Validation loss = 0.3381299376487732
Validation loss = 0.34298983216285706
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.33346104621887207
Validation loss = 0.3342640697956085
Validation loss = 0.3286724388599396
Validation loss = 0.33901020884513855
Validation loss = 0.33194664120674133
Validation loss = 0.33298614621162415
Validation loss = 0.3346923291683197
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3334219455718994
Validation loss = 0.3327915370464325
Validation loss = 0.33898159861564636
Validation loss = 0.3407207131385803
Validation loss = 0.3373848497867584
Validation loss = 0.33546900749206543
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.335120290517807
Validation loss = 0.33233144879341125
Validation loss = 0.3343246281147003
Validation loss = 0.3366805911064148
Validation loss = 0.3331584632396698
Validation loss = 0.33594825863838196
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.33408692479133606
Validation loss = 0.33758148550987244
Validation loss = 0.3411038815975189
Validation loss = 0.3359329402446747
Validation loss = 0.33939164876937866
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -1.61    |
| Iteration     | 8        |
| MaximumReturn | -0.0258  |
| MinimumReturn | -36.5    |
| TotalSamples  | 16660    |
----------------------------
itr #9 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3407618999481201
Validation loss = 0.3491348326206207
Validation loss = 0.34079280495643616
Validation loss = 0.3439350724220276
Validation loss = 0.3387743830680847
Validation loss = 0.3468100428581238
Validation loss = 0.3517625331878662
Validation loss = 0.34465092420578003
Validation loss = 0.34581148624420166
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3484889566898346
Validation loss = 0.3291485011577606
Validation loss = 0.3321657180786133
Validation loss = 0.32993048429489136
Validation loss = 0.3300868272781372
Validation loss = 0.333803653717041
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.33823227882385254
Validation loss = 0.3292540907859802
Validation loss = 0.34676048159599304
Validation loss = 0.33077144622802734
Validation loss = 0.33599886298179626
Validation loss = 0.3366941809654236
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.338064968585968
Validation loss = 0.3338908553123474
Validation loss = 0.3362123966217041
Validation loss = 0.3377078175544739
Validation loss = 0.3421308994293213
Validation loss = 0.3392832279205322
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.34026193618774414
Validation loss = 0.3423333764076233
Validation loss = 0.3375314176082611
Validation loss = 0.3413149118423462
Validation loss = 0.34362998604774475
Validation loss = 0.3417397141456604
Validation loss = 0.3421068489551544
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -11.4    |
| Iteration     | 9        |
| MaximumReturn | -0.0242  |
| MinimumReturn | -67.9    |
| TotalSamples  | 18326    |
----------------------------
itr #10 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.34322068095207214
Validation loss = 0.3454228341579437
Validation loss = 0.3475138545036316
Validation loss = 0.34759411215782166
Validation loss = 0.3449307680130005
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.33457306027412415
Validation loss = 0.33308935165405273
Validation loss = 0.33191531896591187
Validation loss = 0.3359304964542389
Validation loss = 0.342950701713562
Validation loss = 0.3337292969226837
Validation loss = 0.3331843912601471
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3399549126625061
Validation loss = 0.3351867198944092
Validation loss = 0.3353242874145508
Validation loss = 0.3424636721611023
Validation loss = 0.3334091901779175
Validation loss = 0.3548012971878052
Validation loss = 0.33866024017333984
Validation loss = 0.3432384729385376
Validation loss = 0.3403548300266266
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.33183836936950684
Validation loss = 0.337587833404541
Validation loss = 0.3376251757144928
Validation loss = 0.33594101667404175
Validation loss = 0.3347231149673462
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3414570093154907
Validation loss = 0.34185177087783813
Validation loss = 0.33987483382225037
Validation loss = 0.33520257472991943
Validation loss = 0.3422413170337677
Validation loss = 0.33916717767715454
Validation loss = 0.34959161281585693
Validation loss = 0.3463760316371918
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -121     |
| Iteration     | 10       |
| MaximumReturn | -49.8    |
| MinimumReturn | -158     |
| TotalSamples  | 19992    |
----------------------------
itr #11 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.33783289790153503
Validation loss = 0.34114012122154236
Validation loss = 0.3379131853580475
Validation loss = 0.3352527618408203
Validation loss = 0.34926965832710266
Validation loss = 0.3398173451423645
Validation loss = 0.3374280631542206
Validation loss = 0.34163719415664673
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3328455090522766
Validation loss = 0.3346089720726013
Validation loss = 0.3248694837093353
Validation loss = 0.329393208026886
Validation loss = 0.3364538550376892
Validation loss = 0.3281123638153076
Validation loss = 0.3302319347858429
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3418746590614319
Validation loss = 0.335019052028656
Validation loss = 0.33200183510780334
Validation loss = 0.33156079053878784
Validation loss = 0.3399161696434021
Validation loss = 0.342518150806427
Validation loss = 0.33557185530662537
Validation loss = 0.3418612480163574
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3410264849662781
Validation loss = 0.33212152123451233
Validation loss = 0.3276238441467285
Validation loss = 0.3303076922893524
Validation loss = 0.3319886326789856
Validation loss = 0.33812761306762695
Validation loss = 0.3314310312271118
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3407144546508789
Validation loss = 0.33741992712020874
Validation loss = 0.33731192350387573
Validation loss = 0.338796466588974
Validation loss = 0.3426627516746521
Validation loss = 0.33891087770462036
Validation loss = 0.3427659273147583
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -81.9    |
| Iteration     | 11       |
| MaximumReturn | -0.336   |
| MinimumReturn | -159     |
| TotalSamples  | 21658    |
----------------------------
itr #12 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.33876097202301025
Validation loss = 0.3391515612602234
Validation loss = 0.3351971507072449
Validation loss = 0.34671980142593384
Validation loss = 0.34460243582725525
Validation loss = 0.3391692042350769
Validation loss = 0.34694766998291016
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3241158127784729
Validation loss = 0.32693055272102356
Validation loss = 0.3337034583091736
Validation loss = 0.3337465524673462
Validation loss = 0.3313371539115906
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.33284008502960205
Validation loss = 0.3274711072444916
Validation loss = 0.33777084946632385
Validation loss = 0.33359915018081665
Validation loss = 0.3352421224117279
Validation loss = 0.3354565501213074
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.33171775937080383
Validation loss = 0.33535268902778625
Validation loss = 0.32862454652786255
Validation loss = 0.3278816342353821
Validation loss = 0.33132147789001465
Validation loss = 0.3355106711387634
Validation loss = 0.3334203362464905
Validation loss = 0.33525505661964417
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3354790210723877
Validation loss = 0.3357667922973633
Validation loss = 0.33764809370040894
Validation loss = 0.33770471811294556
Validation loss = 0.3423392176628113
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -127     |
| Iteration     | 12       |
| MaximumReturn | -2.3     |
| MinimumReturn | -165     |
| TotalSamples  | 23324    |
----------------------------
itr #13 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.32890284061431885
Validation loss = 0.33563730120658875
Validation loss = 0.33817142248153687
Validation loss = 0.3390294313430786
Validation loss = 0.3397655189037323
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3223899006843567
Validation loss = 0.32603079080581665
Validation loss = 0.32468482851982117
Validation loss = 0.32628104090690613
Validation loss = 0.32717952132225037
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.33249589800834656
Validation loss = 0.32883256673812866
Validation loss = 0.3283405900001526
Validation loss = 0.33116939663887024
Validation loss = 0.33433815836906433
Validation loss = 0.3321611285209656
Validation loss = 0.33482789993286133
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3275021016597748
Validation loss = 0.32654523849487305
Validation loss = 0.3406026065349579
Validation loss = 0.33204999566078186
Validation loss = 0.33031007647514343
Validation loss = 0.32818278670310974
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.32719886302948
Validation loss = 0.3348746597766876
Validation loss = 0.33702972531318665
Validation loss = 0.3370168209075928
Validation loss = 0.3379006087779999
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -156     |
| Iteration     | 13       |
| MaximumReturn | -110     |
| MinimumReturn | -176     |
| TotalSamples  | 24990    |
----------------------------
itr #14 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.32375046610832214
Validation loss = 0.3268756568431854
Validation loss = 0.32667502760887146
Validation loss = 0.3287518322467804
Validation loss = 0.33553028106689453
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3167332112789154
Validation loss = 0.3204168379306793
Validation loss = 0.31844601035118103
Validation loss = 0.32339873909950256
Validation loss = 0.3231932818889618
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.32373014092445374
Validation loss = 0.3214386999607086
Validation loss = 0.32441937923431396
Validation loss = 0.32800063490867615
Validation loss = 0.32654666900634766
Validation loss = 0.32981303334236145
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.32021042704582214
Validation loss = 0.3234005868434906
Validation loss = 0.3195945620536804
Validation loss = 0.32404181361198425
Validation loss = 0.324972540140152
Validation loss = 0.32283297181129456
Validation loss = 0.32628461718559265
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3225872218608856
Validation loss = 0.32924050092697144
Validation loss = 0.32999736070632935
Validation loss = 0.33129724860191345
Validation loss = 0.32819291949272156
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -146     |
| Iteration     | 14       |
| MaximumReturn | -58.3    |
| MinimumReturn | -176     |
| TotalSamples  | 26656    |
----------------------------
itr #15 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3195013403892517
Validation loss = 0.3227201998233795
Validation loss = 0.3251096308231354
Validation loss = 0.321590781211853
Validation loss = 0.3213234543800354
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3092590570449829
Validation loss = 0.311654269695282
Validation loss = 0.32137855887413025
Validation loss = 0.3142898380756378
Validation loss = 0.3165798783302307
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.31790509819984436
Validation loss = 0.32628291845321655
Validation loss = 0.32233017683029175
Validation loss = 0.3214547336101532
Validation loss = 0.3246810734272003
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3169439435005188
Validation loss = 0.3216669261455536
Validation loss = 0.3183453679084778
Validation loss = 0.3205551505088806
Validation loss = 0.32420745491981506
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.31550362706184387
Validation loss = 0.32548972964286804
Validation loss = 0.3237137496471405
Validation loss = 0.3204583525657654
Validation loss = 0.3223908543586731
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -138     |
| Iteration     | 15       |
| MaximumReturn | -75      |
| MinimumReturn | -170     |
| TotalSamples  | 28322    |
----------------------------
itr #16 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3210025727748871
Validation loss = 0.3161154091358185
Validation loss = 0.32055923342704773
Validation loss = 0.3248869776725769
Validation loss = 0.32161474227905273
Validation loss = 0.32097339630126953
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.31445327401161194
Validation loss = 0.3090421259403229
Validation loss = 0.3103031814098358
Validation loss = 0.30732399225234985
Validation loss = 0.31584033370018005
Validation loss = 0.3100435137748718
Validation loss = 0.31038591265678406
Validation loss = 0.31250739097595215
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3101675808429718
Validation loss = 0.316692054271698
Validation loss = 0.3125995993614197
Validation loss = 0.31863516569137573
Validation loss = 0.3160270154476166
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.312296062707901
Validation loss = 0.312132865190506
Validation loss = 0.3164978623390198
Validation loss = 0.31192830204963684
Validation loss = 0.31570887565612793
Validation loss = 0.31643494963645935
Validation loss = 0.31555935740470886
Validation loss = 0.31781792640686035
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3188638687133789
Validation loss = 0.3140265643596649
Validation loss = 0.31205111742019653
Validation loss = 0.31521075963974
Validation loss = 0.3251672089099884
Validation loss = 0.3167596161365509
Validation loss = 0.32568201422691345
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -131     |
| Iteration     | 16       |
| MaximumReturn | -0.599   |
| MinimumReturn | -166     |
| TotalSamples  | 29988    |
----------------------------
itr #17 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.30917513370513916
Validation loss = 0.31629154086112976
Validation loss = 0.31351882219314575
Validation loss = 0.3141135573387146
Validation loss = 0.31596750020980835
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.30702894926071167
Validation loss = 0.30515727400779724
Validation loss = 0.3065244257450104
Validation loss = 0.3050527274608612
Validation loss = 0.3107823133468628
Validation loss = 0.30852213501930237
Validation loss = 0.31060001254081726
Validation loss = 0.3105936646461487
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.31252628564834595
Validation loss = 0.3115655779838562
Validation loss = 0.31112363934516907
Validation loss = 0.31733766198158264
Validation loss = 0.31067925691604614
Validation loss = 0.3126257359981537
Validation loss = 0.31562483310699463
Validation loss = 0.3132888376712799
Validation loss = 0.3154739737510681
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.31299981474876404
Validation loss = 0.30782073736190796
Validation loss = 0.3153272271156311
Validation loss = 0.31354135274887085
Validation loss = 0.31354770064353943
Validation loss = 0.3131270706653595
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.31682106852531433
Validation loss = 0.31077253818511963
Validation loss = 0.3161013424396515
Validation loss = 0.31323307752609253
Validation loss = 0.31539350748062134
Validation loss = 0.3164130449295044
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -133     |
| Iteration     | 17       |
| MaximumReturn | -1.91    |
| MinimumReturn | -175     |
| TotalSamples  | 31654    |
----------------------------
itr #18 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.31198808550834656
Validation loss = 0.3186838924884796
Validation loss = 0.3149867653846741
Validation loss = 0.31756845116615295
Validation loss = 0.3198249936103821
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3079528212547302
Validation loss = 0.30850544571876526
Validation loss = 0.30861058831214905
Validation loss = 0.31474924087524414
Validation loss = 0.3095466196537018
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.31544923782348633
Validation loss = 0.3136567771434784
Validation loss = 0.31379616260528564
Validation loss = 0.31871020793914795
Validation loss = 0.32218775153160095
Validation loss = 0.3192993700504303
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.31471994519233704
Validation loss = 0.30949530005455017
Validation loss = 0.31048262119293213
Validation loss = 0.31347379088401794
Validation loss = 0.31210431456565857
Validation loss = 0.3168547451496124
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3095689117908478
Validation loss = 0.31884872913360596
Validation loss = 0.3161006271839142
Validation loss = 0.31783026456832886
Validation loss = 0.3151111602783203
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -155     |
| Iteration     | 18       |
| MaximumReturn | -91.1    |
| MinimumReturn | -176     |
| TotalSamples  | 33320    |
----------------------------
itr #19 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3202664852142334
Validation loss = 0.3144124150276184
Validation loss = 0.3168981671333313
Validation loss = 0.30980974435806274
Validation loss = 0.31339067220687866
Validation loss = 0.3102245032787323
Validation loss = 0.31770893931388855
Validation loss = 0.3152009844779968
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.30384331941604614
Validation loss = 0.305203378200531
Validation loss = 0.30720973014831543
Validation loss = 0.30810070037841797
Validation loss = 0.3075041174888611
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.30995064973831177
Validation loss = 0.3125472664833069
Validation loss = 0.3134607672691345
Validation loss = 0.30850911140441895
Validation loss = 0.3155611455440521
Validation loss = 0.3185092508792877
Validation loss = 0.31389564275741577
Validation loss = 0.3180811405181885
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.30707499384880066
Validation loss = 0.31091880798339844
Validation loss = 0.30789846181869507
Validation loss = 0.3081173896789551
Validation loss = 0.3138298988342285
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3087656497955322
Validation loss = 0.31108972430229187
Validation loss = 0.31311821937561035
Validation loss = 0.31124669313430786
Validation loss = 0.31253623962402344
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -127     |
| Iteration     | 19       |
| MaximumReturn | -0.767   |
| MinimumReturn | -167     |
| TotalSamples  | 34986    |
----------------------------
itr #20 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.31021854281425476
Validation loss = 0.3154396712779999
Validation loss = 0.31150510907173157
Validation loss = 0.31434759497642517
Validation loss = 0.3163182735443115
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.30288881063461304
Validation loss = 0.3058159649372101
Validation loss = 0.30328866839408875
Validation loss = 0.3067161738872528
Validation loss = 0.31064632534980774
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3106023669242859
Validation loss = 0.31309065222740173
Validation loss = 0.3174073398113251
Validation loss = 0.31370219588279724
Validation loss = 0.3167663812637329
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3084076941013336
Validation loss = 0.3051851689815521
Validation loss = 0.3066428601741791
Validation loss = 0.3092957139015198
Validation loss = 0.31132012605667114
Validation loss = 0.3088446259498596
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3096044063568115
Validation loss = 0.30841416120529175
Validation loss = 0.3093096911907196
Validation loss = 0.30971840023994446
Validation loss = 0.30847159028053284
Validation loss = 0.3087224066257477
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -149     |
| Iteration     | 20       |
| MaximumReturn | -107     |
| MinimumReturn | -173     |
| TotalSamples  | 36652    |
----------------------------
itr #21 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3074701130390167
Validation loss = 0.31048837304115295
Validation loss = 0.30975452065467834
Validation loss = 0.3101186156272888
Validation loss = 0.31387969851493835
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.30543801188468933
Validation loss = 0.30374059081077576
Validation loss = 0.3039988577365875
Validation loss = 0.30234721302986145
Validation loss = 0.30263686180114746
Validation loss = 0.30575457215309143
Validation loss = 0.30451640486717224
Validation loss = 0.30634230375289917
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.30871620774269104
Validation loss = 0.3089543879032135
Validation loss = 0.3104115128517151
Validation loss = 0.3190044164657593
Validation loss = 0.31296634674072266
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3043297529220581
Validation loss = 0.3074027895927429
Validation loss = 0.31046661734580994
Validation loss = 0.30958688259124756
Validation loss = 0.3058604300022125
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.31224098801612854
Validation loss = 0.30977267026901245
Validation loss = 0.30957168340682983
Validation loss = 0.3071627914905548
Validation loss = 0.3117169737815857
Validation loss = 0.31010955572128296
Validation loss = 0.3102053105831146
Validation loss = 0.3103224039077759
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -149     |
| Iteration     | 21       |
| MaximumReturn | -22.2    |
| MinimumReturn | -181     |
| TotalSamples  | 38318    |
----------------------------
itr #22 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.30282899737358093
Validation loss = 0.30564209818840027
Validation loss = 0.3056272566318512
Validation loss = 0.30643072724342346
Validation loss = 0.3080616891384125
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3015057444572449
Validation loss = 0.3027477264404297
Validation loss = 0.30275192856788635
Validation loss = 0.30409863591194153
Validation loss = 0.3059310019016266
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.30272728204727173
Validation loss = 0.3060952126979828
Validation loss = 0.30609846115112305
Validation loss = 0.3089120388031006
Validation loss = 0.30780884623527527
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.30133792757987976
Validation loss = 0.30350205302238464
Validation loss = 0.30456337332725525
Validation loss = 0.30239391326904297
Validation loss = 0.303062379360199
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3074920177459717
Validation loss = 0.306147962808609
Validation loss = 0.30541467666625977
Validation loss = 0.3068288266658783
Validation loss = 0.3080529272556305
Validation loss = 0.31326213479042053
Validation loss = 0.30628249049186707
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -147     |
| Iteration     | 22       |
| MaximumReturn | -61.2    |
| MinimumReturn | -176     |
| TotalSamples  | 39984    |
----------------------------
itr #23 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.30084121227264404
Validation loss = 0.3047282099723816
Validation loss = 0.30292633175849915
Validation loss = 0.3013008236885071
Validation loss = 0.30832815170288086
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.29837745428085327
Validation loss = 0.2987460494041443
Validation loss = 0.30087369680404663
Validation loss = 0.2986198961734772
Validation loss = 0.30479520559310913
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.30637770891189575
Validation loss = 0.3053514361381531
Validation loss = 0.303621768951416
Validation loss = 0.30512535572052
Validation loss = 0.30654019117355347
Validation loss = 0.3078656792640686
Validation loss = 0.3155456781387329
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.2996588349342346
Validation loss = 0.30155709385871887
Validation loss = 0.30554819107055664
Validation loss = 0.3017421364784241
Validation loss = 0.29993242025375366
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.30458229780197144
Validation loss = 0.3052104711532593
Validation loss = 0.3050994277000427
Validation loss = 0.30451110005378723
Validation loss = 0.30825552344322205
Validation loss = 0.30872419476509094
Validation loss = 0.30838900804519653
Validation loss = 0.3045148551464081
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -138     |
| Iteration     | 23       |
| MaximumReturn | -39.7    |
| MinimumReturn | -170     |
| TotalSamples  | 41650    |
----------------------------
itr #24 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.30346211791038513
Validation loss = 0.3066522777080536
Validation loss = 0.30571281909942627
Validation loss = 0.30706483125686646
Validation loss = 0.30423736572265625
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.29844847321510315
Validation loss = 0.3063192665576935
Validation loss = 0.3013889491558075
Validation loss = 0.3075908422470093
Validation loss = 0.30250316858291626
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.31000596284866333
Validation loss = 0.3092098832130432
Validation loss = 0.3044809401035309
Validation loss = 0.3051941990852356
Validation loss = 0.30544671416282654
Validation loss = 0.3070600628852844
Validation loss = 0.3080427646636963
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.29806166887283325
Validation loss = 0.2995288074016571
Validation loss = 0.30537110567092896
Validation loss = 0.303533673286438
Validation loss = 0.30517712235450745
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.30599379539489746
Validation loss = 0.3069903254508972
Validation loss = 0.30851051211357117
Validation loss = 0.3115656077861786
Validation loss = 0.31170886754989624
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -138     |
| Iteration     | 24       |
| MaximumReturn | -1.47    |
| MinimumReturn | -171     |
| TotalSamples  | 43316    |
----------------------------
itr #25 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3035274147987366
Validation loss = 0.3041246831417084
Validation loss = 0.3036009967327118
Validation loss = 0.30231714248657227
Validation loss = 0.30376380681991577
Validation loss = 0.30657362937927246
Validation loss = 0.3060298264026642
Validation loss = 0.3076528012752533
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.2977255880832672
Validation loss = 0.30043038725852966
Validation loss = 0.2999083995819092
Validation loss = 0.3009360730648041
Validation loss = 0.2998875379562378
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3050328493118286
Validation loss = 0.3069170117378235
Validation loss = 0.3062411844730377
Validation loss = 0.3088923394680023
Validation loss = 0.3115067481994629
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.2973109781742096
Validation loss = 0.2983807921409607
Validation loss = 0.30188897252082825
Validation loss = 0.3016182780265808
Validation loss = 0.30508849024772644
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3050023317337036
Validation loss = 0.3084261417388916
Validation loss = 0.3126744329929352
Validation loss = 0.30696094036102295
Validation loss = 0.30853161215782166
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -146     |
| Iteration     | 25       |
| MaximumReturn | -43.1    |
| MinimumReturn | -168     |
| TotalSamples  | 44982    |
----------------------------
itr #26 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.30090025067329407
Validation loss = 0.301817923784256
Validation loss = 0.3014479875564575
Validation loss = 0.3004246652126312
Validation loss = 0.30371469259262085
Validation loss = 0.3024919927120209
Validation loss = 0.3023383319377899
Validation loss = 0.3031434118747711
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.29496094584465027
Validation loss = 0.29765585064888
Validation loss = 0.29563531279563904
Validation loss = 0.2950611412525177
Validation loss = 0.2950914204120636
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.29860007762908936
Validation loss = 0.29860857129096985
Validation loss = 0.3009648025035858
Validation loss = 0.3020685911178589
Validation loss = 0.3002474009990692
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.2952705919742584
Validation loss = 0.3055855631828308
Validation loss = 0.2960999011993408
Validation loss = 0.30046409368515015
Validation loss = 0.3021217882633209
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.296438068151474
Validation loss = 0.3037738502025604
Validation loss = 0.30644023418426514
Validation loss = 0.30202576518058777
Validation loss = 0.3084089457988739
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -114     |
| Iteration     | 26       |
| MaximumReturn | -7.67    |
| MinimumReturn | -155     |
| TotalSamples  | 46648    |
----------------------------
itr #27 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3053152859210968
Validation loss = 0.30438467860221863
Validation loss = 0.3047176003456116
Validation loss = 0.30628126859664917
Validation loss = 0.3078034818172455
Validation loss = 0.307560533285141
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.29704368114471436
Validation loss = 0.30212610960006714
Validation loss = 0.30010753870010376
Validation loss = 0.29933834075927734
Validation loss = 0.2967616319656372
Validation loss = 0.30063655972480774
Validation loss = 0.3013366162776947
Validation loss = 0.3009044826030731
Validation loss = 0.30162346363067627
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.30332282185554504
Validation loss = 0.30073046684265137
Validation loss = 0.3046637177467346
Validation loss = 0.30458754301071167
Validation loss = 0.30577632784843445
Validation loss = 0.30307891964912415
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.30145126581192017
Validation loss = 0.2993221580982208
Validation loss = 0.29974034428596497
Validation loss = 0.3017171621322632
Validation loss = 0.29962626099586487
Validation loss = 0.30315926671028137
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.302837610244751
Validation loss = 0.3035459816455841
Validation loss = 0.3042830228805542
Validation loss = 0.305886447429657
Validation loss = 0.3106320798397064
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -143     |
| Iteration     | 27       |
| MaximumReturn | -0.246   |
| MinimumReturn | -168     |
| TotalSamples  | 48314    |
----------------------------
itr #28 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3018258810043335
Validation loss = 0.3083474338054657
Validation loss = 0.3120467960834503
Validation loss = 0.3058486580848694
Validation loss = 0.30814477801322937
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.2985381484031677
Validation loss = 0.2996087074279785
Validation loss = 0.3021129071712494
Validation loss = 0.30126848816871643
Validation loss = 0.30607932806015015
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.30698612332344055
Validation loss = 0.303140789270401
Validation loss = 0.3024456799030304
Validation loss = 0.3054322898387909
Validation loss = 0.3078306019306183
Validation loss = 0.30518731474876404
Validation loss = 0.31310826539993286
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.29726555943489075
Validation loss = 0.29755112528800964
Validation loss = 0.2993558943271637
Validation loss = 0.2999196946620941
Validation loss = 0.30170369148254395
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.30651772022247314
Validation loss = 0.3037443161010742
Validation loss = 0.30430981516838074
Validation loss = 0.3121010363101959
Validation loss = 0.3057589828968048
Validation loss = 0.30374372005462646
Validation loss = 0.3102867305278778
Validation loss = 0.30961868166923523
Validation loss = 0.3096497356891632
Validation loss = 0.30689728260040283
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -131     |
| Iteration     | 28       |
| MaximumReturn | -48.7    |
| MinimumReturn | -165     |
| TotalSamples  | 49980    |
----------------------------
itr #29 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3029645085334778
Validation loss = 0.30538856983184814
Validation loss = 0.30648383498191833
Validation loss = 0.31088685989379883
Validation loss = 0.3075319528579712
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.29798901081085205
Validation loss = 0.30007168650627136
Validation loss = 0.2994030714035034
Validation loss = 0.3052835762500763
Validation loss = 0.3006359040737152
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.30124324560165405
Validation loss = 0.3047609329223633
Validation loss = 0.3057478070259094
Validation loss = 0.304991215467453
Validation loss = 0.3067449629306793
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.2996439039707184
Validation loss = 0.30152684450149536
Validation loss = 0.30249327421188354
Validation loss = 0.300523579120636
Validation loss = 0.29944878816604614
Validation loss = 0.3044935464859009
Validation loss = 0.30330443382263184
Validation loss = 0.30245718359947205
Validation loss = 0.3026593327522278
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.30820339918136597
Validation loss = 0.3081991672515869
Validation loss = 0.3080054223537445
Validation loss = 0.30746734142303467
Validation loss = 0.3108288645744324
Validation loss = 0.3084072172641754
Validation loss = 0.3086096942424774
Validation loss = 0.3110656440258026
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -140     |
| Iteration     | 29       |
| MaximumReturn | -4.51    |
| MinimumReturn | -164     |
| TotalSamples  | 51646    |
----------------------------
itr #30 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.30993083119392395
Validation loss = 0.306515097618103
Validation loss = 0.31180068850517273
Validation loss = 0.30732694268226624
Validation loss = 0.3080514073371887
Validation loss = 0.30762311816215515
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.30015885829925537
Validation loss = 0.3024102449417114
Validation loss = 0.30026108026504517
Validation loss = 0.3043796420097351
Validation loss = 0.30409109592437744
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.306308776140213
Validation loss = 0.30864471197128296
Validation loss = 0.30811232328414917
Validation loss = 0.31049424409866333
Validation loss = 0.30911991000175476
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.30260220170021057
Validation loss = 0.3029363751411438
Validation loss = 0.30385100841522217
Validation loss = 0.3038633465766907
Validation loss = 0.30700838565826416
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3125942051410675
Validation loss = 0.31311747431755066
Validation loss = 0.3118540644645691
Validation loss = 0.31170645356178284
Validation loss = 0.3153766691684723
Validation loss = 0.313917338848114
Validation loss = 0.31495875120162964
Validation loss = 0.3147568702697754
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -140     |
| Iteration     | 30       |
| MaximumReturn | -32.9    |
| MinimumReturn | -160     |
| TotalSamples  | 53312    |
----------------------------
itr #31 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3065304160118103
Validation loss = 0.30782073736190796
Validation loss = 0.30708470940589905
Validation loss = 0.31007564067840576
Validation loss = 0.3091462254524231
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.30136698484420776
Validation loss = 0.2982630431652069
Validation loss = 0.30245643854141235
Validation loss = 0.30599096417427063
Validation loss = 0.30078843235969543
Validation loss = 0.3038180470466614
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.308112770318985
Validation loss = 0.3049902319908142
Validation loss = 0.3111198842525482
Validation loss = 0.3063156306743622
Validation loss = 0.3080398440361023
Validation loss = 0.3081636130809784
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3093475103378296
Validation loss = 0.3019384741783142
Validation loss = 0.30127057433128357
Validation loss = 0.30629682540893555
Validation loss = 0.30509263277053833
Validation loss = 0.3037945628166199
Validation loss = 0.3064562976360321
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.31252798438072205
Validation loss = 0.3129791021347046
Validation loss = 0.3126475512981415
Validation loss = 0.31438079476356506
Validation loss = 0.31571564078330994
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -131     |
| Iteration     | 31       |
| MaximumReturn | -22      |
| MinimumReturn | -156     |
| TotalSamples  | 54978    |
----------------------------
itr #32 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3054506182670593
Validation loss = 0.3061787784099579
Validation loss = 0.3088524341583252
Validation loss = 0.30991166830062866
Validation loss = 0.3110383450984955
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3021165728569031
Validation loss = 0.30126649141311646
Validation loss = 0.30432677268981934
Validation loss = 0.3038184642791748
Validation loss = 0.3043177127838135
Validation loss = 0.3026455342769623
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3059139549732208
Validation loss = 0.307576984167099
Validation loss = 0.3056837022304535
Validation loss = 0.31307438015937805
Validation loss = 0.3055047392845154
Validation loss = 0.3087071180343628
Validation loss = 0.30947402119636536
Validation loss = 0.3101176917552948
Validation loss = 0.3089444637298584
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3086339831352234
Validation loss = 0.3030781149864197
Validation loss = 0.3057098686695099
Validation loss = 0.3046543598175049
Validation loss = 0.30568814277648926
Validation loss = 0.30337294936180115
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.31192517280578613
Validation loss = 0.3095656633377075
Validation loss = 0.31048014760017395
Validation loss = 0.3151523768901825
Validation loss = 0.31218740344047546
Validation loss = 0.31775543093681335
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -161     |
| Iteration     | 32       |
| MaximumReturn | -140     |
| MinimumReturn | -170     |
| TotalSamples  | 56644    |
----------------------------
itr #33 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3076231777667999
Validation loss = 0.3081417381763458
Validation loss = 0.3112945556640625
Validation loss = 0.3100244402885437
Validation loss = 0.3105959892272949
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3023720681667328
Validation loss = 0.3007057309150696
Validation loss = 0.3019890785217285
Validation loss = 0.3039872646331787
Validation loss = 0.3018401563167572
Validation loss = 0.30641883611679077
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.30907177925109863
Validation loss = 0.3095718026161194
Validation loss = 0.3080752193927765
Validation loss = 0.31186389923095703
Validation loss = 0.3102598190307617
Validation loss = 0.31381815671920776
Validation loss = 0.3098941743373871
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.30773091316223145
Validation loss = 0.3063944876194
Validation loss = 0.3031437397003174
Validation loss = 0.3059597313404083
Validation loss = 0.30812788009643555
Validation loss = 0.3089047968387604
Validation loss = 0.310347318649292
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.31573179364204407
Validation loss = 0.31232163310050964
Validation loss = 0.3145109713077545
Validation loss = 0.31579089164733887
Validation loss = 0.3130165934562683
Validation loss = 0.3133421838283539
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -134     |
| Iteration     | 33       |
| MaximumReturn | -69.2    |
| MinimumReturn | -166     |
| TotalSamples  | 58310    |
----------------------------
itr #34 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3073689043521881
Validation loss = 0.3081347644329071
Validation loss = 0.307657927274704
Validation loss = 0.3100694417953491
Validation loss = 0.30927571654319763
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.30099454522132874
Validation loss = 0.30095627903938293
Validation loss = 0.30457356572151184
Validation loss = 0.3041485846042633
Validation loss = 0.30383312702178955
Validation loss = 0.30488139390945435
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3103715181350708
Validation loss = 0.30915647745132446
Validation loss = 0.31122756004333496
Validation loss = 0.3113732933998108
Validation loss = 0.31287530064582825
Validation loss = 0.31375858187675476
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.30711913108825684
Validation loss = 0.3045918345451355
Validation loss = 0.30401337146759033
Validation loss = 0.30918046832084656
Validation loss = 0.30790480971336365
Validation loss = 0.30874118208885193
Validation loss = 0.30961647629737854
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.311959832906723
Validation loss = 0.31140729784965515
Validation loss = 0.31430479884147644
Validation loss = 0.3135639727115631
Validation loss = 0.31450560688972473
Validation loss = 0.314637154340744
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -151     |
| Iteration     | 34       |
| MaximumReturn | -109     |
| MinimumReturn | -169     |
| TotalSamples  | 59976    |
----------------------------
itr #35 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3062998056411743
Validation loss = 0.3087417185306549
Validation loss = 0.3073473870754242
Validation loss = 0.308380663394928
Validation loss = 0.30823880434036255
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.30461585521698
Validation loss = 0.3075612485408783
Validation loss = 0.3056907653808594
Validation loss = 0.303458034992218
Validation loss = 0.30793681740760803
Validation loss = 0.3070659935474396
Validation loss = 0.3064548075199127
Validation loss = 0.310356467962265
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3091012239456177
Validation loss = 0.31201639771461487
Validation loss = 0.31163090467453003
Validation loss = 0.3103545606136322
Validation loss = 0.3083561956882477
Validation loss = 0.3144860863685608
Validation loss = 0.31312116980552673
Validation loss = 0.31439924240112305
Validation loss = 0.3162684440612793
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3070034980773926
Validation loss = 0.306092232465744
Validation loss = 0.3087632954120636
Validation loss = 0.3065641522407532
Validation loss = 0.3104132413864136
Validation loss = 0.30858907103538513
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.31224220991134644
Validation loss = 0.3134547472000122
Validation loss = 0.31584444642066956
Validation loss = 0.3146200180053711
Validation loss = 0.3119625747203827
Validation loss = 0.31387636065483093
Validation loss = 0.313900887966156
Validation loss = 0.31381046772003174
Validation loss = 0.32072970271110535
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -151     |
| Iteration     | 35       |
| MaximumReturn | -4.52    |
| MinimumReturn | -184     |
| TotalSamples  | 61642    |
----------------------------
itr #36 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.30584946274757385
Validation loss = 0.3063775599002838
Validation loss = 0.3072385787963867
Validation loss = 0.3069426417350769
Validation loss = 0.30831629037857056
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.30499497056007385
Validation loss = 0.3068960905075073
Validation loss = 0.3083205819129944
Validation loss = 0.30791032314300537
Validation loss = 0.30890944600105286
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.30959099531173706
Validation loss = 0.3102624714374542
Validation loss = 0.31455862522125244
Validation loss = 0.31668707728385925
Validation loss = 0.31340405344963074
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.31104394793510437
Validation loss = 0.30793967843055725
Validation loss = 0.3078062832355499
Validation loss = 0.30831003189086914
Validation loss = 0.312041699886322
Validation loss = 0.31086665391921997
Validation loss = 0.31614914536476135
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3171582520008087
Validation loss = 0.3149145841598511
Validation loss = 0.3152664005756378
Validation loss = 0.3175334334373474
Validation loss = 0.31953999400138855
Validation loss = 0.31771034002304077
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -138     |
| Iteration     | 36       |
| MaximumReturn | -67.9    |
| MinimumReturn | -181     |
| TotalSamples  | 63308    |
----------------------------
itr #37 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.31116998195648193
Validation loss = 0.31334546208381653
Validation loss = 0.31157830357551575
Validation loss = 0.3083914518356323
Validation loss = 0.31029659509658813
Validation loss = 0.3101535439491272
Validation loss = 0.31297898292541504
Validation loss = 0.31367242336273193
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3067169487476349
Validation loss = 0.30658847093582153
Validation loss = 0.3090461492538452
Validation loss = 0.314200758934021
Validation loss = 0.3125928044319153
Validation loss = 0.3125613033771515
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3134891092777252
Validation loss = 0.3109425902366638
Validation loss = 0.31697311997413635
Validation loss = 0.3157974183559418
Validation loss = 0.31669822335243225
Validation loss = 0.31810471415519714
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3110492527484894
Validation loss = 0.31498482823371887
Validation loss = 0.3141610026359558
Validation loss = 0.3105800151824951
Validation loss = 0.3126436173915863
Validation loss = 0.3134956657886505
Validation loss = 0.31318479776382446
Validation loss = 0.31523439288139343
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.31772318482398987
Validation loss = 0.3173562288284302
Validation loss = 0.3191796541213989
Validation loss = 0.32489654421806335
Validation loss = 0.32026317715644836
Validation loss = 0.31925541162490845
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -158     |
| Iteration     | 37       |
| MaximumReturn | -115     |
| MinimumReturn | -178     |
| TotalSamples  | 64974    |
----------------------------
itr #38 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3108662962913513
Validation loss = 0.3122117519378662
Validation loss = 0.3152029514312744
Validation loss = 0.3136349618434906
Validation loss = 0.31451255083084106
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.31038010120391846
Validation loss = 0.31003835797309875
Validation loss = 0.3091850280761719
Validation loss = 0.3104373812675476
Validation loss = 0.31420326232910156
Validation loss = 0.313789427280426
Validation loss = 0.31395700573921204
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3180645704269409
Validation loss = 0.3169806897640228
Validation loss = 0.3146895170211792
Validation loss = 0.3169892430305481
Validation loss = 0.3201825022697449
Validation loss = 0.3183230459690094
Validation loss = 0.3192017674446106
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3155655264854431
Validation loss = 0.31505289673805237
Validation loss = 0.31983643770217896
Validation loss = 0.31340640783309937
Validation loss = 0.313750296831131
Validation loss = 0.31844642758369446
Validation loss = 0.322905033826828
Validation loss = 0.3177376389503479
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3158569633960724
Validation loss = 0.3197079300880432
Validation loss = 0.3192363381385803
Validation loss = 0.3218288719654083
Validation loss = 0.3218684494495392
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -144     |
| Iteration     | 38       |
| MaximumReturn | -67.3    |
| MinimumReturn | -185     |
| TotalSamples  | 66640    |
----------------------------
itr #39 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.31043338775634766
Validation loss = 0.3106846809387207
Validation loss = 0.31013453006744385
Validation loss = 0.3116736114025116
Validation loss = 0.3106471598148346
Validation loss = 0.31260964274406433
Validation loss = 0.31234121322631836
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3107677400112152
Validation loss = 0.3100115954875946
Validation loss = 0.3131183087825775
Validation loss = 0.3150957226753235
Validation loss = 0.3132486641407013
Validation loss = 0.3118666410446167
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3152488172054291
Validation loss = 0.3177558481693268
Validation loss = 0.3212723433971405
Validation loss = 0.31795474886894226
Validation loss = 0.3211134076118469
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.31679776310920715
Validation loss = 0.3141433894634247
Validation loss = 0.31809356808662415
Validation loss = 0.31630128622055054
Validation loss = 0.3156399726867676
Validation loss = 0.3171713054180145
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3207300007343292
Validation loss = 0.32317957282066345
Validation loss = 0.31958678364753723
Validation loss = 0.3209693729877472
Validation loss = 0.3224300146102905
Validation loss = 0.3241901099681854
Validation loss = 0.3216254413127899
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -111     |
| Iteration     | 39       |
| MaximumReturn | -27.4    |
| MinimumReturn | -178     |
| TotalSamples  | 68306    |
----------------------------
itr #40 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3181941509246826
Validation loss = 0.3111454248428345
Validation loss = 0.31405648589134216
Validation loss = 0.31536608934402466
Validation loss = 0.31466811895370483
Validation loss = 0.3152356445789337
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3125801980495453
Validation loss = 0.3127136826515198
Validation loss = 0.3140254318714142
Validation loss = 0.31562358140945435
Validation loss = 0.3154285252094269
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.31689491868019104
Validation loss = 0.3176032304763794
Validation loss = 0.3186054825782776
Validation loss = 0.32607612013816833
Validation loss = 0.32255658507347107
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3157147765159607
Validation loss = 0.3176286518573761
Validation loss = 0.31632721424102783
Validation loss = 0.31953513622283936
Validation loss = 0.3207157850265503
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3204629421234131
Validation loss = 0.3200550079345703
Validation loss = 0.3215610086917877
Validation loss = 0.32362261414527893
Validation loss = 0.32374003529548645
Validation loss = 0.32525789737701416
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -153     |
| Iteration     | 40       |
| MaximumReturn | -57.8    |
| MinimumReturn | -184     |
| TotalSamples  | 69972    |
----------------------------
itr #41 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3129415214061737
Validation loss = 0.31528961658477783
Validation loss = 0.3146686851978302
Validation loss = 0.31912729144096375
Validation loss = 0.31613242626190186
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3166937232017517
Validation loss = 0.3118056654930115
Validation loss = 0.31097033619880676
Validation loss = 0.3153392970561981
Validation loss = 0.3148961365222931
Validation loss = 0.3155723214149475
Validation loss = 0.31983646750450134
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3190038502216339
Validation loss = 0.3203459680080414
Validation loss = 0.32204365730285645
Validation loss = 0.32267633080482483
Validation loss = 0.32081180810928345
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.31922224164009094
Validation loss = 0.32682904601097107
Validation loss = 0.31884464621543884
Validation loss = 0.31892427802085876
Validation loss = 0.32226380705833435
Validation loss = 0.32349202036857605
Validation loss = 0.32341673970222473
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3219583034515381
Validation loss = 0.32516807317733765
Validation loss = 0.32541680335998535
Validation loss = 0.32491374015808105
Validation loss = 0.3251749277114868
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -158     |
| Iteration     | 41       |
| MaximumReturn | -69.1    |
| MinimumReturn | -186     |
| TotalSamples  | 71638    |
----------------------------
itr #42 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3154206871986389
Validation loss = 0.31867045164108276
Validation loss = 0.316586434841156
Validation loss = 0.31640395522117615
Validation loss = 0.3179956376552582
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.316096693277359
Validation loss = 0.3165261745452881
Validation loss = 0.3179212510585785
Validation loss = 0.3162219822406769
Validation loss = 0.3177497386932373
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.32176369428634644
Validation loss = 0.32269078493118286
Validation loss = 0.32591915130615234
Validation loss = 0.3275856375694275
Validation loss = 0.32523366808891296
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.32026180624961853
Validation loss = 0.32163846492767334
Validation loss = 0.3244767189025879
Validation loss = 0.32486358284950256
Validation loss = 0.32227596640586853
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.325003057718277
Validation loss = 0.3258163332939148
Validation loss = 0.32522282004356384
Validation loss = 0.325222909450531
Validation loss = 0.32946059107780457
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -114     |
| Iteration     | 42       |
| MaximumReturn | -32.5    |
| MinimumReturn | -167     |
| TotalSamples  | 73304    |
----------------------------
itr #43 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3179994821548462
Validation loss = 0.31936147809028625
Validation loss = 0.3202419877052307
Validation loss = 0.3178718686103821
Validation loss = 0.3212934136390686
Validation loss = 0.3211210370063782
Validation loss = 0.3238782286643982
Validation loss = 0.32213735580444336
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.31783077120780945
Validation loss = 0.3177795112133026
Validation loss = 0.3206663429737091
Validation loss = 0.318229615688324
Validation loss = 0.3180743455886841
Validation loss = 0.32076165080070496
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3245319724082947
Validation loss = 0.3208581507205963
Validation loss = 0.3240516483783722
Validation loss = 0.3239452540874481
Validation loss = 0.32641592621803284
Validation loss = 0.3256089389324188
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3225051164627075
Validation loss = 0.3222670555114746
Validation loss = 0.32295411825180054
Validation loss = 0.3237062394618988
Validation loss = 0.3250085115432739
Validation loss = 0.3236730992794037
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.32768386602401733
Validation loss = 0.3258278965950012
Validation loss = 0.325527548789978
Validation loss = 0.32864564657211304
Validation loss = 0.32746538519859314
Validation loss = 0.32854345440864563
Validation loss = 0.3307400941848755
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -145     |
| Iteration     | 43       |
| MaximumReturn | -77      |
| MinimumReturn | -178     |
| TotalSamples  | 74970    |
----------------------------
itr #44 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.31982994079589844
Validation loss = 0.3179672360420227
Validation loss = 0.31987491250038147
Validation loss = 0.3191629946231842
Validation loss = 0.32271870970726013
Validation loss = 0.3213028609752655
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.31679898500442505
Validation loss = 0.316455215215683
Validation loss = 0.3212060034275055
Validation loss = 0.3187400996685028
Validation loss = 0.3202589452266693
Validation loss = 0.3222663402557373
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3188075125217438
Validation loss = 0.32508134841918945
Validation loss = 0.3241826891899109
Validation loss = 0.32711607217788696
Validation loss = 0.3246394991874695
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3208847939968109
Validation loss = 0.3243086338043213
Validation loss = 0.3254740834236145
Validation loss = 0.3236158490180969
Validation loss = 0.3274906575679779
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.32520824670791626
Validation loss = 0.32997506856918335
Validation loss = 0.3273184597492218
Validation loss = 0.3254906237125397
Validation loss = 0.33001431822776794
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -52.8    |
| Iteration     | 44       |
| MaximumReturn | -2.42    |
| MinimumReturn | -135     |
| TotalSamples  | 76636    |
----------------------------
itr #45 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.32475924491882324
Validation loss = 0.31919124722480774
Validation loss = 0.32148486375808716
Validation loss = 0.3213043212890625
Validation loss = 0.32281050086021423
Validation loss = 0.32101523876190186
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3177703619003296
Validation loss = 0.32126355171203613
Validation loss = 0.3197318911552429
Validation loss = 0.31928926706314087
Validation loss = 0.3218822777271271
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3280365467071533
Validation loss = 0.3206484019756317
Validation loss = 0.32531294226646423
Validation loss = 0.32774969935417175
Validation loss = 0.3250616788864136
Validation loss = 0.3286440074443817
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.32315847277641296
Validation loss = 0.3230668008327484
Validation loss = 0.3250499963760376
Validation loss = 0.32494786381721497
Validation loss = 0.327863872051239
Validation loss = 0.32625725865364075
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.32862311601638794
Validation loss = 0.3273620009422302
Validation loss = 0.3291417360305786
Validation loss = 0.330040842294693
Validation loss = 0.33203956484794617
Validation loss = 0.3309554159641266
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -148     |
| Iteration     | 45       |
| MaximumReturn | -103     |
| MinimumReturn | -170     |
| TotalSamples  | 78302    |
----------------------------
itr #46 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.31990644335746765
Validation loss = 0.3209696412086487
Validation loss = 0.32353049516677856
Validation loss = 0.32260239124298096
Validation loss = 0.3245760500431061
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.317245215177536
Validation loss = 0.3177346885204315
Validation loss = 0.3211829960346222
Validation loss = 0.321093887090683
Validation loss = 0.3183431327342987
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3239586055278778
Validation loss = 0.32763907313346863
Validation loss = 0.32824182510375977
Validation loss = 0.32883718609809875
Validation loss = 0.3320586383342743
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.320330947637558
Validation loss = 0.3250385820865631
Validation loss = 0.323821485042572
Validation loss = 0.324724406003952
Validation loss = 0.32523611187934875
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3277488946914673
Validation loss = 0.3316045105457306
Validation loss = 0.3321285843849182
Validation loss = 0.332267701625824
Validation loss = 0.33347833156585693
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -95.1    |
| Iteration     | 46       |
| MaximumReturn | -3.27    |
| MinimumReturn | -170     |
| TotalSamples  | 79968    |
----------------------------
itr #47 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.32265570759773254
Validation loss = 0.32250335812568665
Validation loss = 0.32963910698890686
Validation loss = 0.3302992880344391
Validation loss = 0.32563167810440063
Validation loss = 0.324558824300766
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.32114559412002563
Validation loss = 0.32345688343048096
Validation loss = 0.32485419511795044
Validation loss = 0.3199882507324219
Validation loss = 0.3243260085582733
Validation loss = 0.3242782950401306
Validation loss = 0.3218538761138916
Validation loss = 0.32492271065711975
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3295063078403473
Validation loss = 0.32637959718704224
Validation loss = 0.3256224989891052
Validation loss = 0.32623207569122314
Validation loss = 0.3294099271297455
Validation loss = 0.330242782831192
Validation loss = 0.3332429528236389
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3233277201652527
Validation loss = 0.32565435767173767
Validation loss = 0.323732852935791
Validation loss = 0.32655149698257446
Validation loss = 0.32975417375564575
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3288343846797943
Validation loss = 0.3304615318775177
Validation loss = 0.3286078870296478
Validation loss = 0.33493107557296753
Validation loss = 0.3340494930744171
Validation loss = 0.3344438076019287
Validation loss = 0.33546990156173706
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -175     |
| Iteration     | 47       |
| MaximumReturn | -154     |
| MinimumReturn | -189     |
| TotalSamples  | 81634    |
----------------------------
itr #48 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.32640352845191956
Validation loss = 0.32818761467933655
Validation loss = 0.32923761010169983
Validation loss = 0.32653042674064636
Validation loss = 0.32831254601478577
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3204559087753296
Validation loss = 0.3239380121231079
Validation loss = 0.32449251413345337
Validation loss = 0.325877845287323
Validation loss = 0.32491302490234375
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3294820189476013
Validation loss = 0.3316022753715515
Validation loss = 0.33340293169021606
Validation loss = 0.33136147260665894
Validation loss = 0.3340187668800354
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3234800696372986
Validation loss = 0.326190710067749
Validation loss = 0.33107051253318787
Validation loss = 0.32664334774017334
Validation loss = 0.3293588161468506
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.33136439323425293
Validation loss = 0.3351801931858063
Validation loss = 0.33716341853141785
Validation loss = 0.3375755250453949
Validation loss = 0.3373236656188965
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -168     |
| Iteration     | 48       |
| MaximumReturn | -144     |
| MinimumReturn | -178     |
| TotalSamples  | 83300    |
----------------------------
itr #49 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.32453322410583496
Validation loss = 0.32659047842025757
Validation loss = 0.3283963203430176
Validation loss = 0.3276548981666565
Validation loss = 0.3350693881511688
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.32440805435180664
Validation loss = 0.32652682065963745
Validation loss = 0.3260987102985382
Validation loss = 0.32621610164642334
Validation loss = 0.327297568321228
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3334132730960846
Validation loss = 0.33451753854751587
Validation loss = 0.33472320437431335
Validation loss = 0.3361492156982422
Validation loss = 0.33452969789505005
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.32499271631240845
Validation loss = 0.3302062153816223
Validation loss = 0.3330274224281311
Validation loss = 0.33173322677612305
Validation loss = 0.3307456970214844
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.33281034231185913
Validation loss = 0.3326554298400879
Validation loss = 0.3364853858947754
Validation loss = 0.33642688393592834
Validation loss = 0.3386615514755249
Validation loss = 0.3376149535179138
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -119     |
| Iteration     | 49       |
| MaximumReturn | -78.6    |
| MinimumReturn | -147     |
| TotalSamples  | 84966    |
----------------------------
itr #50 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3301265835762024
Validation loss = 0.3299853801727295
Validation loss = 0.32721805572509766
Validation loss = 0.32865989208221436
Validation loss = 0.32908716797828674
Validation loss = 0.33491942286491394
Validation loss = 0.329242080450058
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.32792335748672485
Validation loss = 0.32458916306495667
Validation loss = 0.32734525203704834
Validation loss = 0.32757142186164856
Validation loss = 0.32663774490356445
Validation loss = 0.32840466499328613
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.33334070444107056
Validation loss = 0.3327329158782959
Validation loss = 0.33447393774986267
Validation loss = 0.33815547823905945
Validation loss = 0.3348473012447357
Validation loss = 0.3366914689540863
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3257725238800049
Validation loss = 0.3297950327396393
Validation loss = 0.32875609397888184
Validation loss = 0.32851195335388184
Validation loss = 0.33056923747062683
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3356882631778717
Validation loss = 0.33855965733528137
Validation loss = 0.3358328342437744
Validation loss = 0.3378630578517914
Validation loss = 0.33438703417778015
Validation loss = 0.3404920697212219
Validation loss = 0.33932390809059143
Validation loss = 0.33974123001098633
Validation loss = 0.33910441398620605
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -65.7    |
| Iteration     | 50       |
| MaximumReturn | -0.742   |
| MinimumReturn | -161     |
| TotalSamples  | 86632    |
----------------------------
itr #51 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.32790082693099976
Validation loss = 0.3282983899116516
Validation loss = 0.3340945541858673
Validation loss = 0.32966750860214233
Validation loss = 0.3360033333301544
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.32763001322746277
Validation loss = 0.3252595067024231
Validation loss = 0.32615232467651367
Validation loss = 0.3275105357170105
Validation loss = 0.3290548324584961
Validation loss = 0.3287618160247803
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3311765193939209
Validation loss = 0.3329070508480072
Validation loss = 0.3362255096435547
Validation loss = 0.33597299456596375
Validation loss = 0.33567941188812256
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3303936719894409
Validation loss = 0.3291112184524536
Validation loss = 0.32831352949142456
Validation loss = 0.3288533687591553
Validation loss = 0.3322981894016266
Validation loss = 0.32964980602264404
Validation loss = 0.3315230906009674
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.34317559003829956
Validation loss = 0.337316632270813
Validation loss = 0.3399718701839447
Validation loss = 0.33976075053215027
Validation loss = 0.3416788876056671
Validation loss = 0.34179678559303284
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -170     |
| Iteration     | 51       |
| MaximumReturn | -101     |
| MinimumReturn | -186     |
| TotalSamples  | 88298    |
----------------------------
itr #52 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3297598958015442
Validation loss = 0.33093273639678955
Validation loss = 0.3331413269042969
Validation loss = 0.33264636993408203
Validation loss = 0.3327583074569702
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.32782402634620667
Validation loss = 0.3283413350582123
Validation loss = 0.32971805334091187
Validation loss = 0.32743898034095764
Validation loss = 0.3299218416213989
Validation loss = 0.3321276307106018
Validation loss = 0.32901662588119507
Validation loss = 0.33073991537094116
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3346538841724396
Validation loss = 0.3354523479938507
Validation loss = 0.3363657295703888
Validation loss = 0.340016633272171
Validation loss = 0.33822867274284363
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.33040568232536316
Validation loss = 0.3317885994911194
Validation loss = 0.3315565884113312
Validation loss = 0.33498355746269226
Validation loss = 0.33385810256004333
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3389562666416168
Validation loss = 0.3408650755882263
Validation loss = 0.3421885669231415
Validation loss = 0.34204843640327454
Validation loss = 0.3423907458782196
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -183     |
| Iteration     | 52       |
| MaximumReturn | -165     |
| MinimumReturn | -192     |
| TotalSamples  | 89964    |
----------------------------
itr #53 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.33394521474838257
Validation loss = 0.3300773799419403
Validation loss = 0.3312082886695862
Validation loss = 0.3335990011692047
Validation loss = 0.33484870195388794
Validation loss = 0.33770301938056946
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.32827508449554443
Validation loss = 0.3316633105278015
Validation loss = 0.3339962363243103
Validation loss = 0.3324471712112427
Validation loss = 0.3290453553199768
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3370383679866791
Validation loss = 0.3350270986557007
Validation loss = 0.3388746678829193
Validation loss = 0.3367280662059784
Validation loss = 0.33866024017333984
Validation loss = 0.3415409326553345
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.33011382818222046
Validation loss = 0.331264853477478
Validation loss = 0.33165037631988525
Validation loss = 0.3351953625679016
Validation loss = 0.33379608392715454
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.33710137009620667
Validation loss = 0.34059497714042664
Validation loss = 0.34370386600494385
Validation loss = 0.33995321393013
Validation loss = 0.34343576431274414
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -186     |
| Iteration     | 53       |
| MaximumReturn | -153     |
| MinimumReturn | -201     |
| TotalSamples  | 91630    |
----------------------------
itr #54 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.33050084114074707
Validation loss = 0.3322692811489105
Validation loss = 0.333339661359787
Validation loss = 0.3378262519836426
Validation loss = 0.33327436447143555
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.33207377791404724
Validation loss = 0.3318752646446228
Validation loss = 0.3310540020465851
Validation loss = 0.3330739438533783
Validation loss = 0.33382976055145264
Validation loss = 0.33461788296699524
Validation loss = 0.33347970247268677
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.33795061707496643
Validation loss = 0.33827686309814453
Validation loss = 0.3392234444618225
Validation loss = 0.33954474329948425
Validation loss = 0.3431406319141388
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.33514833450317383
Validation loss = 0.33397766947746277
Validation loss = 0.33570417761802673
Validation loss = 0.3345484733581543
Validation loss = 0.3377848267555237
Validation loss = 0.33335161209106445
Validation loss = 0.3357481360435486
Validation loss = 0.3380931615829468
Validation loss = 0.33655044436454773
Validation loss = 0.33911800384521484
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3422464430332184
Validation loss = 0.3390984535217285
Validation loss = 0.34065982699394226
Validation loss = 0.34213048219680786
Validation loss = 0.342357337474823
Validation loss = 0.3419434130191803
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -191     |
| Iteration     | 54       |
| MaximumReturn | -169     |
| MinimumReturn | -203     |
| TotalSamples  | 93296    |
----------------------------
itr #55 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.33279016613960266
Validation loss = 0.33362454175949097
Validation loss = 0.33304110169410706
Validation loss = 0.33539921045303345
Validation loss = 0.33559125661849976
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.33087047934532166
Validation loss = 0.33401918411254883
Validation loss = 0.3318338990211487
Validation loss = 0.3374156057834625
Validation loss = 0.3312072455883026
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3388005495071411
Validation loss = 0.3420872092247009
Validation loss = 0.34080299735069275
Validation loss = 0.3409746289253235
Validation loss = 0.342060387134552
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.33228808641433716
Validation loss = 0.3360176384449005
Validation loss = 0.33773332834243774
Validation loss = 0.33659303188323975
Validation loss = 0.3376644551753998
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3418680727481842
Validation loss = 0.34225112199783325
Validation loss = 0.34051626920700073
Validation loss = 0.34284070134162903
Validation loss = 0.33959534764289856
Validation loss = 0.3434601426124573
Validation loss = 0.34556737542152405
Validation loss = 0.34336569905281067
Validation loss = 0.3441113531589508
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -192     |
| Iteration     | 55       |
| MaximumReturn | -171     |
| MinimumReturn | -208     |
| TotalSamples  | 94962    |
----------------------------
itr #56 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.332294762134552
Validation loss = 0.33573073148727417
Validation loss = 0.33554112911224365
Validation loss = 0.335808128118515
Validation loss = 0.3352963626384735
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3303259611129761
Validation loss = 0.33087703585624695
Validation loss = 0.3346874415874481
Validation loss = 0.3341321647167206
Validation loss = 0.3372761309146881
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.33829668164253235
Validation loss = 0.33889976143836975
Validation loss = 0.33665531873703003
Validation loss = 0.3416493237018585
Validation loss = 0.3423141837120056
Validation loss = 0.34077632427215576
Validation loss = 0.3435947597026825
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.33365482091903687
Validation loss = 0.3351774513721466
Validation loss = 0.33880048990249634
Validation loss = 0.3349186182022095
Validation loss = 0.3392378091812134
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.33939602971076965
Validation loss = 0.3454875648021698
Validation loss = 0.3436526358127594
Validation loss = 0.3443257212638855
Validation loss = 0.34302854537963867
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -180     |
| Iteration     | 56       |
| MaximumReturn | -124     |
| MinimumReturn | -199     |
| TotalSamples  | 96628    |
----------------------------
itr #57 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.33144277334213257
Validation loss = 0.335274338722229
Validation loss = 0.33641231060028076
Validation loss = 0.333985298871994
Validation loss = 0.34030580520629883
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3318484127521515
Validation loss = 0.3351798951625824
Validation loss = 0.3360520899295807
Validation loss = 0.33393946290016174
Validation loss = 0.3347630798816681
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3424054682254791
Validation loss = 0.34150269627571106
Validation loss = 0.34687867760658264
Validation loss = 0.3419433534145355
Validation loss = 0.3419903814792633
Validation loss = 0.3474966287612915
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.33861300349235535
Validation loss = 0.33812835812568665
Validation loss = 0.33647215366363525
Validation loss = 0.3412395417690277
Validation loss = 0.3399285078048706
Validation loss = 0.34052935242652893
Validation loss = 0.338738352060318
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3458012640476227
Validation loss = 0.3457557260990143
Validation loss = 0.3463754653930664
Validation loss = 0.34224462509155273
Validation loss = 0.35119619965553284
Validation loss = 0.3461640179157257
Validation loss = 0.3460775911808014
Validation loss = 0.34733378887176514
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -190     |
| Iteration     | 57       |
| MaximumReturn | -166     |
| MinimumReturn | -206     |
| TotalSamples  | 98294    |
----------------------------
itr #58 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.33492812514305115
Validation loss = 0.33750787377357483
Validation loss = 0.33960238099098206
Validation loss = 0.33624067902565
Validation loss = 0.34117811918258667
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3337234556674957
Validation loss = 0.33690962195396423
Validation loss = 0.3379650413990021
Validation loss = 0.33824780583381653
Validation loss = 0.3362458646297455
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3457116186618805
Validation loss = 0.346621036529541
Validation loss = 0.3459882438182831
Validation loss = 0.3423347473144531
Validation loss = 0.34633418917655945
Validation loss = 0.3476589322090149
Validation loss = 0.34665346145629883
Validation loss = 0.3490442931652069
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.34145915508270264
Validation loss = 0.3387453258037567
Validation loss = 0.34134596586227417
Validation loss = 0.3406214416027069
Validation loss = 0.3409467041492462
Validation loss = 0.343417763710022
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.34637683629989624
Validation loss = 0.3493330180644989
Validation loss = 0.34846505522727966
Validation loss = 0.3461456298828125
Validation loss = 0.34715476632118225
Validation loss = 0.34955617785453796
Validation loss = 0.35163477063179016
Validation loss = 0.35176515579223633
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -182     |
| Iteration     | 58       |
| MaximumReturn | -151     |
| MinimumReturn | -195     |
| TotalSamples  | 99960    |
----------------------------
itr #59 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.33352214097976685
Validation loss = 0.33807867765426636
Validation loss = 0.33549463748931885
Validation loss = 0.3344833254814148
Validation loss = 0.33790919184684753
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.33451351523399353
Validation loss = 0.33123016357421875
Validation loss = 0.33181628584861755
Validation loss = 0.33147522807121277
Validation loss = 0.33599427342414856
Validation loss = 0.3375850319862366
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.34127676486968994
Validation loss = 0.3444575369358063
Validation loss = 0.3423125445842743
Validation loss = 0.34317776560783386
Validation loss = 0.34294041991233826
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.33710217475891113
Validation loss = 0.33623337745666504
Validation loss = 0.3366973102092743
Validation loss = 0.3384464681148529
Validation loss = 0.34028422832489014
Validation loss = 0.3438495397567749
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3457789123058319
Validation loss = 0.3418753445148468
Validation loss = 0.3447553217411041
Validation loss = 0.3464287221431732
Validation loss = 0.3451799750328064
Validation loss = 0.34674981236457825
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -185     |
| Iteration     | 59       |
| MaximumReturn | -111     |
| MinimumReturn | -202     |
| TotalSamples  | 101626   |
----------------------------
itr #60 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3332868218421936
Validation loss = 0.3311541676521301
Validation loss = 0.331917405128479
Validation loss = 0.3304482400417328
Validation loss = 0.3320521414279938
Validation loss = 0.33184272050857544
Validation loss = 0.3330729007720947
Validation loss = 0.3361380696296692
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.32848426699638367
Validation loss = 0.33163338899612427
Validation loss = 0.33261990547180176
Validation loss = 0.3323598802089691
Validation loss = 0.3343880772590637
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.33838459849357605
Validation loss = 0.3385571539402008
Validation loss = 0.3426835238933563
Validation loss = 0.33975911140441895
Validation loss = 0.34208089113235474
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.33415862917900085
Validation loss = 0.33354872465133667
Validation loss = 0.3339952826499939
Validation loss = 0.334250807762146
Validation loss = 0.33734050393104553
Validation loss = 0.3360309302806854
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.34015074372291565
Validation loss = 0.34349241852760315
Validation loss = 0.34536170959472656
Validation loss = 0.3441126346588135
Validation loss = 0.34286221861839294
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -181     |
| Iteration     | 60       |
| MaximumReturn | -155     |
| MinimumReturn | -199     |
| TotalSamples  | 103292   |
----------------------------
itr #61 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.33113348484039307
Validation loss = 0.3290296494960785
Validation loss = 0.33364877104759216
Validation loss = 0.3344202935695648
Validation loss = 0.33342212438583374
Validation loss = 0.33427849411964417
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3295481204986572
Validation loss = 0.3303893506526947
Validation loss = 0.3320844769477844
Validation loss = 0.3293379247188568
Validation loss = 0.3303346037864685
Validation loss = 0.3333475887775421
Validation loss = 0.33054763078689575
Validation loss = 0.33184903860092163
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3374517560005188
Validation loss = 0.3401538133621216
Validation loss = 0.33915063738822937
Validation loss = 0.3385859727859497
Validation loss = 0.3392605483531952
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.33224618434906006
Validation loss = 0.33407503366470337
Validation loss = 0.33651721477508545
Validation loss = 0.3329777717590332
Validation loss = 0.3341568112373352
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3412371277809143
Validation loss = 0.3398180603981018
Validation loss = 0.33946019411087036
Validation loss = 0.34070634841918945
Validation loss = 0.3406313359737396
Validation loss = 0.34333527088165283
Validation loss = 0.34455639123916626
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -154     |
| Iteration     | 61       |
| MaximumReturn | -104     |
| MinimumReturn | -179     |
| TotalSamples  | 104958   |
----------------------------
itr #62 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.28899112343788147
Validation loss = 0.29713451862335205
Validation loss = 0.30088287591934204
Validation loss = 0.29694902896881104
Validation loss = 0.2988966107368469
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.29165318608283997
Validation loss = 0.2952156662940979
Validation loss = 0.29800307750701904
Validation loss = 0.2967190444469452
Validation loss = 0.2976336181163788
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.29738524556159973
Validation loss = 0.30064985156059265
Validation loss = 0.30942246317863464
Validation loss = 0.30570828914642334
Validation loss = 0.3057613670825958
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.29164406657218933
Validation loss = 0.2977962791919708
Validation loss = 0.30231496691703796
Validation loss = 0.3018099069595337
Validation loss = 0.301055371761322
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.30361074209213257
Validation loss = 0.30617648363113403
Validation loss = 0.3071517050266266
Validation loss = 0.30740439891815186
Validation loss = 0.30636072158813477
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -106     |
| Iteration     | 62       |
| MaximumReturn | -10.5    |
| MinimumReturn | -164     |
| TotalSamples  | 106624   |
----------------------------
itr #63 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.28057122230529785
Validation loss = 0.27796006202697754
Validation loss = 0.28010401129722595
Validation loss = 0.27892786264419556
Validation loss = 0.2816539704799652
Validation loss = 0.2792099714279175
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.2784232795238495
Validation loss = 0.27572354674339294
Validation loss = 0.2792900800704956
Validation loss = 0.2794804573059082
Validation loss = 0.27894720435142517
Validation loss = 0.2787855267524719
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.2839982211589813
Validation loss = 0.2817525565624237
Validation loss = 0.28390419483184814
Validation loss = 0.2833150029182434
Validation loss = 0.28492575883865356
Validation loss = 0.2863157093524933
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.28014248609542847
Validation loss = 0.2792881429195404
Validation loss = 0.2785165309906006
Validation loss = 0.2837354838848114
Validation loss = 0.28093430399894714
Validation loss = 0.2836376428604126
Validation loss = 0.28216832876205444
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.28481554985046387
Validation loss = 0.2871154546737671
Validation loss = 0.28765448927879333
Validation loss = 0.2880575358867645
Validation loss = 0.2876715660095215
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -75.2    |
| Iteration     | 63       |
| MaximumReturn | -11.5    |
| MinimumReturn | -177     |
| TotalSamples  | 108290   |
----------------------------
itr #64 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.27678224444389343
Validation loss = 0.27358362078666687
Validation loss = 0.27731049060821533
Validation loss = 0.2752525210380554
Validation loss = 0.2748570740222931
Validation loss = 0.2761996388435364
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.2730439007282257
Validation loss = 0.2737919092178345
Validation loss = 0.27418288588523865
Validation loss = 0.27331337332725525
Validation loss = 0.2750568091869354
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.27901166677474976
Validation loss = 0.27890175580978394
Validation loss = 0.27642810344696045
Validation loss = 0.2796880304813385
Validation loss = 0.27932941913604736
Validation loss = 0.27960655093193054
Validation loss = 0.2795417010784149
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.27597346901893616
Validation loss = 0.27485397458076477
Validation loss = 0.27496257424354553
Validation loss = 0.27582409977912903
Validation loss = 0.27452918887138367
Validation loss = 0.276974618434906
Validation loss = 0.277770072221756
Validation loss = 0.27744030952453613
Validation loss = 0.27736860513687134
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.2823825180530548
Validation loss = 0.28171420097351074
Validation loss = 0.27846240997314453
Validation loss = 0.2800101041793823
Validation loss = 0.28157898783683777
Validation loss = 0.27908727526664734
Validation loss = 0.2829633951187134
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -75.9    |
| Iteration     | 64       |
| MaximumReturn | -2.63    |
| MinimumReturn | -130     |
| TotalSamples  | 109956   |
----------------------------
itr #65 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.268019437789917
Validation loss = 0.26685991883277893
Validation loss = 0.26614123582839966
Validation loss = 0.27007704973220825
Validation loss = 0.2674299478530884
Validation loss = 0.2739587724208832
Validation loss = 0.27003157138824463
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.265255331993103
Validation loss = 0.2656388580799103
Validation loss = 0.26852673292160034
Validation loss = 0.26696115732192993
Validation loss = 0.26710426807403564
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.27447018027305603
Validation loss = 0.27126553654670715
Validation loss = 0.2737531065940857
Validation loss = 0.27192753553390503
Validation loss = 0.272294819355011
Validation loss = 0.27497074007987976
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.267792284488678
Validation loss = 0.2678357660770416
Validation loss = 0.2689747214317322
Validation loss = 0.26991787552833557
Validation loss = 0.2700059413909912
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.2793700098991394
Validation loss = 0.27589428424835205
Validation loss = 0.2733248174190521
Validation loss = 0.27341559529304504
Validation loss = 0.2748155891895294
Validation loss = 0.2756868898868561
Validation loss = 0.2759130299091339
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -46.9    |
| Iteration     | 65       |
| MaximumReturn | -1.05    |
| MinimumReturn | -106     |
| TotalSamples  | 111622   |
----------------------------
itr #66 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.2655006945133209
Validation loss = 0.2629176080226898
Validation loss = 0.26667720079421997
Validation loss = 0.26918596029281616
Validation loss = 0.267953097820282
Validation loss = 0.2680347263813019
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.2635082006454468
Validation loss = 0.2616085112094879
Validation loss = 0.2631385922431946
Validation loss = 0.2643425762653351
Validation loss = 0.2627886235713959
Validation loss = 0.2661871016025543
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.2696835696697235
Validation loss = 0.2690199613571167
Validation loss = 0.27014434337615967
Validation loss = 0.27036765217781067
Validation loss = 0.2734749913215637
Validation loss = 0.27099913358688354
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.2692645490169525
Validation loss = 0.26489347219467163
Validation loss = 0.2649060785770416
Validation loss = 0.2654852867126465
Validation loss = 0.2665705382823944
Validation loss = 0.26606485247612
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.27032801508903503
Validation loss = 0.2712448239326477
Validation loss = 0.27035507559776306
Validation loss = 0.2736816704273224
Validation loss = 0.27246224880218506
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -103     |
| Iteration     | 66       |
| MaximumReturn | -73.1    |
| MinimumReturn | -130     |
| TotalSamples  | 113288   |
----------------------------
itr #67 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.26752349734306335
Validation loss = 0.2655808627605438
Validation loss = 0.2708381712436676
Validation loss = 0.2693091928958893
Validation loss = 0.2700969874858856
Validation loss = 0.27103477716445923
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.26525193452835083
Validation loss = 0.26676321029663086
Validation loss = 0.2669944763183594
Validation loss = 0.26632001996040344
Validation loss = 0.2677880525588989
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.271517276763916
Validation loss = 0.26797252893447876
Validation loss = 0.271406888961792
Validation loss = 0.2721911370754242
Validation loss = 0.27255722880363464
Validation loss = 0.2720809578895569
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.2687578499317169
Validation loss = 0.26746782660484314
Validation loss = 0.26845452189445496
Validation loss = 0.27118855714797974
Validation loss = 0.2710144519805908
Validation loss = 0.2696639597415924
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.27357491850852966
Validation loss = 0.27241453528404236
Validation loss = 0.27804869413375854
Validation loss = 0.2757028043270111
Validation loss = 0.2756795585155487
Validation loss = 0.2757892310619354
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -121     |
| Iteration     | 67       |
| MaximumReturn | -81      |
| MinimumReturn | -151     |
| TotalSamples  | 114954   |
----------------------------
itr #68 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.27166715264320374
Validation loss = 0.2723180651664734
Validation loss = 0.2717907130718231
Validation loss = 0.2724207639694214
Validation loss = 0.2724120616912842
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.269382506608963
Validation loss = 0.26943132281303406
Validation loss = 0.26813074946403503
Validation loss = 0.26892125606536865
Validation loss = 0.27166464924812317
Validation loss = 0.2686542272567749
Validation loss = 0.2697080373764038
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.2735476493835449
Validation loss = 0.2737311124801636
Validation loss = 0.2751864790916443
Validation loss = 0.27457454800605774
Validation loss = 0.27578485012054443
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.27128922939300537
Validation loss = 0.26759228110313416
Validation loss = 0.26922881603240967
Validation loss = 0.27044668793678284
Validation loss = 0.2712722718715668
Validation loss = 0.2729765474796295
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.27489814162254333
Validation loss = 0.2760733664035797
Validation loss = 0.2758665978908539
Validation loss = 0.27768635749816895
Validation loss = 0.2768270671367645
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -135     |
| Iteration     | 68       |
| MaximumReturn | -77.7    |
| MinimumReturn | -175     |
| TotalSamples  | 116620   |
----------------------------
itr #69 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.2727462351322174
Validation loss = 0.2708948850631714
Validation loss = 0.27242833375930786
Validation loss = 0.27348941564559937
Validation loss = 0.27353277802467346
Validation loss = 0.2745774984359741
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.2720208466053009
Validation loss = 0.27204030752182007
Validation loss = 0.2702180743217468
Validation loss = 0.2708977162837982
Validation loss = 0.2709866464138031
Validation loss = 0.2728361487388611
Validation loss = 0.27401116490364075
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.27710098028182983
Validation loss = 0.2753587067127228
Validation loss = 0.2770457863807678
Validation loss = 0.27813246846199036
Validation loss = 0.2767854332923889
Validation loss = 0.27580440044403076
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.27882352471351624
Validation loss = 0.27286988496780396
Validation loss = 0.27403634786605835
Validation loss = 0.2726174592971802
Validation loss = 0.273173451423645
Validation loss = 0.2749628722667694
Validation loss = 0.27292487025260925
Validation loss = 0.27606266736984253
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.28200575709342957
Validation loss = 0.27612748742103577
Validation loss = 0.27821481227874756
Validation loss = 0.27884218096733093
Validation loss = 0.27888816595077515
Validation loss = 0.27959051728248596
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -88.4    |
| Iteration     | 69       |
| MaximumReturn | -48.9    |
| MinimumReturn | -127     |
| TotalSamples  | 118286   |
----------------------------
itr #70 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.27323415875434875
Validation loss = 0.2754074037075043
Validation loss = 0.2745305597782135
Validation loss = 0.27619215846061707
Validation loss = 0.2727922201156616
Validation loss = 0.2737654745578766
Validation loss = 0.27615445852279663
Validation loss = 0.27551406621932983
Validation loss = 0.27661630511283875
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.2732091546058655
Validation loss = 0.2706502079963684
Validation loss = 0.2733265161514282
Validation loss = 0.27453964948654175
Validation loss = 0.2740422785282135
Validation loss = 0.27321794629096985
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.280187726020813
Validation loss = 0.278848260641098
Validation loss = 0.2762458026409149
Validation loss = 0.2804386019706726
Validation loss = 0.2798191010951996
Validation loss = 0.28190797567367554
Validation loss = 0.2817348837852478
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.275257408618927
Validation loss = 0.27682730555534363
Validation loss = 0.27638107538223267
Validation loss = 0.2755580246448517
Validation loss = 0.2757515013217926
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.2822035253047943
Validation loss = 0.2811332643032074
Validation loss = 0.27895426750183105
Validation loss = 0.28052762150764465
Validation loss = 0.27904725074768066
Validation loss = 0.28103968501091003
Validation loss = 0.2831364870071411
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -74.9    |
| Iteration     | 70       |
| MaximumReturn | -1.11    |
| MinimumReturn | -141     |
| TotalSamples  | 119952   |
----------------------------
itr #71 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.2788006067276001
Validation loss = 0.2774006426334381
Validation loss = 0.27936235070228577
Validation loss = 0.27788296341896057
Validation loss = 0.2782543897628784
Validation loss = 0.2803756892681122
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.27444732189178467
Validation loss = 0.2713226079940796
Validation loss = 0.27549442648887634
Validation loss = 0.27796274423599243
Validation loss = 0.27501407265663147
Validation loss = 0.27661874890327454
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.2813645601272583
Validation loss = 0.28228121995925903
Validation loss = 0.28364303708076477
Validation loss = 0.2804452180862427
Validation loss = 0.28244656324386597
Validation loss = 0.28065380454063416
Validation loss = 0.2827294170856476
Validation loss = 0.28451573848724365
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.2763056457042694
Validation loss = 0.2780444324016571
Validation loss = 0.2781808078289032
Validation loss = 0.27598756551742554
Validation loss = 0.2798815369606018
Validation loss = 0.27683451771736145
Validation loss = 0.27765709161758423
Validation loss = 0.28159964084625244
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.28089627623558044
Validation loss = 0.28130754828453064
Validation loss = 0.2811517119407654
Validation loss = 0.28660470247268677
Validation loss = 0.2830347716808319
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -83.7    |
| Iteration     | 71       |
| MaximumReturn | -0.354   |
| MinimumReturn | -157     |
| TotalSamples  | 121618   |
----------------------------
itr #72 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.2766389846801758
Validation loss = 0.27742767333984375
Validation loss = 0.2768051326274872
Validation loss = 0.27698764204978943
Validation loss = 0.2788572311401367
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.2774701416492462
Validation loss = 0.2737176716327667
Validation loss = 0.27622491121292114
Validation loss = 0.2756270170211792
Validation loss = 0.27393007278442383
Validation loss = 0.2770799994468689
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.2829255759716034
Validation loss = 0.28148019313812256
Validation loss = 0.2824951410293579
Validation loss = 0.28498998284339905
Validation loss = 0.2855152189731598
Validation loss = 0.28360259532928467
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.2798055410385132
Validation loss = 0.27648019790649414
Validation loss = 0.2782413363456726
Validation loss = 0.2803319990634918
Validation loss = 0.27876684069633484
Validation loss = 0.2811545729637146
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.2835274636745453
Validation loss = 0.28387823700904846
Validation loss = 0.2846338152885437
Validation loss = 0.2842943072319031
Validation loss = 0.28319892287254333
Validation loss = 0.2844330668449402
Validation loss = 0.286770224571228
Validation loss = 0.28656336665153503
Validation loss = 0.28702566027641296
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -63.5    |
| Iteration     | 72       |
| MaximumReturn | -0.38    |
| MinimumReturn | -164     |
| TotalSamples  | 123284   |
----------------------------
itr #73 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.2802515923976898
Validation loss = 0.2780095338821411
Validation loss = 0.28058192133903503
Validation loss = 0.27874866127967834
Validation loss = 0.28195324540138245
Validation loss = 0.28386953473091125
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.2784009575843811
Validation loss = 0.27742883563041687
Validation loss = 0.277558296918869
Validation loss = 0.2757374048233032
Validation loss = 0.28000304102897644
Validation loss = 0.27827516198158264
Validation loss = 0.2786829471588135
Validation loss = 0.2807598114013672
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.28729701042175293
Validation loss = 0.28454285860061646
Validation loss = 0.28437286615371704
Validation loss = 0.28712013363838196
Validation loss = 0.2875261902809143
Validation loss = 0.2877444624900818
Validation loss = 0.28826674818992615
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.28252291679382324
Validation loss = 0.28377264738082886
Validation loss = 0.28015732765197754
Validation loss = 0.28156572580337524
Validation loss = 0.2819575369358063
Validation loss = 0.2835139334201813
Validation loss = 0.2846094071865082
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.2878718674182892
Validation loss = 0.2857159674167633
Validation loss = 0.2880183756351471
Validation loss = 0.288578599691391
Validation loss = 0.2892117202281952
Validation loss = 0.2860724925994873
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -103     |
| Iteration     | 73       |
| MaximumReturn | -36.9    |
| MinimumReturn | -154     |
| TotalSamples  | 124950   |
----------------------------
itr #74 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.2836401164531708
Validation loss = 0.2832367718219757
Validation loss = 0.28004148602485657
Validation loss = 0.2829204797744751
Validation loss = 0.28286778926849365
Validation loss = 0.2857421040534973
Validation loss = 0.28406408429145813
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.2810911238193512
Validation loss = 0.2814321219921112
Validation loss = 0.28169193863868713
Validation loss = 0.2799551784992218
Validation loss = 0.28265175223350525
Validation loss = 0.28417572379112244
Validation loss = 0.2828867435455322
Validation loss = 0.28245440125465393
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.2884531021118164
Validation loss = 0.285936564207077
Validation loss = 0.28887587785720825
Validation loss = 0.28885048627853394
Validation loss = 0.2888178527355194
Validation loss = 0.2890644967556
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.28486278653144836
Validation loss = 0.28451773524284363
Validation loss = 0.28538036346435547
Validation loss = 0.2855623960494995
Validation loss = 0.28659337759017944
Validation loss = 0.2858528792858124
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.28894251585006714
Validation loss = 0.28674495220184326
Validation loss = 0.2898341715335846
Validation loss = 0.2904817461967468
Validation loss = 0.28957822918891907
Validation loss = 0.2893873453140259
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -82.1    |
| Iteration     | 74       |
| MaximumReturn | -5.59    |
| MinimumReturn | -157     |
| TotalSamples  | 126616   |
----------------------------
itr #75 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.2865612506866455
Validation loss = 0.2850637137889862
Validation loss = 0.28480395674705505
Validation loss = 0.28459787368774414
Validation loss = 0.2854680120944977
Validation loss = 0.28803128004074097
Validation loss = 0.28599780797958374
Validation loss = 0.2880346477031708
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.2842858135700226
Validation loss = 0.281999796628952
Validation loss = 0.2831333875656128
Validation loss = 0.2847358286380768
Validation loss = 0.28296521306037903
Validation loss = 0.2881239950656891
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.2908239960670471
Validation loss = 0.2905253767967224
Validation loss = 0.2897656559944153
Validation loss = 0.2892361283302307
Validation loss = 0.2914935350418091
Validation loss = 0.2908794581890106
Validation loss = 0.2923548221588135
Validation loss = 0.29250162839889526
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.2889046370983124
Validation loss = 0.28498411178588867
Validation loss = 0.28507697582244873
Validation loss = 0.2867971360683441
Validation loss = 0.28673040866851807
Validation loss = 0.2857499122619629
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.2920941412448883
Validation loss = 0.290065199136734
Validation loss = 0.2901349365711212
Validation loss = 0.29050180315971375
Validation loss = 0.29035571217536926
Validation loss = 0.29071786999702454
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -69.6    |
| Iteration     | 75       |
| MaximumReturn | -0.555   |
| MinimumReturn | -138     |
| TotalSamples  | 128282   |
----------------------------
itr #76 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.28712737560272217
Validation loss = 0.2881731688976288
Validation loss = 0.28761744499206543
Validation loss = 0.28831127285957336
Validation loss = 0.292124480009079
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.28694260120391846
Validation loss = 0.28643935918807983
Validation loss = 0.2868179976940155
Validation loss = 0.28475290536880493
Validation loss = 0.28731513023376465
Validation loss = 0.289378821849823
Validation loss = 0.2896852493286133
Validation loss = 0.2872913181781769
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.2932360768318176
Validation loss = 0.2936912178993225
Validation loss = 0.2928440570831299
Validation loss = 0.29415857791900635
Validation loss = 0.2953025698661804
Validation loss = 0.2939736843109131
Validation loss = 0.2940272092819214
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.2896001935005188
Validation loss = 0.2877733111381531
Validation loss = 0.28838199377059937
Validation loss = 0.2895682752132416
Validation loss = 0.28952229022979736
Validation loss = 0.2890082001686096
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.29376858472824097
Validation loss = 0.2924070954322815
Validation loss = 0.2938036024570465
Validation loss = 0.29314911365509033
Validation loss = 0.29360756278038025
Validation loss = 0.2961428165435791
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -57.2    |
| Iteration     | 76       |
| MaximumReturn | -1.99    |
| MinimumReturn | -117     |
| TotalSamples  | 129948   |
----------------------------
itr #77 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.29090821743011475
Validation loss = 0.29036253690719604
Validation loss = 0.2896029055118561
Validation loss = 0.29106953740119934
Validation loss = 0.2897503972053528
Validation loss = 0.2937329113483429
Validation loss = 0.29370278120040894
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.28999200463294983
Validation loss = 0.2901977002620697
Validation loss = 0.28934362530708313
Validation loss = 0.2923504710197449
Validation loss = 0.29225248098373413
Validation loss = 0.2906770408153534
Validation loss = 0.2913912832736969
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.29486390948295593
Validation loss = 0.2968507409095764
Validation loss = 0.29825010895729065
Validation loss = 0.295352578163147
Validation loss = 0.2972038984298706
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.29206985235214233
Validation loss = 0.29386886954307556
Validation loss = 0.2910143733024597
Validation loss = 0.2910574972629547
Validation loss = 0.292495459318161
Validation loss = 0.29141607880592346
Validation loss = 0.29347503185272217
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.2970113456249237
Validation loss = 0.29611390829086304
Validation loss = 0.29573777318000793
Validation loss = 0.2949463427066803
Validation loss = 0.2953883409500122
Validation loss = 0.2957487106323242
Validation loss = 0.29631099104881287
Validation loss = 0.2960852086544037
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -33.9    |
| Iteration     | 77       |
| MaximumReturn | -0.92    |
| MinimumReturn | -88.6    |
| TotalSamples  | 131614   |
----------------------------
itr #78 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.2915709614753723
Validation loss = 0.29067450761795044
Validation loss = 0.2912071943283081
Validation loss = 0.2927943170070648
Validation loss = 0.29144027829170227
Validation loss = 0.29304948449134827
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.2929062843322754
Validation loss = 0.28920018672943115
Validation loss = 0.2882979214191437
Validation loss = 0.2911330759525299
Validation loss = 0.2913953363895416
Validation loss = 0.29065197706222534
Validation loss = 0.29053977131843567
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.29550090432167053
Validation loss = 0.2974688708782196
Validation loss = 0.29690489172935486
Validation loss = 0.2950538396835327
Validation loss = 0.2972536087036133
Validation loss = 0.29982301592826843
Validation loss = 0.2989928722381592
Validation loss = 0.29733872413635254
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.29357075691223145
Validation loss = 0.291044145822525
Validation loss = 0.29268062114715576
Validation loss = 0.29185420274734497
Validation loss = 0.2941350042819977
Validation loss = 0.2930784821510315
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.2972095310688019
Validation loss = 0.2973949611186981
Validation loss = 0.2972964644432068
Validation loss = 0.2968485653400421
Validation loss = 0.29564735293388367
Validation loss = 0.29648593068122864
Validation loss = 0.29689186811447144
Validation loss = 0.29984429478645325
Validation loss = 0.2991342842578888
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -51      |
| Iteration     | 78       |
| MaximumReturn | -4.14    |
| MinimumReturn | -150     |
| TotalSamples  | 133280   |
----------------------------
itr #79 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.29427072405815125
Validation loss = 0.2923441529273987
Validation loss = 0.2928364872932434
Validation loss = 0.29385530948638916
Validation loss = 0.29229575395584106
Validation loss = 0.2918844223022461
Validation loss = 0.2937813699245453
Validation loss = 0.29201167821884155
Validation loss = 0.29466113448143005
Validation loss = 0.2953779399394989
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.2926535904407501
Validation loss = 0.2906232178211212
Validation loss = 0.29260504245758057
Validation loss = 0.29180508852005005
Validation loss = 0.29414597153663635
Validation loss = 0.2939211428165436
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.2997880280017853
Validation loss = 0.29979246854782104
Validation loss = 0.2982444763183594
Validation loss = 0.29836219549179077
Validation loss = 0.29868441820144653
Validation loss = 0.30064070224761963
Validation loss = 0.30360451340675354
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.295157253742218
Validation loss = 0.29303258657455444
Validation loss = 0.2932548224925995
Validation loss = 0.2942551374435425
Validation loss = 0.29547107219696045
Validation loss = 0.29493752121925354
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.30012810230255127
Validation loss = 0.2973368167877197
Validation loss = 0.2991631031036377
Validation loss = 0.3010963201522827
Validation loss = 0.2989768981933594
Validation loss = 0.2987898290157318
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -46.2    |
| Iteration     | 79       |
| MaximumReturn | -0.779   |
| MinimumReturn | -101     |
| TotalSamples  | 134946   |
----------------------------
