Logging to experiments/invertedPendulum/IPO01/Tue-01-Nov-2022-09-49-35-PM-CDT_invertedPendulum_trpo_iteration_20_seed2231
Print configuration .....
{'env_name': 'invertedPendulum', 'random_seeds': [3214, 2431, 2531, 2231], 'save_variables': False, 'model_save_dir': '/tmp/invertedPendulum_models/', 'restore_variables': False, 'start_onpol_iter': 0, 'onpol_iters': 80, 'num_path_random': 25, 'num_path_onpol': 25, 'env_horizon': 100, 'max_train_data': 200000, 'max_val_data': 100000, 'discard_ratio': 0.0, 'dynamics': {'pre_training': {'mode': 'intrinsic_reward', 'itr': 0, 'policy_itr': 20}, 'model': 'nn', 'ensemble': True, 'ensemble_model_count': 5, 'enable_particle_ensemble': True, 'particles': 5, 'obs_var': 1.0, 'intrinsic_reward_coeff': 1.0, 'ita': 1.0, 'mode': 'random', 'val': True, 'n_layers': 4, 'hidden_size': 1000, 'activation': 'relu', 'batch_size': 1000, 'learning_rate': 0.001, 'reg_coeff': 0.0, 'epochs': 200, 'kfac_params': {'learning_rate': 0.1, 'damping': 0.001, 'momentum': 0.9, 'kl_clip': 0.0001, 'cov_ema_decay': 0.99}}, 'policy': {'network_shape': [64, 64], 'init_logstd': 0.0, 'activation': 'tanh', 'reinitialize_every_itr': False}, 'trpo': {'horizon': 100, 'gamma': 0.99, 'step_size': 0.01, 'iterations': 20, 'batch_size': 50000, 'gae': 0.95, 'visualization': False, 'visualize_iterations': [0]}, 'algo': 'trpo'}
Generating random rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating random rollouts.
Creating normalization for training data.
Done creating normalization for training data.
Particle ensemble enabled? True
An ensemble of 5 dynamics model <class 'model.dynamics.NNDynamicsModel'> initialized
Train dynamics model with intrinsic reward only? False
Pre-training enabled. Using only intrinsic reward.
Pre-training dynamics model for 0 iterations...
Done pre-training dynamics model.
Using external reward only.
itr #0 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.8447191715240479
Validation loss = 0.6687535047531128
Validation loss = 0.6499295830726624
Validation loss = 0.624647855758667
Validation loss = 0.6129534840583801
Validation loss = 0.6030133366584778
Validation loss = 0.6142218112945557
Validation loss = 0.5916587114334106
Validation loss = 0.5622785687446594
Validation loss = 0.5724918842315674
Validation loss = 0.5546044111251831
Validation loss = 0.5503038167953491
Validation loss = 0.5728588104248047
Validation loss = 0.5447510480880737
Validation loss = 0.5521541237831116
Validation loss = 0.5592381358146667
Validation loss = 0.5833870768547058
Validation loss = 0.5415529012680054
Validation loss = 0.5434002876281738
Validation loss = 0.5375659465789795
Validation loss = 0.558007538318634
Validation loss = 0.5448025465011597
Validation loss = 0.5360581874847412
Validation loss = 0.5270528793334961
Validation loss = 0.5268961787223816
Validation loss = 0.5247061848640442
Validation loss = 0.5185284614562988
Validation loss = 0.5062139630317688
Validation loss = 0.5122077465057373
Validation loss = 0.5289949774742126
Validation loss = 0.531217634677887
Validation loss = 0.5083186030387878
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.8245090246200562
Validation loss = 0.6848828792572021
Validation loss = 0.6837112307548523
Validation loss = 0.6708071231842041
Validation loss = 0.6247708797454834
Validation loss = 0.6165009140968323
Validation loss = 0.5996282696723938
Validation loss = 0.5981976389884949
Validation loss = 0.5734545588493347
Validation loss = 0.5649545788764954
Validation loss = 0.5657212734222412
Validation loss = 0.5524088740348816
Validation loss = 0.553097128868103
Validation loss = 0.540779173374176
Validation loss = 0.549608051776886
Validation loss = 0.5408240556716919
Validation loss = 0.5454971194267273
Validation loss = 0.5241151452064514
Validation loss = 0.5330374240875244
Validation loss = 0.5317519903182983
Validation loss = 0.5382811427116394
Validation loss = 0.5275160074234009
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.8285409808158875
Validation loss = 0.7030138969421387
Validation loss = 0.6438477039337158
Validation loss = 0.6438790559768677
Validation loss = 0.6176291704177856
Validation loss = 0.6055388450622559
Validation loss = 0.5928071737289429
Validation loss = 0.5995524525642395
Validation loss = 0.5686476230621338
Validation loss = 0.5757919549942017
Validation loss = 0.5525128245353699
Validation loss = 0.5769079923629761
Validation loss = 0.561787486076355
Validation loss = 0.5539518594741821
Validation loss = 0.5341269969940186
Validation loss = 0.5588244795799255
Validation loss = 0.5458762645721436
Validation loss = 0.5475490689277649
Validation loss = 0.5502467751502991
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.8247571587562561
Validation loss = 0.705652117729187
Validation loss = 0.6617929339408875
Validation loss = 0.6420865654945374
Validation loss = 0.6202266812324524
Validation loss = 0.6166902780532837
Validation loss = 0.5995770692825317
Validation loss = 0.5950444340705872
Validation loss = 0.5907937288284302
Validation loss = 0.576928973197937
Validation loss = 0.5628129243850708
Validation loss = 0.5512192249298096
Validation loss = 0.5507011413574219
Validation loss = 0.5603674054145813
Validation loss = 0.5431795716285706
Validation loss = 0.5552309155464172
Validation loss = 0.5559220910072327
Validation loss = 0.5576342940330505
Validation loss = 0.529213547706604
Validation loss = 0.5251595973968506
Validation loss = 0.5247935652732849
Validation loss = 0.5275325179100037
Validation loss = 0.5389300584793091
Validation loss = 0.5295605659484863
Validation loss = 0.5202992558479309
Validation loss = 0.5196225047111511
Validation loss = 0.5192089676856995
Validation loss = 0.529137909412384
Validation loss = 0.5264381766319275
Validation loss = 0.5263117551803589
Validation loss = 0.5285235643386841
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.8354986310005188
Validation loss = 0.6839473843574524
Validation loss = 0.6489758491516113
Validation loss = 0.6370353698730469
Validation loss = 0.621570885181427
Validation loss = 0.6242642998695374
Validation loss = 0.5916622877120972
Validation loss = 0.5791723132133484
Validation loss = 0.5933065414428711
Validation loss = 0.5547686815261841
Validation loss = 0.5706554651260376
Validation loss = 0.5477202534675598
Validation loss = 0.5627102851867676
Validation loss = 0.5392554998397827
Validation loss = 0.5462188124656677
Validation loss = 0.5260832905769348
Validation loss = 0.5407929420471191
Validation loss = 0.5236541628837585
Validation loss = 0.5303522348403931
Validation loss = 0.5377265214920044
Validation loss = 0.531325101852417
Validation loss = 0.5241707563400269
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -201     |
| Iteration     | 0        |
| MaximumReturn | -189     |
| MinimumReturn | -213     |
| TotalSamples  | 3332     |
----------------------------
itr #1 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.8214653730392456
Validation loss = 0.6344581842422485
Validation loss = 0.6067807078361511
Validation loss = 0.591841459274292
Validation loss = 0.556523859500885
Validation loss = 0.533157229423523
Validation loss = 0.5316387414932251
Validation loss = 0.5538725256919861
Validation loss = 0.535015881061554
Validation loss = 0.5117304921150208
Validation loss = 0.5083585977554321
Validation loss = 0.4896857440471649
Validation loss = 0.507803201675415
Validation loss = 0.47902435064315796
Validation loss = 0.47800523042678833
Validation loss = 0.47826632857322693
Validation loss = 0.474308043718338
Validation loss = 0.4791143834590912
Validation loss = 0.4624235928058624
Validation loss = 0.4794616401195526
Validation loss = 0.4967174828052521
Validation loss = 0.460961252450943
Validation loss = 0.4902704060077667
Validation loss = 0.4682517945766449
Validation loss = 0.450292706489563
Validation loss = 0.45239248871803284
Validation loss = 0.46285220980644226
Validation loss = 0.4691106379032135
Validation loss = 0.4512239992618561
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.763841450214386
Validation loss = 0.6465626955032349
Validation loss = 0.6036360263824463
Validation loss = 0.5701339244842529
Validation loss = 0.5668137073516846
Validation loss = 0.5524507761001587
Validation loss = 0.5525175333023071
Validation loss = 0.5599849224090576
Validation loss = 0.5477275252342224
Validation loss = 0.5318930745124817
Validation loss = 0.5036168694496155
Validation loss = 0.5007174611091614
Validation loss = 0.49233585596084595
Validation loss = 0.5056105256080627
Validation loss = 0.5130829811096191
Validation loss = 0.4887557923793793
Validation loss = 0.48445719480514526
Validation loss = 0.48860064148902893
Validation loss = 0.4806217849254608
Validation loss = 0.46098291873931885
Validation loss = 0.48360398411750793
Validation loss = 0.49472591280937195
Validation loss = 0.46713191270828247
Validation loss = 0.4627011716365814
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7145459055900574
Validation loss = 0.6180599331855774
Validation loss = 0.5792033672332764
Validation loss = 0.5638266801834106
Validation loss = 0.5527635216712952
Validation loss = 0.5648757815361023
Validation loss = 0.5244637131690979
Validation loss = 0.5246370434761047
Validation loss = 0.5151149034500122
Validation loss = 0.519989013671875
Validation loss = 0.5000741481781006
Validation loss = 0.5013129711151123
Validation loss = 0.5063315629959106
Validation loss = 0.48685237765312195
Validation loss = 0.49117225408554077
Validation loss = 0.49132946133613586
Validation loss = 0.5055450797080994
Validation loss = 0.494116872549057
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.8425868153572083
Validation loss = 0.6518011689186096
Validation loss = 0.614592432975769
Validation loss = 0.5954888463020325
Validation loss = 0.557272732257843
Validation loss = 0.5633211135864258
Validation loss = 0.547681987285614
Validation loss = 0.5256214737892151
Validation loss = 0.5055409073829651
Validation loss = 0.49943646788597107
Validation loss = 0.4918361306190491
Validation loss = 0.5209115147590637
Validation loss = 0.4836803376674652
Validation loss = 0.47757765650749207
Validation loss = 0.4627566933631897
Validation loss = 0.47499680519104004
Validation loss = 0.46909528970718384
Validation loss = 0.4674743115901947
Validation loss = 0.4603489637374878
Validation loss = 0.46277153491973877
Validation loss = 0.46795910596847534
Validation loss = 0.455331414937973
Validation loss = 0.48056572675704956
Validation loss = 0.4571821391582489
Validation loss = 0.45547521114349365
Validation loss = 0.43240806460380554
Validation loss = 0.47629833221435547
Validation loss = 0.4615245759487152
Validation loss = 0.44395703077316284
Validation loss = 0.458168625831604
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7276850342750549
Validation loss = 0.6251760125160217
Validation loss = 0.5833620429039001
Validation loss = 0.56350177526474
Validation loss = 0.5399130582809448
Validation loss = 0.5509966015815735
Validation loss = 0.5295529961585999
Validation loss = 0.5089488625526428
Validation loss = 0.5153143405914307
Validation loss = 0.5039758086204529
Validation loss = 0.4796357750892639
Validation loss = 0.5094206929206848
Validation loss = 0.4942738115787506
Validation loss = 0.4689978063106537
Validation loss = 0.4712570607662201
Validation loss = 0.4667527973651886
Validation loss = 0.48898711800575256
Validation loss = 0.4728885591030121
Validation loss = 0.4695899188518524
Validation loss = 0.47545674443244934
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -0.315   |
| Iteration     | 1        |
| MaximumReturn | -0.181   |
| MinimumReturn | -0.509   |
| TotalSamples  | 4998     |
----------------------------
itr #2 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.5854362845420837
Validation loss = 0.4459778666496277
Validation loss = 0.42837077379226685
Validation loss = 0.42115718126296997
Validation loss = 0.410030722618103
Validation loss = 0.40203991532325745
Validation loss = 0.39375004172325134
Validation loss = 0.37886786460876465
Validation loss = 0.3746432662010193
Validation loss = 0.3732927441596985
Validation loss = 0.37674039602279663
Validation loss = 0.3568357527256012
Validation loss = 0.3614652752876282
Validation loss = 0.36140313744544983
Validation loss = 0.35164082050323486
Validation loss = 0.35629796981811523
Validation loss = 0.3519825041294098
Validation loss = 0.34563103318214417
Validation loss = 0.3544350862503052
Validation loss = 0.3585459589958191
Validation loss = 0.34875357151031494
Validation loss = 0.3582099676132202
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.5538130402565002
Validation loss = 0.44596898555755615
Validation loss = 0.4306527376174927
Validation loss = 0.41766679286956787
Validation loss = 0.4058855175971985
Validation loss = 0.40009671449661255
Validation loss = 0.38998377323150635
Validation loss = 0.37815558910369873
Validation loss = 0.37282702326774597
Validation loss = 0.3684964179992676
Validation loss = 0.36096763610839844
Validation loss = 0.3695840835571289
Validation loss = 0.35329508781433105
Validation loss = 0.3629274368286133
Validation loss = 0.3666311800479889
Validation loss = 0.3676949441432953
Validation loss = 0.3607184588909149
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.5616200566291809
Validation loss = 0.4431707262992859
Validation loss = 0.4198790192604065
Validation loss = 0.4126686453819275
Validation loss = 0.40485674142837524
Validation loss = 0.3861977756023407
Validation loss = 0.4249742031097412
Validation loss = 0.38204240798950195
Validation loss = 0.38246023654937744
Validation loss = 0.3643571734428406
Validation loss = 0.36210572719573975
Validation loss = 0.3686160445213318
Validation loss = 0.3740065395832062
Validation loss = 0.36414894461631775
Validation loss = 0.35323429107666016
Validation loss = 0.3713369071483612
Validation loss = 0.3536252975463867
Validation loss = 0.3628973364830017
Validation loss = 0.365745484828949
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.5843475461006165
Validation loss = 0.449695885181427
Validation loss = 0.42795413732528687
Validation loss = 0.41568630933761597
Validation loss = 0.40630048513412476
Validation loss = 0.4045065641403198
Validation loss = 0.39457380771636963
Validation loss = 0.38440603017807007
Validation loss = 0.3738253116607666
Validation loss = 0.37163057923316956
Validation loss = 0.3632745146751404
Validation loss = 0.35745346546173096
Validation loss = 0.35397636890411377
Validation loss = 0.35855090618133545
Validation loss = 0.3559620976448059
Validation loss = 0.3508946895599365
Validation loss = 0.3541785478591919
Validation loss = 0.3685309886932373
Validation loss = 0.35625943541526794
Validation loss = 0.3581565022468567
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.5611311197280884
Validation loss = 0.44191956520080566
Validation loss = 0.42550456523895264
Validation loss = 0.4165610671043396
Validation loss = 0.4105794429779053
Validation loss = 0.401752769947052
Validation loss = 0.39186376333236694
Validation loss = 0.3781527876853943
Validation loss = 0.40385690331459045
Validation loss = 0.3686613440513611
Validation loss = 0.36444348096847534
Validation loss = 0.3596040904521942
Validation loss = 0.3599403500556946
Validation loss = 0.35802167654037476
Validation loss = 0.35906144976615906
Validation loss = 0.347522497177124
Validation loss = 0.3517003059387207
Validation loss = 0.3501701056957245
Validation loss = 0.3475699722766876
Validation loss = 0.3474208414554596
Validation loss = 0.34900912642478943
Validation loss = 0.3520733714103699
Validation loss = 0.36731642484664917
Validation loss = 0.3482878804206848
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -32      |
| Iteration     | 2        |
| MaximumReturn | -0.107   |
| MinimumReturn | -80.1    |
| TotalSamples  | 6664     |
----------------------------
itr #3 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.4224133789539337
Validation loss = 0.36293014883995056
Validation loss = 0.34373071789741516
Validation loss = 0.3309608995914459
Validation loss = 0.33228135108947754
Validation loss = 0.334263414144516
Validation loss = 0.3287031948566437
Validation loss = 0.32730868458747864
Validation loss = 0.3344871699810028
Validation loss = 0.328193336725235
Validation loss = 0.33331504464149475
Validation loss = 0.3320070207118988
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.4037221372127533
Validation loss = 0.363344669342041
Validation loss = 0.3454783260822296
Validation loss = 0.3376162052154541
Validation loss = 0.3317258358001709
Validation loss = 0.33739712834358215
Validation loss = 0.330287903547287
Validation loss = 0.3311748504638672
Validation loss = 0.3310856819152832
Validation loss = 0.3302393853664398
Validation loss = 0.32961222529411316
Validation loss = 0.3272881805896759
Validation loss = 0.3302684724330902
Validation loss = 0.3312559127807617
Validation loss = 0.33159127831459045
Validation loss = 0.33413389325141907
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.4237813949584961
Validation loss = 0.35943254828453064
Validation loss = 0.3388758599758148
Validation loss = 0.33414915204048157
Validation loss = 0.33138975501060486
Validation loss = 0.33508428931236267
Validation loss = 0.33110931515693665
Validation loss = 0.3431273400783539
Validation loss = 0.33107444643974304
Validation loss = 0.3422866761684418
Validation loss = 0.3354967534542084
Validation loss = 0.3391582667827606
Validation loss = 0.33847320079803467
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.4252438545227051
Validation loss = 0.3585338294506073
Validation loss = 0.3379303514957428
Validation loss = 0.3378707468509674
Validation loss = 0.33655500411987305
Validation loss = 0.3372964859008789
Validation loss = 0.3379237949848175
Validation loss = 0.32778143882751465
Validation loss = 0.34437474608421326
Validation loss = 0.32995104789733887
Validation loss = 0.3361356258392334
Validation loss = 0.3321796953678131
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.4256567656993866
Validation loss = 0.35587847232818604
Validation loss = 0.33168038725852966
Validation loss = 0.33462369441986084
Validation loss = 0.33645525574684143
Validation loss = 0.3330312669277191
Validation loss = 0.33500710129737854
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -48.7    |
| Iteration     | 3        |
| MaximumReturn | -0.0829  |
| MinimumReturn | -115     |
| TotalSamples  | 8330     |
----------------------------
itr #4 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.35027894377708435
Validation loss = 0.32663631439208984
Validation loss = 0.327034056186676
Validation loss = 0.33061593770980835
Validation loss = 0.3295167088508606
Validation loss = 0.32853198051452637
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3426334261894226
Validation loss = 0.326974093914032
Validation loss = 0.32448339462280273
Validation loss = 0.3259906768798828
Validation loss = 0.3257278800010681
Validation loss = 0.32775822281837463
Validation loss = 0.32670050859451294
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3539874851703644
Validation loss = 0.3399645686149597
Validation loss = 0.33023595809936523
Validation loss = 0.33278846740722656
Validation loss = 0.3356250822544098
Validation loss = 0.3292348384857178
Validation loss = 0.3315616250038147
Validation loss = 0.33046260476112366
Validation loss = 0.33075785636901855
Validation loss = 0.33389315009117126
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.35190922021865845
Validation loss = 0.3257596492767334
Validation loss = 0.3275315761566162
Validation loss = 0.32632017135620117
Validation loss = 0.33122435212135315
Validation loss = 0.327182799577713
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3472353219985962
Validation loss = 0.3261141777038574
Validation loss = 0.3262695372104645
Validation loss = 0.3261398375034332
Validation loss = 0.32782846689224243
Validation loss = 0.32333534955978394
Validation loss = 0.32325300574302673
Validation loss = 0.3272723853588104
Validation loss = 0.3307298719882965
Validation loss = 0.3303399085998535
Validation loss = 0.3233896791934967
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -31.7    |
| Iteration     | 4        |
| MaximumReturn | -0.0386  |
| MinimumReturn | -88.7    |
| TotalSamples  | 9996     |
----------------------------
itr #5 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3376534879207611
Validation loss = 0.31938233971595764
Validation loss = 0.3180135190486908
Validation loss = 0.32000645995140076
Validation loss = 0.3207290768623352
Validation loss = 0.3232637345790863
Validation loss = 0.31925275921821594
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3424527049064636
Validation loss = 0.3258017599582672
Validation loss = 0.32317855954170227
Validation loss = 0.3179923892021179
Validation loss = 0.3239431381225586
Validation loss = 0.32418063282966614
Validation loss = 0.32249027490615845
Validation loss = 0.3228594660758972
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3340020775794983
Validation loss = 0.3265930116176605
Validation loss = 0.323867529630661
Validation loss = 0.3197297155857086
Validation loss = 0.32772764563560486
Validation loss = 0.32454800605773926
Validation loss = 0.32813581824302673
Validation loss = 0.3186953365802765
Validation loss = 0.3252040147781372
Validation loss = 0.321661114692688
Validation loss = 0.32879194617271423
Validation loss = 0.33050668239593506
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.33592039346694946
Validation loss = 0.3191946744918823
Validation loss = 0.31967273354530334
Validation loss = 0.3221026062965393
Validation loss = 0.3205467164516449
Validation loss = 0.31948134303092957
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.33903253078460693
Validation loss = 0.3171352446079254
Validation loss = 0.3230842053890228
Validation loss = 0.32182085514068604
Validation loss = 0.3204634189605713
Validation loss = 0.32252246141433716
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -26.5    |
| Iteration     | 5        |
| MaximumReturn | -0.0374  |
| MinimumReturn | -107     |
| TotalSamples  | 11662    |
----------------------------
itr #6 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3372611403465271
Validation loss = 0.3254237473011017
Validation loss = 0.32416749000549316
Validation loss = 0.32727593183517456
Validation loss = 0.3225499987602234
Validation loss = 0.3318292200565338
Validation loss = 0.326471209526062
Validation loss = 0.3297995626926422
Validation loss = 0.3255082964897156
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.33596450090408325
Validation loss = 0.3296232223510742
Validation loss = 0.3278577923774719
Validation loss = 0.325921893119812
Validation loss = 0.32695096731185913
Validation loss = 0.3211975693702698
Validation loss = 0.329987108707428
Validation loss = 0.33227330446243286
Validation loss = 0.33632639050483704
Validation loss = 0.32962530851364136
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3375368118286133
Validation loss = 0.32991519570350647
Validation loss = 0.3318813145160675
Validation loss = 0.33020779490470886
Validation loss = 0.33633777499198914
Validation loss = 0.3291521966457367
Validation loss = 0.33336374163627625
Validation loss = 0.34526491165161133
Validation loss = 0.33506739139556885
Validation loss = 0.3344050645828247
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3470272123813629
Validation loss = 0.32383522391319275
Validation loss = 0.32402199506759644
Validation loss = 0.3205721378326416
Validation loss = 0.33696743845939636
Validation loss = 0.32417765259742737
Validation loss = 0.32652488350868225
Validation loss = 0.3286423087120056
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3416983187198639
Validation loss = 0.32421618700027466
Validation loss = 0.3310742974281311
Validation loss = 0.3256235718727112
Validation loss = 0.32365864515304565
Validation loss = 0.325672447681427
Validation loss = 0.32725441455841064
Validation loss = 0.322564035654068
Validation loss = 0.32363301515579224
Validation loss = 0.3225591480731964
Validation loss = 0.3287973999977112
Validation loss = 0.329842209815979
Validation loss = 0.34351009130477905
Validation loss = 0.3335579037666321
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -12.1    |
| Iteration     | 6        |
| MaximumReturn | -0.0456  |
| MinimumReturn | -60      |
| TotalSamples  | 13328    |
----------------------------
itr #7 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3479417860507965
Validation loss = 0.3245084881782532
Validation loss = 0.32430145144462585
Validation loss = 0.3293320834636688
Validation loss = 0.3306412696838379
Validation loss = 0.3244146704673767
Validation loss = 0.33247122168540955
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.33013448119163513
Validation loss = 0.32467353343963623
Validation loss = 0.3251132369041443
Validation loss = 0.3257335424423218
Validation loss = 0.331328421831131
Validation loss = 0.32618042826652527
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3467360734939575
Validation loss = 0.32979726791381836
Validation loss = 0.33477911353111267
Validation loss = 0.3317265510559082
Validation loss = 0.3371637761592865
Validation loss = 0.3344712555408478
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.34500423073768616
Validation loss = 0.3256668150424957
Validation loss = 0.32291874289512634
Validation loss = 0.3211897611618042
Validation loss = 0.33001193404197693
Validation loss = 0.32684215903282166
Validation loss = 0.326368123292923
Validation loss = 0.3245215117931366
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3363074064254761
Validation loss = 0.330549031496048
Validation loss = 0.325417697429657
Validation loss = 0.3299989700317383
Validation loss = 0.3264836072921753
Validation loss = 0.3312794864177704
Validation loss = 0.32822856307029724
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -30.3    |
| Iteration     | 7        |
| MaximumReturn | -0.0278  |
| MinimumReturn | -112     |
| TotalSamples  | 14994    |
----------------------------
itr #8 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.32440075278282166
Validation loss = 0.3207266926765442
Validation loss = 0.3248228430747986
Validation loss = 0.32207542657852173
Validation loss = 0.32561108469963074
Validation loss = 0.3216642439365387
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3291706144809723
Validation loss = 0.32887449860572815
Validation loss = 0.32733842730522156
Validation loss = 0.3234565258026123
Validation loss = 0.32201990485191345
Validation loss = 0.33049842715263367
Validation loss = 0.32448646426200867
Validation loss = 0.3265807032585144
Validation loss = 0.33045610785484314
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3312342166900635
Validation loss = 0.32654356956481934
Validation loss = 0.3238455355167389
Validation loss = 0.3302122950553894
Validation loss = 0.3326743543148041
Validation loss = 0.333615243434906
Validation loss = 0.3310621380805969
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3277854323387146
Validation loss = 0.3327038884162903
Validation loss = 0.3199024796485901
Validation loss = 0.3216191828250885
Validation loss = 0.32315441966056824
Validation loss = 0.3235154151916504
Validation loss = 0.32506415247917175
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3325255513191223
Validation loss = 0.32295793294906616
Validation loss = 0.32823115587234497
Validation loss = 0.32889899611473083
Validation loss = 0.3279576301574707
Validation loss = 0.32882142066955566
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -3.84    |
| Iteration     | 8        |
| MaximumReturn | -0.0166  |
| MinimumReturn | -66.7    |
| TotalSamples  | 16660    |
----------------------------
itr #9 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.33506542444229126
Validation loss = 0.31935346126556396
Validation loss = 0.3239624500274658
Validation loss = 0.327788382768631
Validation loss = 0.3276573419570923
Validation loss = 0.3188019394874573
Validation loss = 0.32050710916519165
Validation loss = 0.32354170083999634
Validation loss = 0.32356321811676025
Validation loss = 0.32123875617980957
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3290983736515045
Validation loss = 0.32069265842437744
Validation loss = 0.32449325919151306
Validation loss = 0.3236527144908905
Validation loss = 0.32700878381729126
Validation loss = 0.3232539892196655
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3283039331436157
Validation loss = 0.3226223587989807
Validation loss = 0.3294209837913513
Validation loss = 0.3269859552383423
Validation loss = 0.3247257471084595
Validation loss = 0.33206406235694885
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3258770704269409
Validation loss = 0.320301353931427
Validation loss = 0.3191379904747009
Validation loss = 0.3197396993637085
Validation loss = 0.3209131956100464
Validation loss = 0.31948328018188477
Validation loss = 0.323416143655777
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3281232416629791
Validation loss = 0.32198143005371094
Validation loss = 0.3219083845615387
Validation loss = 0.3266115188598633
Validation loss = 0.32382509112358093
Validation loss = 0.32957714796066284
Validation loss = 0.32762610912323
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -9.05    |
| Iteration     | 9        |
| MaximumReturn | -0.0718  |
| MinimumReturn | -61.1    |
| TotalSamples  | 18326    |
----------------------------
itr #10 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.32168325781822205
Validation loss = 0.31663835048675537
Validation loss = 0.32214629650115967
Validation loss = 0.3202136158943176
Validation loss = 0.3202971816062927
Validation loss = 0.3193768560886383
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3173157870769501
Validation loss = 0.32491233944892883
Validation loss = 0.32017841935157776
Validation loss = 0.3177843689918518
Validation loss = 0.32329124212265015
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.32063013315200806
Validation loss = 0.3219822347164154
Validation loss = 0.32436686754226685
Validation loss = 0.3281196355819702
Validation loss = 0.3336877226829529
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.31869757175445557
Validation loss = 0.32448679208755493
Validation loss = 0.3250858783721924
Validation loss = 0.32100027799606323
Validation loss = 0.32081326842308044
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3213956654071808
Validation loss = 0.3327813148498535
Validation loss = 0.32469260692596436
Validation loss = 0.3248820900917053
Validation loss = 0.3242197334766388
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -135     |
| Iteration     | 10       |
| MaximumReturn | -90      |
| MinimumReturn | -162     |
| TotalSamples  | 19992    |
----------------------------
itr #11 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.32819265127182007
Validation loss = 0.32131820917129517
Validation loss = 0.3190799653530121
Validation loss = 0.32142192125320435
Validation loss = 0.3186364769935608
Validation loss = 0.3209995627403259
Validation loss = 0.32132935523986816
Validation loss = 0.3238530457019806
Validation loss = 0.32199209928512573
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.33345726132392883
Validation loss = 0.3188789188861847
Validation loss = 0.32102710008621216
Validation loss = 0.31967976689338684
Validation loss = 0.3222509026527405
Validation loss = 0.32180261611938477
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.32840317487716675
Validation loss = 0.3214647173881531
Validation loss = 0.32321804761886597
Validation loss = 0.3258489966392517
Validation loss = 0.32589811086654663
Validation loss = 0.32933494448661804
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3245229125022888
Validation loss = 0.3178406357765198
Validation loss = 0.3184768259525299
Validation loss = 0.3222491145133972
Validation loss = 0.3203999400138855
Validation loss = 0.32192397117614746
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3276893198490143
Validation loss = 0.3238261342048645
Validation loss = 0.32023271918296814
Validation loss = 0.32169657945632935
Validation loss = 0.3214513063430786
Validation loss = 0.32397228479385376
Validation loss = 0.32310932874679565
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -132     |
| Iteration     | 11       |
| MaximumReturn | -68.9    |
| MinimumReturn | -171     |
| TotalSamples  | 21658    |
----------------------------
itr #12 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3174455165863037
Validation loss = 0.31650739908218384
Validation loss = 0.3164023160934448
Validation loss = 0.3189515173435211
Validation loss = 0.3223262131214142
Validation loss = 0.3228839337825775
Validation loss = 0.32611632347106934
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3158895671367645
Validation loss = 0.317121297121048
Validation loss = 0.3176723122596741
Validation loss = 0.317451536655426
Validation loss = 0.3201119005680084
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.32163113355636597
Validation loss = 0.3225638270378113
Validation loss = 0.3300909996032715
Validation loss = 0.3216671943664551
Validation loss = 0.3241097927093506
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.32383283972740173
Validation loss = 0.3160567879676819
Validation loss = 0.3197227418422699
Validation loss = 0.32156315445899963
Validation loss = 0.31929710507392883
Validation loss = 0.3166622519493103
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3192877769470215
Validation loss = 0.32110702991485596
Validation loss = 0.3206464648246765
Validation loss = 0.3238450288772583
Validation loss = 0.3232547342777252
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -160     |
| Iteration     | 12       |
| MaximumReturn | -120     |
| MinimumReturn | -181     |
| TotalSamples  | 23324    |
----------------------------
itr #13 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3189347982406616
Validation loss = 0.3208909332752228
Validation loss = 0.31855469942092896
Validation loss = 0.3194035589694977
Validation loss = 0.3234148621559143
Validation loss = 0.32261836528778076
Validation loss = 0.32192039489746094
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.31972774863243103
Validation loss = 0.3131072223186493
Validation loss = 0.31397005915641785
Validation loss = 0.31455227732658386
Validation loss = 0.31336522102355957
Validation loss = 0.3180398643016815
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3184097111225128
Validation loss = 0.32251736521720886
Validation loss = 0.321890652179718
Validation loss = 0.32434847950935364
Validation loss = 0.328784316778183
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.31489965319633484
Validation loss = 0.31648728251457214
Validation loss = 0.31405147910118103
Validation loss = 0.3177601099014282
Validation loss = 0.31690895557403564
Validation loss = 0.31816521286964417
Validation loss = 0.3198047876358032
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3167515993118286
Validation loss = 0.32017216086387634
Validation loss = 0.31740978360176086
Validation loss = 0.3197013735771179
Validation loss = 0.31868550181388855
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -174     |
| Iteration     | 13       |
| MaximumReturn | -96.7    |
| MinimumReturn | -192     |
| TotalSamples  | 24990    |
----------------------------
itr #14 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.31659671664237976
Validation loss = 0.313975065946579
Validation loss = 0.31115731596946716
Validation loss = 0.31280282139778137
Validation loss = 0.31460121273994446
Validation loss = 0.3129216730594635
Validation loss = 0.3235233426094055
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.31541958451271057
Validation loss = 0.3130609095096588
Validation loss = 0.31361499428749084
Validation loss = 0.3171185553073883
Validation loss = 0.3091108798980713
Validation loss = 0.31227952241897583
Validation loss = 0.3126119077205658
Validation loss = 0.31704744696617126
Validation loss = 0.3126889765262604
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.31685471534729004
Validation loss = 0.31518593430519104
Validation loss = 0.3181745707988739
Validation loss = 0.31882908940315247
Validation loss = 0.31775209307670593
Validation loss = 0.3214346766471863
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3106556832790375
Validation loss = 0.3133377730846405
Validation loss = 0.31097880005836487
Validation loss = 0.310254842042923
Validation loss = 0.3139680325984955
Validation loss = 0.31399601697921753
Validation loss = 0.31697264313697815
Validation loss = 0.31324684619903564
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.31556710600852966
Validation loss = 0.3187122046947479
Validation loss = 0.3187622129917145
Validation loss = 0.31448960304260254
Validation loss = 0.3154497444629669
Validation loss = 0.3178894519805908
Validation loss = 0.31890806555747986
Validation loss = 0.317778617143631
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -179     |
| Iteration     | 14       |
| MaximumReturn | -144     |
| MinimumReturn | -193     |
| TotalSamples  | 26656    |
----------------------------
itr #15 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.31107813119888306
Validation loss = 0.3116414248943329
Validation loss = 0.31301504373550415
Validation loss = 0.3144504129886627
Validation loss = 0.31553757190704346
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.31748583912849426
Validation loss = 0.3141166567802429
Validation loss = 0.3107931911945343
Validation loss = 0.310284823179245
Validation loss = 0.31834059953689575
Validation loss = 0.31663796305656433
Validation loss = 0.3139788806438446
Validation loss = 0.3117491602897644
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3164803683757782
Validation loss = 0.3135085105895996
Validation loss = 0.31613975763320923
Validation loss = 0.3159463107585907
Validation loss = 0.3186643421649933
Validation loss = 0.3162609934806824
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3096955418586731
Validation loss = 0.30654022097587585
Validation loss = 0.3092963993549347
Validation loss = 0.31359827518463135
Validation loss = 0.3150354027748108
Validation loss = 0.31103020906448364
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3118434250354767
Validation loss = 0.31505265831947327
Validation loss = 0.3114180266857147
Validation loss = 0.3191673159599304
Validation loss = 0.3145409822463989
Validation loss = 0.3160688281059265
Validation loss = 0.31587836146354675
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -187     |
| Iteration     | 15       |
| MaximumReturn | -132     |
| MinimumReturn | -203     |
| TotalSamples  | 28322    |
----------------------------
itr #16 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.31352129578590393
Validation loss = 0.3151475489139557
Validation loss = 0.31099042296409607
Validation loss = 0.3154701888561249
Validation loss = 0.3155536651611328
Validation loss = 0.31499141454696655
Validation loss = 0.31378453969955444
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.31026455760002136
Validation loss = 0.31040337681770325
Validation loss = 0.3134055733680725
Validation loss = 0.31719347834587097
Validation loss = 0.3157356083393097
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3211265206336975
Validation loss = 0.32392916083335876
Validation loss = 0.32384103536605835
Validation loss = 0.31848448514938354
Validation loss = 0.3185318410396576
Validation loss = 0.3206978142261505
Validation loss = 0.3201436698436737
Validation loss = 0.32078853249549866
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3103041648864746
Validation loss = 0.3097771108150482
Validation loss = 0.3108390271663666
Validation loss = 0.30886173248291016
Validation loss = 0.31053808331489563
Validation loss = 0.31131380796432495
Validation loss = 0.313059002161026
Validation loss = 0.3132602274417877
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.31360509991645813
Validation loss = 0.312130868434906
Validation loss = 0.3196561336517334
Validation loss = 0.31763166189193726
Validation loss = 0.31609490513801575
Validation loss = 0.31971433758735657
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -188     |
| Iteration     | 16       |
| MaximumReturn | -164     |
| MinimumReturn | -199     |
| TotalSamples  | 29988    |
----------------------------
itr #17 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.31576666235923767
Validation loss = 0.31013232469558716
Validation loss = 0.3151557147502899
Validation loss = 0.31435316801071167
Validation loss = 0.3119555115699768
Validation loss = 0.3153095543384552
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3082248270511627
Validation loss = 0.313569039106369
Validation loss = 0.31567221879959106
Validation loss = 0.3119122385978699
Validation loss = 0.31349268555641174
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3222423791885376
Validation loss = 0.31477734446525574
Validation loss = 0.31962648034095764
Validation loss = 0.31747886538505554
Validation loss = 0.3187764585018158
Validation loss = 0.3248365819454193
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.31429538130760193
Validation loss = 0.3101634681224823
Validation loss = 0.3138059079647064
Validation loss = 0.3111022412776947
Validation loss = 0.3103996217250824
Validation loss = 0.31406113505363464
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3113056719303131
Validation loss = 0.3143674433231354
Validation loss = 0.31543177366256714
Validation loss = 0.316238135099411
Validation loss = 0.31568634510040283
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -196     |
| Iteration     | 17       |
| MaximumReturn | -177     |
| MinimumReturn | -205     |
| TotalSamples  | 31654    |
----------------------------
itr #18 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3160024881362915
Validation loss = 0.31154918670654297
Validation loss = 0.3077755570411682
Validation loss = 0.31955787539482117
Validation loss = 0.31578686833381653
Validation loss = 0.31405121088027954
Validation loss = 0.31917205452919006
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3117532730102539
Validation loss = 0.3139781951904297
Validation loss = 0.3097056448459625
Validation loss = 0.3163415491580963
Validation loss = 0.3099516034126282
Validation loss = 0.3120252788066864
Validation loss = 0.3117508292198181
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.31757938861846924
Validation loss = 0.3196881413459778
Validation loss = 0.31926482915878296
Validation loss = 0.3234966993331909
Validation loss = 0.320478618144989
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.311208575963974
Validation loss = 0.31617480516433716
Validation loss = 0.3152250051498413
Validation loss = 0.31027984619140625
Validation loss = 0.31332510709762573
Validation loss = 0.3167741000652313
Validation loss = 0.3138331472873688
Validation loss = 0.3131616413593292
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.31412383913993835
Validation loss = 0.3149152100086212
Validation loss = 0.31741803884506226
Validation loss = 0.31293585896492004
Validation loss = 0.31767570972442627
Validation loss = 0.3221721947193146
Validation loss = 0.317842572927475
Validation loss = 0.325580894947052
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -183     |
| Iteration     | 18       |
| MaximumReturn | -147     |
| MinimumReturn | -200     |
| TotalSamples  | 33320    |
----------------------------
itr #19 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.31852591037750244
Validation loss = 0.31023910641670227
Validation loss = 0.314558207988739
Validation loss = 0.31281334161758423
Validation loss = 0.3156592547893524
Validation loss = 0.31880101561546326
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.31384432315826416
Validation loss = 0.30985015630722046
Validation loss = 0.31170040369033813
Validation loss = 0.3103622794151306
Validation loss = 0.3147050738334656
Validation loss = 0.3137124180793762
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.31351086497306824
Validation loss = 0.31550201773643494
Validation loss = 0.3189905285835266
Validation loss = 0.3215339183807373
Validation loss = 0.31845152378082275
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3112863302230835
Validation loss = 0.3097235858440399
Validation loss = 0.31308111548423767
Validation loss = 0.31059783697128296
Validation loss = 0.316562294960022
Validation loss = 0.3152227997779846
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3162323832511902
Validation loss = 0.3136129379272461
Validation loss = 0.31401655077934265
Validation loss = 0.31784841418266296
Validation loss = 0.31564807891845703
Validation loss = 0.3204708695411682
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -179     |
| Iteration     | 19       |
| MaximumReturn | -132     |
| MinimumReturn | -202     |
| TotalSamples  | 34986    |
----------------------------
itr #20 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.31282028555870056
Validation loss = 0.310173362493515
Validation loss = 0.3101154565811157
Validation loss = 0.3128465712070465
Validation loss = 0.3107132017612457
Validation loss = 0.3110349774360657
Validation loss = 0.31364768743515015
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.310109943151474
Validation loss = 0.3067336678504944
Validation loss = 0.3118128776550293
Validation loss = 0.3065304458141327
Validation loss = 0.30728697776794434
Validation loss = 0.3095465302467346
Validation loss = 0.31463178992271423
Validation loss = 0.312054842710495
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3106173574924469
Validation loss = 0.31416964530944824
Validation loss = 0.3114086985588074
Validation loss = 0.3144596517086029
Validation loss = 0.3170543611049652
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3101511299610138
Validation loss = 0.3103047311306
Validation loss = 0.308016300201416
Validation loss = 0.31170082092285156
Validation loss = 0.31404879689216614
Validation loss = 0.3092970550060272
Validation loss = 0.3097669780254364
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.31291887164115906
Validation loss = 0.3162534832954407
Validation loss = 0.31732067465782166
Validation loss = 0.3139946162700653
Validation loss = 0.3182390332221985
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -185     |
| Iteration     | 20       |
| MaximumReturn | -164     |
| MinimumReturn | -198     |
| TotalSamples  | 36652    |
----------------------------
itr #21 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3051622807979584
Validation loss = 0.3053905963897705
Validation loss = 0.30623263120651245
Validation loss = 0.31476399302482605
Validation loss = 0.3046148121356964
Validation loss = 0.3091537654399872
Validation loss = 0.309256374835968
Validation loss = 0.3074948787689209
Validation loss = 0.3149474859237671
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3062211275100708
Validation loss = 0.3041936755180359
Validation loss = 0.30624571442604065
Validation loss = 0.3094312250614166
Validation loss = 0.3070855736732483
Validation loss = 0.31246301531791687
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.31473636627197266
Validation loss = 0.31376928091049194
Validation loss = 0.3113313317298889
Validation loss = 0.31308314204216003
Validation loss = 0.3137555420398712
Validation loss = 0.31284528970718384
Validation loss = 0.31661689281463623
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3043583929538727
Validation loss = 0.30738839507102966
Validation loss = 0.30525511503219604
Validation loss = 0.30743807554244995
Validation loss = 0.3070053458213806
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3047164976596832
Validation loss = 0.3096301853656769
Validation loss = 0.30886316299438477
Validation loss = 0.30876943469047546
Validation loss = 0.31195998191833496
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -179     |
| Iteration     | 21       |
| MaximumReturn | -158     |
| MinimumReturn | -190     |
| TotalSamples  | 38318    |
----------------------------
itr #22 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.30428624153137207
Validation loss = 0.3030586242675781
Validation loss = 0.3040657341480255
Validation loss = 0.30639150738716125
Validation loss = 0.30923396348953247
Validation loss = 0.30451297760009766
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.30135577917099
Validation loss = 0.299190878868103
Validation loss = 0.30217522382736206
Validation loss = 0.3039694130420685
Validation loss = 0.30570024251937866
Validation loss = 0.3046443462371826
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3081839680671692
Validation loss = 0.3079828917980194
Validation loss = 0.3087123930454254
Validation loss = 0.3067866265773773
Validation loss = 0.31018757820129395
Validation loss = 0.31280142068862915
Validation loss = 0.31184709072113037
Validation loss = 0.315631628036499
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.30267369747161865
Validation loss = 0.30570438504219055
Validation loss = 0.3034060299396515
Validation loss = 0.3043700158596039
Validation loss = 0.30342522263526917
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3039452135562897
Validation loss = 0.3061821162700653
Validation loss = 0.3064138889312744
Validation loss = 0.3067183792591095
Validation loss = 0.3043353259563446
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -183     |
| Iteration     | 22       |
| MaximumReturn | -169     |
| MinimumReturn | -194     |
| TotalSamples  | 39984    |
----------------------------
itr #23 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.2989426255226135
Validation loss = 0.30178558826446533
Validation loss = 0.3012135624885559
Validation loss = 0.3058839440345764
Validation loss = 0.3020978569984436
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.2996538281440735
Validation loss = 0.2992415428161621
Validation loss = 0.2994954586029053
Validation loss = 0.3044299781322479
Validation loss = 0.30511099100112915
Validation loss = 0.3032463490962982
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3048570454120636
Validation loss = 0.3084860146045685
Validation loss = 0.3106361925601959
Validation loss = 0.31115418672561646
Validation loss = 0.30904993414878845
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.30027884244918823
Validation loss = 0.30074572563171387
Validation loss = 0.30257153511047363
Validation loss = 0.303640216588974
Validation loss = 0.3019406199455261
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.30122822523117065
Validation loss = 0.30303946137428284
Validation loss = 0.30610641837120056
Validation loss = 0.3046336770057678
Validation loss = 0.30729663372039795
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -182     |
| Iteration     | 23       |
| MaximumReturn | -164     |
| MinimumReturn | -193     |
| TotalSamples  | 41650    |
----------------------------
itr #24 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.304823100566864
Validation loss = 0.3031086027622223
Validation loss = 0.3011612892150879
Validation loss = 0.3045639395713806
Validation loss = 0.30529579520225525
Validation loss = 0.309013307094574
Validation loss = 0.30683809518814087
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.30371376872062683
Validation loss = 0.30222612619400024
Validation loss = 0.30132898688316345
Validation loss = 0.30384010076522827
Validation loss = 0.3056386113166809
Validation loss = 0.3032419979572296
Validation loss = 0.30593496561050415
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.309562087059021
Validation loss = 0.30705830454826355
Validation loss = 0.3147006630897522
Validation loss = 0.31099286675453186
Validation loss = 0.3131803870201111
Validation loss = 0.30961185693740845
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.2986430525779724
Validation loss = 0.3039185404777527
Validation loss = 0.3003470301628113
Validation loss = 0.3000861406326294
Validation loss = 0.3036438822746277
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.30150777101516724
Validation loss = 0.30344343185424805
Validation loss = 0.3048226237297058
Validation loss = 0.3034850060939789
Validation loss = 0.31162622570991516
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -187     |
| Iteration     | 24       |
| MaximumReturn | -167     |
| MinimumReturn | -195     |
| TotalSamples  | 43316    |
----------------------------
itr #25 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.301643043756485
Validation loss = 0.30245915055274963
Validation loss = 0.3009346127510071
Validation loss = 0.2994251847267151
Validation loss = 0.30175817012786865
Validation loss = 0.30482354760169983
Validation loss = 0.3044760227203369
Validation loss = 0.3047717213630676
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.29990968108177185
Validation loss = 0.3001854717731476
Validation loss = 0.3042775094509125
Validation loss = 0.3029533922672272
Validation loss = 0.301576167345047
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.30623021721839905
Validation loss = 0.3034798800945282
Validation loss = 0.3067365288734436
Validation loss = 0.30776816606521606
Validation loss = 0.3117471933364868
Validation loss = 0.3088291883468628
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.29497334361076355
Validation loss = 0.29808178544044495
Validation loss = 0.3006569445133209
Validation loss = 0.3014200031757355
Validation loss = 0.2998269498348236
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3010334074497223
Validation loss = 0.3022424578666687
Validation loss = 0.30290886759757996
Validation loss = 0.3038612902164459
Validation loss = 0.30632373690605164
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -183     |
| Iteration     | 25       |
| MaximumReturn | -170     |
| MinimumReturn | -191     |
| TotalSamples  | 44982    |
----------------------------
itr #26 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.2980406880378723
Validation loss = 0.2989044487476349
Validation loss = 0.297490656375885
Validation loss = 0.3008441925048828
Validation loss = 0.30038371682167053
Validation loss = 0.3021649718284607
Validation loss = 0.3042387068271637
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.29751265048980713
Validation loss = 0.2986377477645874
Validation loss = 0.30168190598487854
Validation loss = 0.2970038652420044
Validation loss = 0.2996034026145935
Validation loss = 0.30457770824432373
Validation loss = 0.3009844422340393
Validation loss = 0.30070599913597107
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.30544954538345337
Validation loss = 0.30343782901763916
Validation loss = 0.30622953176498413
Validation loss = 0.3043194115161896
Validation loss = 0.30765601992607117
Validation loss = 0.3072696030139923
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.2954306900501251
Validation loss = 0.2977270781993866
Validation loss = 0.29657191038131714
Validation loss = 0.2976997196674347
Validation loss = 0.30001065135002136
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.2970513105392456
Validation loss = 0.297299861907959
Validation loss = 0.2968906760215759
Validation loss = 0.30007049441337585
Validation loss = 0.3016650676727295
Validation loss = 0.3013051748275757
Validation loss = 0.3033159673213959
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -175     |
| Iteration     | 26       |
| MaximumReturn | -139     |
| MinimumReturn | -184     |
| TotalSamples  | 46648    |
----------------------------
itr #27 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.2966862916946411
Validation loss = 0.29894953966140747
Validation loss = 0.3010248839855194
Validation loss = 0.29939374327659607
Validation loss = 0.30074968934059143
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.2976325750350952
Validation loss = 0.2984437942504883
Validation loss = 0.2973097860813141
Validation loss = 0.29936468601226807
Validation loss = 0.29814428091049194
Validation loss = 0.3018890917301178
Validation loss = 0.2994753122329712
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.2997816801071167
Validation loss = 0.30336692929267883
Validation loss = 0.3073839843273163
Validation loss = 0.3036479651927948
Validation loss = 0.3063340187072754
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.2942568361759186
Validation loss = 0.29685306549072266
Validation loss = 0.2942858040332794
Validation loss = 0.29421350359916687
Validation loss = 0.30041733384132385
Validation loss = 0.2954045236110687
Validation loss = 0.2996864318847656
Validation loss = 0.2967076599597931
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.2947603464126587
Validation loss = 0.2984601855278015
Validation loss = 0.3010881841182709
Validation loss = 0.3005376160144806
Validation loss = 0.30113619565963745
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -180     |
| Iteration     | 27       |
| MaximumReturn | -157     |
| MinimumReturn | -190     |
| TotalSamples  | 48314    |
----------------------------
itr #28 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.2959310710430145
Validation loss = 0.2970528304576874
Validation loss = 0.29843828082084656
Validation loss = 0.29476845264434814
Validation loss = 0.29773440957069397
Validation loss = 0.29994913935661316
Validation loss = 0.30072882771492004
Validation loss = 0.2971327602863312
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.29740452766418457
Validation loss = 0.29813671112060547
Validation loss = 0.2970418930053711
Validation loss = 0.29783856868743896
Validation loss = 0.29868409037590027
Validation loss = 0.2998248338699341
Validation loss = 0.3003571927547455
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.2962009012699127
Validation loss = 0.2988891303539276
Validation loss = 0.3042140007019043
Validation loss = 0.3019755184650421
Validation loss = 0.30540233850479126
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.294931024312973
Validation loss = 0.2917848825454712
Validation loss = 0.29278305172920227
Validation loss = 0.29423201084136963
Validation loss = 0.29931387305259705
Validation loss = 0.29550740122795105
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.2953161299228668
Validation loss = 0.2987562119960785
Validation loss = 0.29705753922462463
Validation loss = 0.2981291115283966
Validation loss = 0.30097058415412903
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -174     |
| Iteration     | 28       |
| MaximumReturn | -154     |
| MinimumReturn | -181     |
| TotalSamples  | 49980    |
----------------------------
itr #29 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.2957395613193512
Validation loss = 0.29819267988204956
Validation loss = 0.30500662326812744
Validation loss = 0.2967623770236969
Validation loss = 0.29770079255104065
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.29520294070243835
Validation loss = 0.29661476612091064
Validation loss = 0.30158600211143494
Validation loss = 0.3007257580757141
Validation loss = 0.30265891551971436
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.29831886291503906
Validation loss = 0.30136412382125854
Validation loss = 0.2999933063983917
Validation loss = 0.301371693611145
Validation loss = 0.3073805272579193
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.29428374767303467
Validation loss = 0.2972927987575531
Validation loss = 0.29395851492881775
Validation loss = 0.29474547505378723
Validation loss = 0.29755425453186035
Validation loss = 0.2950378954410553
Validation loss = 0.298736572265625
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.2953489124774933
Validation loss = 0.2964977025985718
Validation loss = 0.29505348205566406
Validation loss = 0.29781603813171387
Validation loss = 0.30022770166397095
Validation loss = 0.2985006868839264
Validation loss = 0.3003458082675934
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -167     |
| Iteration     | 29       |
| MaximumReturn | -150     |
| MinimumReturn | -180     |
| TotalSamples  | 51646    |
----------------------------
itr #30 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.29789912700653076
Validation loss = 0.2982744872570038
Validation loss = 0.30190274119377136
Validation loss = 0.29700717329978943
Validation loss = 0.2995273470878601
Validation loss = 0.3026113212108612
Validation loss = 0.3006834089756012
Validation loss = 0.2999594509601593
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.2991887629032135
Validation loss = 0.29706379771232605
Validation loss = 0.2972501516342163
Validation loss = 0.30487170815467834
Validation loss = 0.2981455326080322
Validation loss = 0.29888060688972473
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.2995140552520752
Validation loss = 0.3017149865627289
Validation loss = 0.3030734956264496
Validation loss = 0.30233004689216614
Validation loss = 0.3058048188686371
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.2942073345184326
Validation loss = 0.2971123456954956
Validation loss = 0.2981666624546051
Validation loss = 0.29776322841644287
Validation loss = 0.2971126139163971
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.2994799315929413
Validation loss = 0.29866331815719604
Validation loss = 0.2989073097705841
Validation loss = 0.29864564538002014
Validation loss = 0.3066830337047577
Validation loss = 0.30201035737991333
Validation loss = 0.3017837703227997
Validation loss = 0.3031916916370392
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -150     |
| Iteration     | 30       |
| MaximumReturn | -137     |
| MinimumReturn | -159     |
| TotalSamples  | 53312    |
----------------------------
itr #31 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.30055564641952515
Validation loss = 0.29907068610191345
Validation loss = 0.3018347918987274
Validation loss = 0.3030119836330414
Validation loss = 0.2993676960468292
Validation loss = 0.30165717005729675
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3004131019115448
Validation loss = 0.29908570647239685
Validation loss = 0.2988927960395813
Validation loss = 0.3010491728782654
Validation loss = 0.30225870013237
Validation loss = 0.30165237188339233
Validation loss = 0.30357497930526733
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3003123104572296
Validation loss = 0.30166250467300415
Validation loss = 0.30363646149635315
Validation loss = 0.30209770798683167
Validation loss = 0.3024809658527374
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.2966352701187134
Validation loss = 0.29782283306121826
Validation loss = 0.3000790476799011
Validation loss = 0.3008546531200409
Validation loss = 0.30013740062713623
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.2996228039264679
Validation loss = 0.29913991689682007
Validation loss = 0.30236709117889404
Validation loss = 0.30301597714424133
Validation loss = 0.303347110748291
Validation loss = 0.3004312217235565
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -51.6    |
| Iteration     | 31       |
| MaximumReturn | -2.85    |
| MinimumReturn | -86.7    |
| TotalSamples  | 54978    |
----------------------------
itr #32 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.297927588224411
Validation loss = 0.3010692894458771
Validation loss = 0.29900673031806946
Validation loss = 0.3011045753955841
Validation loss = 0.29980769753456116
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.29819658398628235
Validation loss = 0.29797419905662537
Validation loss = 0.2997529208660126
Validation loss = 0.3035217523574829
Validation loss = 0.30231738090515137
Validation loss = 0.3021637201309204
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.2984992563724518
Validation loss = 0.30330389738082886
Validation loss = 0.30512893199920654
Validation loss = 0.30250951647758484
Validation loss = 0.3040269613265991
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.29481977224349976
Validation loss = 0.2935746908187866
Validation loss = 0.2995055019855499
Validation loss = 0.29715946316719055
Validation loss = 0.29669034481048584
Validation loss = 0.30056628584861755
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.29996544122695923
Validation loss = 0.301445335149765
Validation loss = 0.29931017756462097
Validation loss = 0.3011137545108795
Validation loss = 0.3032187223434448
Validation loss = 0.30618077516555786
Validation loss = 0.3051414489746094
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -43.3    |
| Iteration     | 32       |
| MaximumReturn | -1.14    |
| MinimumReturn | -98.6    |
| TotalSamples  | 56644    |
----------------------------
itr #33 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.30283689498901367
Validation loss = 0.300493061542511
Validation loss = 0.303971529006958
Validation loss = 0.3049306273460388
Validation loss = 0.30685561895370483
Validation loss = 0.30557742714881897
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3076881766319275
Validation loss = 0.30264678597450256
Validation loss = 0.30481845140457153
Validation loss = 0.3039267361164093
Validation loss = 0.3059515953063965
Validation loss = 0.3114197552204132
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.30481791496276855
Validation loss = 0.30841371417045593
Validation loss = 0.30499038100242615
Validation loss = 0.30962663888931274
Validation loss = 0.3089396357536316
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.30348995327949524
Validation loss = 0.3014196455478668
Validation loss = 0.3026639521121979
Validation loss = 0.29962673783302307
Validation loss = 0.3037674129009247
Validation loss = 0.30394870042800903
Validation loss = 0.30682650208473206
Validation loss = 0.3080212473869324
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.30992716550827026
Validation loss = 0.3065246045589447
Validation loss = 0.30673494935035706
Validation loss = 0.3088096082210541
Validation loss = 0.30833539366722107
Validation loss = 0.30875203013420105
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -79.3    |
| Iteration     | 33       |
| MaximumReturn | -2.12    |
| MinimumReturn | -144     |
| TotalSamples  | 58310    |
----------------------------
itr #34 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3077884912490845
Validation loss = 0.3056418299674988
Validation loss = 0.30842146277427673
Validation loss = 0.3089097738265991
Validation loss = 0.3104167878627777
Validation loss = 0.311896413564682
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3112315535545349
Validation loss = 0.312349408864975
Validation loss = 0.3138681948184967
Validation loss = 0.31331774592399597
Validation loss = 0.31475794315338135
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.30876436829566956
Validation loss = 0.31194618344306946
Validation loss = 0.3145841360092163
Validation loss = 0.313289612531662
Validation loss = 0.3150343596935272
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.31042712926864624
Validation loss = 0.3066737949848175
Validation loss = 0.31096789240837097
Validation loss = 0.3058071434497833
Validation loss = 0.31314921379089355
Validation loss = 0.31198975443840027
Validation loss = 0.31446367502212524
Validation loss = 0.3138675093650818
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.31203019618988037
Validation loss = 0.3161921203136444
Validation loss = 0.3131469786167145
Validation loss = 0.31492844223976135
Validation loss = 0.31882229447364807
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -64.1    |
| Iteration     | 34       |
| MaximumReturn | -1.05    |
| MinimumReturn | -159     |
| TotalSamples  | 59976    |
----------------------------
itr #35 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.31019434332847595
Validation loss = 0.310029536485672
Validation loss = 0.31575194001197815
Validation loss = 0.314714252948761
Validation loss = 0.3143693506717682
Validation loss = 0.3152042627334595
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3135717511177063
Validation loss = 0.3122028112411499
Validation loss = 0.31395086646080017
Validation loss = 0.3158160448074341
Validation loss = 0.31614139676094055
Validation loss = 0.31878888607025146
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.31421715021133423
Validation loss = 0.3126654326915741
Validation loss = 0.3161589801311493
Validation loss = 0.31892842054367065
Validation loss = 0.32171547412872314
Validation loss = 0.32491549849510193
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.31933650374412537
Validation loss = 0.3179813027381897
Validation loss = 0.3164571523666382
Validation loss = 0.3187442719936371
Validation loss = 0.31858909130096436
Validation loss = 0.31651729345321655
Validation loss = 0.3225841224193573
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.31358879804611206
Validation loss = 0.3135548233985901
Validation loss = 0.31663909554481506
Validation loss = 0.31863901019096375
Validation loss = 0.3176012933254242
Validation loss = 0.31959396600723267
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -1.73    |
| Iteration     | 35       |
| MaximumReturn | -0.627   |
| MinimumReturn | -3.7     |
| TotalSamples  | 61642    |
----------------------------
itr #36 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.31669119000434875
Validation loss = 0.3167109191417694
Validation loss = 0.3151615858078003
Validation loss = 0.31836846470832825
Validation loss = 0.3186357617378235
Validation loss = 0.31945833563804626
Validation loss = 0.32113274931907654
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3193007707595825
Validation loss = 0.32052740454673767
Validation loss = 0.31805142760276794
Validation loss = 0.31887441873550415
Validation loss = 0.3211403489112854
Validation loss = 0.3226414918899536
Validation loss = 0.3223394751548767
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.32618725299835205
Validation loss = 0.3213929235935211
Validation loss = 0.32130515575408936
Validation loss = 0.32190191745758057
Validation loss = 0.3238511383533478
Validation loss = 0.32954317331314087
Validation loss = 0.32767465710639954
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3213648498058319
Validation loss = 0.3221249282360077
Validation loss = 0.3223170340061188
Validation loss = 0.32327526807785034
Validation loss = 0.327799916267395
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3204957842826843
Validation loss = 0.32138773798942566
Validation loss = 0.32298389077186584
Validation loss = 0.3237553536891937
Validation loss = 0.3273594081401825
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -10.2    |
| Iteration     | 36       |
| MaximumReturn | -2.11    |
| MinimumReturn | -72.5    |
| TotalSamples  | 63308    |
----------------------------
itr #37 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3278999626636505
Validation loss = 0.32474881410598755
Validation loss = 0.3256694972515106
Validation loss = 0.3234233558177948
Validation loss = 0.32261455059051514
Validation loss = 0.33086079359054565
Validation loss = 0.32534560561180115
Validation loss = 0.3272492587566376
Validation loss = 0.3289259076118469
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3284681737422943
Validation loss = 0.3245313763618469
Validation loss = 0.32573840022087097
Validation loss = 0.3268983066082001
Validation loss = 0.3259754478931427
Validation loss = 0.3300333023071289
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3295069634914398
Validation loss = 0.33038732409477234
Validation loss = 0.3280175030231476
Validation loss = 0.3308458924293518
Validation loss = 0.33713141083717346
Validation loss = 0.3326988220214844
Validation loss = 0.331890344619751
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3265947997570038
Validation loss = 0.326381653547287
Validation loss = 0.3273802399635315
Validation loss = 0.32521072030067444
Validation loss = 0.3279714584350586
Validation loss = 0.3250989317893982
Validation loss = 0.3311883807182312
Validation loss = 0.33188554644584656
Validation loss = 0.332783579826355
Validation loss = 0.3342326283454895
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.32373273372650146
Validation loss = 0.3245556056499481
Validation loss = 0.32630476355552673
Validation loss = 0.3327188193798065
Validation loss = 0.3277672231197357
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -4.72    |
| Iteration     | 37       |
| MaximumReturn | -2.35    |
| MinimumReturn | -10.8    |
| TotalSamples  | 64974    |
----------------------------
itr #38 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3318060636520386
Validation loss = 0.33309459686279297
Validation loss = 0.3371521830558777
Validation loss = 0.3345351219177246
Validation loss = 0.33695703744888306
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3341415524482727
Validation loss = 0.32920730113983154
Validation loss = 0.3338918685913086
Validation loss = 0.33674436807632446
Validation loss = 0.33783161640167236
Validation loss = 0.3361698389053345
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.33768412470817566
Validation loss = 0.33731335401535034
Validation loss = 0.34078341722488403
Validation loss = 0.340505450963974
Validation loss = 0.34130194783210754
Validation loss = 0.3432368338108063
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.34042036533355713
Validation loss = 0.3342621326446533
Validation loss = 0.3355349898338318
Validation loss = 0.33658552169799805
Validation loss = 0.3394826054573059
Validation loss = 0.34293854236602783
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3308541178703308
Validation loss = 0.33225229382514954
Validation loss = 0.33214521408081055
Validation loss = 0.3397960364818573
Validation loss = 0.33326566219329834
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -23.7    |
| Iteration     | 38       |
| MaximumReturn | -3.33    |
| MinimumReturn | -85.8    |
| TotalSamples  | 66640    |
----------------------------
itr #39 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.33201926946640015
Validation loss = 0.328921377658844
Validation loss = 0.332671195268631
Validation loss = 0.3347914516925812
Validation loss = 0.33127978444099426
Validation loss = 0.3350532054901123
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3333873152732849
Validation loss = 0.32739022374153137
Validation loss = 0.3324185013771057
Validation loss = 0.3291674852371216
Validation loss = 0.3328068256378174
Validation loss = 0.335556298494339
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3394162356853485
Validation loss = 0.33504149317741394
Validation loss = 0.3364339768886566
Validation loss = 0.3389195501804352
Validation loss = 0.3428093492984772
Validation loss = 0.3413422405719757
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3331904411315918
Validation loss = 0.33056822419166565
Validation loss = 0.3337557017803192
Validation loss = 0.3376450836658478
Validation loss = 0.3374701738357544
Validation loss = 0.3353991210460663
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3332078754901886
Validation loss = 0.32981517910957336
Validation loss = 0.33027976751327515
Validation loss = 0.332820326089859
Validation loss = 0.33488133549690247
Validation loss = 0.33478373289108276
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -18.5    |
| Iteration     | 39       |
| MaximumReturn | -2.91    |
| MinimumReturn | -66      |
| TotalSamples  | 68306    |
----------------------------
itr #40 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3302103877067566
Validation loss = 0.32638946175575256
Validation loss = 0.3310328423976898
Validation loss = 0.33127570152282715
Validation loss = 0.33092013001441956
Validation loss = 0.3342846930027008
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3254128098487854
Validation loss = 0.32864752411842346
Validation loss = 0.32759326696395874
Validation loss = 0.33081433176994324
Validation loss = 0.3337579667568207
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.34127914905548096
Validation loss = 0.33431506156921387
Validation loss = 0.339022159576416
Validation loss = 0.33992695808410645
Validation loss = 0.33889585733413696
Validation loss = 0.3398762345314026
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3358987867832184
Validation loss = 0.33193010091781616
Validation loss = 0.33779236674308777
Validation loss = 0.3344627022743225
Validation loss = 0.335448294878006
Validation loss = 0.34042030572891235
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3429596424102783
Validation loss = 0.32699888944625854
Validation loss = 0.32894372940063477
Validation loss = 0.3340691328048706
Validation loss = 0.34072330594062805
Validation loss = 0.3354032337665558
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -39.6    |
| Iteration     | 40       |
| MaximumReturn | -9.53    |
| MinimumReturn | -86.6    |
| TotalSamples  | 69972    |
----------------------------
itr #41 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3295791745185852
Validation loss = 0.32968389987945557
Validation loss = 0.3299146890640259
Validation loss = 0.33124226331710815
Validation loss = 0.3334793150424957
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.32938897609710693
Validation loss = 0.32801687717437744
Validation loss = 0.32891327142715454
Validation loss = 0.3302038311958313
Validation loss = 0.3301367461681366
Validation loss = 0.3313383162021637
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.33714038133621216
Validation loss = 0.3342652916908264
Validation loss = 0.3382362723350525
Validation loss = 0.34040066599845886
Validation loss = 0.34241148829460144
Validation loss = 0.3414419889450073
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3384845554828644
Validation loss = 0.33522188663482666
Validation loss = 0.33571067452430725
Validation loss = 0.3355783224105835
Validation loss = 0.3369022309780121
Validation loss = 0.3359927833080292
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3336351215839386
Validation loss = 0.33081984519958496
Validation loss = 0.33400964736938477
Validation loss = 0.33625349402427673
Validation loss = 0.3354266583919525
Validation loss = 0.334257572889328
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -117     |
| Iteration     | 41       |
| MaximumReturn | -2.64    |
| MinimumReturn | -151     |
| TotalSamples  | 71638    |
----------------------------
itr #42 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3327137231826782
Validation loss = 0.33123722672462463
Validation loss = 0.33239543437957764
Validation loss = 0.33429181575775146
Validation loss = 0.33509111404418945
Validation loss = 0.3360576927661896
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3303743302822113
Validation loss = 0.3365575075149536
Validation loss = 0.330483078956604
Validation loss = 0.33603307604789734
Validation loss = 0.33401617407798767
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.341383695602417
Validation loss = 0.3398406505584717
Validation loss = 0.3415204584598541
Validation loss = 0.3447766602039337
Validation loss = 0.34633591771125793
Validation loss = 0.3442028760910034
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.33715009689331055
Validation loss = 0.33656492829322815
Validation loss = 0.3385758101940155
Validation loss = 0.34142351150512695
Validation loss = 0.3387391269207001
Validation loss = 0.34076690673828125
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.33707523345947266
Validation loss = 0.3357871174812317
Validation loss = 0.340195894241333
Validation loss = 0.3369206190109253
Validation loss = 0.3363092839717865
Validation loss = 0.3373521864414215
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -127     |
| Iteration     | 42       |
| MaximumReturn | -1.97    |
| MinimumReturn | -160     |
| TotalSamples  | 73304    |
----------------------------
itr #43 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.33555901050567627
Validation loss = 0.3351631760597229
Validation loss = 0.33819517493247986
Validation loss = 0.33582618832588196
Validation loss = 0.33660194277763367
Validation loss = 0.3401744067668915
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3331736624240875
Validation loss = 0.3330029249191284
Validation loss = 0.3355323076248169
Validation loss = 0.33196744322776794
Validation loss = 0.33765923976898193
Validation loss = 0.33861786127090454
Validation loss = 0.3381555676460266
Validation loss = 0.33581459522247314
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3400636911392212
Validation loss = 0.34483376145362854
Validation loss = 0.3461493253707886
Validation loss = 0.34539318084716797
Validation loss = 0.3477752208709717
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.34088796377182007
Validation loss = 0.33936071395874023
Validation loss = 0.3410201966762543
Validation loss = 0.3395521938800812
Validation loss = 0.3420202136039734
Validation loss = 0.34266945719718933
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3383876085281372
Validation loss = 0.33786383271217346
Validation loss = 0.3385026454925537
Validation loss = 0.34106388688087463
Validation loss = 0.3397950530052185
Validation loss = 0.34482818841934204
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -26.2    |
| Iteration     | 43       |
| MaximumReturn | -1.69    |
| MinimumReturn | -87.1    |
| TotalSamples  | 74970    |
----------------------------
itr #44 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.33547651767730713
Validation loss = 0.3331657648086548
Validation loss = 0.33795666694641113
Validation loss = 0.3338410258293152
Validation loss = 0.33420366048812866
Validation loss = 0.3444872796535492
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3393266797065735
Validation loss = 0.33804386854171753
Validation loss = 0.3346596658229828
Validation loss = 0.3354829251766205
Validation loss = 0.33757588267326355
Validation loss = 0.3394995927810669
Validation loss = 0.34396982192993164
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.34712284803390503
Validation loss = 0.34485098719596863
Validation loss = 0.34272661805152893
Validation loss = 0.3471677303314209
Validation loss = 0.35405388474464417
Validation loss = 0.3469655513763428
Validation loss = 0.3482047915458679
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.33794793486595154
Validation loss = 0.33691316843032837
Validation loss = 0.339987576007843
Validation loss = 0.33940067887306213
Validation loss = 0.3412143290042877
Validation loss = 0.34491413831710815
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3403133749961853
Validation loss = 0.34017622470855713
Validation loss = 0.3429325520992279
Validation loss = 0.34182503819465637
Validation loss = 0.34397605061531067
Validation loss = 0.34458646178245544
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -0.922   |
| Iteration     | 44       |
| MaximumReturn | -0.307   |
| MinimumReturn | -2.57    |
| TotalSamples  | 76636    |
----------------------------
itr #45 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.33985403180122375
Validation loss = 0.3376493752002716
Validation loss = 0.34092816710472107
Validation loss = 0.33944547176361084
Validation loss = 0.3397708833217621
Validation loss = 0.3442588150501251
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3453141152858734
Validation loss = 0.3417792320251465
Validation loss = 0.34243398904800415
Validation loss = 0.3437396287918091
Validation loss = 0.3442535996437073
Validation loss = 0.3438985347747803
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.354119211435318
Validation loss = 0.3508003056049347
Validation loss = 0.35005277395248413
Validation loss = 0.35130950808525085
Validation loss = 0.35208019614219666
Validation loss = 0.35664036870002747
Validation loss = 0.3517356216907501
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.34335222840309143
Validation loss = 0.3423156142234802
Validation loss = 0.34509727358818054
Validation loss = 0.3500864505767822
Validation loss = 0.3478602170944214
Validation loss = 0.3482539653778076
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3443789780139923
Validation loss = 0.3462049961090088
Validation loss = 0.34985169768333435
Validation loss = 0.3488958775997162
Validation loss = 0.3495190143585205
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -78.5    |
| Iteration     | 45       |
| MaximumReturn | -57.7    |
| MinimumReturn | -106     |
| TotalSamples  | 78302    |
----------------------------
itr #46 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.34222131967544556
Validation loss = 0.34160059690475464
Validation loss = 0.3453713059425354
Validation loss = 0.3453892171382904
Validation loss = 0.34731197357177734
Validation loss = 0.34348005056381226
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.34869977831840515
Validation loss = 0.3474978506565094
Validation loss = 0.34556490182876587
Validation loss = 0.3465762138366699
Validation loss = 0.3476533591747284
Validation loss = 0.3502890169620514
Validation loss = 0.34788569808006287
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.35547181963920593
Validation loss = 0.35673871636390686
Validation loss = 0.35462507605552673
Validation loss = 0.3540953993797302
Validation loss = 0.3591623902320862
Validation loss = 0.3625762164592743
Validation loss = 0.35963043570518494
Validation loss = 0.3633202910423279
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3503861129283905
Validation loss = 0.34772366285324097
Validation loss = 0.3484441637992859
Validation loss = 0.3507360517978668
Validation loss = 0.3517240285873413
Validation loss = 0.3518677055835724
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3490992486476898
Validation loss = 0.3472591042518616
Validation loss = 0.3510289490222931
Validation loss = 0.34967583417892456
Validation loss = 0.3548632264137268
Validation loss = 0.35080376267433167
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -6.48    |
| Iteration     | 46       |
| MaximumReturn | -0.273   |
| MinimumReturn | -49.9    |
| TotalSamples  | 79968    |
----------------------------
itr #47 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.34632760286331177
Validation loss = 0.3475497364997864
Validation loss = 0.3471967279911041
Validation loss = 0.3468811511993408
Validation loss = 0.34982770681381226
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.35116058588027954
Validation loss = 0.3530885577201843
Validation loss = 0.3488787114620209
Validation loss = 0.3526471257209778
Validation loss = 0.35190054774284363
Validation loss = 0.35104435682296753
Validation loss = 0.35467490553855896
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.359806627035141
Validation loss = 0.36045390367507935
Validation loss = 0.36029380559921265
Validation loss = 0.3667246401309967
Validation loss = 0.36423927545547485
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.35368794202804565
Validation loss = 0.3538277745246887
Validation loss = 0.3541315197944641
Validation loss = 0.3538772463798523
Validation loss = 0.35816994309425354
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.35763803124427795
Validation loss = 0.3555946350097656
Validation loss = 0.3546259105205536
Validation loss = 0.36103978753089905
Validation loss = 0.35643455386161804
Validation loss = 0.3580149710178375
Validation loss = 0.3580678105354309
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -0.571   |
| Iteration     | 47       |
| MaximumReturn | -0.133   |
| MinimumReturn | -1.1     |
| TotalSamples  | 81634    |
----------------------------
itr #48 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3517930209636688
Validation loss = 0.34797024726867676
Validation loss = 0.3500494658946991
Validation loss = 0.3479699492454529
Validation loss = 0.3493899405002594
Validation loss = 0.35136106610298157
Validation loss = 0.35333871841430664
Validation loss = 0.34973838925361633
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.35467877984046936
Validation loss = 0.3531072437763214
Validation loss = 0.3548434376716614
Validation loss = 0.356604665517807
Validation loss = 0.3558562397956848
Validation loss = 0.35409075021743774
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3647422194480896
Validation loss = 0.3636283278465271
Validation loss = 0.3651234805583954
Validation loss = 0.36433833837509155
Validation loss = 0.3641415536403656
Validation loss = 0.3693349361419678
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3542192578315735
Validation loss = 0.3553643822669983
Validation loss = 0.3572438061237335
Validation loss = 0.3571321964263916
Validation loss = 0.3575224280357361
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.35758599638938904
Validation loss = 0.35744839906692505
Validation loss = 0.35738715529441833
Validation loss = 0.35883083939552307
Validation loss = 0.3633485436439514
Validation loss = 0.3613974452018738
Validation loss = 0.362414687871933
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -0.596   |
| Iteration     | 48       |
| MaximumReturn | -0.308   |
| MinimumReturn | -0.932   |
| TotalSamples  | 83300    |
----------------------------
itr #49 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.35667258501052856
Validation loss = 0.353433221578598
Validation loss = 0.35281574726104736
Validation loss = 0.35144153237342834
Validation loss = 0.3540472984313965
Validation loss = 0.354300320148468
Validation loss = 0.35830968618392944
Validation loss = 0.35854238271713257
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3570765256881714
Validation loss = 0.3569667637348175
Validation loss = 0.3548130989074707
Validation loss = 0.35967230796813965
Validation loss = 0.35794517397880554
Validation loss = 0.3617204427719116
Validation loss = 0.3610957860946655
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3683012127876282
Validation loss = 0.36701253056526184
Validation loss = 0.37226003408432007
Validation loss = 0.3665660321712494
Validation loss = 0.3690488040447235
Validation loss = 0.36947324872016907
Validation loss = 0.36727476119995117
Validation loss = 0.3713335692882538
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3586135506629944
Validation loss = 0.3584553003311157
Validation loss = 0.35772979259490967
Validation loss = 0.3608989119529724
Validation loss = 0.35983943939208984
Validation loss = 0.36525219678878784
Validation loss = 0.36034485697746277
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3649826943874359
Validation loss = 0.3632745146751404
Validation loss = 0.3625154495239258
Validation loss = 0.36175358295440674
Validation loss = 0.36235761642456055
Validation loss = 0.3668379783630371
Validation loss = 0.3652845323085785
Validation loss = 0.36582082509994507
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -0.586   |
| Iteration     | 49       |
| MaximumReturn | -0.269   |
| MinimumReturn | -1.54    |
| TotalSamples  | 84966    |
----------------------------
itr #50 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3586695194244385
Validation loss = 0.3580515384674072
Validation loss = 0.35987669229507446
Validation loss = 0.3659719228744507
Validation loss = 0.35968825221061707
Validation loss = 0.35884180665016174
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.36074376106262207
Validation loss = 0.35845255851745605
Validation loss = 0.3639542758464813
Validation loss = 0.36458244919776917
Validation loss = 0.3637692928314209
Validation loss = 0.36118316650390625
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.37332987785339355
Validation loss = 0.3725256323814392
Validation loss = 0.3736158013343811
Validation loss = 0.37222176790237427
Validation loss = 0.3710537850856781
Validation loss = 0.37404075264930725
Validation loss = 0.3751850426197052
Validation loss = 0.37296125292778015
Validation loss = 0.3756701946258545
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.36535412073135376
Validation loss = 0.3659358620643616
Validation loss = 0.36490827798843384
Validation loss = 0.365817129611969
Validation loss = 0.36572590470314026
Validation loss = 0.36684414744377136
Validation loss = 0.3703053593635559
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.37103521823883057
Validation loss = 0.3693297505378723
Validation loss = 0.3677802085876465
Validation loss = 0.3691483736038208
Validation loss = 0.37051892280578613
Validation loss = 0.36937472224235535
Validation loss = 0.37294629216194153
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -0.765   |
| Iteration     | 50       |
| MaximumReturn | -0.326   |
| MinimumReturn | -1.63    |
| TotalSamples  | 86632    |
----------------------------
itr #51 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3604171872138977
Validation loss = 0.3628484308719635
Validation loss = 0.3629046082496643
Validation loss = 0.3636603355407715
Validation loss = 0.36702626943588257
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3646252453327179
Validation loss = 0.36454030871391296
Validation loss = 0.36616626381874084
Validation loss = 0.3670975863933563
Validation loss = 0.36986449360847473
Validation loss = 0.3710018992424011
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.37713801860809326
Validation loss = 0.3807711899280548
Validation loss = 0.37803158164024353
Validation loss = 0.3768519461154938
Validation loss = 0.3809419274330139
Validation loss = 0.3818347752094269
Validation loss = 0.38176074624061584
Validation loss = 0.3812190592288971
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3695744574069977
Validation loss = 0.36896762251853943
Validation loss = 0.36800283193588257
Validation loss = 0.3726859390735626
Validation loss = 0.3694409132003784
Validation loss = 0.3697194457054138
Validation loss = 0.3722347915172577
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.37390393018722534
Validation loss = 0.3719583749771118
Validation loss = 0.3750535845756531
Validation loss = 0.37263259291648865
Validation loss = 0.374545693397522
Validation loss = 0.37532463669776917
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -0.68    |
| Iteration     | 51       |
| MaximumReturn | -0.387   |
| MinimumReturn | -1.39    |
| TotalSamples  | 88298    |
----------------------------
itr #52 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3663971722126007
Validation loss = 0.36492690443992615
Validation loss = 0.3654031455516815
Validation loss = 0.36477306485176086
Validation loss = 0.36445292830467224
Validation loss = 0.364524245262146
Validation loss = 0.3692435324192047
Validation loss = 0.36915120482444763
Validation loss = 0.37159690260887146
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3687509596347809
Validation loss = 0.36718979477882385
Validation loss = 0.3713570833206177
Validation loss = 0.36946117877960205
Validation loss = 0.3703674376010895
Validation loss = 0.37000200152397156
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.38473865389823914
Validation loss = 0.3822420835494995
Validation loss = 0.3812144100666046
Validation loss = 0.385006308555603
Validation loss = 0.3850124180316925
Validation loss = 0.384603887796402
Validation loss = 0.38549548387527466
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.37466517090797424
Validation loss = 0.3718750774860382
Validation loss = 0.37363293766975403
Validation loss = 0.3724379241466522
Validation loss = 0.37547576427459717
Validation loss = 0.37547826766967773
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.37920090556144714
Validation loss = 0.37279748916625977
Validation loss = 0.374359130859375
Validation loss = 0.3761598467826843
Validation loss = 0.3779147267341614
Validation loss = 0.3803640604019165
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -0.793   |
| Iteration     | 52       |
| MaximumReturn | -0.156   |
| MinimumReturn | -1.66    |
| TotalSamples  | 89964    |
----------------------------
itr #53 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3737733066082001
Validation loss = 0.3723656237125397
Validation loss = 0.36983823776245117
Validation loss = 0.37386465072631836
Validation loss = 0.37360233068466187
Validation loss = 0.3724844753742218
Validation loss = 0.3706207573413849
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3718622922897339
Validation loss = 0.37070244550704956
Validation loss = 0.37479373812675476
Validation loss = 0.3750416338443756
Validation loss = 0.3746449053287506
Validation loss = 0.37358829379081726
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.38778597116470337
Validation loss = 0.3861748278141022
Validation loss = 0.3882208466529846
Validation loss = 0.3902643322944641
Validation loss = 0.3912723660469055
Validation loss = 0.38942477107048035
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3814622759819031
Validation loss = 0.37712445855140686
Validation loss = 0.377843976020813
Validation loss = 0.3834936022758484
Validation loss = 0.37580350041389465
Validation loss = 0.3765297830104828
Validation loss = 0.37982189655303955
Validation loss = 0.3802095949649811
Validation loss = 0.38019344210624695
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.37916818261146545
Validation loss = 0.377808153629303
Validation loss = 0.37785258889198303
Validation loss = 0.38100865483283997
Validation loss = 0.3825344443321228
Validation loss = 0.38077542185783386
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -12.3    |
| Iteration     | 53       |
| MaximumReturn | -0.2     |
| MinimumReturn | -52.5    |
| TotalSamples  | 91630    |
----------------------------
itr #54 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3757379353046417
Validation loss = 0.37262627482414246
Validation loss = 0.37231042981147766
Validation loss = 0.3748313784599304
Validation loss = 0.376198947429657
Validation loss = 0.3774934411048889
Validation loss = 0.3748406767845154
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.37309643626213074
Validation loss = 0.3718952536582947
Validation loss = 0.3725346326828003
Validation loss = 0.37100574374198914
Validation loss = 0.37554800510406494
Validation loss = 0.3720601499080658
Validation loss = 0.3765318691730499
Validation loss = 0.37618744373321533
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3923211693763733
Validation loss = 0.39127272367477417
Validation loss = 0.3899988532066345
Validation loss = 0.3932158946990967
Validation loss = 0.3920426666736603
Validation loss = 0.3887037932872772
Validation loss = 0.3924171030521393
Validation loss = 0.39236101508140564
Validation loss = 0.39347073435783386
Validation loss = 0.395643413066864
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.38167089223861694
Validation loss = 0.37691858410835266
Validation loss = 0.3795095980167389
Validation loss = 0.37863630056381226
Validation loss = 0.3793857991695404
Validation loss = 0.3788819909095764
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.38033023476600647
Validation loss = 0.37963584065437317
Validation loss = 0.38034430146217346
Validation loss = 0.3791544735431671
Validation loss = 0.3849197328090668
Validation loss = 0.3833993375301361
Validation loss = 0.38463279604911804
Validation loss = 0.38595160841941833
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -18.4    |
| Iteration     | 54       |
| MaximumReturn | -0.365   |
| MinimumReturn | -51.8    |
| TotalSamples  | 93296    |
----------------------------
itr #55 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3747839033603668
Validation loss = 0.37434664368629456
Validation loss = 0.37466442584991455
Validation loss = 0.37879055738449097
Validation loss = 0.3790394365787506
Validation loss = 0.38034650683403015
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.37886568903923035
Validation loss = 0.37678349018096924
Validation loss = 0.37550055980682373
Validation loss = 0.3775978982448578
Validation loss = 0.3776267468929291
Validation loss = 0.3793738782405853
Validation loss = 0.37806838750839233
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3964044451713562
Validation loss = 0.3934314250946045
Validation loss = 0.39497247338294983
Validation loss = 0.39454618096351624
Validation loss = 0.39871513843536377
Validation loss = 0.3971972167491913
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3861769437789917
Validation loss = 0.38427460193634033
Validation loss = 0.38422733545303345
Validation loss = 0.385545939207077
Validation loss = 0.38331669569015503
Validation loss = 0.3873218894004822
Validation loss = 0.3880026340484619
Validation loss = 0.3887767493724823
Validation loss = 0.38652095198631287
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3848809003829956
Validation loss = 0.38792553544044495
Validation loss = 0.383350133895874
Validation loss = 0.38718655705451965
Validation loss = 0.3876447379589081
Validation loss = 0.38616275787353516
Validation loss = 0.38739192485809326
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -13.9    |
| Iteration     | 55       |
| MaximumReturn | -0.406   |
| MinimumReturn | -57.8    |
| TotalSamples  | 94962    |
----------------------------
itr #56 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.378170907497406
Validation loss = 0.3783305585384369
Validation loss = 0.3798254430294037
Validation loss = 0.378348171710968
Validation loss = 0.37668702006340027
Validation loss = 0.3818660080432892
Validation loss = 0.3817303478717804
Validation loss = 0.3844356834888458
Validation loss = 0.3842894434928894
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.380881130695343
Validation loss = 0.37910911440849304
Validation loss = 0.3815441131591797
Validation loss = 0.3817239999771118
Validation loss = 0.3801654577255249
Validation loss = 0.38164418935775757
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3982193171977997
Validation loss = 0.39823490381240845
Validation loss = 0.39958077669143677
Validation loss = 0.39730095863342285
Validation loss = 0.3941328525543213
Validation loss = 0.40079620480537415
Validation loss = 0.3994600474834442
Validation loss = 0.4023993909358978
Validation loss = 0.40194690227508545
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.38767722249031067
Validation loss = 0.382960706949234
Validation loss = 0.3836666941642761
Validation loss = 0.38883936405181885
Validation loss = 0.38844308257102966
Validation loss = 0.393430233001709
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3828747570514679
Validation loss = 0.38680580258369446
Validation loss = 0.3848261833190918
Validation loss = 0.39012232422828674
Validation loss = 0.38983234763145447
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -34.9    |
| Iteration     | 56       |
| MaximumReturn | -0.447   |
| MinimumReturn | -68.9    |
| TotalSamples  | 96628    |
----------------------------
itr #57 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.38375547528266907
Validation loss = 0.3830193281173706
Validation loss = 0.3832291066646576
Validation loss = 0.38675394654273987
Validation loss = 0.38554635643959045
Validation loss = 0.38551679253578186
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.38144543766975403
Validation loss = 0.3851292133331299
Validation loss = 0.3779943883419037
Validation loss = 0.381307989358902
Validation loss = 0.3850351870059967
Validation loss = 0.3839808404445648
Validation loss = 0.38613638281822205
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.40259456634521484
Validation loss = 0.4000028669834137
Validation loss = 0.40257027745246887
Validation loss = 0.40177974104881287
Validation loss = 0.4029325544834137
Validation loss = 0.4037923514842987
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3904336392879486
Validation loss = 0.3875282108783722
Validation loss = 0.3887046277523041
Validation loss = 0.3890761137008667
Validation loss = 0.39076903462409973
Validation loss = 0.39311861991882324
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.388895183801651
Validation loss = 0.3892393410205841
Validation loss = 0.38905850052833557
Validation loss = 0.3893207013607025
Validation loss = 0.3891463279724121
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -103     |
| Iteration     | 57       |
| MaximumReturn | -54.3    |
| MinimumReturn | -123     |
| TotalSamples  | 98294    |
----------------------------
itr #58 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3821503818035126
Validation loss = 0.38297539949417114
Validation loss = 0.3846011161804199
Validation loss = 0.3867625892162323
Validation loss = 0.38504770398139954
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3830936551094055
Validation loss = 0.38176214694976807
Validation loss = 0.3815799653530121
Validation loss = 0.38301584124565125
Validation loss = 0.38454166054725647
Validation loss = 0.3863396644592285
Validation loss = 0.38633817434310913
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.40219151973724365
Validation loss = 0.3999372720718384
Validation loss = 0.39927974343299866
Validation loss = 0.40394148230552673
Validation loss = 0.4046034812927246
Validation loss = 0.40353190898895264
Validation loss = 0.40600359439849854
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3859451413154602
Validation loss = 0.3895223140716553
Validation loss = 0.3901287317276001
Validation loss = 0.39125746488571167
Validation loss = 0.39144811034202576
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.38568341732025146
Validation loss = 0.3889508843421936
Validation loss = 0.3917819857597351
Validation loss = 0.38682109117507935
Validation loss = 0.39185473322868347
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -156     |
| Iteration     | 58       |
| MaximumReturn | -115     |
| MinimumReturn | -171     |
| TotalSamples  | 99960    |
----------------------------
itr #59 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3804522454738617
Validation loss = 0.3826100826263428
Validation loss = 0.3824617266654968
Validation loss = 0.3817897439002991
Validation loss = 0.3852315843105316
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.37740376591682434
Validation loss = 0.38068991899490356
Validation loss = 0.3845544755458832
Validation loss = 0.3869294822216034
Validation loss = 0.3846001923084259
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3996289372444153
Validation loss = 0.39973342418670654
Validation loss = 0.40222564339637756
Validation loss = 0.40341809391975403
Validation loss = 0.4033569097518921
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.385538250207901
Validation loss = 0.3874731957912445
Validation loss = 0.38815972208976746
Validation loss = 0.38810375332832336
Validation loss = 0.3905889093875885
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3874366283416748
Validation loss = 0.3865257501602173
Validation loss = 0.39060965180397034
Validation loss = 0.39015161991119385
Validation loss = 0.3933067321777344
Validation loss = 0.39175137877464294
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -75.3    |
| Iteration     | 59       |
| MaximumReturn | -32.5    |
| MinimumReturn | -107     |
| TotalSamples  | 101626   |
----------------------------
itr #60 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3831270933151245
Validation loss = 0.3875548541545868
Validation loss = 0.38752347230911255
Validation loss = 0.3847649395465851
Validation loss = 0.39033180475234985
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3819974660873413
Validation loss = 0.3814574182033539
Validation loss = 0.38557589054107666
Validation loss = 0.38782718777656555
Validation loss = 0.3846224546432495
Validation loss = 0.38616040349006653
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.4067096412181854
Validation loss = 0.4064786434173584
Validation loss = 0.4033844769001007
Validation loss = 0.40462031960487366
Validation loss = 0.40525925159454346
Validation loss = 0.40709948539733887
Validation loss = 0.4059726297855377
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.38853636384010315
Validation loss = 0.3863999545574188
Validation loss = 0.3889724314212799
Validation loss = 0.39095252752304077
Validation loss = 0.3924001455307007
Validation loss = 0.3936445116996765
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3880455493927002
Validation loss = 0.3913590610027313
Validation loss = 0.39144352078437805
Validation loss = 0.39201751351356506
Validation loss = 0.3939588665962219
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -47.3    |
| Iteration     | 60       |
| MaximumReturn | -0.52    |
| MinimumReturn | -137     |
| TotalSamples  | 103292   |
----------------------------
itr #61 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3818713128566742
Validation loss = 0.38902488350868225
Validation loss = 0.3817358613014221
Validation loss = 0.38396409153938293
Validation loss = 0.3827454447746277
Validation loss = 0.3856847286224365
Validation loss = 0.3853195011615753
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.38669753074645996
Validation loss = 0.38368499279022217
Validation loss = 0.3836759924888611
Validation loss = 0.3879040777683258
Validation loss = 0.38934212923049927
Validation loss = 0.38600778579711914
Validation loss = 0.3885497748851776
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.40541353821754456
Validation loss = 0.4028843939304352
Validation loss = 0.4042435586452484
Validation loss = 0.40628960728645325
Validation loss = 0.40422195196151733
Validation loss = 0.408056378364563
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3903730511665344
Validation loss = 0.3907906413078308
Validation loss = 0.39496874809265137
Validation loss = 0.39407676458358765
Validation loss = 0.39300537109375
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3917240500450134
Validation loss = 0.3878634572029114
Validation loss = 0.389472097158432
Validation loss = 0.39393511414527893
Validation loss = 0.39237236976623535
Validation loss = 0.3937574028968811
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -21.6    |
| Iteration     | 61       |
| MaximumReturn | -0.918   |
| MinimumReturn | -81.9    |
| TotalSamples  | 104958   |
----------------------------
itr #62 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.38500672578811646
Validation loss = 0.38586533069610596
Validation loss = 0.38583213090896606
Validation loss = 0.38504424691200256
Validation loss = 0.3903248608112335
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.387652188539505
Validation loss = 0.3857678771018982
Validation loss = 0.38846355676651
Validation loss = 0.38752683997154236
Validation loss = 0.3884138762950897
Validation loss = 0.38752979040145874
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.4061470031738281
Validation loss = 0.40328410267829895
Validation loss = 0.4052741229534149
Validation loss = 0.40764206647872925
Validation loss = 0.40763935446739197
Validation loss = 0.41181331872940063
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3941059112548828
Validation loss = 0.39250072836875916
Validation loss = 0.3925020396709442
Validation loss = 0.39452841877937317
Validation loss = 0.3942152261734009
Validation loss = 0.39352652430534363
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.39201024174690247
Validation loss = 0.3921831548213959
Validation loss = 0.3940029740333557
Validation loss = 0.3998255431652069
Validation loss = 0.39463937282562256
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -52.8    |
| Iteration     | 62       |
| MaximumReturn | -0.167   |
| MinimumReturn | -95.7    |
| TotalSamples  | 106624   |
----------------------------
itr #63 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3884700536727905
Validation loss = 0.3868344724178314
Validation loss = 0.3877353370189667
Validation loss = 0.3881266117095947
Validation loss = 0.3873325288295746
Validation loss = 0.38958603143692017
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3870837092399597
Validation loss = 0.3883872926235199
Validation loss = 0.38606971502304077
Validation loss = 0.39117851853370667
Validation loss = 0.3903266191482544
Validation loss = 0.3892790973186493
Validation loss = 0.39257165789604187
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.4101030230522156
Validation loss = 0.4092963933944702
Validation loss = 0.4067804217338562
Validation loss = 0.41087329387664795
Validation loss = 0.4093600809574127
Validation loss = 0.41052713990211487
Validation loss = 0.4139305651187897
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.39420971274375916
Validation loss = 0.3918408155441284
Validation loss = 0.3944929540157318
Validation loss = 0.39575129747390747
Validation loss = 0.3954695463180542
Validation loss = 0.39524608850479126
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.39138397574424744
Validation loss = 0.39370474219322205
Validation loss = 0.39089807868003845
Validation loss = 0.3935244083404541
Validation loss = 0.3948601186275482
Validation loss = 0.3958401679992676
Validation loss = 0.3997761607170105
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -11.9    |
| Iteration     | 63       |
| MaximumReturn | -0.448   |
| MinimumReturn | -66      |
| TotalSamples  | 108290   |
----------------------------
itr #64 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3892821669578552
Validation loss = 0.38930609822273254
Validation loss = 0.38728904724121094
Validation loss = 0.3895891308784485
Validation loss = 0.39312833547592163
Validation loss = 0.3910228908061981
Validation loss = 0.39421698451042175
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.390811562538147
Validation loss = 0.3889901340007782
Validation loss = 0.3938021957874298
Validation loss = 0.39308661222457886
Validation loss = 0.38892561197280884
Validation loss = 0.3948359191417694
Validation loss = 0.3980982005596161
Validation loss = 0.4016498029232025
Validation loss = 0.3937976658344269
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.4077562093734741
Validation loss = 0.4095063805580139
Validation loss = 0.4112106263637543
Validation loss = 0.41220569610595703
Validation loss = 0.41099607944488525
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.39484307169914246
Validation loss = 0.3939626216888428
Validation loss = 0.3985649049282074
Validation loss = 0.39500707387924194
Validation loss = 0.39467155933380127
Validation loss = 0.3968433141708374
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3986777663230896
Validation loss = 0.3997257947921753
Validation loss = 0.39778706431388855
Validation loss = 0.39559587836265564
Validation loss = 0.39862552285194397
Validation loss = 0.39764541387557983
Validation loss = 0.39947617053985596
Validation loss = 0.39882346987724304
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -14.4    |
| Iteration     | 64       |
| MaximumReturn | -0.151   |
| MinimumReturn | -82.1    |
| TotalSamples  | 109956   |
----------------------------
itr #65 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3948417007923126
Validation loss = 0.3946135640144348
Validation loss = 0.3922792971134186
Validation loss = 0.3923121988773346
Validation loss = 0.3945923447608948
Validation loss = 0.3933063745498657
Validation loss = 0.39783918857574463
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3935506343841553
Validation loss = 0.39573368430137634
Validation loss = 0.39249926805496216
Validation loss = 0.3929431140422821
Validation loss = 0.39534714818000793
Validation loss = 0.3959479331970215
Validation loss = 0.39732983708381653
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.4117675721645355
Validation loss = 0.4129677414894104
Validation loss = 0.41188284754753113
Validation loss = 0.41299545764923096
Validation loss = 0.41476309299468994
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3982422947883606
Validation loss = 0.39886483550071716
Validation loss = 0.39796513319015503
Validation loss = 0.3996058702468872
Validation loss = 0.3978762924671173
Validation loss = 0.400629460811615
Validation loss = 0.3998987376689911
Validation loss = 0.4004587233066559
Validation loss = 0.4016687273979187
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.4022381603717804
Validation loss = 0.4007459580898285
Validation loss = 0.3997543454170227
Validation loss = 0.3997354209423065
Validation loss = 0.4021525979042053
Validation loss = 0.40618178248405457
Validation loss = 0.40139856934547424
Validation loss = 0.4070485234260559
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -2.68    |
| Iteration     | 65       |
| MaximumReturn | -0.163   |
| MinimumReturn | -24.6    |
| TotalSamples  | 111622   |
----------------------------
itr #66 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3969900608062744
Validation loss = 0.3943869173526764
Validation loss = 0.39638862013816833
Validation loss = 0.3931121528148651
Validation loss = 0.3957803547382355
Validation loss = 0.3973754048347473
Validation loss = 0.3961350917816162
Validation loss = 0.3976893424987793
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.404361754655838
Validation loss = 0.39767587184906006
Validation loss = 0.3973318934440613
Validation loss = 0.39764440059661865
Validation loss = 0.39519500732421875
Validation loss = 0.399170845746994
Validation loss = 0.40050020813941956
Validation loss = 0.39797544479370117
Validation loss = 0.399281769990921
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.41520652174949646
Validation loss = 0.4146702289581299
Validation loss = 0.4153272807598114
Validation loss = 0.4109227657318115
Validation loss = 0.4116404354572296
Validation loss = 0.4125881493091583
Validation loss = 0.41521260142326355
Validation loss = 0.41761788725852966
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.400162011384964
Validation loss = 0.40082502365112305
Validation loss = 0.4034018814563751
Validation loss = 0.4026867747306824
Validation loss = 0.40460655093193054
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.40653377771377563
Validation loss = 0.4044835567474365
Validation loss = 0.4014720320701599
Validation loss = 0.4042852520942688
Validation loss = 0.4101521968841553
Validation loss = 0.4056397080421448
Validation loss = 0.40343010425567627
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -25.4    |
| Iteration     | 66       |
| MaximumReturn | -0.264   |
| MinimumReturn | -101     |
| TotalSamples  | 113288   |
----------------------------
itr #67 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3956209123134613
Validation loss = 0.3962041437625885
Validation loss = 0.3994455635547638
Validation loss = 0.3990793526172638
Validation loss = 0.3980409801006317
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.40085190534591675
Validation loss = 0.3963477909564972
Validation loss = 0.3976894021034241
Validation loss = 0.3986578583717346
Validation loss = 0.4021385610103607
Validation loss = 0.4009343683719635
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.4157600700855255
Validation loss = 0.41335374116897583
Validation loss = 0.4136740565299988
Validation loss = 0.4146561622619629
Validation loss = 0.4173359274864197
Validation loss = 0.41912078857421875
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.4035322964191437
Validation loss = 0.4000529646873474
Validation loss = 0.4036804735660553
Validation loss = 0.40179678797721863
Validation loss = 0.4009663760662079
Validation loss = 0.4010169804096222
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.4052742123603821
Validation loss = 0.40368303656578064
Validation loss = 0.4052777886390686
Validation loss = 0.40613850951194763
Validation loss = 0.4047287404537201
Validation loss = 0.40596696734428406
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -176     |
| Iteration     | 67       |
| MaximumReturn | -145     |
| MinimumReturn | -188     |
| TotalSamples  | 114954   |
----------------------------
itr #68 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3923579752445221
Validation loss = 0.39379966259002686
Validation loss = 0.3948606550693512
Validation loss = 0.3965928852558136
Validation loss = 0.3950258195400238
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3927791714668274
Validation loss = 0.39760923385620117
Validation loss = 0.3979535698890686
Validation loss = 0.4000842869281769
Validation loss = 0.4011822044849396
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.40767231583595276
Validation loss = 0.41205668449401855
Validation loss = 0.41111046075820923
Validation loss = 0.4119223654270172
Validation loss = 0.41811615228652954
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.396034300327301
Validation loss = 0.40129804611206055
Validation loss = 0.4029216170310974
Validation loss = 0.4018811881542206
Validation loss = 0.4037061333656311
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3989918828010559
Validation loss = 0.40259650349617004
Validation loss = 0.40531548857688904
Validation loss = 0.40308916568756104
Validation loss = 0.40366408228874207
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -15      |
| Iteration     | 68       |
| MaximumReturn | -6.12    |
| MinimumReturn | -162     |
| TotalSamples  | 116620   |
----------------------------
itr #69 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.38997882604599
Validation loss = 0.3899180293083191
Validation loss = 0.39091479778289795
Validation loss = 0.39348095655441284
Validation loss = 0.3909546136856079
Validation loss = 0.39361733198165894
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.39204955101013184
Validation loss = 0.39023852348327637
Validation loss = 0.3963106870651245
Validation loss = 0.39571481943130493
Validation loss = 0.39380431175231934
Validation loss = 0.39668208360671997
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.4104897677898407
Validation loss = 0.4122123718261719
Validation loss = 0.40948113799095154
Validation loss = 0.4097186326980591
Validation loss = 0.4112958610057831
Validation loss = 0.4121057987213135
Validation loss = 0.4141274690628052
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3973309397697449
Validation loss = 0.39476779103279114
Validation loss = 0.39914852380752563
Validation loss = 0.3988021910190582
Validation loss = 0.3961121141910553
Validation loss = 0.4029747545719147
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.4013674259185791
Validation loss = 0.3993614912033081
Validation loss = 0.3991507589817047
Validation loss = 0.40322330594062805
Validation loss = 0.40289536118507385
Validation loss = 0.40160441398620605
Validation loss = 0.4047562777996063
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -8.4     |
| Iteration     | 69       |
| MaximumReturn | -4.84    |
| MinimumReturn | -13.5    |
| TotalSamples  | 118286   |
----------------------------
itr #70 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.39117124676704407
Validation loss = 0.39137694239616394
Validation loss = 0.38940489292144775
Validation loss = 0.39406099915504456
Validation loss = 0.3908179700374603
Validation loss = 0.3926658034324646
Validation loss = 0.391504168510437
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.39007797837257385
Validation loss = 0.39020153880119324
Validation loss = 0.3951878547668457
Validation loss = 0.392232209444046
Validation loss = 0.3928920030593872
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.4085771143436432
Validation loss = 0.40939608216285706
Validation loss = 0.4090207517147064
Validation loss = 0.4106093645095825
Validation loss = 0.4130275845527649
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.39644089341163635
Validation loss = 0.394766241312027
Validation loss = 0.39821723103523254
Validation loss = 0.39969152212142944
Validation loss = 0.3989654779434204
Validation loss = 0.40111345052719116
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.4008725881576538
Validation loss = 0.4022136330604553
Validation loss = 0.39939844608306885
Validation loss = 0.4075574576854706
Validation loss = 0.40070757269859314
Validation loss = 0.40318727493286133
Validation loss = 0.40443870425224304
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -55      |
| Iteration     | 70       |
| MaximumReturn | -2.85    |
| MinimumReturn | -119     |
| TotalSamples  | 119952   |
----------------------------
itr #71 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.39754608273506165
Validation loss = 0.392647385597229
Validation loss = 0.39295804500579834
Validation loss = 0.3948606550693512
Validation loss = 0.39376720786094666
Validation loss = 0.3930133879184723
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.39348310232162476
Validation loss = 0.3920954465866089
Validation loss = 0.39435380697250366
Validation loss = 0.3924068510532379
Validation loss = 0.39476728439331055
Validation loss = 0.3945501744747162
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.4117628037929535
Validation loss = 0.4127114415168762
Validation loss = 0.4108048677444458
Validation loss = 0.415871262550354
Validation loss = 0.41426071524620056
Validation loss = 0.413595587015152
Validation loss = 0.41432973742485046
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3962429463863373
Validation loss = 0.398012638092041
Validation loss = 0.39870598912239075
Validation loss = 0.3986433446407318
Validation loss = 0.3986547291278839
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.40284252166748047
Validation loss = 0.4014683663845062
Validation loss = 0.40572744607925415
Validation loss = 0.40468069911003113
Validation loss = 0.4086049497127533
Validation loss = 0.406526118516922
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -122     |
| Iteration     | 71       |
| MaximumReturn | -55.8    |
| MinimumReturn | -178     |
| TotalSamples  | 121618   |
----------------------------
itr #72 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3916032910346985
Validation loss = 0.3896388113498688
Validation loss = 0.3912748396396637
Validation loss = 0.39210933446884155
Validation loss = 0.3937793970108032
Validation loss = 0.3930168151855469
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.39740341901779175
Validation loss = 0.3910669684410095
Validation loss = 0.39311450719833374
Validation loss = 0.3954402208328247
Validation loss = 0.39455538988113403
Validation loss = 0.3939270079135895
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.40710827708244324
Validation loss = 0.41129234433174133
Validation loss = 0.4087309241294861
Validation loss = 0.4096437096595764
Validation loss = 0.4131891429424286
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.393378347158432
Validation loss = 0.395114004611969
Validation loss = 0.401295006275177
Validation loss = 0.39812949299812317
Validation loss = 0.4006109833717346
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.4003334939479828
Validation loss = 0.39973798394203186
Validation loss = 0.39996543526649475
Validation loss = 0.4025135636329651
Validation loss = 0.40407034754753113
Validation loss = 0.4055565595626831
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -170     |
| Iteration     | 72       |
| MaximumReturn | -95.3    |
| MinimumReturn | -199     |
| TotalSamples  | 123284   |
----------------------------
itr #73 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.38984015583992004
Validation loss = 0.3957310616970062
Validation loss = 0.393927663564682
Validation loss = 0.39375004172325134
Validation loss = 0.39397165179252625
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3900301456451416
Validation loss = 0.39404046535491943
Validation loss = 0.39242175221443176
Validation loss = 0.392556369304657
Validation loss = 0.3918965458869934
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.4104382395744324
Validation loss = 0.41024380922317505
Validation loss = 0.41038593649864197
Validation loss = 0.4116269052028656
Validation loss = 0.41223809123039246
Validation loss = 0.4112606644630432
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3938758373260498
Validation loss = 0.3977445662021637
Validation loss = 0.4015031158924103
Validation loss = 0.39963382482528687
Validation loss = 0.3989849090576172
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.40030983090400696
Validation loss = 0.40231817960739136
Validation loss = 0.4033130705356598
Validation loss = 0.4031224846839905
Validation loss = 0.4064314365386963
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -171     |
| Iteration     | 73       |
| MaximumReturn | -125     |
| MinimumReturn | -198     |
| TotalSamples  | 124950   |
----------------------------
itr #74 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.39149725437164307
Validation loss = 0.39282113313674927
Validation loss = 0.3896186947822571
Validation loss = 0.3932287096977234
Validation loss = 0.39382022619247437
Validation loss = 0.3976438045501709
Validation loss = 0.3948001563549042
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.39007797837257385
Validation loss = 0.3890651762485504
Validation loss = 0.39051735401153564
Validation loss = 0.3902606964111328
Validation loss = 0.3944058120250702
Validation loss = 0.3906952738761902
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.4077160954475403
Validation loss = 0.4085642695426941
Validation loss = 0.4109325408935547
Validation loss = 0.4104371964931488
Validation loss = 0.41054826974868774
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.39470553398132324
Validation loss = 0.3971790075302124
Validation loss = 0.39467522501945496
Validation loss = 0.3977713882923126
Validation loss = 0.39700618386268616
Validation loss = 0.3977856934070587
Validation loss = 0.3986950218677521
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3977299630641937
Validation loss = 0.39943939447402954
Validation loss = 0.3970892131328583
Validation loss = 0.3994329869747162
Validation loss = 0.40435194969177246
Validation loss = 0.404767781496048
Validation loss = 0.4029445946216583
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -186     |
| Iteration     | 74       |
| MaximumReturn | -111     |
| MinimumReturn | -220     |
| TotalSamples  | 126616   |
----------------------------
itr #75 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.40211373567581177
Validation loss = 0.39269325137138367
Validation loss = 0.3938494622707367
Validation loss = 0.39192482829093933
Validation loss = 0.39642003178596497
Validation loss = 0.39926546812057495
Validation loss = 0.398415207862854
Validation loss = 0.40236398577690125
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.39181914925575256
Validation loss = 0.38646674156188965
Validation loss = 0.3886636793613434
Validation loss = 0.3922547399997711
Validation loss = 0.3886057138442993
Validation loss = 0.3955899477005005
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.4458235502243042
Validation loss = 0.4104909300804138
Validation loss = 0.4075852930545807
Validation loss = 0.4090035557746887
Validation loss = 0.4123619794845581
Validation loss = 0.4120669960975647
Validation loss = 0.41204550862312317
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3960268199443817
Validation loss = 0.39751285314559937
Validation loss = 0.4119711220264435
Validation loss = 0.4286952614784241
Validation loss = 0.4000386893749237
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.40941324830055237
Validation loss = 0.39877331256866455
Validation loss = 0.4019225239753723
Validation loss = 0.4035258889198303
Validation loss = 0.4026500880718231
Validation loss = 0.4039750099182129
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -191     |
| Iteration     | 75       |
| MaximumReturn | -158     |
| MinimumReturn | -208     |
| TotalSamples  | 128282   |
----------------------------
itr #76 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.39322543144226074
Validation loss = 0.39267370104789734
Validation loss = 0.39436858892440796
Validation loss = 0.39037224650382996
Validation loss = 0.39211779832839966
Validation loss = 0.3944751024246216
Validation loss = 0.3922842741012573
Validation loss = 0.39497923851013184
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3863306939601898
Validation loss = 0.3877946734428406
Validation loss = 0.39127233624458313
Validation loss = 0.38843783736228943
Validation loss = 0.38695430755615234
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.40579652786254883
Validation loss = 0.4074770212173462
Validation loss = 0.40963804721832275
Validation loss = 0.405611127614975
Validation loss = 0.4073820114135742
Validation loss = 0.40594252943992615
Validation loss = 0.4089270830154419
Validation loss = 0.41491609811782837
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.40275874733924866
Validation loss = 0.39673203229904175
Validation loss = 0.39961665868759155
Validation loss = 0.4023734927177429
Validation loss = 0.4043925404548645
Validation loss = 0.41178029775619507
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3967140316963196
Validation loss = 0.3963741958141327
Validation loss = 0.3974410891532898
Validation loss = 0.39813315868377686
Validation loss = 0.39778539538383484
Validation loss = 0.40013161301612854
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -191     |
| Iteration     | 76       |
| MaximumReturn | -155     |
| MinimumReturn | -206     |
| TotalSamples  | 129948   |
----------------------------
itr #77 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.39165160059928894
Validation loss = 0.38940197229385376
Validation loss = 0.3897552788257599
Validation loss = 0.3919108510017395
Validation loss = 0.3883962631225586
Validation loss = 0.3909689784049988
Validation loss = 0.3923117518424988
Validation loss = 0.3950202167034149
Validation loss = 0.3940340280532837
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3851895332336426
Validation loss = 0.38309913873672485
Validation loss = 0.3845200538635254
Validation loss = 0.3874399662017822
Validation loss = 0.38704603910446167
Validation loss = 0.3887164294719696
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.402868390083313
Validation loss = 0.40662896633148193
Validation loss = 0.4021008610725403
Validation loss = 0.4037589430809021
Validation loss = 0.4026973843574524
Validation loss = 0.4043848216533661
Validation loss = 0.40637460350990295
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.39542293548583984
Validation loss = 0.4002813398838043
Validation loss = 0.40082061290740967
Validation loss = 0.3969358801841736
Validation loss = 0.4039081633090973
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3936782777309418
Validation loss = 0.39496004581451416
Validation loss = 0.3974682688713074
Validation loss = 0.3950986862182617
Validation loss = 0.39606282114982605
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -203     |
| Iteration     | 77       |
| MaximumReturn | -176     |
| MinimumReturn | -212     |
| TotalSamples  | 131614   |
----------------------------
itr #78 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.4528161883354187
Validation loss = 0.37815672159194946
Validation loss = 0.3797202408313751
Validation loss = 0.3798545300960541
Validation loss = 0.37953442335128784
Validation loss = 0.3808337152004242
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.38636675477027893
Validation loss = 0.3769662082195282
Validation loss = 0.37524741888046265
Validation loss = 0.37733450531959534
Validation loss = 0.37926527857780457
Validation loss = 0.38002341985702515
Validation loss = 0.37863689661026
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3995233476161957
Validation loss = 0.397983193397522
Validation loss = 0.39642441272735596
Validation loss = 0.3961898684501648
Validation loss = 0.3980882465839386
Validation loss = 0.398974746465683
Validation loss = 0.3981516361236572
Validation loss = 0.39717498421669006
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.38656681776046753
Validation loss = 0.38127264380455017
Validation loss = 0.3818240165710449
Validation loss = 0.3855503797531128
Validation loss = 0.383480042219162
Validation loss = 0.38411945104599
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3974919021129608
Validation loss = 0.37989941239356995
Validation loss = 0.3835521638393402
Validation loss = 0.38518378138542175
Validation loss = 0.38733813166618347
Validation loss = 0.38627415895462036
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -201     |
| Iteration     | 78       |
| MaximumReturn | -187     |
| MinimumReturn | -213     |
| TotalSamples  | 133280   |
----------------------------
itr #79 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.38036707043647766
Validation loss = 0.37875935435295105
Validation loss = 0.3792683184146881
Validation loss = 0.3783615529537201
Validation loss = 0.3760230541229248
Validation loss = 0.3782588243484497
Validation loss = 0.3780949115753174
Validation loss = 0.38158097863197327
Validation loss = 0.37640658020973206
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3737192749977112
Validation loss = 0.37273767590522766
Validation loss = 0.3725208640098572
Validation loss = 0.37336626648902893
Validation loss = 0.3756687045097351
Validation loss = 0.3788756728172302
Validation loss = 0.3750085234642029
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.392145037651062
Validation loss = 0.39711710810661316
Validation loss = 0.3946083188056946
Validation loss = 0.3934435546398163
Validation loss = 0.3942418694496155
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3778347671031952
Validation loss = 0.3803582787513733
Validation loss = 0.3781711161136627
Validation loss = 0.38296547532081604
Validation loss = 0.3815086781978607
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.38349756598472595
Validation loss = 0.38219380378723145
Validation loss = 0.383908748626709
Validation loss = 0.3810817003250122
Validation loss = 0.38232555985450745
Validation loss = 0.3861769437789917
Validation loss = 0.3862852454185486
Validation loss = 0.3851480185985565
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
Path 1 | total_timesteps 100.
Path 2 | total_timesteps 200.
Path 3 | total_timesteps 300.
Path 4 | total_timesteps 400.
Path 5 | total_timesteps 500.
Path 6 | total_timesteps 600.
Path 7 | total_timesteps 700.
Path 8 | total_timesteps 800.
Path 9 | total_timesteps 900.
Path 10 | total_timesteps 1000.
Path 11 | total_timesteps 1100.
Path 12 | total_timesteps 1200.
Path 13 | total_timesteps 1300.
Path 14 | total_timesteps 1400.
Path 15 | total_timesteps 1500.
Path 16 | total_timesteps 1600.
Path 17 | total_timesteps 1700.
Path 18 | total_timesteps 1800.
Path 19 | total_timesteps 1900.
Path 20 | total_timesteps 2000.
Path 21 | total_timesteps 2100.
Path 22 | total_timesteps 2200.
Path 23 | total_timesteps 2300.
Path 24 | total_timesteps 2400.
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -198     |
| Iteration     | 79       |
| MaximumReturn | -177     |
| MinimumReturn | -209     |
| TotalSamples  | 134946   |
----------------------------
