Logging to experiments/gym_cheetahA01/oct29/w350e3_seed2341
Print configuration .....
{'env_name': 'gym_cheetahA01', 'random_seeds': [4321, 2314, 2341, 3421], 'save_variables': False, 'model_save_dir': '/tmp/gym_cheetahA01_models/', 'restore_variables': False, 'start_onpol_iter': 0, 'onpol_iters': 33, 'num_path_random': 6, 'num_path_onpol': 6, 'env_horizon': 1000, 'max_train_data': 200000, 'max_val_data': 100000, 'discard_ratio': 0.0, 'dynamics': {'pre_training': {'mode': 'intrinsic_reward', 'itr': 0, 'policy_itr': 20}, 'model': 'nn', 'ensemble': True, 'ensemble_model_count': 5, 'enable_particle_ensemble': True, 'particles': 5, 'intrinsic_reward_only': False, 'external_reward_evaluation_interval': 5, 'obs_var': 1.0, 'intrinsic_reward_coeff': 1.0, 'ita': 1.0, 'mode': 'random', 'val': True, 'n_layers': 4, 'hidden_size': 1000, 'activation': 'relu', 'batch_size': 1000, 'learning_rate': 0.001, 'epochs': 200, 'kfac_params': {'learning_rate': 0.1, 'damping': 0.001, 'momentum': 0.9, 'kl_clip': 0.0001, 'cov_ema_decay': 0.99}}, 'policy': {'network_shape': [32, 32], 'init_logstd': 0.0, 'activation': 'tanh', 'reinitialize_every_itr': False}, 'trpo': {'horizon': 1000, 'gamma': 0.99, 'step_size': 0.01, 'iterations': 20, 'batch_size': 50000, 'gae': 0.95}, 'trpo_ext_reward': {'horizon': 1000, 'gamma': 0.99, 'step_size': 0.01, 'iterations': 20, 'batch_size': 50000, 'gae': 0.95}, 'algo': 'trpo'}
Generating random rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 0
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 0
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 0
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 0
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 0
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 0
Done generating random rollouts.
Creating normalization for training data.
Done creating normalization for training data.
Particle ensemble enabled? True
An ensemble of 5 dynamics model <class 'model.dynamics.NNDynamicsModel'> initialized
Train dynamics model with intrinsic reward only? False
Pre-training enabled. Using only intrinsic reward.
Pre-training dynamics model for 0 iterations...
Done pre-training dynamics model.
Using external reward only.
itr #0 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.5491533875465393
Validation loss = 0.15748900175094604
Validation loss = 0.11371127516031265
Validation loss = 0.09409484267234802
Validation loss = 0.08423462510108948
Validation loss = 0.07576920092105865
Validation loss = 0.07503031939268112
Validation loss = 0.06994400918483734
Validation loss = 0.06776947528123856
Validation loss = 0.07260580360889435
Validation loss = 0.06978974491357803
Validation loss = 0.06962908059358597
Validation loss = 0.07020577788352966
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6275701522827148
Validation loss = 0.15554559230804443
Validation loss = 0.11331028491258621
Validation loss = 0.09132735431194305
Validation loss = 0.08121457695960999
Validation loss = 0.07450047880411148
Validation loss = 0.07495857775211334
Validation loss = 0.07508570700883865
Validation loss = 0.08398059010505676
Validation loss = 0.08041173964738846
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.869869589805603
Validation loss = 0.1517207771539688
Validation loss = 0.11028105765581131
Validation loss = 0.09051796793937683
Validation loss = 0.07616376876831055
Validation loss = 0.07232753932476044
Validation loss = 0.0707937702536583
Validation loss = 0.0700024887919426
Validation loss = 0.06767526268959045
Validation loss = 0.06329493224620819
Validation loss = 0.06341594457626343
Validation loss = 0.06793174147605896
Validation loss = 0.0626421570777893
Validation loss = 0.06306257843971252
Validation loss = 0.05916569381952286
Validation loss = 0.05655543506145477
Validation loss = 0.0682876780629158
Validation loss = 0.06112448871135712
Validation loss = 0.08046113699674606
Validation loss = 0.06525440514087677
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.5249120593070984
Validation loss = 0.15822534263134003
Validation loss = 0.11228084564208984
Validation loss = 0.08966707438230515
Validation loss = 0.07617848366498947
Validation loss = 0.07511941343545914
Validation loss = 0.07035605609416962
Validation loss = 0.07362134754657745
Validation loss = 0.07888521254062653
Validation loss = 0.06979715824127197
Validation loss = 0.0770159512758255
Validation loss = 0.07507456839084625
Validation loss = 0.07179231196641922
Validation loss = 0.07216597348451614
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.8635494709014893
Validation loss = 0.15848679840564728
Validation loss = 0.11440984904766083
Validation loss = 0.09441371262073517
Validation loss = 0.08314790576696396
Validation loss = 0.07675942778587341
Validation loss = 0.07572680711746216
Validation loss = 0.07213132828474045
Validation loss = 0.07145816832780838
Validation loss = 0.07106658816337585
Validation loss = 0.07330693304538727
Validation loss = 0.07303225994110107
Validation loss = 0.06943532824516296
Validation loss = 0.07136270403862
Validation loss = 0.071512870490551
Validation loss = 0.07486649602651596
Validation loss = 0.0694333165884018
Validation loss = 0.08314501494169235
Validation loss = 0.0889439508318901
Validation loss = 0.0823807343840599
Validation loss = 0.07352551072835922
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 52
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 161
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 250
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 100
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 144
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 238
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -399     |
| Iteration     | 0        |
| MaximumReturn | -341     |
| MinimumReturn | -448     |
| TotalSamples  | 8000     |
----------------------------
itr #1 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.12404616177082062
Validation loss = 0.07696709781885147
Validation loss = 0.07053939998149872
Validation loss = 0.07226784527301788
Validation loss = 0.06775901466608047
Validation loss = 0.06493813544511795
Validation loss = 0.06696787476539612
Validation loss = 0.06566134095191956
Validation loss = 0.06683614104986191
Validation loss = 0.06882423162460327
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.12117922306060791
Validation loss = 0.07932747900485992
Validation loss = 0.07900908589363098
Validation loss = 0.07074260711669922
Validation loss = 0.07113383710384369
Validation loss = 0.07023663818836212
Validation loss = 0.067361980676651
Validation loss = 0.08249654620885849
Validation loss = 0.07604888081550598
Validation loss = 0.06321550905704498
Validation loss = 0.0637836828827858
Validation loss = 0.06302250921726227
Validation loss = 0.06511005759239197
Validation loss = 0.0638573169708252
Validation loss = 0.062133342027664185
Validation loss = 0.06321460008621216
Validation loss = 0.06293909251689911
Validation loss = 0.06444379687309265
Validation loss = 0.062149859964847565
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.1215168684720993
Validation loss = 0.07576169073581696
Validation loss = 0.06923701614141464
Validation loss = 0.06742052733898163
Validation loss = 0.06767432391643524
Validation loss = 0.0647648349404335
Validation loss = 0.06439390778541565
Validation loss = 0.06274215877056122
Validation loss = 0.062351688742637634
Validation loss = 0.06158392131328583
Validation loss = 0.061880551278591156
Validation loss = 0.0667576715350151
Validation loss = 0.06064389646053314
Validation loss = 0.06247923895716667
Validation loss = 0.06137096881866455
Validation loss = 0.06144574284553528
Validation loss = 0.06335712969303131
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.12115918099880219
Validation loss = 0.07715202122926712
Validation loss = 0.07124428451061249
Validation loss = 0.07415726780891418
Validation loss = 0.06797459721565247
Validation loss = 0.06822393834590912
Validation loss = 0.0829324722290039
Validation loss = 0.06393204629421234
Validation loss = 0.06935685873031616
Validation loss = 0.07042406499385834
Validation loss = 0.062201354652643204
Validation loss = 0.06517153978347778
Validation loss = 0.07117092609405518
Validation loss = 0.06449484080076218
Validation loss = 0.06472477316856384
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.12916716933250427
Validation loss = 0.07546312361955643
Validation loss = 0.08132138103246689
Validation loss = 0.06967222690582275
Validation loss = 0.0675601214170456
Validation loss = 0.06624913960695267
Validation loss = 0.06870478391647339
Validation loss = 0.0651279091835022
Validation loss = 0.06542985141277313
Validation loss = 0.07066188007593155
Validation loss = 0.06301520019769669
Validation loss = 0.06482722610235214
Validation loss = 0.06210258603096008
Validation loss = 0.06294651329517365
Validation loss = 0.06306156516075134
Validation loss = 0.06572292745113373
Validation loss = 0.06464555114507675
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 526
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 601
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 617
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 671
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 640
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 685
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -349     |
| Iteration     | 1        |
| MaximumReturn | -320     |
| MinimumReturn | -423     |
| TotalSamples  | 12000    |
----------------------------
itr #2 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.10456884652376175
Validation loss = 0.0795753002166748
Validation loss = 0.07938892394304276
Validation loss = 0.0798870176076889
Validation loss = 0.07560241967439651
Validation loss = 0.07868360728025436
Validation loss = 0.07485643774271011
Validation loss = 0.07346784323453903
Validation loss = 0.07466816902160645
Validation loss = 0.07686672359704971
Validation loss = 0.07517736405134201
Validation loss = 0.07398238033056259
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.10189530998468399
Validation loss = 0.07838155329227448
Validation loss = 0.07742246240377426
Validation loss = 0.07700920850038528
Validation loss = 0.08508649468421936
Validation loss = 0.07497275620698929
Validation loss = 0.07492934912443161
Validation loss = 0.07366544753313065
Validation loss = 0.07330217212438583
Validation loss = 0.07758373022079468
Validation loss = 0.07731098681688309
Validation loss = 0.08632448315620422
Validation loss = 0.07406545430421829
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.09670094400644302
Validation loss = 0.07722660899162292
Validation loss = 0.07739118486642838
Validation loss = 0.07486183196306229
Validation loss = 0.07529792934656143
Validation loss = 0.07515724748373032
Validation loss = 0.07379834353923798
Validation loss = 0.07345516234636307
Validation loss = 0.07613727450370789
Validation loss = 0.07557369023561478
Validation loss = 0.07567044347524643
Validation loss = 0.07563743740320206
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.09744143486022949
Validation loss = 0.07774978131055832
Validation loss = 0.08000557869672775
Validation loss = 0.07503839582204819
Validation loss = 0.07634185254573822
Validation loss = 0.07385117560625076
Validation loss = 0.07714703679084778
Validation loss = 0.07268158346414566
Validation loss = 0.07414349168539047
Validation loss = 0.07524222135543823
Validation loss = 0.07532461732625961
Validation loss = 0.07294142991304398
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.09840366244316101
Validation loss = 0.07886257767677307
Validation loss = 0.07411884516477585
Validation loss = 0.07788222283124924
Validation loss = 0.075799860060215
Validation loss = 0.0750417485833168
Validation loss = 0.07505649328231812
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 332
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 674
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 662
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 642
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 664
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 634
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -372     |
| Iteration     | 2        |
| MaximumReturn | -332     |
| MinimumReturn | -436     |
| TotalSamples  | 16000    |
----------------------------
itr #3 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.10322688519954681
Validation loss = 0.08017205446958542
Validation loss = 0.07765626907348633
Validation loss = 0.07714317739009857
Validation loss = 0.07852694392204285
Validation loss = 0.07683047652244568
Validation loss = 0.07569631934165955
Validation loss = 0.07443387061357498
Validation loss = 0.07998797297477722
Validation loss = 0.07442870736122131
Validation loss = 0.0739465206861496
Validation loss = 0.07820369303226471
Validation loss = 0.07548925280570984
Validation loss = 0.07727011293172836
Validation loss = 0.07629214972257614
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.10712389647960663
Validation loss = 0.08424147963523865
Validation loss = 0.07925761491060257
Validation loss = 0.08010652661323547
Validation loss = 0.0758877843618393
Validation loss = 0.07837450504302979
Validation loss = 0.07454264163970947
Validation loss = 0.07668560743331909
Validation loss = 0.08008342236280441
Validation loss = 0.0789971798658371
Validation loss = 0.07733357697725296
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.10655808448791504
Validation loss = 0.08015484362840652
Validation loss = 0.07756785303354263
Validation loss = 0.07598014920949936
Validation loss = 0.07655013352632523
Validation loss = 0.07661320269107819
Validation loss = 0.07502032816410065
Validation loss = 0.07698407769203186
Validation loss = 0.07639160007238388
Validation loss = 0.07471808791160583
Validation loss = 0.07532615959644318
Validation loss = 0.09221336245536804
Validation loss = 0.07418127357959747
Validation loss = 0.07446953654289246
Validation loss = 0.07559079676866531
Validation loss = 0.07583548873662949
Validation loss = 0.07657875120639801
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.10648404061794281
Validation loss = 0.07954961061477661
Validation loss = 0.07610902190208435
Validation loss = 0.07784076035022736
Validation loss = 0.07515580207109451
Validation loss = 0.07485032081604004
Validation loss = 0.0765869989991188
Validation loss = 0.07467059046030045
Validation loss = 0.07627187669277191
Validation loss = 0.07896736264228821
Validation loss = 0.07452637702226639
Validation loss = 0.07530477643013
Validation loss = 0.08901616930961609
Validation loss = 0.07532911002635956
Validation loss = 0.07410450279712677
Validation loss = 0.07271071523427963
Validation loss = 0.07370442152023315
Validation loss = 0.07758748531341553
Validation loss = 0.07817497104406357
Validation loss = 0.0753263533115387
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.10080479085445404
Validation loss = 0.07836790382862091
Validation loss = 0.08100949227809906
Validation loss = 0.08382686227560043
Validation loss = 0.07738898694515228
Validation loss = 0.07460588216781616
Validation loss = 0.07696105539798737
Validation loss = 0.07583371549844742
Validation loss = 0.07567369937896729
Validation loss = 0.07638701796531677
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 487
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 500
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 456
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 516
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 471
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 459
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -268     |
| Iteration     | 3        |
| MaximumReturn | -229     |
| MinimumReturn | -315     |
| TotalSamples  | 20000    |
----------------------------
itr #4 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.07512035965919495
Validation loss = 0.07000118494033813
Validation loss = 0.07199369370937347
Validation loss = 0.07258439064025879
Validation loss = 0.06819910556077957
Validation loss = 0.06963024288415909
Validation loss = 0.0709424614906311
Validation loss = 0.07079018652439117
Validation loss = 0.0676647201180458
Validation loss = 0.07070498168468475
Validation loss = 0.07013145089149475
Validation loss = 0.06818760931491852
Validation loss = 0.07048983126878738
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.07394177466630936
Validation loss = 0.0703795924782753
Validation loss = 0.07037296891212463
Validation loss = 0.07087575644254684
Validation loss = 0.07000589370727539
Validation loss = 0.07007404416799545
Validation loss = 0.07135780900716782
Validation loss = 0.07040093839168549
Validation loss = 0.08200719207525253
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.07519741356372833
Validation loss = 0.06973566114902496
Validation loss = 0.0690377950668335
Validation loss = 0.06640531122684479
Validation loss = 0.06730664521455765
Validation loss = 0.07306917011737823
Validation loss = 0.06933628022670746
Validation loss = 0.067076176404953
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.07399140298366547
Validation loss = 0.07122780382633209
Validation loss = 0.06759394705295563
Validation loss = 0.06773190200328827
Validation loss = 0.06826680153608322
Validation loss = 0.06716841459274292
Validation loss = 0.0702926367521286
Validation loss = 0.0688239336013794
Validation loss = 0.06664702296257019
Validation loss = 0.06738125532865524
Validation loss = 0.0686030387878418
Validation loss = 0.07360018044710159
Validation loss = 0.06943812966346741
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.07248900830745697
Validation loss = 0.07190460711717606
Validation loss = 0.06955576688051224
Validation loss = 0.06880117207765579
Validation loss = 0.07354423403739929
Validation loss = 0.06927818059921265
Validation loss = 0.06722918897867203
Validation loss = 0.06774621456861496
Validation loss = 0.06846402585506439
Validation loss = 0.0685267224907875
Validation loss = 0.0673309937119484
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 430
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 510
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 480
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 486
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 451
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 483
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 64.4     |
| Iteration     | 4        |
| MaximumReturn | 89.9     |
| MinimumReturn | 45.2     |
| TotalSamples  | 24000    |
----------------------------
itr #5 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.06850100308656693
Validation loss = 0.06782428920269012
Validation loss = 0.0675094872713089
Validation loss = 0.06716597825288773
Validation loss = 0.06538514792919159
Validation loss = 0.06481321901082993
Validation loss = 0.06518816202878952
Validation loss = 0.06380986422300339
Validation loss = 0.06800394505262375
Validation loss = 0.06888912618160248
Validation loss = 0.07539516687393188
Validation loss = 0.06598446518182755
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.06860296428203583
Validation loss = 0.06544111669063568
Validation loss = 0.06510491669178009
Validation loss = 0.06784567981958389
Validation loss = 0.06450998038053513
Validation loss = 0.06588760018348694
Validation loss = 0.06447184830904007
Validation loss = 0.06342709064483643
Validation loss = 0.06608237326145172
Validation loss = 0.06314026564359665
Validation loss = 0.06412293016910553
Validation loss = 0.063250832259655
Validation loss = 0.06581829488277435
Validation loss = 0.06597293168306351
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.0674326941370964
Validation loss = 0.06318844854831696
Validation loss = 0.06374150514602661
Validation loss = 0.06299354881048203
Validation loss = 0.06281766295433044
Validation loss = 0.06258975714445114
Validation loss = 0.06353908032178879
Validation loss = 0.06558460742235184
Validation loss = 0.06090298295021057
Validation loss = 0.06972747296094894
Validation loss = 0.06187598034739494
Validation loss = 0.06141970679163933
Validation loss = 0.061572521924972534
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.06692966818809509
Validation loss = 0.0628759115934372
Validation loss = 0.06259570270776749
Validation loss = 0.06479281932115555
Validation loss = 0.06329815834760666
Validation loss = 0.06098796799778938
Validation loss = 0.06374417990446091
Validation loss = 0.06211080774664879
Validation loss = 0.06668145209550858
Validation loss = 0.06112192943692207
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.06897171586751938
Validation loss = 0.067203089594841
Validation loss = 0.06716285645961761
Validation loss = 0.0664847269654274
Validation loss = 0.06553235650062561
Validation loss = 0.06588079780340195
Validation loss = 0.06478560715913773
Validation loss = 0.06353176385164261
Validation loss = 0.06354662030935287
Validation loss = 0.06493960320949554
Validation loss = 0.0641956701874733
Validation loss = 0.06560026854276657
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 584
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 619
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 601
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 595
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 630
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 626
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 381      |
| Iteration     | 5        |
| MaximumReturn | 671      |
| MinimumReturn | -58.1    |
| TotalSamples  | 28000    |
----------------------------
itr #6 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.0676438957452774
Validation loss = 0.0655236765742302
Validation loss = 0.06037721037864685
Validation loss = 0.06136197969317436
Validation loss = 0.06117298826575279
Validation loss = 0.0622902438044548
Validation loss = 0.06147027388215065
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.06495974212884903
Validation loss = 0.06072259321808815
Validation loss = 0.06482154875993729
Validation loss = 0.06171734258532524
Validation loss = 0.06157645955681801
Validation loss = 0.06001672521233559
Validation loss = 0.060794610530138016
Validation loss = 0.06099255010485649
Validation loss = 0.061858002096414566
Validation loss = 0.06129610538482666
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.0627889335155487
Validation loss = 0.06177765130996704
Validation loss = 0.06018529087305069
Validation loss = 0.05859687179327011
Validation loss = 0.05913902074098587
Validation loss = 0.061484675854444504
Validation loss = 0.05758192390203476
Validation loss = 0.060825977474451065
Validation loss = 0.05921432375907898
Validation loss = 0.05813220143318176
Validation loss = 0.05872016400098801
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.06558467447757721
Validation loss = 0.061331961303949356
Validation loss = 0.06129872053861618
Validation loss = 0.06689588725566864
Validation loss = 0.059503402560949326
Validation loss = 0.058416787534952164
Validation loss = 0.06323261559009552
Validation loss = 0.059535980224609375
Validation loss = 0.06071673706173897
Validation loss = 0.05693701654672623
Validation loss = 0.05669458582997322
Validation loss = 0.059378623962402344
Validation loss = 0.0650123730301857
Validation loss = 0.060502130538225174
Validation loss = 0.05929187312722206
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.06301634758710861
Validation loss = 0.06012508645653725
Validation loss = 0.061048176139593124
Validation loss = 0.0616716705262661
Validation loss = 0.06287398189306259
Validation loss = 0.06135009974241257
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 677
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 640
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 653
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 627
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 614
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 659
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 583      |
| Iteration     | 6        |
| MaximumReturn | 1.01e+03 |
| MinimumReturn | -167     |
| TotalSamples  | 32000    |
----------------------------
itr #7 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.060734424740076065
Validation loss = 0.05569329112768173
Validation loss = 0.0554339662194252
Validation loss = 0.05400888994336128
Validation loss = 0.05380275472998619
Validation loss = 0.05413875728845596
Validation loss = 0.05420919880270958
Validation loss = 0.054302867501974106
Validation loss = 0.05365390703082085
Validation loss = 0.05513880401849747
Validation loss = 0.054338764399290085
Validation loss = 0.05371106415987015
Validation loss = 0.05533357709646225
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.061839696019887924
Validation loss = 0.05571013689041138
Validation loss = 0.05829304829239845
Validation loss = 0.054441697895526886
Validation loss = 0.054528966546058655
Validation loss = 0.05322285369038582
Validation loss = 0.05297442898154259
Validation loss = 0.05493318289518356
Validation loss = 0.05221746861934662
Validation loss = 0.0543777234852314
Validation loss = 0.05267143249511719
Validation loss = 0.051126107573509216
Validation loss = 0.05390177294611931
Validation loss = 0.0527348667383194
Validation loss = 0.056253716349601746
Validation loss = 0.05098408833146095
Validation loss = 0.05184531956911087
Validation loss = 0.05476567894220352
Validation loss = 0.052337389439344406
Validation loss = 0.05491027981042862
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.059627190232276917
Validation loss = 0.05260554328560829
Validation loss = 0.05153745040297508
Validation loss = 0.05276329815387726
Validation loss = 0.052985429763793945
Validation loss = 0.051786623895168304
Validation loss = 0.05084896832704544
Validation loss = 0.05477817729115486
Validation loss = 0.05073770880699158
Validation loss = 0.05331549048423767
Validation loss = 0.05084322392940521
Validation loss = 0.05110947787761688
Validation loss = 0.05016990005970001
Validation loss = 0.051110513508319855
Validation loss = 0.05289845913648605
Validation loss = 0.05287063121795654
Validation loss = 0.050972480326890945
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.05917700380086899
Validation loss = 0.05215943604707718
Validation loss = 0.05111358314752579
Validation loss = 0.052480299025774
Validation loss = 0.051673512905836105
Validation loss = 0.05368904024362564
Validation loss = 0.0523165762424469
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.06190863251686096
Validation loss = 0.0546228289604187
Validation loss = 0.05535034090280533
Validation loss = 0.05421770364046097
Validation loss = 0.054409049451351166
Validation loss = 0.05348256230354309
Validation loss = 0.05242730677127838
Validation loss = 0.05303032323718071
Validation loss = 0.05363975092768669
Validation loss = 0.05263936519622803
Validation loss = 0.05318880081176758
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 668
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 680
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 674
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 652
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 690
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 658
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 815      |
| Iteration     | 7        |
| MaximumReturn | 1.42e+03 |
| MinimumReturn | -553     |
| TotalSamples  | 36000    |
----------------------------
itr #8 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.05765930935740471
Validation loss = 0.05186542868614197
Validation loss = 0.0516161285340786
Validation loss = 0.05319952964782715
Validation loss = 0.05096481740474701
Validation loss = 0.05128402262926102
Validation loss = 0.052885740995407104
Validation loss = 0.05124790593981743
Validation loss = 0.05361953005194664
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.06111238896846771
Validation loss = 0.05152459815144539
Validation loss = 0.05076722800731659
Validation loss = 0.05044498294591904
Validation loss = 0.051670730113983154
Validation loss = 0.04996475577354431
Validation loss = 0.05031951516866684
Validation loss = 0.05053580552339554
Validation loss = 0.051684487611055374
Validation loss = 0.057476334273815155
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.0567312054336071
Validation loss = 0.049893975257873535
Validation loss = 0.05099859833717346
Validation loss = 0.051096267998218536
Validation loss = 0.055745355784893036
Validation loss = 0.05344274267554283
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.05789429694414139
Validation loss = 0.05007672682404518
Validation loss = 0.050452087074518204
Validation loss = 0.05102001503109932
Validation loss = 0.05154123902320862
Validation loss = 0.050362005829811096
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.055057477205991745
Validation loss = 0.05311059206724167
Validation loss = 0.052829790860414505
Validation loss = 0.05504785105586052
Validation loss = 0.053051695227622986
Validation loss = 0.051890406757593155
Validation loss = 0.051905397325754166
Validation loss = 0.052141886204481125
Validation loss = 0.052618831396102905
Validation loss = 0.05106762796640396
Validation loss = 0.05238182842731476
Validation loss = 0.05181539058685303
Validation loss = 0.050189029425382614
Validation loss = 0.051194727420806885
Validation loss = 0.05138257145881653
Validation loss = 0.050240207463502884
Validation loss = 0.04995600879192352
Validation loss = 0.050218481570482254
Validation loss = 0.0522666834294796
Validation loss = 0.05046496540307999
Validation loss = 0.052430495619773865
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 719
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 708
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 711
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 604
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 705
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 719
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.25e+03 |
| Iteration     | 8        |
| MaximumReturn | 1.7e+03  |
| MinimumReturn | -549     |
| TotalSamples  | 40000    |
----------------------------
itr #9 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.053420424461364746
Validation loss = 0.04633255675435066
Validation loss = 0.046240027993917465
Validation loss = 0.04603597894310951
Validation loss = 0.04628344625234604
Validation loss = 0.04678751900792122
Validation loss = 0.047586794942617416
Validation loss = 0.04595806077122688
Validation loss = 0.046078067272901535
Validation loss = 0.04530933499336243
Validation loss = 0.0462864488363266
Validation loss = 0.04674018174409866
Validation loss = 0.0457661934196949
Validation loss = 0.04592583328485489
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.049633849412202835
Validation loss = 0.04591875523328781
Validation loss = 0.04791942238807678
Validation loss = 0.04669047147035599
Validation loss = 0.04715288430452347
Validation loss = 0.044972363859415054
Validation loss = 0.04606439918279648
Validation loss = 0.0465412512421608
Validation loss = 0.047229938209056854
Validation loss = 0.04593675211071968
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.04909178242087364
Validation loss = 0.04569575935602188
Validation loss = 0.04643593356013298
Validation loss = 0.046002037823200226
Validation loss = 0.04537254944443703
Validation loss = 0.045340195298194885
Validation loss = 0.04540877789258957
Validation loss = 0.047596439719200134
Validation loss = 0.04663106054067612
Validation loss = 0.04563993215560913
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.051278553903102875
Validation loss = 0.04658106341958046
Validation loss = 0.049754925072193146
Validation loss = 0.04652424901723862
Validation loss = 0.04596080631017685
Validation loss = 0.04518696293234825
Validation loss = 0.046282462775707245
Validation loss = 0.044632863253355026
Validation loss = 0.04619507119059563
Validation loss = 0.045092903077602386
Validation loss = 0.043930474668741226
Validation loss = 0.04519972577691078
Validation loss = 0.0444142185151577
Validation loss = 0.04659859463572502
Validation loss = 0.043906070291996
Validation loss = 0.046332087367773056
Validation loss = 0.04538606107234955
Validation loss = 0.04420509189367294
Validation loss = 0.04549679905176163
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.0482698492705822
Validation loss = 0.047936178743839264
Validation loss = 0.04515162482857704
Validation loss = 0.04566100612282753
Validation loss = 0.04531922936439514
Validation loss = 0.04778867959976196
Validation loss = 0.0457705482840538
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 687
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 712
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 751
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 757
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 761
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 662
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 718      |
| Iteration     | 9        |
| MaximumReturn | 1.74e+03 |
| MinimumReturn | -589     |
| TotalSamples  | 44000    |
----------------------------
itr #10 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.051714763045310974
Validation loss = 0.04474107176065445
Validation loss = 0.0432671494781971
Validation loss = 0.04402042180299759
Validation loss = 0.04202098026871681
Validation loss = 0.04373307153582573
Validation loss = 0.04185846820473671
Validation loss = 0.042338091880083084
Validation loss = 0.04254690185189247
Validation loss = 0.04222837835550308
Validation loss = 0.042580559849739075
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.04911131039261818
Validation loss = 0.044549360871315
Validation loss = 0.04305880889296532
Validation loss = 0.04400743544101715
Validation loss = 0.043391820043325424
Validation loss = 0.04365278780460358
Validation loss = 0.043696049600839615
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.04765540361404419
Validation loss = 0.04249274358153343
Validation loss = 0.045051209628582
Validation loss = 0.0414583794772625
Validation loss = 0.0427427664399147
Validation loss = 0.04281404986977577
Validation loss = 0.04288431257009506
Validation loss = 0.042269084602594376
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.048043206334114075
Validation loss = 0.04245257005095482
Validation loss = 0.043489325791597366
Validation loss = 0.04253871738910675
Validation loss = 0.041580602526664734
Validation loss = 0.04405119642615318
Validation loss = 0.04218289256095886
Validation loss = 0.042930591851472855
Validation loss = 0.042789772152900696
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.04906065762042999
Validation loss = 0.044198743999004364
Validation loss = 0.04334781691431999
Validation loss = 0.04317149892449379
Validation loss = 0.042968615889549255
Validation loss = 0.042745884507894516
Validation loss = 0.04249277710914612
Validation loss = 0.042319077998399734
Validation loss = 0.043171435594558716
Validation loss = 0.04303475841879845
Validation loss = 0.04348008707165718
Validation loss = 0.04187578335404396
Validation loss = 0.04250064119696617
Validation loss = 0.04387999325990677
Validation loss = 0.04337761923670769
Validation loss = 0.045308470726013184
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 754
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 765
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 753
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 714
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 786
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 766
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.67e+03 |
| Iteration     | 10       |
| MaximumReturn | 2.16e+03 |
| MinimumReturn | -164     |
| TotalSamples  | 48000    |
----------------------------
itr #11 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.04142080992460251
Validation loss = 0.0392044335603714
Validation loss = 0.03733489289879799
Validation loss = 0.03834147751331329
Validation loss = 0.03799297288060188
Validation loss = 0.04042740538716316
Validation loss = 0.03869069740176201
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.044225651770830154
Validation loss = 0.040570493787527084
Validation loss = 0.03960416093468666
Validation loss = 0.03913288190960884
Validation loss = 0.03938772901892662
Validation loss = 0.03854026272892952
Validation loss = 0.03949696943163872
Validation loss = 0.03928215429186821
Validation loss = 0.03783024847507477
Validation loss = 0.03832755237817764
Validation loss = 0.03826778009533882
Validation loss = 0.03791440650820732
Validation loss = 0.03831460699439049
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.044601377099752426
Validation loss = 0.03900139406323433
Validation loss = 0.03837192803621292
Validation loss = 0.03821038082242012
Validation loss = 0.03822396695613861
Validation loss = 0.038357701152563095
Validation loss = 0.03879193589091301
Validation loss = 0.03753438964486122
Validation loss = 0.038132525980472565
Validation loss = 0.03801332041621208
Validation loss = 0.03855525329709053
Validation loss = 0.03737659379839897
Validation loss = 0.036813318729400635
Validation loss = 0.037024255841970444
Validation loss = 0.037049539387226105
Validation loss = 0.037243619561195374
Validation loss = 0.03803931176662445
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.043112438172101974
Validation loss = 0.03834646940231323
Validation loss = 0.03930828347802162
Validation loss = 0.03745449706912041
Validation loss = 0.03725948929786682
Validation loss = 0.0391772985458374
Validation loss = 0.03800719976425171
Validation loss = 0.03902125358581543
Validation loss = 0.03826359659433365
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.04465942457318306
Validation loss = 0.03875599429011345
Validation loss = 0.0386386401951313
Validation loss = 0.0401531420648098
Validation loss = 0.0386803075671196
Validation loss = 0.037574540823698044
Validation loss = 0.03779354318976402
Validation loss = 0.03852536901831627
Validation loss = 0.03897320106625557
Validation loss = 0.037516769021749496
Validation loss = 0.037711698561906815
Validation loss = 0.04147560894489288
Validation loss = 0.03791715204715729
Validation loss = 0.0387570858001709
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 758
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 735
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 713
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 760
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 768
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 743
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.71e+03 |
| Iteration     | 11       |
| MaximumReturn | 2.24e+03 |
| MinimumReturn | 385      |
| TotalSamples  | 52000    |
----------------------------
itr #12 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.041445568203926086
Validation loss = 0.03493756800889969
Validation loss = 0.03495579957962036
Validation loss = 0.035239044576883316
Validation loss = 0.0352029874920845
Validation loss = 0.03586151823401451
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.038895133882761
Validation loss = 0.035593122243881226
Validation loss = 0.03444148600101471
Validation loss = 0.03524179756641388
Validation loss = 0.0346953310072422
Validation loss = 0.03603283315896988
Validation loss = 0.03582926094532013
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.03885994479060173
Validation loss = 0.03562256321310997
Validation loss = 0.034835297614336014
Validation loss = 0.03534062206745148
Validation loss = 0.03422053903341293
Validation loss = 0.034966498613357544
Validation loss = 0.03617588430643082
Validation loss = 0.034555938094854355
Validation loss = 0.033668432384729385
Validation loss = 0.0356888622045517
Validation loss = 0.03310319408774376
Validation loss = 0.03369592875242233
Validation loss = 0.035328056663274765
Validation loss = 0.03486663103103638
Validation loss = 0.03375914320349693
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.04208860173821449
Validation loss = 0.03508391231298447
Validation loss = 0.03509578853845596
Validation loss = 0.034884534776210785
Validation loss = 0.03478206321597099
Validation loss = 0.034626755863428116
Validation loss = 0.03405220806598663
Validation loss = 0.034914422780275345
Validation loss = 0.03581956401467323
Validation loss = 0.03487155959010124
Validation loss = 0.03523079305887222
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.03948327526450157
Validation loss = 0.03594127297401428
Validation loss = 0.03689848631620407
Validation loss = 0.03645574674010277
Validation loss = 0.0354955829679966
Validation loss = 0.03406796604394913
Validation loss = 0.035756271332502365
Validation loss = 0.034825365990400314
Validation loss = 0.03434018790721893
Validation loss = 0.036238107830286026
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 766
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 757
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 781
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 759
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 766
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 774
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.83e+03 |
| Iteration     | 12       |
| MaximumReturn | 2.06e+03 |
| MinimumReturn | 1.62e+03 |
| TotalSamples  | 56000    |
----------------------------
itr #13 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.03495272621512413
Validation loss = 0.03173143044114113
Validation loss = 0.033262405544519424
Validation loss = 0.03239981457591057
Validation loss = 0.032779958099126816
Validation loss = 0.03125293180346489
Validation loss = 0.032102398574352264
Validation loss = 0.0318852998316288
Validation loss = 0.03248161822557449
Validation loss = 0.03224373236298561
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.03642856702208519
Validation loss = 0.03260635584592819
Validation loss = 0.03193589672446251
Validation loss = 0.03261082246899605
Validation loss = 0.0315207839012146
Validation loss = 0.03394859656691551
Validation loss = 0.031260740011930466
Validation loss = 0.033939749002456665
Validation loss = 0.03196992725133896
Validation loss = 0.03204445168375969
Validation loss = 0.03178323432803154
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.03476135432720184
Validation loss = 0.0312659852206707
Validation loss = 0.03200891241431236
Validation loss = 0.03285781294107437
Validation loss = 0.03390958160161972
Validation loss = 0.03120388649404049
Validation loss = 0.030980588868260384
Validation loss = 0.03202527016401291
Validation loss = 0.03079581819474697
Validation loss = 0.03199489787220955
Validation loss = 0.03123355284333229
Validation loss = 0.03245548531413078
Validation loss = 0.030006423592567444
Validation loss = 0.03269437700510025
Validation loss = 0.030380571261048317
Validation loss = 0.030800601467490196
Validation loss = 0.03186166658997536
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.035723306238651276
Validation loss = 0.0317700132727623
Validation loss = 0.030843209475278854
Validation loss = 0.03127875179052353
Validation loss = 0.03256272152066231
Validation loss = 0.032143063843250275
Validation loss = 0.031228477135300636
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.03593193367123604
Validation loss = 0.03215884417295456
Validation loss = 0.03379737585783005
Validation loss = 0.03174147382378578
Validation loss = 0.03203090280294418
Validation loss = 0.032576173543930054
Validation loss = 0.03241952881217003
Validation loss = 0.03192662447690964
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 726
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 779
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 787
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 750
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 752
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 748
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 976      |
| Iteration     | 13       |
| MaximumReturn | 2.25e+03 |
| MinimumReturn | -608     |
| TotalSamples  | 60000    |
----------------------------
itr #14 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.0354146771132946
Validation loss = 0.031482964754104614
Validation loss = 0.030761724337935448
Validation loss = 0.031536493450403214
Validation loss = 0.032208625227212906
Validation loss = 0.0317077673971653
Validation loss = 0.031232429668307304
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.03559597209095955
Validation loss = 0.030825037509202957
Validation loss = 0.03185247629880905
Validation loss = 0.03234530985355377
Validation loss = 0.03108142502605915
Validation loss = 0.03145575150847435
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.03500194102525711
Validation loss = 0.031032627448439598
Validation loss = 0.029919790104031563
Validation loss = 0.03136005252599716
Validation loss = 0.03080226108431816
Validation loss = 0.03052150271832943
Validation loss = 0.030244192108511925
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.03457625210285187
Validation loss = 0.03246716037392616
Validation loss = 0.03027195669710636
Validation loss = 0.03124201111495495
Validation loss = 0.030583469197154045
Validation loss = 0.03324933722615242
Validation loss = 0.03155110031366348
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.03592697158455849
Validation loss = 0.030940311029553413
Validation loss = 0.031018417328596115
Validation loss = 0.031338635832071304
Validation loss = 0.03147616609930992
Validation loss = 0.03171785548329353
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 735
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 766
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 775
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 773
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 767
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 737
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.63e+03 |
| Iteration     | 14       |
| MaximumReturn | 2.24e+03 |
| MinimumReturn | 699      |
| TotalSamples  | 64000    |
----------------------------
itr #15 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.03174970671534538
Validation loss = 0.029327988624572754
Validation loss = 0.031090904027223587
Validation loss = 0.029391810297966003
Validation loss = 0.0312960185110569
Validation loss = 0.029207440093159676
Validation loss = 0.03055168315768242
Validation loss = 0.0292664747685194
Validation loss = 0.030299987643957138
Validation loss = 0.030460625886917114
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.03455515205860138
Validation loss = 0.030273040756583214
Validation loss = 0.030286526307463646
Validation loss = 0.03158940002322197
Validation loss = 0.03021378628909588
Validation loss = 0.031251460313797
Validation loss = 0.029127011075615883
Validation loss = 0.03098311275243759
Validation loss = 0.029009485617280006
Validation loss = 0.02969317138195038
Validation loss = 0.030514873564243317
Validation loss = 0.030053729191422462
Validation loss = 0.03283679485321045
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.032298579812049866
Validation loss = 0.029815375804901123
Validation loss = 0.029177986085414886
Validation loss = 0.02987820841372013
Validation loss = 0.029845507815480232
Validation loss = 0.029381129890680313
Validation loss = 0.029130985960364342
Validation loss = 0.028852902352809906
Validation loss = 0.029596824198961258
Validation loss = 0.02977178618311882
Validation loss = 0.02956560254096985
Validation loss = 0.02904510498046875
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.03174731135368347
Validation loss = 0.029362443834543228
Validation loss = 0.029417142271995544
Validation loss = 0.02959442138671875
Validation loss = 0.029945008456707
Validation loss = 0.028513872995972633
Validation loss = 0.03089749440550804
Validation loss = 0.0282549187541008
Validation loss = 0.030296269804239273
Validation loss = 0.028841808438301086
Validation loss = 0.028824981302022934
Validation loss = 0.028667040169239044
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.03423961624503136
Validation loss = 0.03027377650141716
Validation loss = 0.02975754253566265
Validation loss = 0.030628589913249016
Validation loss = 0.029271721839904785
Validation loss = 0.03040349669754505
Validation loss = 0.030307605862617493
Validation loss = 0.029262900352478027
Validation loss = 0.02950582467019558
Validation loss = 0.030101042240858078
Validation loss = 0.03147069364786148
Validation loss = 0.028864271938800812
Validation loss = 0.029601607471704483
Validation loss = 0.03192303329706192
Validation loss = 0.030376218259334564
Validation loss = 0.028961654752492905
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 797
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 779
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 785
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 778
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 769
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 808
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.85e+03 |
| Iteration     | 15       |
| MaximumReturn | 2.15e+03 |
| MinimumReturn | 1.17e+03 |
| TotalSamples  | 68000    |
----------------------------
itr #16 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.03054985962808132
Validation loss = 0.027506133541464806
Validation loss = 0.02800910919904709
Validation loss = 0.028112519532442093
Validation loss = 0.028304148465394974
Validation loss = 0.029549669474363327
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.03122716397047043
Validation loss = 0.02883980982005596
Validation loss = 0.028484653681516647
Validation loss = 0.028296787291765213
Validation loss = 0.029751455411314964
Validation loss = 0.027650877833366394
Validation loss = 0.027773037552833557
Validation loss = 0.028483660891652107
Validation loss = 0.02784300595521927
Validation loss = 0.028401218354701996
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.03141925483942032
Validation loss = 0.027685202658176422
Validation loss = 0.02792847901582718
Validation loss = 0.02825784683227539
Validation loss = 0.02735344134271145
Validation loss = 0.028030453249812126
Validation loss = 0.026968739926815033
Validation loss = 0.02876131981611252
Validation loss = 0.026575488969683647
Validation loss = 0.027811845764517784
Validation loss = 0.029361439868807793
Validation loss = 0.026051392778754234
Validation loss = 0.0281169805675745
Validation loss = 0.027961784973740578
Validation loss = 0.02692100778222084
Validation loss = 0.028301486745476723
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.030462419614195824
Validation loss = 0.027952605858445168
Validation loss = 0.027486994862556458
Validation loss = 0.028688959777355194
Validation loss = 0.02764577977359295
Validation loss = 0.02694529853761196
Validation loss = 0.02810286357998848
Validation loss = 0.02778315357863903
Validation loss = 0.02711578458547592
Validation loss = 0.026927225291728973
Validation loss = 0.025934383273124695
Validation loss = 0.029366863891482353
Validation loss = 0.026124121621251106
Validation loss = 0.026710396632552147
Validation loss = 0.025982335209846497
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.031233208253979683
Validation loss = 0.02746100351214409
Validation loss = 0.02969885990023613
Validation loss = 0.029547467827796936
Validation loss = 0.02715262770652771
Validation loss = 0.02895141951739788
Validation loss = 0.028312288224697113
Validation loss = 0.027243580669164658
Validation loss = 0.028108390048146248
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 804
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 703
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 776
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 745
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 757
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 743
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.36e+03 |
| Iteration     | 16       |
| MaximumReturn | 2.24e+03 |
| MinimumReturn | 48       |
| TotalSamples  | 72000    |
----------------------------
itr #17 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.03097786195576191
Validation loss = 0.02778620272874832
Validation loss = 0.026941563934087753
Validation loss = 0.027995547279715538
Validation loss = 0.029303167015314102
Validation loss = 0.02729036472737789
Validation loss = 0.027657363563776016
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.032256320118904114
Validation loss = 0.027318743988871574
Validation loss = 0.027256516739726067
Validation loss = 0.027580171823501587
Validation loss = 0.02753346972167492
Validation loss = 0.028137508779764175
Validation loss = 0.02656368725001812
Validation loss = 0.02824087254703045
Validation loss = 0.02583199180662632
Validation loss = 0.027145057916641235
Validation loss = 0.030973156914114952
Validation loss = 0.026063673198223114
Validation loss = 0.027178116142749786
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.029104385524988174
Validation loss = 0.02609977312386036
Validation loss = 0.0265312772244215
Validation loss = 0.026411456987261772
Validation loss = 0.026563726365566254
Validation loss = 0.02678731456398964
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.028480172157287598
Validation loss = 0.02621808648109436
Validation loss = 0.02571318857371807
Validation loss = 0.027261443436145782
Validation loss = 0.025900734588503838
Validation loss = 0.026126157492399216
Validation loss = 0.02647501975297928
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.02978573925793171
Validation loss = 0.027011489495635033
Validation loss = 0.027876999229192734
Validation loss = 0.02802225574851036
Validation loss = 0.02802879363298416
Validation loss = 0.0267268605530262
Validation loss = 0.027807006612420082
Validation loss = 0.026414159685373306
Validation loss = 0.027736417949199677
Validation loss = 0.028633899986743927
Validation loss = 0.026102468371391296
Validation loss = 0.026847682893276215
Validation loss = 0.02667609229683876
Validation loss = 0.026113752275705338
Validation loss = 0.027154425159096718
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 742
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 770
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 816
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 796
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 706
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 752
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.4e+03  |
| Iteration     | 17       |
| MaximumReturn | 2.35e+03 |
| MinimumReturn | -145     |
| TotalSamples  | 76000    |
----------------------------
itr #18 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.0297507643699646
Validation loss = 0.02651238441467285
Validation loss = 0.02669193595647812
Validation loss = 0.027492092922329903
Validation loss = 0.02655171975493431
Validation loss = 0.02625342272222042
Validation loss = 0.027208691462874413
Validation loss = 0.026133209466934204
Validation loss = 0.025635097175836563
Validation loss = 0.028373993933200836
Validation loss = 0.02650008164346218
Validation loss = 0.02581072598695755
Validation loss = 0.027137702330946922
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.02974468097090721
Validation loss = 0.026600545272231102
Validation loss = 0.026954032480716705
Validation loss = 0.027879122644662857
Validation loss = 0.02660813368856907
Validation loss = 0.026919864118099213
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.028130576014518738
Validation loss = 0.026928002014756203
Validation loss = 0.025567227974534035
Validation loss = 0.02621384523808956
Validation loss = 0.026334404945373535
Validation loss = 0.026756977662444115
Validation loss = 0.02554829977452755
Validation loss = 0.02548101730644703
Validation loss = 0.026044582948088646
Validation loss = 0.02589697204530239
Validation loss = 0.026255663484334946
Validation loss = 0.025969497859477997
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.028829261660575867
Validation loss = 0.025377096608281136
Validation loss = 0.027226990088820457
Validation loss = 0.024942178279161453
Validation loss = 0.02484072558581829
Validation loss = 0.025795631110668182
Validation loss = 0.024759238585829735
Validation loss = 0.024953359737992287
Validation loss = 0.0261392742395401
Validation loss = 0.024637741968035698
Validation loss = 0.02665381319820881
Validation loss = 0.02408287301659584
Validation loss = 0.024884197860956192
Validation loss = 0.025034483522176743
Validation loss = 0.024062076583504677
Validation loss = 0.025973549112677574
Validation loss = 0.024950817227363586
Validation loss = 0.024129027500748634
Validation loss = 0.026510581374168396
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.029380857944488525
Validation loss = 0.026060784235596657
Validation loss = 0.026091616600751877
Validation loss = 0.026582960039377213
Validation loss = 0.025896579027175903
Validation loss = 0.026339765638113022
Validation loss = 0.026957668364048004
Validation loss = 0.026202792301774025
Validation loss = 0.026468519121408463
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 818
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 771
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 828
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 741
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 717
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 805
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.41e+03 |
| Iteration     | 18       |
| MaximumReturn | 2.38e+03 |
| MinimumReturn | 279      |
| TotalSamples  | 80000    |
----------------------------
itr #19 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.028567442670464516
Validation loss = 0.02542366459965706
Validation loss = 0.025461947545409203
Validation loss = 0.025832295417785645
Validation loss = 0.024939557537436485
Validation loss = 0.026930350810289383
Validation loss = 0.026653656736016273
Validation loss = 0.025235146284103394
Validation loss = 0.027098853141069412
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.02801317349076271
Validation loss = 0.025945216417312622
Validation loss = 0.026235288009047508
Validation loss = 0.026013106107711792
Validation loss = 0.02590596117079258
Validation loss = 0.026989927515387535
Validation loss = 0.026249319314956665
Validation loss = 0.026909267529845238
Validation loss = 0.02474590763449669
Validation loss = 0.026346269994974136
Validation loss = 0.025266826152801514
Validation loss = 0.0258330050855875
Validation loss = 0.025780802592635155
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.02776091732084751
Validation loss = 0.02499164268374443
Validation loss = 0.025943756103515625
Validation loss = 0.025087619200348854
Validation loss = 0.02515377663075924
Validation loss = 0.024803152307868004
Validation loss = 0.024197768419981003
Validation loss = 0.025501716881990433
Validation loss = 0.024750348180532455
Validation loss = 0.024080898612737656
Validation loss = 0.025378376245498657
Validation loss = 0.024015840142965317
Validation loss = 0.025162208825349808
Validation loss = 0.024789011105895042
Validation loss = 0.024117715656757355
Validation loss = 0.02503359317779541
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.02590283378958702
Validation loss = 0.023906102403998375
Validation loss = 0.02409270778298378
Validation loss = 0.02496602013707161
Validation loss = 0.023631533607840538
Validation loss = 0.02469802275300026
Validation loss = 0.023142797872424126
Validation loss = 0.02349742315709591
Validation loss = 0.024486929178237915
Validation loss = 0.023038571700453758
Validation loss = 0.023971324786543846
Validation loss = 0.024347975850105286
Validation loss = 0.02379370667040348
Validation loss = 0.02323436550796032
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.028376728296279907
Validation loss = 0.025382081046700478
Validation loss = 0.026210183277726173
Validation loss = 0.025113096460700035
Validation loss = 0.0255698524415493
Validation loss = 0.02495793253183365
Validation loss = 0.025241801515221596
Validation loss = 0.02481805719435215
Validation loss = 0.02530588209629059
Validation loss = 0.02527971938252449
Validation loss = 0.025177139788866043
Validation loss = 0.025942036882042885
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 807
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 801
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 801
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 802
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 816
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 839
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 2.3e+03  |
| Iteration     | 19       |
| MaximumReturn | 2.51e+03 |
| MinimumReturn | 2.03e+03 |
| TotalSamples  | 84000    |
----------------------------
itr #20 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.025457505136728287
Validation loss = 0.02362942136824131
Validation loss = 0.024070588871836662
Validation loss = 0.023463835939764977
Validation loss = 0.02434374950826168
Validation loss = 0.024173861369490623
Validation loss = 0.024958882480859756
Validation loss = 0.02438453584909439
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.02660371921956539
Validation loss = 0.02401839755475521
Validation loss = 0.02424299716949463
Validation loss = 0.02523217350244522
Validation loss = 0.02471780776977539
Validation loss = 0.024146340787410736
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.026561180129647255
Validation loss = 0.02317730523645878
Validation loss = 0.023731663823127747
Validation loss = 0.02455604076385498
Validation loss = 0.023334123194217682
Validation loss = 0.023539606481790543
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.023893287405371666
Validation loss = 0.022214777767658234
Validation loss = 0.022879185155034065
Validation loss = 0.021746212616562843
Validation loss = 0.02265986241400242
Validation loss = 0.022104045376181602
Validation loss = 0.022350206971168518
Validation loss = 0.02168305218219757
Validation loss = 0.023157110437750816
Validation loss = 0.021341849118471146
Validation loss = 0.022183502092957497
Validation loss = 0.021731534972786903
Validation loss = 0.021751834079623222
Validation loss = 0.02142127975821495
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.02505195513367653
Validation loss = 0.023903099820017815
Validation loss = 0.025023290887475014
Validation loss = 0.024635039269924164
Validation loss = 0.023610340431332588
Validation loss = 0.024873968213796616
Validation loss = 0.023380717262625694
Validation loss = 0.024449706077575684
Validation loss = 0.025147534906864166
Validation loss = 0.02272316999733448
Validation loss = 0.023806165903806686
Validation loss = 0.024027971550822258
Validation loss = 0.023068737238645554
Validation loss = 0.02587946504354477
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 809
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 825
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 824
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 748
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 786
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 840
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.42e+03 |
| Iteration     | 20       |
| MaximumReturn | 2.41e+03 |
| MinimumReturn | -11.5    |
| TotalSamples  | 88000    |
----------------------------
itr #21 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.025471186265349388
Validation loss = 0.023260153830051422
Validation loss = 0.023572029545903206
Validation loss = 0.02345440350472927
Validation loss = 0.024464525282382965
Validation loss = 0.023132018744945526
Validation loss = 0.024302827194333076
Validation loss = 0.023466581478714943
Validation loss = 0.023199504241347313
Validation loss = 0.023134680464863777
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.025973260402679443
Validation loss = 0.02444264106452465
Validation loss = 0.02380150370299816
Validation loss = 0.02458406612277031
Validation loss = 0.023286860436201096
Validation loss = 0.023690106347203255
Validation loss = 0.022963982075452805
Validation loss = 0.023329155519604683
Validation loss = 0.02274436503648758
Validation loss = 0.02381278946995735
Validation loss = 0.023218072950839996
Validation loss = 0.02316557615995407
Validation loss = 0.023142904043197632
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.02539949305355549
Validation loss = 0.022302864119410515
Validation loss = 0.02282479591667652
Validation loss = 0.022921986877918243
Validation loss = 0.02239830605685711
Validation loss = 0.02227613888680935
Validation loss = 0.02319498173892498
Validation loss = 0.02150183543562889
Validation loss = 0.023392824456095695
Validation loss = 0.02185378223657608
Validation loss = 0.02182804048061371
Validation loss = 0.02256758324801922
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.02303491346538067
Validation loss = 0.021050700917840004
Validation loss = 0.021052075549960136
Validation loss = 0.021116768941283226
Validation loss = 0.021046539768576622
Validation loss = 0.019716652110219002
Validation loss = 0.02252228744328022
Validation loss = 0.019972799345850945
Validation loss = 0.020063914358615875
Validation loss = 0.02050555869936943
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.02572488971054554
Validation loss = 0.022780369967222214
Validation loss = 0.02304539829492569
Validation loss = 0.022990042343735695
Validation loss = 0.02274268865585327
Validation loss = 0.023067297413945198
Validation loss = 0.022680768743157387
Validation loss = 0.023055849596858025
Validation loss = 0.023696791380643845
Validation loss = 0.022009801119565964
Validation loss = 0.021934660151600838
Validation loss = 0.023493368178606033
Validation loss = 0.02147655189037323
Validation loss = 0.024217594414949417
Validation loss = 0.021563820540905
Validation loss = 0.023019427433609962
Validation loss = 0.0224534310400486
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 833
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 810
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 847
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 829
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 830
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 846
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 2.23e+03 |
| Iteration     | 21       |
| MaximumReturn | 2.34e+03 |
| MinimumReturn | 2.02e+03 |
| TotalSamples  | 92000    |
----------------------------
itr #22 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.02262219600379467
Validation loss = 0.021727459505200386
Validation loss = 0.02199782244861126
Validation loss = 0.021865852177143097
Validation loss = 0.021870216354727745
Validation loss = 0.022156117483973503
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.022626861929893494
Validation loss = 0.021678995341062546
Validation loss = 0.023188240826129913
Validation loss = 0.022273359820246696
Validation loss = 0.02154957689344883
Validation loss = 0.021938743069767952
Validation loss = 0.02186950109899044
Validation loss = 0.02135188691318035
Validation loss = 0.021424314007163048
Validation loss = 0.023012537509202957
Validation loss = 0.021108847111463547
Validation loss = 0.022119581699371338
Validation loss = 0.020889168605208397
Validation loss = 0.02114645577967167
Validation loss = 0.02264757826924324
Validation loss = 0.020399022847414017
Validation loss = 0.02128218114376068
Validation loss = 0.02164396084845066
Validation loss = 0.020190594717860222
Validation loss = 0.021908123046159744
Validation loss = 0.020439404994249344
Validation loss = 0.021336765959858894
Validation loss = 0.02068466506898403
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.02182886376976967
Validation loss = 0.021284593269228935
Validation loss = 0.021183837205171585
Validation loss = 0.02159426361322403
Validation loss = 0.021029042080044746
Validation loss = 0.02206479012966156
Validation loss = 0.02080550603568554
Validation loss = 0.021415656432509422
Validation loss = 0.020367726683616638
Validation loss = 0.01995200105011463
Validation loss = 0.021287476643919945
Validation loss = 0.020131176337599754
Validation loss = 0.022355621680617332
Validation loss = 0.019779857248067856
Validation loss = 0.02149813063442707
Validation loss = 0.020181970670819283
Validation loss = 0.020958993583917618
Validation loss = 0.01967449113726616
Validation loss = 0.021194810047745705
Validation loss = 0.020318327471613884
Validation loss = 0.020140979439020157
Validation loss = 0.020014885812997818
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.020494429394602776
Validation loss = 0.01982240192592144
Validation loss = 0.020565902814269066
Validation loss = 0.019430851563811302
Validation loss = 0.019602227956056595
Validation loss = 0.02075517736375332
Validation loss = 0.01924680545926094
Validation loss = 0.019489634782075882
Validation loss = 0.019607091322541237
Validation loss = 0.01900402270257473
Validation loss = 0.019020618870854378
Validation loss = 0.020231464877724648
Validation loss = 0.0191524438560009
Validation loss = 0.01944054663181305
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.022035997360944748
Validation loss = 0.02128649316728115
Validation loss = 0.02192990481853485
Validation loss = 0.020603559911251068
Validation loss = 0.021781371906399727
Validation loss = 0.0207161083817482
Validation loss = 0.021618379279971123
Validation loss = 0.021406566724181175
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 840
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 862
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 852
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 838
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 864
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 788
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 2.1e+03  |
| Iteration     | 22       |
| MaximumReturn | 2.43e+03 |
| MinimumReturn | 1.02e+03 |
| TotalSamples  | 96000    |
----------------------------
itr #23 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.021771131083369255
Validation loss = 0.021084100008010864
Validation loss = 0.0211744774132967
Validation loss = 0.02113204449415207
Validation loss = 0.022277196869254112
Validation loss = 0.02073768340051174
Validation loss = 0.022354617714881897
Validation loss = 0.019990744069218636
Validation loss = 0.02124302089214325
Validation loss = 0.020937731489539146
Validation loss = 0.020344967022538185
Validation loss = 0.02089834026992321
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.021283505484461784
Validation loss = 0.020498134195804596
Validation loss = 0.019949421286582947
Validation loss = 0.020130889490246773
Validation loss = 0.020233770832419395
Validation loss = 0.019830411300063133
Validation loss = 0.020533611997961998
Validation loss = 0.019642425701022148
Validation loss = 0.02037452533841133
Validation loss = 0.0203129630535841
Validation loss = 0.01922653615474701
Validation loss = 0.02031625621020794
Validation loss = 0.020570162683725357
Validation loss = 0.019143959507346153
Validation loss = 0.020647943019866943
Validation loss = 0.020070748403668404
Validation loss = 0.01885770820081234
Validation loss = 0.01984376087784767
Validation loss = 0.019373023882508278
Validation loss = 0.018962115049362183
Validation loss = 0.020277440547943115
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.019859785214066505
Validation loss = 0.01968708075582981
Validation loss = 0.01957228220999241
Validation loss = 0.018462710082530975
Validation loss = 0.018913520500063896
Validation loss = 0.019817227497696877
Validation loss = 0.0186736099421978
Validation loss = 0.01897457428276539
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.019283918663859367
Validation loss = 0.018443554639816284
Validation loss = 0.018532896414399147
Validation loss = 0.01801259256899357
Validation loss = 0.019152609631419182
Validation loss = 0.017922418192029
Validation loss = 0.018091635778546333
Validation loss = 0.018050961196422577
Validation loss = 0.0191876869648695
Validation loss = 0.017874740064144135
Validation loss = 0.01844007335603237
Validation loss = 0.01834438182413578
Validation loss = 0.01757788471877575
Validation loss = 0.018365414813160896
Validation loss = 0.017439140006899834
Validation loss = 0.017871888354420662
Validation loss = 0.017780432477593422
Validation loss = 0.01821785978972912
Validation loss = 0.017075225710868835
Validation loss = 0.017132194712758064
Validation loss = 0.01760106533765793
Validation loss = 0.016558324918150902
Validation loss = 0.017671620473265648
Validation loss = 0.016810452565550804
Validation loss = 0.018078230321407318
Validation loss = 0.016447575762867928
Validation loss = 0.018243934959173203
Validation loss = 0.017164409160614014
Validation loss = 0.01684151403605938
Validation loss = 0.01709180884063244
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.02123893052339554
Validation loss = 0.02013750560581684
Validation loss = 0.02049064449965954
Validation loss = 0.020402757450938225
Validation loss = 0.021945318207144737
Validation loss = 0.020467931404709816
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 861
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 837
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 849
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 847
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 850
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 839
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 2.4e+03  |
| Iteration     | 23       |
| MaximumReturn | 2.57e+03 |
| MinimumReturn | 2.22e+03 |
| TotalSamples  | 100000   |
----------------------------
itr #24 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.021516775712370872
Validation loss = 0.019419219344854355
Validation loss = 0.020476853474974632
Validation loss = 0.021325450390577316
Validation loss = 0.01926323212683201
Validation loss = 0.02034612186253071
Validation loss = 0.019769888371229172
Validation loss = 0.019377650693058968
Validation loss = 0.01973911002278328
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.02059089206159115
Validation loss = 0.018197499215602875
Validation loss = 0.019287673756480217
Validation loss = 0.018144872039556503
Validation loss = 0.017702152952551842
Validation loss = 0.01879914663732052
Validation loss = 0.018065834417939186
Validation loss = 0.018028946593403816
Validation loss = 0.018543865531682968
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.020235992968082428
Validation loss = 0.018237585201859474
Validation loss = 0.01828118972480297
Validation loss = 0.018040407449007034
Validation loss = 0.018522178754210472
Validation loss = 0.018049318343400955
Validation loss = 0.017648423090577126
Validation loss = 0.018786698579788208
Validation loss = 0.01708277501165867
Validation loss = 0.0180471483618021
Validation loss = 0.01707957684993744
Validation loss = 0.017433052882552147
Validation loss = 0.01733432710170746
Validation loss = 0.018924089148640633
Validation loss = 0.016557645052671432
Validation loss = 0.0167901199311018
Validation loss = 0.01687225140631199
Validation loss = 0.016363363713026047
Validation loss = 0.01766873151063919
Validation loss = 0.01642613485455513
Validation loss = 0.01727534644305706
Validation loss = 0.01607586443424225
Validation loss = 0.016885405406355858
Validation loss = 0.01626637578010559
Validation loss = 0.015678877010941505
Validation loss = 0.01615433394908905
Validation loss = 0.016503680497407913
Validation loss = 0.01614396646618843
Validation loss = 0.01641552709043026
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.01649126224219799
Validation loss = 0.01566799357533455
Validation loss = 0.017174774780869484
Validation loss = 0.016005929559469223
Validation loss = 0.015812955796718597
Validation loss = 0.016240645200014114
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.019940603524446487
Validation loss = 0.019635256379842758
Validation loss = 0.020209677517414093
Validation loss = 0.01924648880958557
Validation loss = 0.01928391121327877
Validation loss = 0.01914796605706215
Validation loss = 0.01973544992506504
Validation loss = 0.019771676510572433
Validation loss = 0.018855683505535126
Validation loss = 0.019741619005799294
Validation loss = 0.018844394013285637
Validation loss = 0.019150398671627045
Validation loss = 0.018488997593522072
Validation loss = 0.019406309351325035
Validation loss = 0.018688153475522995
Validation loss = 0.019806426018476486
Validation loss = 0.01845586486160755
Validation loss = 0.020483478903770447
Validation loss = 0.01862119324505329
Validation loss = 0.01921122521162033
Validation loss = 0.017889827489852905
Validation loss = 0.021110806614160538
Validation loss = 0.0178742203861475
Validation loss = 0.018615856766700745
Validation loss = 0.01827141083776951
Validation loss = 0.018396729603409767
Validation loss = 0.018294712528586388
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 824
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 770
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 818
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 847
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 854
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 840
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 2e+03    |
| Iteration     | 24       |
| MaximumReturn | 2.52e+03 |
| MinimumReturn | 240      |
| TotalSamples  | 104000   |
----------------------------
itr #25 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.02022760733962059
Validation loss = 0.01843087188899517
Validation loss = 0.019790930673480034
Validation loss = 0.01882314495742321
Validation loss = 0.018582167103886604
Validation loss = 0.018842410296201706
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.018237359821796417
Validation loss = 0.017504408955574036
Validation loss = 0.01833268813788891
Validation loss = 0.017126524820923805
Validation loss = 0.01723742112517357
Validation loss = 0.01679084822535515
Validation loss = 0.01769906096160412
Validation loss = 0.01683110184967518
Validation loss = 0.0173307117074728
Validation loss = 0.01649441011250019
Validation loss = 0.01677601970732212
Validation loss = 0.017861781641840935
Validation loss = 0.016647133976221085
Validation loss = 0.016837330535054207
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.016220249235630035
Validation loss = 0.01561901904642582
Validation loss = 0.015402302145957947
Validation loss = 0.016370711848139763
Validation loss = 0.01548904087394476
Validation loss = 0.015821753069758415
Validation loss = 0.015223052352666855
Validation loss = 0.014829849824309349
Validation loss = 0.015371228568255901
Validation loss = 0.015151968225836754
Validation loss = 0.015112771652638912
Validation loss = 0.01504472829401493
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.01707790046930313
Validation loss = 0.015547111630439758
Validation loss = 0.015887008979916573
Validation loss = 0.01570281945168972
Validation loss = 0.01602679118514061
Validation loss = 0.016016976907849312
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.018732532858848572
Validation loss = 0.016946827992796898
Validation loss = 0.01845725066959858
Validation loss = 0.01761961355805397
Validation loss = 0.018701253458857536
Validation loss = 0.017494821920990944
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 831
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 760
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 777
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 818
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 788
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 846
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.51e+03 |
| Iteration     | 25       |
| MaximumReturn | 2.43e+03 |
| MinimumReturn | 125      |
| TotalSamples  | 108000   |
----------------------------
itr #26 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.020792802795767784
Validation loss = 0.018160592764616013
Validation loss = 0.018408939242362976
Validation loss = 0.01837913878262043
Validation loss = 0.01800691895186901
Validation loss = 0.018127115443348885
Validation loss = 0.017910193651914597
Validation loss = 0.017988717183470726
Validation loss = 0.0181124284863472
Validation loss = 0.018167417496442795
Validation loss = 0.017900072038173676
Validation loss = 0.017617767676711082
Validation loss = 0.018093327060341835
Validation loss = 0.01778348535299301
Validation loss = 0.017773959785699844
Validation loss = 0.017467690631747246
Validation loss = 0.01783469319343567
Validation loss = 0.017649592831730843
Validation loss = 0.016972428187727928
Validation loss = 0.017978547140955925
Validation loss = 0.017304809764027596
Validation loss = 0.019185125827789307
Validation loss = 0.016792362555861473
Validation loss = 0.017093626782298088
Validation loss = 0.016743427142500877
Validation loss = 0.01757967472076416
Validation loss = 0.018112322315573692
Validation loss = 0.016612902283668518
Validation loss = 0.01715025119483471
Validation loss = 0.01750415563583374
Validation loss = 0.016617467626929283
Validation loss = 0.016834374517202377
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.01799289882183075
Validation loss = 0.01609431393444538
Validation loss = 0.01747817173600197
Validation loss = 0.016069890931248665
Validation loss = 0.016534537076950073
Validation loss = 0.016402747482061386
Validation loss = 0.015984999015927315
Validation loss = 0.01621508039534092
Validation loss = 0.01756824180483818
Validation loss = 0.015989067032933235
Validation loss = 0.015630314126610756
Validation loss = 0.015676409006118774
Validation loss = 0.015566416084766388
Validation loss = 0.016102828085422516
Validation loss = 0.01606871373951435
Validation loss = 0.015424628742039204
Validation loss = 0.01556786336004734
Validation loss = 0.015786709263920784
Validation loss = 0.015536701306700706
Validation loss = 0.015781382098793983
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.014838602393865585
Validation loss = 0.014432575553655624
Validation loss = 0.015128628350794315
Validation loss = 0.01406383141875267
Validation loss = 0.014195294119417667
Validation loss = 0.014420051127672195
Validation loss = 0.014486520551145077
Validation loss = 0.014160512015223503
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.01625235192477703
Validation loss = 0.014889007434248924
Validation loss = 0.015028689987957478
Validation loss = 0.01507669035345316
Validation loss = 0.015007332898676395
Validation loss = 0.014872037805616856
Validation loss = 0.014383869245648384
Validation loss = 0.015340081416070461
Validation loss = 0.014594335108995438
Validation loss = 0.014198677614331245
Validation loss = 0.01453044917434454
Validation loss = 0.014657204039394855
Validation loss = 0.014788460917770863
Validation loss = 0.014149404130876064
Validation loss = 0.014944701455533504
Validation loss = 0.013901016674935818
Validation loss = 0.014126262627542019
Validation loss = 0.014679664745926857
Validation loss = 0.014371822588145733
Validation loss = 0.015491647645831108
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.018307676538825035
Validation loss = 0.016867931932210922
Validation loss = 0.01690550707280636
Validation loss = 0.01723744161427021
Validation loss = 0.017380913719534874
Validation loss = 0.016998540610074997
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 820
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 819
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 839
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 761
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 841
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 850
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 2.05e+03 |
| Iteration     | 26       |
| MaximumReturn | 2.48e+03 |
| MinimumReturn | 329      |
| TotalSamples  | 112000   |
----------------------------
itr #27 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.01667342707514763
Validation loss = 0.0159893985837698
Validation loss = 0.016306495293974876
Validation loss = 0.01701890490949154
Validation loss = 0.016382765024900436
Validation loss = 0.016597723588347435
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.01686117611825466
Validation loss = 0.014773407019674778
Validation loss = 0.015441293828189373
Validation loss = 0.015236944891512394
Validation loss = 0.015165625140070915
Validation loss = 0.014912022277712822
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.014262658543884754
Validation loss = 0.013680198229849339
Validation loss = 0.013716391287744045
Validation loss = 0.013961689546704292
Validation loss = 0.013751246966421604
Validation loss = 0.01347456406801939
Validation loss = 0.013768724165856838
Validation loss = 0.013586742803454399
Validation loss = 0.013594004325568676
Validation loss = 0.013458439148962498
Validation loss = 0.013346227817237377
Validation loss = 0.013491509482264519
Validation loss = 0.01340035442262888
Validation loss = 0.013111740350723267
Validation loss = 0.012863409705460072
Validation loss = 0.013212056830525398
Validation loss = 0.012627224437892437
Validation loss = 0.013493947684764862
Validation loss = 0.013918468728661537
Validation loss = 0.012694580480456352
Validation loss = 0.013148726895451546
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.015024915337562561
Validation loss = 0.01338449027389288
Validation loss = 0.013388697989284992
Validation loss = 0.01319217961281538
Validation loss = 0.01351025141775608
Validation loss = 0.013322283513844013
Validation loss = 0.013994233682751656
Validation loss = 0.013725143857300282
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.017346937209367752
Validation loss = 0.01616467349231243
Validation loss = 0.016238074749708176
Validation loss = 0.015727395191788673
Validation loss = 0.016243701800704002
Validation loss = 0.016773920506238937
Validation loss = 0.016588589176535606
Validation loss = 0.01563236117362976
Validation loss = 0.016171373426914215
Validation loss = 0.01585942506790161
Validation loss = 0.015418999828398228
Validation loss = 0.01627202518284321
Validation loss = 0.015339651145040989
Validation loss = 0.015879856422543526
Validation loss = 0.015091704204678535
Validation loss = 0.01605323515832424
Validation loss = 0.015395351685583591
Validation loss = 0.015329350717365742
Validation loss = 0.01586191914975643
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 827
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 842
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 830
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 842
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 848
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 831
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 2.4e+03  |
| Iteration     | 27       |
| MaximumReturn | 2.77e+03 |
| MinimumReturn | 2.1e+03  |
| TotalSamples  | 116000   |
----------------------------
itr #28 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.01658526435494423
Validation loss = 0.016110826283693314
Validation loss = 0.015505490824580193
Validation loss = 0.016253264620900154
Validation loss = 0.01507579255849123
Validation loss = 0.015562581829726696
Validation loss = 0.015361637808382511
Validation loss = 0.015442527830600739
Validation loss = 0.014978722669184208
Validation loss = 0.015267162583768368
Validation loss = 0.014839868992567062
Validation loss = 0.015270734205842018
Validation loss = 0.015329993329942226
Validation loss = 0.015854718163609505
Validation loss = 0.015589531511068344
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.015416218899190426
Validation loss = 0.015343275852501392
Validation loss = 0.014176161959767342
Validation loss = 0.015167947858572006
Validation loss = 0.01523230317980051
Validation loss = 0.01429396029561758
Validation loss = 0.0148811936378479
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.013392833061516285
Validation loss = 0.012004644609987736
Validation loss = 0.013196097686886787
Validation loss = 0.012954706326127052
Validation loss = 0.013587409630417824
Validation loss = 0.012840422801673412
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.01331110019236803
Validation loss = 0.013053559698164463
Validation loss = 0.012705299071967602
Validation loss = 0.0135072972625494
Validation loss = 0.012696345336735249
Validation loss = 0.012427828274667263
Validation loss = 0.013035396113991737
Validation loss = 0.013244838453829288
Validation loss = 0.012368202209472656
Validation loss = 0.012751933187246323
Validation loss = 0.012430252507328987
Validation loss = 0.01326748076826334
Validation loss = 0.012168675661087036
Validation loss = 0.012585443444550037
Validation loss = 0.012492623180150986
Validation loss = 0.012470858171582222
Validation loss = 0.0124402130022645
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.01601250097155571
Validation loss = 0.01485704816877842
Validation loss = 0.014522613026201725
Validation loss = 0.015261958353221416
Validation loss = 0.014898134395480156
Validation loss = 0.014924787916243076
Validation loss = 0.014770050533115864
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 837
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 843
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 818
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 850
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 767
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 769
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.59e+03 |
| Iteration     | 28       |
| MaximumReturn | 2.52e+03 |
| MinimumReturn | -311     |
| TotalSamples  | 120000   |
----------------------------
itr #29 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.016132112592458725
Validation loss = 0.015570865012705326
Validation loss = 0.014655736275017262
Validation loss = 0.015330085530877113
Validation loss = 0.0149147417396307
Validation loss = 0.014667227864265442
Validation loss = 0.015242921188473701
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.01544768176972866
Validation loss = 0.014423373155295849
Validation loss = 0.014195344410836697
Validation loss = 0.014644605107605457
Validation loss = 0.014729281887412071
Validation loss = 0.014489413239061832
Validation loss = 0.01373082585632801
Validation loss = 0.014135334640741348
Validation loss = 0.01402306742966175
Validation loss = 0.014206701889634132
Validation loss = 0.014170226640999317
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.012632401660084724
Validation loss = 0.01269399281591177
Validation loss = 0.012790936976671219
Validation loss = 0.011908191256225109
Validation loss = 0.012516115792095661
Validation loss = 0.012494232505559921
Validation loss = 0.012213313020765781
Validation loss = 0.012893636710941792
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.013204660266637802
Validation loss = 0.01271780300885439
Validation loss = 0.012514210306107998
Validation loss = 0.012441219761967659
Validation loss = 0.012180071324110031
Validation loss = 0.01197433564811945
Validation loss = 0.012616662308573723
Validation loss = 0.012469091452658176
Validation loss = 0.012082146480679512
Validation loss = 0.01233643852174282
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.015203162096440792
Validation loss = 0.014428703114390373
Validation loss = 0.01500630285590887
Validation loss = 0.014480686746537685
Validation loss = 0.015287782065570354
Validation loss = 0.01444321684539318
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 830
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 812
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 815
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 816
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 812
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 764
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 2.27e+03 |
| Iteration     | 29       |
| MaximumReturn | 2.7e+03  |
| MinimumReturn | 858      |
| TotalSamples  | 124000   |
----------------------------
itr #30 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.014644152484834194
Validation loss = 0.014351309277117252
Validation loss = 0.015125070698559284
Validation loss = 0.014085953123867512
Validation loss = 0.013986441306769848
Validation loss = 0.015061963349580765
Validation loss = 0.013987619429826736
Validation loss = 0.014332559891045094
Validation loss = 0.014336387626826763
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.014636397361755371
Validation loss = 0.013538619503378868
Validation loss = 0.013661100529134274
Validation loss = 0.01375816110521555
Validation loss = 0.013664930127561092
Validation loss = 0.01340087503194809
Validation loss = 0.01338022667914629
Validation loss = 0.013531172648072243
Validation loss = 0.013339999131858349
Validation loss = 0.01360657624900341
Validation loss = 0.013386751525104046
Validation loss = 0.013964264653623104
Validation loss = 0.012886782176792622
Validation loss = 0.013509511016309261
Validation loss = 0.012925955466926098
Validation loss = 0.013597032986581326
Validation loss = 0.012818536721169949
Validation loss = 0.013305537402629852
Validation loss = 0.013749091885983944
Validation loss = 0.01277461089193821
Validation loss = 0.01337113045156002
Validation loss = 0.01314974669367075
Validation loss = 0.013073746114969254
Validation loss = 0.013031207025051117
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.012496139854192734
Validation loss = 0.011877352371811867
Validation loss = 0.012346059083938599
Validation loss = 0.012447578832507133
Validation loss = 0.01235537976026535
Validation loss = 0.012390469200909138
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.011974497698247433
Validation loss = 0.012623300775885582
Validation loss = 0.0120881712064147
Validation loss = 0.011730819009244442
Validation loss = 0.012162119150161743
Validation loss = 0.011708635836839676
Validation loss = 0.011936627328395844
Validation loss = 0.012271123006939888
Validation loss = 0.011586497537791729
Validation loss = 0.011509444564580917
Validation loss = 0.012766274623572826
Validation loss = 0.011518783867359161
Validation loss = 0.011756354942917824
Validation loss = 0.011544364504516125
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.015409995801746845
Validation loss = 0.014824125915765762
Validation loss = 0.014680052176117897
Validation loss = 0.01390822697430849
Validation loss = 0.015893056988716125
Validation loss = 0.013874148018658161
Validation loss = 0.01418667659163475
Validation loss = 0.013875013217329979
Validation loss = 0.01416027918457985
Validation loss = 0.013893954455852509
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 790
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 717
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 797
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 830
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 811
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 834
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.85e+03 |
| Iteration     | 30       |
| MaximumReturn | 2.47e+03 |
| MinimumReturn | 450      |
| TotalSamples  | 128000   |
----------------------------
itr #31 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.013820323161780834
Validation loss = 0.013898871839046478
Validation loss = 0.014113113284111023
Validation loss = 0.013337032869458199
Validation loss = 0.014124264940619469
Validation loss = 0.013523070141673088
Validation loss = 0.013708051294088364
Validation loss = 0.012867633253335953
Validation loss = 0.013190928846597672
Validation loss = 0.013081943616271019
Validation loss = 0.013555368408560753
Validation loss = 0.013535984791815281
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.013625407591462135
Validation loss = 0.012478138320147991
Validation loss = 0.012166503816843033
Validation loss = 0.012998290359973907
Validation loss = 0.012548552826046944
Validation loss = 0.012108602561056614
Validation loss = 0.012519745156168938
Validation loss = 0.012355701997876167
Validation loss = 0.012366244569420815
Validation loss = 0.012654904276132584
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.013024212792515755
Validation loss = 0.012068957090377808
Validation loss = 0.012832320295274258
Validation loss = 0.011522701010107994
Validation loss = 0.011684220284223557
Validation loss = 0.011680839583277702
Validation loss = 0.011738467961549759
Validation loss = 0.011571662500500679
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.012453779578208923
Validation loss = 0.011543529108166695
Validation loss = 0.011653252877295017
Validation loss = 0.011477446183562279
Validation loss = 0.011496827937662601
Validation loss = 0.011335376650094986
Validation loss = 0.011237191036343575
Validation loss = 0.011571574956178665
Validation loss = 0.011516181752085686
Validation loss = 0.01107010617852211
Validation loss = 0.011267025955021381
Validation loss = 0.011740073561668396
Validation loss = 0.011382183991372585
Validation loss = 0.011357037350535393
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.01434691995382309
Validation loss = 0.013680404052138329
Validation loss = 0.013725575990974903
Validation loss = 0.013755040243268013
Validation loss = 0.01427772082388401
Validation loss = 0.013155677355825901
Validation loss = 0.013219956308603287
Validation loss = 0.013428697362542152
Validation loss = 0.013250462710857391
Validation loss = 0.013447960838675499
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 812
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 766
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 819
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 798
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 809
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 802
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 2.22e+03 |
| Iteration     | 31       |
| MaximumReturn | 2.84e+03 |
| MinimumReturn | 365      |
| TotalSamples  | 132000   |
----------------------------
itr #32 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.013574229553341866
Validation loss = 0.013149380683898926
Validation loss = 0.013154422864317894
Validation loss = 0.012931594625115395
Validation loss = 0.012851574458181858
Validation loss = 0.013126656413078308
Validation loss = 0.012262165546417236
Validation loss = 0.012654172256588936
Validation loss = 0.012127792462706566
Validation loss = 0.01245687622576952
Validation loss = 0.01234912034124136
Validation loss = 0.012836937792599201
Validation loss = 0.01210889220237732
Validation loss = 0.012806673534214497
Validation loss = 0.012232938781380653
Validation loss = 0.011685695499181747
Validation loss = 0.012283803895115852
Validation loss = 0.012058897875249386
Validation loss = 0.012419434264302254
Validation loss = 0.012085667811334133
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.012732256203889847
Validation loss = 0.011830649338662624
Validation loss = 0.01279617939144373
Validation loss = 0.011841008439660072
Validation loss = 0.012155979871749878
Validation loss = 0.011880666948854923
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.011474738828837872
Validation loss = 0.011344022117555141
Validation loss = 0.011263646185398102
Validation loss = 0.01214025728404522
Validation loss = 0.011101876385509968
Validation loss = 0.01111835241317749
Validation loss = 0.011309731751680374
Validation loss = 0.010924394242465496
Validation loss = 0.011370077729225159
Validation loss = 0.011827144771814346
Validation loss = 0.010878629051148891
Validation loss = 0.011540117673575878
Validation loss = 0.010942645370960236
Validation loss = 0.010976603254675865
Validation loss = 0.011122819036245346
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.011197943240404129
Validation loss = 0.011112108826637268
Validation loss = 0.011247266083955765
Validation loss = 0.011203711852431297
Validation loss = 0.010944541543722153
Validation loss = 0.010739983059465885
Validation loss = 0.011081159114837646
Validation loss = 0.010954823344945908
Validation loss = 0.01139692310243845
Validation loss = 0.010678276419639587
Validation loss = 0.011804642155766487
Validation loss = 0.010973251424729824
Validation loss = 0.010819203220307827
Validation loss = 0.011329643428325653
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.01354475412517786
Validation loss = 0.012675866484642029
Validation loss = 0.012983933091163635
Validation loss = 0.013089343905448914
Validation loss = 0.012914281338453293
Validation loss = 0.013149809092283249
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 819
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 809
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 814
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 803
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 802
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 792
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 2.49e+03 |
| Iteration     | 32       |
| MaximumReturn | 2.65e+03 |
| MinimumReturn | 2.35e+03 |
| TotalSamples  | 136000   |
----------------------------
