Logging to experiments/hopper/oct31/w350e03_Durl_seed2531
Print configuration .....
{'env_name': 'hopper', 'random_seeds': [1234, 2431, 2531, 2231], 'save_variables': False, 'model_save_dir': '/tmp/hopper_models/', 'restore_variables': False, 'start_onpol_iter': 0, 'onpol_iters': 33, 'num_path_random': 6, 'num_path_onpol': 6, 'env_horizon': 1000, 'max_train_data': 200000, 'max_val_data': 100000, 'discard_ratio': 0.0, 'dynamics': {'pre_training': {'mode': 'intrinsic_reward', 'itr': 0, 'policy_itr': 20}, 'model': 'nn', 'ensemble': True, 'ensemble_model_count': 5, 'enable_particle_ensemble': True, 'particles': 5, 'obs_var': 1.0, 'intrinsic_reward_coeff': 1.0, 'ita': 1.0, 'mode': 'random', 'val': True, 'n_layers': 4, 'hidden_size': 1000, 'activation': 'relu', 'batch_size': 1000, 'learning_rate': 0.001, 'reg_coeff': 0.0, 'epochs': 200, 'kfac_params': {'learning_rate': 0.1, 'damping': 0.001, 'momentum': 0.9, 'kl_clip': 0.0001, 'cov_ema_decay': 0.99}}, 'policy': {'network_shape': [64, 64], 'init_logstd': 0.0, 'activation': 'tanh', 'reinitialize_every_itr': False}, 'trpo': {'horizon': 1000, 'gamma': 0.99, 'step_size': 0.01, 'iterations': 20, 'batch_size': 50000, 'gae': 0.95, 'visualization': False, 'visualize_iterations': [0]}, 'algo': 'trpo'}
Generating random rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 0
average number of affinization = 0.0
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 0
average number of affinization = 0.0
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 0
average number of affinization = 0.0
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 0
average number of affinization = 0.0
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 0
average number of affinization = 0.0
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 0
average number of affinization = 0.0
Done generating random rollouts.
Creating normalization for training data.
Done creating normalization for training data.
Particle ensemble enabled? True
An ensemble of 5 dynamics model <class 'model.dynamics.NNDynamicsModel'> initialized
Train dynamics model with intrinsic reward only? False
Pre-training enabled. Using only intrinsic reward.
Pre-training dynamics model for 0 iterations...
Done pre-training dynamics model.
Using external reward only.
itr #0 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.5650959014892578
Validation loss = 0.27895206212997437
Validation loss = 0.2910032868385315
Validation loss = 0.29045403003692627
Validation loss = 0.29237958788871765
Validation loss = 0.31383395195007324
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.5342131853103638
Validation loss = 0.29032230377197266
Validation loss = 0.28657597303390503
Validation loss = 0.29373782873153687
Validation loss = 0.2884942889213562
Validation loss = 0.2880483865737915
Validation loss = 0.30159395933151245
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.5044245719909668
Validation loss = 0.28544747829437256
Validation loss = 0.29153335094451904
Validation loss = 0.3033505976200104
Validation loss = 0.29001104831695557
Validation loss = 0.28870299458503723
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.5004665851593018
Validation loss = 0.2816789150238037
Validation loss = 0.28627219796180725
Validation loss = 0.28817635774612427
Validation loss = 0.2929942309856415
Validation loss = 0.30719059705734253
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6537232398986816
Validation loss = 0.28437674045562744
Validation loss = 0.2826277017593384
Validation loss = 0.2839188575744629
Validation loss = 0.2916809618473053
Validation loss = 0.30053016543388367
Validation loss = 0.2924555242061615
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 258
average number of affinization = 36.857142857142854
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 210
average number of affinization = 58.5
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 224
average number of affinization = 76.88888888888889
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 230
average number of affinization = 92.2
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 228
average number of affinization = 104.54545454545455
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 214
average number of affinization = 113.66666666666667
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.34e+03 |
| Iteration     | 0         |
| MaximumReturn | -595      |
| MinimumReturn | -2.01e+03 |
| TotalSamples  | 8000      |
-----------------------------
itr #1 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.25809696316719055
Validation loss = 0.2337484210729599
Validation loss = 0.2280811369419098
Validation loss = 0.2365286946296692
Validation loss = 0.255401074886322
Validation loss = 0.254364550113678
Validation loss = 0.26214709877967834
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.25397276878356934
Validation loss = 0.23928338289260864
Validation loss = 0.24468834698200226
Validation loss = 0.2330484390258789
Validation loss = 0.23511312901973724
Validation loss = 0.25258710980415344
Validation loss = 0.23940196633338928
Validation loss = 0.2553677558898926
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.25495028495788574
Validation loss = 0.23137545585632324
Validation loss = 0.22329960763454437
Validation loss = 0.224415123462677
Validation loss = 0.22904521226882935
Validation loss = 0.22976695001125336
Validation loss = 0.23065248131752014
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.26075273752212524
Validation loss = 0.2368319034576416
Validation loss = 0.23780786991119385
Validation loss = 0.2301962971687317
Validation loss = 0.23535341024398804
Validation loss = 0.24100404977798462
Validation loss = 0.2558046579360962
Validation loss = 0.24736356735229492
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.26168808341026306
Validation loss = 0.23103243112564087
Validation loss = 0.24116581678390503
Validation loss = 0.22322142124176025
Validation loss = 0.24611836671829224
Validation loss = 0.2598186135292053
Validation loss = 0.2581139802932739
Validation loss = 0.26277998089790344
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 582
average number of affinization = 149.69230769230768
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 518
average number of affinization = 176.0
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 526
average number of affinization = 199.33333333333334
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 499
average number of affinization = 218.0625
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 506
average number of affinization = 235.0
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 561
average number of affinization = 253.11111111111111
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.97e+03 |
| Iteration     | 1         |
| MaximumReturn | -1.72e+03 |
| MinimumReturn | -2.35e+03 |
| TotalSamples  | 12000     |
-----------------------------
itr #2 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.22489647567272186
Validation loss = 0.2108568400144577
Validation loss = 0.2092599719762802
Validation loss = 0.21648456156253815
Validation loss = 0.2199820727109909
Validation loss = 0.2058320790529251
Validation loss = 0.21093446016311646
Validation loss = 0.21063752472400665
Validation loss = 0.2064640074968338
Validation loss = 0.20370453596115112
Validation loss = 0.20271103084087372
Validation loss = 0.21570153534412384
Validation loss = 0.2051820158958435
Validation loss = 0.20689332485198975
Validation loss = 0.20722176134586334
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.22498071193695068
Validation loss = 0.2117164582014084
Validation loss = 0.21407891809940338
Validation loss = 0.22866876423358917
Validation loss = 0.23033057153224945
Validation loss = 0.21525944769382477
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.20874916017055511
Validation loss = 0.1936509609222412
Validation loss = 0.1869080811738968
Validation loss = 0.19703392684459686
Validation loss = 0.18758852779865265
Validation loss = 0.19261497259140015
Validation loss = 0.18580132722854614
Validation loss = 0.18869535624980927
Validation loss = 0.19173114001750946
Validation loss = 0.18544530868530273
Validation loss = 0.18934603035449982
Validation loss = 0.186160609126091
Validation loss = 0.19305704534053802
Validation loss = 0.18581724166870117
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.21862749755382538
Validation loss = 0.19984392821788788
Validation loss = 0.2087908238172531
Validation loss = 0.20391738414764404
Validation loss = 0.20663301646709442
Validation loss = 0.2079440951347351
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.2108052521944046
Validation loss = 0.22450490295886993
Validation loss = 0.2115635871887207
Validation loss = 0.20851773023605347
Validation loss = 0.2219373732805252
Validation loss = 0.21400904655456543
Validation loss = 0.21945172548294067
Validation loss = 0.21241630613803864
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 406
average number of affinization = 261.1578947368421
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 359
average number of affinization = 266.05
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 419
average number of affinization = 273.3333333333333
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 415
average number of affinization = 279.77272727272725
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 415
average number of affinization = 285.6521739130435
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 302
average number of affinization = 286.3333333333333
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -872      |
| Iteration     | 2         |
| MaximumReturn | 732       |
| MinimumReturn | -1.58e+03 |
| TotalSamples  | 16000     |
-----------------------------
itr #3 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.20502300560474396
Validation loss = 0.19825777411460876
Validation loss = 0.1936214566230774
Validation loss = 0.19314314424991608
Validation loss = 0.19431817531585693
Validation loss = 0.20126637816429138
Validation loss = 0.19692584872245789
Validation loss = 0.19710183143615723
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.22210995852947235
Validation loss = 0.20782741904258728
Validation loss = 0.19313271343708038
Validation loss = 0.1981363296508789
Validation loss = 0.2044115513563156
Validation loss = 0.20238053798675537
Validation loss = 0.1944618970155716
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.19843032956123352
Validation loss = 0.18042874336242676
Validation loss = 0.19098517298698425
Validation loss = 0.17592501640319824
Validation loss = 0.18247312307357788
Validation loss = 0.18344631791114807
Validation loss = 0.18077746033668518
Validation loss = 0.18450307846069336
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.2091345638036728
Validation loss = 0.19807744026184082
Validation loss = 0.19867247343063354
Validation loss = 0.20147964358329773
Validation loss = 0.19994652271270752
Validation loss = 0.20932021737098694
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.20517316460609436
Validation loss = 0.19229985773563385
Validation loss = 0.1947154402732849
Validation loss = 0.1947135031223297
Validation loss = 0.18811959028244019
Validation loss = 0.19354361295700073
Validation loss = 0.19184711575508118
Validation loss = 0.19105267524719238
Validation loss = 0.19835534691810608
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 366
average number of affinization = 289.52
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 405
average number of affinization = 293.96153846153845
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 386
average number of affinization = 297.3703703703704
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 387
average number of affinization = 300.57142857142856
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 394
average number of affinization = 303.7931034482759
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 396
average number of affinization = 306.8666666666667
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | 204       |
| Iteration     | 3         |
| MaximumReturn | 906       |
| MinimumReturn | -2.03e+03 |
| TotalSamples  | 20000     |
-----------------------------
itr #4 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.22391434013843536
Validation loss = 0.23474068939685822
Validation loss = 0.22472162544727325
Validation loss = 0.24501129984855652
Validation loss = 0.2481069564819336
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.22889837622642517
Validation loss = 0.23927518725395203
Validation loss = 0.26625603437423706
Validation loss = 0.23984691500663757
Validation loss = 0.2487909346818924
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.21212002635002136
Validation loss = 0.21932777762413025
Validation loss = 0.27704617381095886
Validation loss = 0.23232798278331757
Validation loss = 0.24024388194084167
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.23266008496284485
Validation loss = 0.25506365299224854
Validation loss = 0.24634654819965363
Validation loss = 0.2548152208328247
Validation loss = 0.2500268816947937
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.22054210305213928
Validation loss = 0.22936753928661346
Validation loss = 0.22582419216632843
Validation loss = 0.22736239433288574
Validation loss = 0.23141565918922424
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 363
average number of affinization = 308.6774193548387
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 408
average number of affinization = 311.78125
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 374
average number of affinization = 313.6666666666667
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 378
average number of affinization = 315.55882352941177
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 283
average number of affinization = 314.62857142857143
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 400
average number of affinization = 317.0
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -574      |
| Iteration     | 4         |
| MaximumReturn | 1.39e+03  |
| MinimumReturn | -2.59e+03 |
| TotalSamples  | 24000     |
-----------------------------
itr #5 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.20516328513622284
Validation loss = 0.1962505578994751
Validation loss = 0.20184649527072906
Validation loss = 0.226432204246521
Validation loss = 0.21015942096710205
Validation loss = 0.20930272340774536
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.2097119837999344
Validation loss = 0.21138179302215576
Validation loss = 0.18811620771884918
Validation loss = 0.20369207859039307
Validation loss = 0.1920342594385147
Validation loss = 0.20042647421360016
Validation loss = 0.19402168691158295
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.20888389647006989
Validation loss = 0.20229290425777435
Validation loss = 0.2179383784532547
Validation loss = 0.21498467028141022
Validation loss = 0.2170182466506958
Validation loss = 0.22534245252609253
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.2264062613248825
Validation loss = 0.20455874502658844
Validation loss = 0.21342594921588898
Validation loss = 0.2086748331785202
Validation loss = 0.21089279651641846
Validation loss = 0.20094998180866241
Validation loss = 0.21478714048862457
Validation loss = 0.2081959843635559
Validation loss = 0.21955911815166473
Validation loss = 0.19781170785427094
Validation loss = 0.1967269331216812
Validation loss = 0.2106803059577942
Validation loss = 0.20811940729618073
Validation loss = 0.1950550675392151
Validation loss = 0.20312856137752533
Validation loss = 0.19547319412231445
Validation loss = 0.20171116292476654
Validation loss = 0.19394290447235107
Validation loss = 0.19838498532772064
Validation loss = 0.20032984018325806
Validation loss = 0.19189584255218506
Validation loss = 0.19004637002944946
Validation loss = 0.19452603161334991
Validation loss = 0.19700370728969574
Validation loss = 0.18280571699142456
Validation loss = 0.21484588086605072
Validation loss = 0.18655140697956085
Validation loss = 0.18881408870220184
Validation loss = 0.17923597991466522
Validation loss = 0.1811169981956482
Validation loss = 0.1843637228012085
Validation loss = 0.1791345477104187
Validation loss = 0.19389458000659943
Validation loss = 0.17865251004695892
Validation loss = 0.17600692808628082
Validation loss = 0.1775774359703064
Validation loss = 0.19075268507003784
Validation loss = 0.18729449808597565
Validation loss = 0.18739815056324005
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.20198474824428558
Validation loss = 0.2041991949081421
Validation loss = 0.19297415018081665
Validation loss = 0.1980670839548111
Validation loss = 0.191679909825325
Validation loss = 0.1990555077791214
Validation loss = 0.1910257339477539
Validation loss = 0.19237232208251953
Validation loss = 0.20271944999694824
Validation loss = 0.19938404858112335
Validation loss = 0.20546980202198029
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 339
average number of affinization = 317.5945945945946
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 423
average number of affinization = 320.36842105263156
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 384
average number of affinization = 322.0
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 412
average number of affinization = 324.25
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 400
average number of affinization = 326.0975609756098
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 414
average number of affinization = 328.1904761904762
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -165     |
| Iteration     | 5        |
| MaximumReturn | 1.03e+03 |
| MinimumReturn | -786     |
| TotalSamples  | 28000    |
----------------------------
itr #6 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.1797778606414795
Validation loss = 0.17319808900356293
Validation loss = 0.16440889239311218
Validation loss = 0.1797211468219757
Validation loss = 0.17814533412456512
Validation loss = 0.18777014315128326
Validation loss = 0.17959049344062805
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.16953788697719574
Validation loss = 0.17306171357631683
Validation loss = 0.16562940180301666
Validation loss = 0.16704070568084717
Validation loss = 0.16663730144500732
Validation loss = 0.16392548382282257
Validation loss = 0.16634903848171234
Validation loss = 0.17462371289730072
Validation loss = 0.16778385639190674
Validation loss = 0.17231614887714386
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.20527182519435883
Validation loss = 0.1999736875295639
Validation loss = 0.21686652302742004
Validation loss = 0.18734309077262878
Validation loss = 0.21103310585021973
Validation loss = 0.19122926890850067
Validation loss = 0.190949484705925
Validation loss = 0.2076716423034668
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.17561200261116028
Validation loss = 0.16040894389152527
Validation loss = 0.1568296104669571
Validation loss = 0.15408246219158173
Validation loss = 0.15502728521823883
Validation loss = 0.15253908932209015
Validation loss = 0.15326818823814392
Validation loss = 0.1565534919500351
Validation loss = 0.1536346971988678
Validation loss = 0.1539435088634491
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.1763218194246292
Validation loss = 0.19097527861595154
Validation loss = 0.17774440348148346
Validation loss = 0.16781459748744965
Validation loss = 0.1703425496816635
Validation loss = 0.17887234687805176
Validation loss = 0.1711173802614212
Validation loss = 0.16848230361938477
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 405
average number of affinization = 329.9767441860465
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 463
average number of affinization = 333.0
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 353
average number of affinization = 333.44444444444446
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 471
average number of affinization = 336.4347826086956
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 415
average number of affinization = 338.1063829787234
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 330
average number of affinization = 337.9375
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 464      |
| Iteration     | 6        |
| MaximumReturn | 1.35e+03 |
| MinimumReturn | -429     |
| TotalSamples  | 32000    |
----------------------------
itr #7 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.17204107344150543
Validation loss = 0.1538364291191101
Validation loss = 0.14619800448417664
Validation loss = 0.152031809091568
Validation loss = 0.15429559350013733
Validation loss = 0.15143775939941406
Validation loss = 0.1484794318675995
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.16699814796447754
Validation loss = 0.14943884313106537
Validation loss = 0.1495523452758789
Validation loss = 0.15742316842079163
Validation loss = 0.15363553166389465
Validation loss = 0.15740056335926056
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.17537513375282288
Validation loss = 0.19451534748077393
Validation loss = 0.1628759652376175
Validation loss = 0.17766636610031128
Validation loss = 0.17331595718860626
Validation loss = 0.16221724450588226
Validation loss = 0.17161250114440918
Validation loss = 0.16211727261543274
Validation loss = 0.15984183549880981
Validation loss = 0.17370370030403137
Validation loss = 0.1605755090713501
Validation loss = 0.15687815845012665
Validation loss = 0.1537318080663681
Validation loss = 0.16678625345230103
Validation loss = 0.17808480560779572
Validation loss = 0.17494243383407593
Validation loss = 0.1632416695356369
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.17301860451698303
Validation loss = 0.15239448845386505
Validation loss = 0.14709576964378357
Validation loss = 0.13909463584423065
Validation loss = 0.14170455932617188
Validation loss = 0.14365337789058685
Validation loss = 0.14999745786190033
Validation loss = 0.1422787606716156
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.19065922498703003
Validation loss = 0.16280575096607208
Validation loss = 0.1479850709438324
Validation loss = 0.16471095383167267
Validation loss = 0.1550440639257431
Validation loss = 0.15877646207809448
Validation loss = 0.15579108893871307
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 397
average number of affinization = 339.14285714285717
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 182
average number of affinization = 336.0
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 381
average number of affinization = 336.88235294117646
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 397
average number of affinization = 338.03846153846155
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 400
average number of affinization = 339.20754716981133
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 386
average number of affinization = 340.0740740740741
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | 1.02e+03  |
| Iteration     | 7         |
| MaximumReturn | 1.93e+03  |
| MinimumReturn | -1.24e+03 |
| TotalSamples  | 36000     |
-----------------------------
itr #8 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.16112856566905975
Validation loss = 0.1416720747947693
Validation loss = 0.13195112347602844
Validation loss = 0.1312825232744217
Validation loss = 0.13453428447246552
Validation loss = 0.13780266046524048
Validation loss = 0.12203618139028549
Validation loss = 0.12765832245349884
Validation loss = 0.125394806265831
Validation loss = 0.12126123905181885
Validation loss = 0.1265943944454193
Validation loss = 0.11958948522806168
Validation loss = 0.12102922052145004
Validation loss = 0.12948039174079895
Validation loss = 0.12467855960130692
Validation loss = 0.11728639155626297
Validation loss = 0.12237396836280823
Validation loss = 0.11801854521036148
Validation loss = 0.1327676773071289
Validation loss = 0.12298587709665298
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.14461307227611542
Validation loss = 0.12914633750915527
Validation loss = 0.12435303628444672
Validation loss = 0.12633845210075378
Validation loss = 0.12039069086313248
Validation loss = 0.12273553013801575
Validation loss = 0.12577062845230103
Validation loss = 0.1337546557188034
Validation loss = 0.12050452828407288
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.14834684133529663
Validation loss = 0.12587852776050568
Validation loss = 0.12244289368391037
Validation loss = 0.12292417138814926
Validation loss = 0.1351117044687271
Validation loss = 0.13161736726760864
Validation loss = 0.1346428245306015
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.15153342485427856
Validation loss = 0.1254858821630478
Validation loss = 0.12377127259969711
Validation loss = 0.12920254468917847
Validation loss = 0.12285272032022476
Validation loss = 0.12918080389499664
Validation loss = 0.13606955111026764
Validation loss = 0.13090690970420837
Validation loss = 0.12913304567337036
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.15245015919208527
Validation loss = 0.12787798047065735
Validation loss = 0.12239424139261246
Validation loss = 0.12349280714988708
Validation loss = 0.11976524442434311
Validation loss = 0.13285914063453674
Validation loss = 0.12275245785713196
Validation loss = 0.13001614809036255
Validation loss = 0.1342417448759079
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 386
average number of affinization = 340.90909090909093
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 417
average number of affinization = 342.26785714285717
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 427
average number of affinization = 343.7543859649123
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 314
average number of affinization = 343.2413793103448
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 390
average number of affinization = 344.03389830508473
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 415
average number of affinization = 345.21666666666664
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 868      |
| Iteration     | 8        |
| MaximumReturn | 1.53e+03 |
| MinimumReturn | -34.5    |
| TotalSamples  | 40000    |
----------------------------
itr #9 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.13841357827186584
Validation loss = 0.11860199272632599
Validation loss = 0.10636202991008759
Validation loss = 0.1073697954416275
Validation loss = 0.10609928518533707
Validation loss = 0.10804905742406845
Validation loss = 0.10874912887811661
Validation loss = 0.11245711147785187
Validation loss = 0.1036512702703476
Validation loss = 0.10620679706335068
Validation loss = 0.10333476215600967
Validation loss = 0.10515809059143066
Validation loss = 0.11707083880901337
Validation loss = 0.10702934116125107
Validation loss = 0.09899778664112091
Validation loss = 0.10116682946681976
Validation loss = 0.10217952728271484
Validation loss = 0.10395529121160507
Validation loss = 0.11311353743076324
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.1418343484401703
Validation loss = 0.1165439710021019
Validation loss = 0.1126759871840477
Validation loss = 0.11711243540048599
Validation loss = 0.11337421089410782
Validation loss = 0.10890118032693863
Validation loss = 0.11111684143543243
Validation loss = 0.12868599593639374
Validation loss = 0.11132873594760895
Validation loss = 0.10683542490005493
Validation loss = 0.10431840270757675
Validation loss = 0.10755070298910141
Validation loss = 0.10346285998821259
Validation loss = 0.10578566789627075
Validation loss = 0.10826486349105835
Validation loss = 0.11388297379016876
Validation loss = 0.10524275153875351
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.13444273173809052
Validation loss = 0.12171069532632828
Validation loss = 0.12128086388111115
Validation loss = 0.11280570924282074
Validation loss = 0.1167990192770958
Validation loss = 0.11807046085596085
Validation loss = 0.11630801111459732
Validation loss = 0.12363342940807343
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.12182489782571793
Validation loss = 0.1261918544769287
Validation loss = 0.1155373603105545
Validation loss = 0.11173602193593979
Validation loss = 0.11209896951913834
Validation loss = 0.11482478678226471
Validation loss = 0.12535321712493896
Validation loss = 0.12085839360952377
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.13863080739974976
Validation loss = 0.11458225548267365
Validation loss = 0.11793694645166397
Validation loss = 0.1150638610124588
Validation loss = 0.11191464960575104
Validation loss = 0.11839596182107925
Validation loss = 0.11152269691228867
Validation loss = 0.10826144367456436
Validation loss = 0.10661928355693817
Validation loss = 0.12374953925609589
Validation loss = 0.10752119868993759
Validation loss = 0.10978861898183823
Validation loss = 0.10689061880111694
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 291
average number of affinization = 344.327868852459
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 308
average number of affinization = 343.741935483871
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 384
average number of affinization = 344.3809523809524
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 349
average number of affinization = 344.453125
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 351
average number of affinization = 344.55384615384617
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 382
average number of affinization = 345.1212121212121
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 2.05e+03 |
| Iteration     | 9        |
| MaximumReturn | 2.27e+03 |
| MinimumReturn | 1.81e+03 |
| TotalSamples  | 44000    |
----------------------------
itr #10 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.12053921818733215
Validation loss = 0.10186386108398438
Validation loss = 0.09548034518957138
Validation loss = 0.0936182290315628
Validation loss = 0.0974176824092865
Validation loss = 0.09419818222522736
Validation loss = 0.1178751215338707
Validation loss = 0.09774952381849289
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.11945075541734695
Validation loss = 0.1107860580086708
Validation loss = 0.10261349380016327
Validation loss = 0.10101378709077835
Validation loss = 0.10161226242780685
Validation loss = 0.11151731014251709
Validation loss = 0.10401137173175812
Validation loss = 0.10212040692567825
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.1326809972524643
Validation loss = 0.11725308746099472
Validation loss = 0.11872882395982742
Validation loss = 0.10944440960884094
Validation loss = 0.11291725188493729
Validation loss = 0.11572963744401932
Validation loss = 0.11168838292360306
Validation loss = 0.12322366237640381
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.11438818275928497
Validation loss = 0.11613916605710983
Validation loss = 0.10875137150287628
Validation loss = 0.11126293987035751
Validation loss = 0.10567517578601837
Validation loss = 0.11001286655664444
Validation loss = 0.11504188179969788
Validation loss = 0.10914530605077744
Validation loss = 0.1060275062918663
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.12127924710512161
Validation loss = 0.11060375720262527
Validation loss = 0.10684297233819962
Validation loss = 0.10648760944604874
Validation loss = 0.1084328442811966
Validation loss = 0.11364138871431351
Validation loss = 0.10785091668367386
Validation loss = 0.10879690945148468
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 343
average number of affinization = 345.089552238806
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 268
average number of affinization = 343.95588235294116
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 303
average number of affinization = 343.3623188405797
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 398
average number of affinization = 344.14285714285717
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 403
average number of affinization = 344.9718309859155
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 381
average number of affinization = 345.47222222222223
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.5e+03  |
| Iteration     | 10       |
| MaximumReturn | 2.37e+03 |
| MinimumReturn | 200      |
| TotalSamples  | 48000    |
----------------------------
itr #11 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.11082256585359573
Validation loss = 0.10245867818593979
Validation loss = 0.09206906706094742
Validation loss = 0.08855080604553223
Validation loss = 0.08899310976266861
Validation loss = 0.08667502552270889
Validation loss = 0.09761103242635727
Validation loss = 0.08733785152435303
Validation loss = 0.09002687782049179
Validation loss = 0.0917302742600441
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.10657878965139389
Validation loss = 0.10585925728082657
Validation loss = 0.09827146679162979
Validation loss = 0.09292415529489517
Validation loss = 0.09311523288488388
Validation loss = 0.10296809673309326
Validation loss = 0.09856199473142624
Validation loss = 0.09233454614877701
Validation loss = 0.09293099492788315
Validation loss = 0.0936284065246582
Validation loss = 0.10366463661193848
Validation loss = 0.09744397550821304
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.13228435814380646
Validation loss = 0.10425246506929398
Validation loss = 0.10850018262863159
Validation loss = 0.09992886334657669
Validation loss = 0.10210955142974854
Validation loss = 0.10170817375183105
Validation loss = 0.09747949242591858
Validation loss = 0.10801342874765396
Validation loss = 0.10006826370954514
Validation loss = 0.09416501969099045
Validation loss = 0.09330013394355774
Validation loss = 0.09538823366165161
Validation loss = 0.11565031856298447
Validation loss = 0.11854948848485947
Validation loss = 0.09474829584360123
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.1295834630727768
Validation loss = 0.09695980697870255
Validation loss = 0.09584220498800278
Validation loss = 0.09790434688329697
Validation loss = 0.10236912965774536
Validation loss = 0.09967353940010071
Validation loss = 0.0985596552491188
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.11539182811975479
Validation loss = 0.09877262264490128
Validation loss = 0.10222754627466202
Validation loss = 0.1188802719116211
Validation loss = 0.10000761598348618
Validation loss = 0.10480397194623947
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 422
average number of affinization = 346.52054794520546
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 430
average number of affinization = 347.64864864864865
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 439
average number of affinization = 348.8666666666667
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 383
average number of affinization = 349.3157894736842
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 381
average number of affinization = 349.72727272727275
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 376
average number of affinization = 350.06410256410254
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.83e+03 |
| Iteration     | 11       |
| MaximumReturn | 2.29e+03 |
| MinimumReturn | 1.37e+03 |
| TotalSamples  | 52000    |
----------------------------
itr #12 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.09293440729379654
Validation loss = 0.09139823168516159
Validation loss = 0.08106300234794617
Validation loss = 0.08309096843004227
Validation loss = 0.08286435157060623
Validation loss = 0.09104737639427185
Validation loss = 0.07991944253444672
Validation loss = 0.0828232392668724
Validation loss = 0.0836741030216217
Validation loss = 0.09691484272480011
Validation loss = 0.09559161216020584
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.09724516421556473
Validation loss = 0.09184630215167999
Validation loss = 0.088271364569664
Validation loss = 0.09162722527980804
Validation loss = 0.0879322737455368
Validation loss = 0.1095670536160469
Validation loss = 0.08875873684883118
Validation loss = 0.08568663150072098
Validation loss = 0.08629271388053894
Validation loss = 0.1003764346241951
Validation loss = 0.08779799193143845
Validation loss = 0.0874280035495758
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.11145469546318054
Validation loss = 0.09488208591938019
Validation loss = 0.08863858133554459
Validation loss = 0.10047374665737152
Validation loss = 0.10109266638755798
Validation loss = 0.098109669983387
Validation loss = 0.09545257687568665
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.10775242745876312
Validation loss = 0.0976082906126976
Validation loss = 0.09213928878307343
Validation loss = 0.08711646497249603
Validation loss = 0.08914607018232346
Validation loss = 0.1136971190571785
Validation loss = 0.09085202217102051
Validation loss = 0.09288401901721954
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.12695671617984772
Validation loss = 0.10638152062892914
Validation loss = 0.09961995482444763
Validation loss = 0.09127548336982727
Validation loss = 0.0911223366856575
Validation loss = 0.09134001284837723
Validation loss = 0.09124335646629333
Validation loss = 0.0893053263425827
Validation loss = 0.08749055862426758
Validation loss = 0.09425747394561768
Validation loss = 0.1005634069442749
Validation loss = 0.09172748029232025
Validation loss = 0.0926498994231224
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 386
average number of affinization = 350.5189873417722
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 402
average number of affinization = 351.1625
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 435
average number of affinization = 352.1975308641975
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 398
average number of affinization = 352.7560975609756
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 397
average number of affinization = 353.28915662650604
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 345
average number of affinization = 353.1904761904762
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.5e+03  |
| Iteration     | 12       |
| MaximumReturn | 2.23e+03 |
| MinimumReturn | 1.17e+03 |
| TotalSamples  | 56000    |
----------------------------
itr #13 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.08992324024438858
Validation loss = 0.08143333345651627
Validation loss = 0.07573256641626358
Validation loss = 0.08240633457899094
Validation loss = 0.07768522202968597
Validation loss = 0.08070564270019531
Validation loss = 0.0856308788061142
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.09833938628435135
Validation loss = 0.09063280373811722
Validation loss = 0.08651383966207504
Validation loss = 0.08602162450551987
Validation loss = 0.08210340887308121
Validation loss = 0.09941700845956802
Validation loss = 0.07743579894304276
Validation loss = 0.08106904476881027
Validation loss = 0.08080340921878815
Validation loss = 0.08229583501815796
Validation loss = 0.08801981061697006
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.10644174367189407
Validation loss = 0.08620447665452957
Validation loss = 0.08485829085111618
Validation loss = 0.09258445352315903
Validation loss = 0.08625547587871552
Validation loss = 0.08713233470916748
Validation loss = 0.08783090859651566
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.09229902923107147
Validation loss = 0.09522425383329391
Validation loss = 0.0885704755783081
Validation loss = 0.09144525974988937
Validation loss = 0.0900842696428299
Validation loss = 0.08908682316541672
Validation loss = 0.08562974631786346
Validation loss = 0.08566081523895264
Validation loss = 0.08363235741853714
Validation loss = 0.0853472352027893
Validation loss = 0.09624029695987701
Validation loss = 0.08596684038639069
Validation loss = 0.08448154479265213
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.09617968648672104
Validation loss = 0.08704490214586258
Validation loss = 0.09534751623868942
Validation loss = 0.09981364011764526
Validation loss = 0.08435887098312378
Validation loss = 0.08232015371322632
Validation loss = 0.0949089303612709
Validation loss = 0.08339287340641022
Validation loss = 0.08421950787305832
Validation loss = 0.0860600471496582
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 340
average number of affinization = 353.0352941176471
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 306
average number of affinization = 352.48837209302326
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 374
average number of affinization = 352.735632183908
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 406
average number of affinization = 353.34090909090907
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 367
average number of affinization = 353.4943820224719
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 337
average number of affinization = 353.31111111111113
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.52e+03 |
| Iteration     | 13       |
| MaximumReturn | 1.9e+03  |
| MinimumReturn | 977      |
| TotalSamples  | 60000    |
----------------------------
itr #14 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.09491333365440369
Validation loss = 0.07520069926977158
Validation loss = 0.07474463433027267
Validation loss = 0.07266063243150711
Validation loss = 0.08513064682483673
Validation loss = 0.07659831643104553
Validation loss = 0.07311245054006577
Validation loss = 0.0771031454205513
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.09441716223955154
Validation loss = 0.08513674885034561
Validation loss = 0.0807027593255043
Validation loss = 0.07952642440795898
Validation loss = 0.07804490625858307
Validation loss = 0.09255826473236084
Validation loss = 0.07893065363168716
Validation loss = 0.07715962082147598
Validation loss = 0.09668286889791489
Validation loss = 0.0763665959239006
Validation loss = 0.07838468998670578
Validation loss = 0.07995564490556717
Validation loss = 0.08808405697345734
Validation loss = 0.07856485992670059
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.09463665634393692
Validation loss = 0.08824099600315094
Validation loss = 0.08366701751947403
Validation loss = 0.08286676555871964
Validation loss = 0.10124251246452332
Validation loss = 0.08276833593845367
Validation loss = 0.0807708278298378
Validation loss = 0.08147363364696503
Validation loss = 0.08210495859384537
Validation loss = 0.09677925705909729
Validation loss = 0.08144626021385193
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.09856303781270981
Validation loss = 0.08507764339447021
Validation loss = 0.08987191319465637
Validation loss = 0.07934261858463287
Validation loss = 0.08791722357273102
Validation loss = 0.08075711876153946
Validation loss = 0.08323414623737335
Validation loss = 0.08114912360906601
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.09034906327724457
Validation loss = 0.08841542899608612
Validation loss = 0.08390399068593979
Validation loss = 0.10034123063087463
Validation loss = 0.08346640318632126
Validation loss = 0.08106108009815216
Validation loss = 0.09282825887203217
Validation loss = 0.08526841551065445
Validation loss = 0.0913916751742363
Validation loss = 0.08059081435203552
Validation loss = 0.07996509969234467
Validation loss = 0.08800684660673141
Validation loss = 0.08529483526945114
Validation loss = 0.0777219757437706
Validation loss = 0.08447034657001495
Validation loss = 0.09059007465839386
Validation loss = 0.07691485434770584
Validation loss = 0.09771852940320969
Validation loss = 0.07904554158449173
Validation loss = 0.08267436921596527
Validation loss = 0.08208664506673813
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 181
average number of affinization = 351.4175824175824
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 370
average number of affinization = 351.6195652173913
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 375
average number of affinization = 351.8709677419355
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 348
average number of affinization = 351.82978723404256
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 294
average number of affinization = 351.22105263157897
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 418
average number of affinization = 351.9166666666667
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.43e+03 |
| Iteration     | 14       |
| MaximumReturn | 2.13e+03 |
| MinimumReturn | -454     |
| TotalSamples  | 64000    |
----------------------------
itr #15 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.0892074704170227
Validation loss = 0.07104035466909409
Validation loss = 0.07081949710845947
Validation loss = 0.0765773355960846
Validation loss = 0.07111933082342148
Validation loss = 0.07095392048358917
Validation loss = 0.07718007266521454
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.0873502790927887
Validation loss = 0.072137251496315
Validation loss = 0.07026514410972595
Validation loss = 0.08142025023698807
Validation loss = 0.09767856448888779
Validation loss = 0.07285261154174805
Validation loss = 0.078407883644104
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.08422056585550308
Validation loss = 0.07920314371585846
Validation loss = 0.07693957537412643
Validation loss = 0.07885369658470154
Validation loss = 0.07951012253761292
Validation loss = 0.07829301804304123
Validation loss = 0.07203122228384018
Validation loss = 0.07605613768100739
Validation loss = 0.07451210916042328
Validation loss = 0.08466649055480957
Validation loss = 0.07412425428628922
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.10934799909591675
Validation loss = 0.07840392738580704
Validation loss = 0.07622207701206207
Validation loss = 0.0721050500869751
Validation loss = 0.07708661258220673
Validation loss = 0.07629664242267609
Validation loss = 0.07506155222654343
Validation loss = 0.07467957586050034
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.09161543846130371
Validation loss = 0.08018150925636292
Validation loss = 0.07310909777879715
Validation loss = 0.07136595249176025
Validation loss = 0.08197882771492004
Validation loss = 0.07914674282073975
Validation loss = 0.07170024514198303
Validation loss = 0.07213280349969864
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 328
average number of affinization = 351.6701030927835
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 389
average number of affinization = 352.05102040816325
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 409
average number of affinization = 352.62626262626264
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 332
average number of affinization = 352.42
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 337
average number of affinization = 352.26732673267327
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 387
average number of affinization = 352.6078431372549
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.84e+03 |
| Iteration     | 15       |
| MaximumReturn | 2.42e+03 |
| MinimumReturn | 1.5e+03  |
| TotalSamples  | 68000    |
----------------------------
itr #16 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.08559700846672058
Validation loss = 0.07237180322408676
Validation loss = 0.0756785050034523
Validation loss = 0.06818397343158722
Validation loss = 0.0707780197262764
Validation loss = 0.06804070621728897
Validation loss = 0.0690477043390274
Validation loss = 0.06732659786939621
Validation loss = 0.07005664706230164
Validation loss = 0.0709892213344574
Validation loss = 0.06649716198444366
Validation loss = 0.06711366027593613
Validation loss = 0.07037440687417984
Validation loss = 0.07049620151519775
Validation loss = 0.06782155483961105
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.07831840217113495
Validation loss = 0.07574532926082611
Validation loss = 0.07790496200323105
Validation loss = 0.07205963879823685
Validation loss = 0.07066339254379272
Validation loss = 0.07624547183513641
Validation loss = 0.0826951265335083
Validation loss = 0.07150091230869293
Validation loss = 0.06944525986909866
Validation loss = 0.0795540139079094
Validation loss = 0.07170919328927994
Validation loss = 0.07195786386728287
Validation loss = 0.06679970026016235
Validation loss = 0.07090640068054199
Validation loss = 0.06896930187940598
Validation loss = 0.06853974610567093
Validation loss = 0.07262282818555832
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.08732227981090546
Validation loss = 0.07324856519699097
Validation loss = 0.07079952955245972
Validation loss = 0.07405796647071838
Validation loss = 0.07300866395235062
Validation loss = 0.07792167365550995
Validation loss = 0.07051485031843185
Validation loss = 0.07063973695039749
Validation loss = 0.0717848390340805
Validation loss = 0.08584120124578476
Validation loss = 0.07557133585214615
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.09211385995149612
Validation loss = 0.0794704407453537
Validation loss = 0.0756310224533081
Validation loss = 0.0698956847190857
Validation loss = 0.0709061548113823
Validation loss = 0.08045489341020584
Validation loss = 0.07241234928369522
Validation loss = 0.07374804466962814
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.08182636648416519
Validation loss = 0.0734023004770279
Validation loss = 0.06969621032476425
Validation loss = 0.06868883967399597
Validation loss = 0.07035116851329803
Validation loss = 0.0737789198756218
Validation loss = 0.07671374082565308
Validation loss = 0.06910646706819534
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 394
average number of affinization = 353.00970873786406
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 378
average number of affinization = 353.25
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 342
average number of affinization = 353.14285714285717
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 366
average number of affinization = 353.2641509433962
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 390
average number of affinization = 353.60747663551405
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 301
average number of affinization = 353.1203703703704
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.46e+03 |
| Iteration     | 16       |
| MaximumReturn | 1.67e+03 |
| MinimumReturn | 1.31e+03 |
| TotalSamples  | 72000    |
----------------------------
itr #17 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.07418013364076614
Validation loss = 0.06484074890613556
Validation loss = 0.06409652531147003
Validation loss = 0.06674578785896301
Validation loss = 0.06297311931848526
Validation loss = 0.06704342365264893
Validation loss = 0.06532252579927444
Validation loss = 0.06915868073701859
Validation loss = 0.06338077038526535
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.08547982573509216
Validation loss = 0.06590192019939423
Validation loss = 0.06508743762969971
Validation loss = 0.0965522825717926
Validation loss = 0.07173217833042145
Validation loss = 0.0669018030166626
Validation loss = 0.06651904433965683
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.0771641805768013
Validation loss = 0.06921811401844025
Validation loss = 0.06814548373222351
Validation loss = 0.0702010914683342
Validation loss = 0.0769587978720665
Validation loss = 0.06804683804512024
Validation loss = 0.06888091564178467
Validation loss = 0.07104850560426712
Validation loss = 0.06826261430978775
Validation loss = 0.07571056485176086
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.07136781513690948
Validation loss = 0.06855727732181549
Validation loss = 0.06995157897472382
Validation loss = 0.07032552361488342
Validation loss = 0.06659049540758133
Validation loss = 0.07662811130285263
Validation loss = 0.07003994286060333
Validation loss = 0.06583938002586365
Validation loss = 0.07123763114213943
Validation loss = 0.06892713904380798
Validation loss = 0.06686777621507645
Validation loss = 0.06973998248577118
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.09708461165428162
Validation loss = 0.07104701548814774
Validation loss = 0.0683978870511055
Validation loss = 0.06816407293081284
Validation loss = 0.06829511374235153
Validation loss = 0.06589372456073761
Validation loss = 0.07045819610357285
Validation loss = 0.08177635073661804
Validation loss = 0.06859001517295837
Validation loss = 0.06526196002960205
Validation loss = 0.07679212093353271
Validation loss = 0.08355715125799179
Validation loss = 0.0649707242846489
Validation loss = 0.06945332139730453
Validation loss = 0.07296160608530045
Validation loss = 0.06743121147155762
Validation loss = 0.06909087300300598
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 319
average number of affinization = 352.8073394495413
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 426
average number of affinization = 353.4727272727273
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 322
average number of affinization = 353.18918918918916
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 382
average number of affinization = 353.44642857142856
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 358
average number of affinization = 353.4867256637168
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 370
average number of affinization = 353.63157894736844
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.69e+03 |
| Iteration     | 17       |
| MaximumReturn | 2.56e+03 |
| MinimumReturn | 82.9     |
| TotalSamples  | 76000    |
----------------------------
itr #18 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.07084200531244278
Validation loss = 0.06361288577318192
Validation loss = 0.06455718725919724
Validation loss = 0.06286267936229706
Validation loss = 0.07319306582212448
Validation loss = 0.059056952595710754
Validation loss = 0.059777211397886276
Validation loss = 0.06415233761072159
Validation loss = 0.06468244642019272
Validation loss = 0.05909737944602966
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.0730644091963768
Validation loss = 0.06354426592588425
Validation loss = 0.06230898201465607
Validation loss = 0.06657581031322479
Validation loss = 0.06950033456087112
Validation loss = 0.06320194154977798
Validation loss = 0.06261172145605087
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.07764533907175064
Validation loss = 0.06476900726556778
Validation loss = 0.06369702517986298
Validation loss = 0.08180338144302368
Validation loss = 0.06525757908821106
Validation loss = 0.06475316733121872
Validation loss = 0.06673217564821243
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.07452689856290817
Validation loss = 0.06752115488052368
Validation loss = 0.06264206022024155
Validation loss = 0.06247140094637871
Validation loss = 0.06435097008943558
Validation loss = 0.06894014030694962
Validation loss = 0.06944658607244492
Validation loss = 0.06377582997083664
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.08576101064682007
Validation loss = 0.06837731599807739
Validation loss = 0.06454140692949295
Validation loss = 0.06722614914178848
Validation loss = 0.0682448148727417
Validation loss = 0.06224794685840607
Validation loss = 0.06152981147170067
Validation loss = 0.06553252041339874
Validation loss = 0.06681709736585617
Validation loss = 0.06442424654960632
Validation loss = 0.0679287388920784
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 390
average number of affinization = 353.9478260869565
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 435
average number of affinization = 354.6465517241379
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 388
average number of affinization = 354.9316239316239
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 367
average number of affinization = 355.03389830508473
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 383
average number of affinization = 355.2689075630252
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 437
average number of affinization = 355.95
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.88e+03 |
| Iteration     | 18       |
| MaximumReturn | 2.23e+03 |
| MinimumReturn | 1.44e+03 |
| TotalSamples  | 80000    |
----------------------------
itr #19 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.06670472025871277
Validation loss = 0.05956654995679855
Validation loss = 0.06430657207965851
Validation loss = 0.057797063142061234
Validation loss = 0.06028778478503227
Validation loss = 0.059138037264347076
Validation loss = 0.06646616756916046
Validation loss = 0.06986989080905914
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.10340654850006104
Validation loss = 0.06361861526966095
Validation loss = 0.06469234079122543
Validation loss = 0.07631387561559677
Validation loss = 0.061952970921993256
Validation loss = 0.06064262241125107
Validation loss = 0.06205970048904419
Validation loss = 0.06312673538923264
Validation loss = 0.06192482262849808
Validation loss = 0.0609058253467083
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.08067803084850311
Validation loss = 0.06251639872789383
Validation loss = 0.0675969198346138
Validation loss = 0.06192915514111519
Validation loss = 0.06647449731826782
Validation loss = 0.06568185240030289
Validation loss = 0.07583324611186981
Validation loss = 0.06137629598379135
Validation loss = 0.06404127925634384
Validation loss = 0.06485471874475479
Validation loss = 0.06597144901752472
Validation loss = 0.06666137278079987
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.07520431280136108
Validation loss = 0.06329882144927979
Validation loss = 0.06064885854721069
Validation loss = 0.07343193143606186
Validation loss = 0.06344030797481537
Validation loss = 0.06886854767799377
Validation loss = 0.062026239931583405
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.0734882578253746
Validation loss = 0.0670844092965126
Validation loss = 0.06321867555379868
Validation loss = 0.0625673308968544
Validation loss = 0.06259816139936447
Validation loss = 0.06668083369731903
Validation loss = 0.0623929426074028
Validation loss = 0.06990344077348709
Validation loss = 0.06523986160755157
Validation loss = 0.06116172671318054
Validation loss = 0.06345353275537491
Validation loss = 0.061658523976802826
Validation loss = 0.06305430829524994
Validation loss = 0.06303026527166367
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 399
average number of affinization = 356.3057851239669
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 403
average number of affinization = 356.6885245901639
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 423
average number of affinization = 357.2276422764228
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 421
average number of affinization = 357.741935483871
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 326
average number of affinization = 357.488
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 414
average number of affinization = 357.93650793650795
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 2.01e+03 |
| Iteration     | 19       |
| MaximumReturn | 2.33e+03 |
| MinimumReturn | 1.64e+03 |
| TotalSamples  | 84000    |
----------------------------
itr #20 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.06455832719802856
Validation loss = 0.05727621540427208
Validation loss = 0.057393841445446014
Validation loss = 0.05907391756772995
Validation loss = 0.05704209953546524
Validation loss = 0.060928184539079666
Validation loss = 0.05804689601063728
Validation loss = 0.05546118691563606
Validation loss = 0.057021044194698334
Validation loss = 0.08603483438491821
Validation loss = 0.054363347589969635
Validation loss = 0.056676626205444336
Validation loss = 0.06413538008928299
Validation loss = 0.05659659206867218
Validation loss = 0.05334246903657913
Validation loss = 0.05558082461357117
Validation loss = 0.06234350800514221
Validation loss = 0.053456708788871765
Validation loss = 0.05384758487343788
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.06818439066410065
Validation loss = 0.05964169651269913
Validation loss = 0.058262698352336884
Validation loss = 0.06332007795572281
Validation loss = 0.061832245439291
Validation loss = 0.06108611449599266
Validation loss = 0.06102839484810829
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.06699951738119125
Validation loss = 0.06029556319117546
Validation loss = 0.059224843978881836
Validation loss = 0.06248093768954277
Validation loss = 0.06145564094185829
Validation loss = 0.05932950973510742
Validation loss = 0.061212971806526184
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.07025780528783798
Validation loss = 0.0681685134768486
Validation loss = 0.05979670211672783
Validation loss = 0.06044536083936691
Validation loss = 0.07294964045286179
Validation loss = 0.06239704415202141
Validation loss = 0.061075739562511444
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.07515696436166763
Validation loss = 0.06002265587449074
Validation loss = 0.0622580349445343
Validation loss = 0.05681801959872246
Validation loss = 0.05904208868741989
Validation loss = 0.06500531733036041
Validation loss = 0.056944340467453
Validation loss = 0.05680606886744499
Validation loss = 0.05536207556724548
Validation loss = 0.06350580602884293
Validation loss = 0.05623145401477814
Validation loss = 0.05943317711353302
Validation loss = 0.056871045380830765
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 371
average number of affinization = 358.03937007874015
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 422
average number of affinization = 358.5390625
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 403
average number of affinization = 358.8837209302326
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 383
average number of affinization = 359.0692307692308
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 389
average number of affinization = 359.2977099236641
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 383
average number of affinization = 359.47727272727275
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.98e+03 |
| Iteration     | 20       |
| MaximumReturn | 2.42e+03 |
| MinimumReturn | 1.06e+03 |
| TotalSamples  | 88000    |
----------------------------
itr #21 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.06163323298096657
Validation loss = 0.05532747134566307
Validation loss = 0.05365370213985443
Validation loss = 0.056809280067682266
Validation loss = 0.06136513128876686
Validation loss = 0.05374482646584511
Validation loss = 0.05467561259865761
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.06621741503477097
Validation loss = 0.06043194234371185
Validation loss = 0.06091799959540367
Validation loss = 0.05659809336066246
Validation loss = 0.06285995990037918
Validation loss = 0.059508420526981354
Validation loss = 0.05714896693825722
Validation loss = 0.06170424446463585
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.0801725909113884
Validation loss = 0.06942783296108246
Validation loss = 0.0599842295050621
Validation loss = 0.06101500988006592
Validation loss = 0.07034698128700256
Validation loss = 0.059273578226566315
Validation loss = 0.06019219383597374
Validation loss = 0.06050962209701538
Validation loss = 0.061847154051065445
Validation loss = 0.06472212076187134
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.06966438889503479
Validation loss = 0.06083785369992256
Validation loss = 0.05912238731980324
Validation loss = 0.06141318380832672
Validation loss = 0.059003207832574844
Validation loss = 0.06435311585664749
Validation loss = 0.06362304091453552
Validation loss = 0.060625217854976654
Validation loss = 0.06549609452486038
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.07238492369651794
Validation loss = 0.05497486889362335
Validation loss = 0.05765772983431816
Validation loss = 0.06242179870605469
Validation loss = 0.05859696492552757
Validation loss = 0.05605268105864525
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 406
average number of affinization = 359.82706766917295
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 411
average number of affinization = 360.2089552238806
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 417
average number of affinization = 360.6296296296296
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 364
average number of affinization = 360.65441176470586
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 418
average number of affinization = 361.0729927007299
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 399
average number of affinization = 361.3478260869565
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.89e+03 |
| Iteration     | 21       |
| MaximumReturn | 2.43e+03 |
| MinimumReturn | 1.53e+03 |
| TotalSamples  | 92000    |
----------------------------
itr #22 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.06106986105442047
Validation loss = 0.05261408910155296
Validation loss = 0.05536720156669617
Validation loss = 0.053442224860191345
Validation loss = 0.05236499384045601
Validation loss = 0.054060712456703186
Validation loss = 0.05742530897259712
Validation loss = 0.05078902468085289
Validation loss = 0.056889086961746216
Validation loss = 0.0529630072414875
Validation loss = 0.05032142251729965
Validation loss = 0.0563209168612957
Validation loss = 0.05126732960343361
Validation loss = 0.0513586699962616
Validation loss = 0.050458695739507675
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.05983889102935791
Validation loss = 0.0570121705532074
Validation loss = 0.05640901252627373
Validation loss = 0.06430269032716751
Validation loss = 0.05534357577562332
Validation loss = 0.05654614418745041
Validation loss = 0.07180558890104294
Validation loss = 0.05809316784143448
Validation loss = 0.05681074783205986
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.06894057989120483
Validation loss = 0.056077953428030014
Validation loss = 0.05634988471865654
Validation loss = 0.06559650599956512
Validation loss = 0.05445168912410736
Validation loss = 0.05448737367987633
Validation loss = 0.056731294840574265
Validation loss = 0.05616510286927223
Validation loss = 0.05393550172448158
Validation loss = 0.05895702913403511
Validation loss = 0.06263475120067596
Validation loss = 0.05811226740479469
Validation loss = 0.053689081221818924
Validation loss = 0.05401197820901871
Validation loss = 0.06059180200099945
Validation loss = 0.056473508477211
Validation loss = 0.05520904064178467
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.06488534063100815
Validation loss = 0.057989902794361115
Validation loss = 0.0588596947491169
Validation loss = 0.05422221124172211
Validation loss = 0.05591112747788429
Validation loss = 0.05756213515996933
Validation loss = 0.05674036219716072
Validation loss = 0.0563814640045166
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.06286755204200745
Validation loss = 0.05982357636094093
Validation loss = 0.055495355278253555
Validation loss = 0.054737869650125504
Validation loss = 0.06185828149318695
Validation loss = 0.061182089149951935
Validation loss = 0.05423450469970703
Validation loss = 0.05507343262434006
Validation loss = 0.06413943320512772
Validation loss = 0.05384610965847969
Validation loss = 0.055586930364370346
Validation loss = 0.05573835223913193
Validation loss = 0.07005448639392853
Validation loss = 0.05439268425107002
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 355
average number of affinization = 361.3021582733813
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 402
average number of affinization = 361.59285714285716
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 413
average number of affinization = 361.9574468085106
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 425
average number of affinization = 362.40140845070425
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 399
average number of affinization = 362.65734265734267
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 420
average number of affinization = 363.05555555555554
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.92e+03 |
| Iteration     | 22       |
| MaximumReturn | 2.66e+03 |
| MinimumReturn | 795      |
| TotalSamples  | 96000    |
----------------------------
itr #23 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.0612574927508831
Validation loss = 0.052185703068971634
Validation loss = 0.05124059319496155
Validation loss = 0.054349496960639954
Validation loss = 0.05209682881832123
Validation loss = 0.05675005912780762
Validation loss = 0.050684500485658646
Validation loss = 0.050713181495666504
Validation loss = 0.05688519403338432
Validation loss = 0.050186920911073685
Validation loss = 0.05295247212052345
Validation loss = 0.05131106078624725
Validation loss = 0.06904395669698715
Validation loss = 0.05103705823421478
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.06316038221120834
Validation loss = 0.05446087196469307
Validation loss = 0.057362962514162064
Validation loss = 0.05504859611392021
Validation loss = 0.06243164837360382
Validation loss = 0.05768013000488281
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.06212780252099037
Validation loss = 0.053119808435440063
Validation loss = 0.055830370634794235
Validation loss = 0.05403662100434303
Validation loss = 0.052492279559373856
Validation loss = 0.0529116690158844
Validation loss = 0.05574159696698189
Validation loss = 0.05183413624763489
Validation loss = 0.054031167179346085
Validation loss = 0.054276298731565475
Validation loss = 0.05431198701262474
Validation loss = 0.054548412561416626
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.058423686772584915
Validation loss = 0.05548788979649544
Validation loss = 0.056269049644470215
Validation loss = 0.06664499640464783
Validation loss = 0.05630980804562569
Validation loss = 0.05405217781662941
Validation loss = 0.057360440492630005
Validation loss = 0.05695425346493721
Validation loss = 0.05744262412190437
Validation loss = 0.05518806353211403
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.05494849011301994
Validation loss = 0.052905578166246414
Validation loss = 0.057289008051157
Validation loss = 0.054617270827293396
Validation loss = 0.05170939490199089
Validation loss = 0.061657678335905075
Validation loss = 0.05543288215994835
Validation loss = 0.05871725082397461
Validation loss = 0.05669068172574043
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 356
average number of affinization = 363.00689655172414
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 392
average number of affinization = 363.2054794520548
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 398
average number of affinization = 363.4421768707483
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 396
average number of affinization = 363.6621621621622
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 394
average number of affinization = 363.86577181208054
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 363
average number of affinization = 363.86
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 2.49e+03 |
| Iteration     | 23       |
| MaximumReturn | 2.64e+03 |
| MinimumReturn | 2.35e+03 |
| TotalSamples  | 100000   |
----------------------------
itr #24 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.05971992015838623
Validation loss = 0.05038325861096382
Validation loss = 0.05146530643105507
Validation loss = 0.05070807784795761
Validation loss = 0.050194740295410156
Validation loss = 0.05005941540002823
Validation loss = 0.05214698240160942
Validation loss = 0.05242203548550606
Validation loss = 0.048908721655607224
Validation loss = 0.05131169408559799
Validation loss = 0.049636583775281906
Validation loss = 0.04872564226388931
Validation loss = 0.04819769412279129
Validation loss = 0.05041305720806122
Validation loss = 0.0513259693980217
Validation loss = 0.051545556634664536
Validation loss = 0.05131276696920395
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.060845404863357544
Validation loss = 0.061064478009939194
Validation loss = 0.05343393236398697
Validation loss = 0.0649310052394867
Validation loss = 0.05559011921286583
Validation loss = 0.05211621895432472
Validation loss = 0.055059995502233505
Validation loss = 0.05269060283899307
Validation loss = 0.055611394345760345
Validation loss = 0.08948534727096558
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.05923737213015556
Validation loss = 0.06031506508588791
Validation loss = 0.05064795911312103
Validation loss = 0.05107772350311279
Validation loss = 0.05328869819641113
Validation loss = 0.05904320254921913
Validation loss = 0.050769317895174026
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.06907227635383606
Validation loss = 0.05299877002835274
Validation loss = 0.06404373049736023
Validation loss = 0.05983348190784454
Validation loss = 0.05388575419783592
Validation loss = 0.059164222329854965
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.05568575859069824
Validation loss = 0.057057905942201614
Validation loss = 0.05151723325252533
Validation loss = 0.05158882960677147
Validation loss = 0.05767786502838135
Validation loss = 0.055097803473472595
Validation loss = 0.05025368183851242
Validation loss = 0.05890050530433655
Validation loss = 0.04967086389660835
Validation loss = 0.06648989766836166
Validation loss = 0.05093374103307724
Validation loss = 0.051420778036117554
Validation loss = 0.055716048926115036
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 425
average number of affinization = 364.26490066225165
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 443
average number of affinization = 364.7828947368421
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 444
average number of affinization = 365.30065359477123
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 439
average number of affinization = 365.7792207792208
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 429
average number of affinization = 366.18709677419355
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 385
average number of affinization = 366.3076923076923
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 2.13e+03 |
| Iteration     | 24       |
| MaximumReturn | 2.66e+03 |
| MinimumReturn | 1.42e+03 |
| TotalSamples  | 104000   |
----------------------------
itr #25 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.05052546411752701
Validation loss = 0.0496249757707119
Validation loss = 0.049326855689287186
Validation loss = 0.0531330369412899
Validation loss = 0.05060264840722084
Validation loss = 0.053283948451280594
Validation loss = 0.04888131469488144
Validation loss = 0.05298145115375519
Validation loss = 0.053466685116291046
Validation loss = 0.04855183884501457
Validation loss = 0.050831399857997894
Validation loss = 0.047732237726449966
Validation loss = 0.04761860519647598
Validation loss = 0.048547450453042984
Validation loss = 0.046623893082141876
Validation loss = 0.0484132319688797
Validation loss = 0.047001950442790985
Validation loss = 0.04846397787332535
Validation loss = 0.047596752643585205
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.0617680586874485
Validation loss = 0.05363864079117775
Validation loss = 0.052572254091501236
Validation loss = 0.053801681846380234
Validation loss = 0.054546017199754715
Validation loss = 0.059180375188589096
Validation loss = 0.05869591236114502
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.06101568043231964
Validation loss = 0.0533427968621254
Validation loss = 0.05059412866830826
Validation loss = 0.05118034780025482
Validation loss = 0.053891025483608246
Validation loss = 0.048465386033058167
Validation loss = 0.052291788160800934
Validation loss = 0.05116100609302521
Validation loss = 0.05274927243590355
Validation loss = 0.050980325788259506
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.06268313527107239
Validation loss = 0.05315180867910385
Validation loss = 0.05112738162279129
Validation loss = 0.05791143327951431
Validation loss = 0.05504743382334709
Validation loss = 0.05513013154268265
Validation loss = 0.06562259793281555
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.05228978767991066
Validation loss = 0.05065605789422989
Validation loss = 0.04997701197862625
Validation loss = 0.05364863574504852
Validation loss = 0.05139869451522827
Validation loss = 0.04988496005535126
Validation loss = 0.050213318318128586
Validation loss = 0.06229391321539879
Validation loss = 0.05011669546365738
Validation loss = 0.051040444523096085
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 389
average number of affinization = 366.45222929936307
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 367
average number of affinization = 366.45569620253167
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 359
average number of affinization = 366.40880503144655
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 378
average number of affinization = 366.48125
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 346
average number of affinization = 366.35403726708074
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 431
average number of affinization = 366.75308641975306
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 2.38e+03 |
| Iteration     | 25       |
| MaximumReturn | 2.64e+03 |
| MinimumReturn | 1.87e+03 |
| TotalSamples  | 108000   |
----------------------------
itr #26 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.0562414675951004
Validation loss = 0.04693658649921417
Validation loss = 0.04864659160375595
Validation loss = 0.04775996506214142
Validation loss = 0.04981973022222519
Validation loss = 0.04920180141925812
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.07017609477043152
Validation loss = 0.052621278911828995
Validation loss = 0.05448617786169052
Validation loss = 0.056696899235248566
Validation loss = 0.05287547409534454
Validation loss = 0.053975220769643784
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.05865629017353058
Validation loss = 0.05092884972691536
Validation loss = 0.049155354499816895
Validation loss = 0.05107033625245094
Validation loss = 0.050374679267406464
Validation loss = 0.05496677756309509
Validation loss = 0.049123670905828476
Validation loss = 0.04955063760280609
Validation loss = 0.05428552255034447
Validation loss = 0.04872732236981392
Validation loss = 0.05133018270134926
Validation loss = 0.05034786835312843
Validation loss = 0.04996271803975105
Validation loss = 0.05250568315386772
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.05854528397321701
Validation loss = 0.048283644020557404
Validation loss = 0.05432116240262985
Validation loss = 0.05280708521604538
Validation loss = 0.05327225476503372
Validation loss = 0.0501052625477314
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.057879481464624405
Validation loss = 0.049802105873823166
Validation loss = 0.052437905222177505
Validation loss = 0.051897723227739334
Validation loss = 0.048871953040361404
Validation loss = 0.04922078549861908
Validation loss = 0.05429631844162941
Validation loss = 0.04975821077823639
Validation loss = 0.049043696373701096
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 388
average number of affinization = 366.8834355828221
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 416
average number of affinization = 367.1829268292683
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 390
average number of affinization = 367.3212121212121
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 373
average number of affinization = 367.355421686747
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 416
average number of affinization = 367.6467065868263
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 308
average number of affinization = 367.2916666666667
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 2.14e+03 |
| Iteration     | 26       |
| MaximumReturn | 2.76e+03 |
| MinimumReturn | 985      |
| TotalSamples  | 112000   |
----------------------------
itr #27 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.05510880425572395
Validation loss = 0.0493028499186039
Validation loss = 0.046919193118810654
Validation loss = 0.049772895872592926
Validation loss = 0.05019064620137215
Validation loss = 0.0465693399310112
Validation loss = 0.046793870627880096
Validation loss = 0.04849349707365036
Validation loss = 0.04739709571003914
Validation loss = 0.05510980635881424
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.060283124446868896
Validation loss = 0.05160274729132652
Validation loss = 0.0506088063120842
Validation loss = 0.05360262840986252
Validation loss = 0.05273938924074173
Validation loss = 0.05363145098090172
Validation loss = 0.051631439477205276
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.053853947669267654
Validation loss = 0.04799247533082962
Validation loss = 0.05264971777796745
Validation loss = 0.051390767097473145
Validation loss = 0.04742930456995964
Validation loss = 0.05443120002746582
Validation loss = 0.0505097396671772
Validation loss = 0.054255690425634384
Validation loss = 0.0496300533413887
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.05950099229812622
Validation loss = 0.050748564302921295
Validation loss = 0.05200716480612755
Validation loss = 0.04953858256340027
Validation loss = 0.053822170943021774
Validation loss = 0.049943070858716965
Validation loss = 0.05391773581504822
Validation loss = 0.049920838326215744
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.05448078364133835
Validation loss = 0.05737050250172615
Validation loss = 0.04717938229441643
Validation loss = 0.05262557789683342
Validation loss = 0.04732273146510124
Validation loss = 0.04913371056318283
Validation loss = 0.04876730963587761
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 421
average number of affinization = 367.6094674556213
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 394
average number of affinization = 367.7647058823529
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 392
average number of affinization = 367.906432748538
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 339
average number of affinization = 367.73837209302326
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 374
average number of affinization = 367.77456647398844
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 403
average number of affinization = 367.97701149425285
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.94e+03 |
| Iteration     | 27       |
| MaximumReturn | 2.69e+03 |
| MinimumReturn | 1.06e+03 |
| TotalSamples  | 116000   |
----------------------------
itr #28 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.053055088967084885
Validation loss = 0.04698309674859047
Validation loss = 0.04863214120268822
Validation loss = 0.04646854102611542
Validation loss = 0.04869425296783447
Validation loss = 0.04714401811361313
Validation loss = 0.0466235987842083
Validation loss = 0.045624446123838425
Validation loss = 0.04840730503201485
Validation loss = 0.04721095412969589
Validation loss = 0.0469944030046463
Validation loss = 0.05413338541984558
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.0515822134912014
Validation loss = 0.05164102092385292
Validation loss = 0.04765789583325386
Validation loss = 0.05076019465923309
Validation loss = 0.05093497410416603
Validation loss = 0.05215199291706085
Validation loss = 0.04987882450222969
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.05383303388953209
Validation loss = 0.04823373630642891
Validation loss = 0.04901420697569847
Validation loss = 0.04927017539739609
Validation loss = 0.053029801696538925
Validation loss = 0.055092524737119675
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.052360761910676956
Validation loss = 0.04803885519504547
Validation loss = 0.05175415426492691
Validation loss = 0.0496637299656868
Validation loss = 0.0506766103208065
Validation loss = 0.05232340842485428
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.050965163856744766
Validation loss = 0.046568114310503006
Validation loss = 0.0492841936647892
Validation loss = 0.04810936376452446
Validation loss = 0.04764506593346596
Validation loss = 0.04659947752952576
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 399
average number of affinization = 368.1542857142857
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 343
average number of affinization = 368.0113636363636
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 391
average number of affinization = 368.1412429378531
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 396
average number of affinization = 368.2977528089888
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 348
average number of affinization = 368.18435754189943
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 343
average number of affinization = 368.0444444444444
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.88e+03 |
| Iteration     | 28       |
| MaximumReturn | 2.55e+03 |
| MinimumReturn | 1.36e+03 |
| TotalSamples  | 120000   |
----------------------------
itr #29 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.054878704249858856
Validation loss = 0.04644226282835007
Validation loss = 0.04734719172120094
Validation loss = 0.04659498855471611
Validation loss = 0.04613494873046875
Validation loss = 0.04717685282230377
Validation loss = 0.04812666028738022
Validation loss = 0.0471959263086319
Validation loss = 0.044125691056251526
Validation loss = 0.0797581747174263
Validation loss = 0.045442547649145126
Validation loss = 0.043924953788518906
Validation loss = 0.04826388135552406
Validation loss = 0.04516962543129921
Validation loss = 0.04449823871254921
Validation loss = 0.05695944279432297
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.05795777961611748
Validation loss = 0.05179138481616974
Validation loss = 0.0474502332508564
Validation loss = 0.04913990944623947
Validation loss = 0.04950852319598198
Validation loss = 0.05137592926621437
Validation loss = 0.05968208611011505
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.05985641852021217
Validation loss = 0.04746643453836441
Validation loss = 0.05027603358030319
Validation loss = 0.0463608056306839
Validation loss = 0.046501271426677704
Validation loss = 0.046777524054050446
Validation loss = 0.045660898089408875
Validation loss = 0.048900965601205826
Validation loss = 0.04800014942884445
Validation loss = 0.05057412385940552
Validation loss = 0.04971824213862419
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.0536181665956974
Validation loss = 0.05315995216369629
Validation loss = 0.04918297752737999
Validation loss = 0.047863055020570755
Validation loss = 0.050567876547575
Validation loss = 0.05042267590761185
Validation loss = 0.05148946866393089
Validation loss = 0.05031583830714226
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.04834398999810219
Validation loss = 0.04533035680651665
Validation loss = 0.04576531797647476
Validation loss = 0.0448073148727417
Validation loss = 0.048910658806562424
Validation loss = 0.044797930866479874
Validation loss = 0.05061328783631325
Validation loss = 0.04576772078871727
Validation loss = 0.04945925623178482
Validation loss = 0.04428150877356529
Validation loss = 0.04699111357331276
Validation loss = 0.04674150422215462
Validation loss = 0.04642946273088455
Validation loss = 0.04472384229302406
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 380
average number of affinization = 368.11049723756906
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 360
average number of affinization = 368.0659340659341
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 399
average number of affinization = 368.23497267759564
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 427
average number of affinization = 368.55434782608694
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 372
average number of affinization = 368.572972972973
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 427
average number of affinization = 368.88709677419354
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 2.33e+03 |
| Iteration     | 29       |
| MaximumReturn | 2.49e+03 |
| MinimumReturn | 2.04e+03 |
| TotalSamples  | 124000   |
----------------------------
itr #30 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.04886208474636078
Validation loss = 0.04287716746330261
Validation loss = 0.04628271237015724
Validation loss = 0.04627547413110733
Validation loss = 0.04264409840106964
Validation loss = 0.04399086534976959
Validation loss = 0.05249647796154022
Validation loss = 0.04620252549648285
Validation loss = 0.046554259955883026
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.05969822034239769
Validation loss = 0.04685957357287407
Validation loss = 0.05107301101088524
Validation loss = 0.049196574836969376
Validation loss = 0.04689980670809746
Validation loss = 0.04909063130617142
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.048823703080415726
Validation loss = 0.04640324041247368
Validation loss = 0.045260507613420486
Validation loss = 0.045091744512319565
Validation loss = 0.04507243260741234
Validation loss = 0.046198006719350815
Validation loss = 0.0467643216252327
Validation loss = 0.04678373038768768
Validation loss = 0.04661766439676285
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.05055024102330208
Validation loss = 0.04668998718261719
Validation loss = 0.04799206554889679
Validation loss = 0.04814692586660385
Validation loss = 0.053195640444755554
Validation loss = 0.05021626129746437
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.04559297487139702
Validation loss = 0.04459433630108833
Validation loss = 0.04328081011772156
Validation loss = 0.04765691980719566
Validation loss = 0.0429222472012043
Validation loss = 0.044232625514268875
Validation loss = 0.04757095128297806
Validation loss = 0.04378994554281235
Validation loss = 0.047309860587120056
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 384
average number of affinization = 368.96791443850265
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 392
average number of affinization = 369.0904255319149
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 427
average number of affinization = 369.3968253968254
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 398
average number of affinization = 369.5473684210526
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 387
average number of affinization = 369.6387434554974
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 411
average number of affinization = 369.8541666666667
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 2.23e+03 |
| Iteration     | 30       |
| MaximumReturn | 2.58e+03 |
| MinimumReturn | 1.59e+03 |
| TotalSamples  | 128000   |
----------------------------
itr #31 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.044357724487781525
Validation loss = 0.043568722903728485
Validation loss = 0.04396665096282959
Validation loss = 0.04453820735216141
Validation loss = 0.042963773012161255
Validation loss = 0.04278554394841194
Validation loss = 0.04264858365058899
Validation loss = 0.04190750792622566
Validation loss = 0.04409494996070862
Validation loss = 0.04289022088050842
Validation loss = 0.04575537517666817
Validation loss = 0.04414075240492821
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.06038307398557663
Validation loss = 0.04973328113555908
Validation loss = 0.04632076621055603
Validation loss = 0.048021622002124786
Validation loss = 0.0546126514673233
Validation loss = 0.04759813845157623
Validation loss = 0.04442468285560608
Validation loss = 0.04674261808395386
Validation loss = 0.05117763578891754
Validation loss = 0.04576287418603897
Validation loss = 0.04827796667814255
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.047428280115127563
Validation loss = 0.046225354075431824
Validation loss = 0.04998830333352089
Validation loss = 0.051937226206064224
Validation loss = 0.046041011810302734
Validation loss = 0.04799865186214447
Validation loss = 0.046252090483903885
Validation loss = 0.04369521141052246
Validation loss = 0.050836630165576935
Validation loss = 0.04681722819805145
Validation loss = 0.04562258720397949
Validation loss = 0.04638754948973656
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.05359308421611786
Validation loss = 0.04940679669380188
Validation loss = 0.047288887202739716
Validation loss = 0.04912184923887253
Validation loss = 0.04724118858575821
Validation loss = 0.04551467299461365
Validation loss = 0.046162791550159454
Validation loss = 0.047705233097076416
Validation loss = 0.0467916876077652
Validation loss = 0.04687434062361717
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.05100291222333908
Validation loss = 0.04195655137300491
Validation loss = 0.04222216457128525
Validation loss = 0.04747629165649414
Validation loss = 0.04338664188981056
Validation loss = 0.04331366717815399
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 405
average number of affinization = 370.0362694300518
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 449
average number of affinization = 370.44329896907215
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 79
average number of affinization = 368.94871794871796
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 419
average number of affinization = 369.2040816326531
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 431
average number of affinization = 369.51776649746193
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 395
average number of affinization = 369.64646464646466
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | 1.38e+03  |
| Iteration     | 31        |
| MaximumReturn | 2.46e+03  |
| MinimumReturn | -2.76e+03 |
| TotalSamples  | 132000    |
-----------------------------
itr #32 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.05040690675377846
Validation loss = 0.04367633908987045
Validation loss = 0.04298241809010506
Validation loss = 0.04775046929717064
Validation loss = 0.04165448248386383
Validation loss = 0.04156704619526863
Validation loss = 0.04772740229964256
Validation loss = 0.04266373813152313
Validation loss = 0.04375109821557999
Validation loss = 0.0415104515850544
Validation loss = 0.05422232300043106
Validation loss = 0.046306442469358444
Validation loss = 0.04524847865104675
Validation loss = 0.04218000918626785
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.04748152941465378
Validation loss = 0.04703082889318466
Validation loss = 0.046156831085681915
Validation loss = 0.04525969550013542
Validation loss = 0.05075840651988983
Validation loss = 0.043482642620801926
Validation loss = 0.04366642236709595
Validation loss = 0.04544537514448166
Validation loss = 0.04562059044837952
Validation loss = 0.04707586392760277
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.04697652906179428
Validation loss = 0.04724632948637009
Validation loss = 0.04750865325331688
Validation loss = 0.04641587287187576
Validation loss = 0.047054097056388855
Validation loss = 0.04531891271471977
Validation loss = 0.046284258365631104
Validation loss = 0.04596581682562828
Validation loss = 0.04403727129101753
Validation loss = 0.046959225088357925
Validation loss = 0.04621054604649544
Validation loss = 0.04578855633735657
Validation loss = 0.048150915652513504
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.049548156559467316
Validation loss = 0.04715511202812195
Validation loss = 0.05158902704715729
Validation loss = 0.04426310956478119
Validation loss = 0.048875659704208374
Validation loss = 0.048718489706516266
Validation loss = 0.05611131340265274
Validation loss = 0.04646783322095871
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.05013415962457657
Validation loss = 0.04243471845984459
Validation loss = 0.04663282632827759
Validation loss = 0.043900977820158005
Validation loss = 0.04382263123989105
Validation loss = 0.04426411911845207
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 344
average number of affinization = 369.5175879396985
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 425
average number of affinization = 369.795
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 378
average number of affinization = 369.8358208955224
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 361
average number of affinization = 369.7920792079208
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 396
average number of affinization = 369.92118226600985
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 414
average number of affinization = 370.1372549019608
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 2.15e+03 |
| Iteration     | 32       |
| MaximumReturn | 2.63e+03 |
| MinimumReturn | 1.11e+03 |
| TotalSamples  | 136000   |
----------------------------
