Logging to experiments/hopper/oct31/w350e03_Durl_seed1234
Print configuration .....
{'env_name': 'hopper', 'random_seeds': [1234, 2431, 2531, 2231], 'save_variables': False, 'model_save_dir': '/tmp/hopper_models/', 'restore_variables': False, 'start_onpol_iter': 0, 'onpol_iters': 33, 'num_path_random': 6, 'num_path_onpol': 6, 'env_horizon': 1000, 'max_train_data': 200000, 'max_val_data': 100000, 'discard_ratio': 0.0, 'dynamics': {'pre_training': {'mode': 'intrinsic_reward', 'itr': 0, 'policy_itr': 20}, 'model': 'nn', 'ensemble': True, 'ensemble_model_count': 5, 'enable_particle_ensemble': True, 'particles': 5, 'obs_var': 1.0, 'intrinsic_reward_coeff': 1.0, 'ita': 1.0, 'mode': 'random', 'val': True, 'n_layers': 4, 'hidden_size': 1000, 'activation': 'relu', 'batch_size': 1000, 'learning_rate': 0.001, 'reg_coeff': 0.0, 'epochs': 200, 'kfac_params': {'learning_rate': 0.1, 'damping': 0.001, 'momentum': 0.9, 'kl_clip': 0.0001, 'cov_ema_decay': 0.99}}, 'policy': {'network_shape': [64, 64], 'init_logstd': 0.0, 'activation': 'tanh', 'reinitialize_every_itr': False}, 'trpo': {'horizon': 1000, 'gamma': 0.99, 'step_size': 0.01, 'iterations': 20, 'batch_size': 50000, 'gae': 0.95, 'visualization': False, 'visualize_iterations': [0]}, 'algo': 'trpo'}
Generating random rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 0
average number of affinization = 0.0
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 0
average number of affinization = 0.0
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 0
average number of affinization = 0.0
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 0
average number of affinization = 0.0
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 0
average number of affinization = 0.0
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 0
average number of affinization = 0.0
Done generating random rollouts.
Creating normalization for training data.
Done creating normalization for training data.
Particle ensemble enabled? True
An ensemble of 5 dynamics model <class 'model.dynamics.NNDynamicsModel'> initialized
Train dynamics model with intrinsic reward only? False
Pre-training enabled. Using only intrinsic reward.
Pre-training dynamics model for 0 iterations...
Done pre-training dynamics model.
Using external reward only.
itr #0 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.5691556930541992
Validation loss = 0.26161640882492065
Validation loss = 0.21620044112205505
Validation loss = 0.2052093744277954
Validation loss = 0.20411548018455505
Validation loss = 0.227137953042984
Validation loss = 0.2113637626171112
Validation loss = 0.213888019323349
Validation loss = 0.2221056967973709
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.547812819480896
Validation loss = 0.26787954568862915
Validation loss = 0.22037267684936523
Validation loss = 0.20323359966278076
Validation loss = 0.20476490259170532
Validation loss = 0.20493543148040771
Validation loss = 0.21168574690818787
Validation loss = 0.22275574505329132
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.45246100425720215
Validation loss = 0.27340561151504517
Validation loss = 0.22546404600143433
Validation loss = 0.20666569471359253
Validation loss = 0.20238085091114044
Validation loss = 0.2088562548160553
Validation loss = 0.20900550484657288
Validation loss = 0.20966428518295288
Validation loss = 0.214692622423172
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.5585249662399292
Validation loss = 0.26129353046417236
Validation loss = 0.22166189551353455
Validation loss = 0.20703798532485962
Validation loss = 0.20157402753829956
Validation loss = 0.21046128869056702
Validation loss = 0.20992353558540344
Validation loss = 0.2262895405292511
Validation loss = 0.22370865941047668
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.582796037197113
Validation loss = 0.25870344042778015
Validation loss = 0.21573595702648163
Validation loss = 0.20450815558433533
Validation loss = 0.21033845841884613
Validation loss = 0.2109692543745041
Validation loss = 0.2212391048669815
Validation loss = 0.21624711155891418
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 480
average number of affinization = 68.57142857142857
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 530
average number of affinization = 126.25
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 483
average number of affinization = 165.88888888888889
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 552
average number of affinization = 204.5
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 551
average number of affinization = 236.0
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 456
average number of affinization = 254.33333333333334
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -2.82e+03 |
| Iteration     | 0         |
| MaximumReturn | -2.22e+03 |
| MinimumReturn | -3.28e+03 |
| TotalSamples  | 8000      |
-----------------------------
itr #1 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.21919268369674683
Validation loss = 0.2106890082359314
Validation loss = 0.20580095052719116
Validation loss = 0.21750687062740326
Validation loss = 0.21997016668319702
Validation loss = 0.21646258234977722
Validation loss = 0.21323131024837494
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.21235640347003937
Validation loss = 0.20326921343803406
Validation loss = 0.20889359712600708
Validation loss = 0.2183457314968109
Validation loss = 0.21787726879119873
Validation loss = 0.22039246559143066
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.23181970417499542
Validation loss = 0.20329341292381287
Validation loss = 0.20550087094306946
Validation loss = 0.2083524763584137
Validation loss = 0.2171928882598877
Validation loss = 0.21466651558876038
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.21848498284816742
Validation loss = 0.20445013046264648
Validation loss = 0.22350090742111206
Validation loss = 0.2084648609161377
Validation loss = 0.20832282304763794
Validation loss = 0.21878919005393982
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.23662489652633667
Validation loss = 0.21161365509033203
Validation loss = 0.21281158924102783
Validation loss = 0.21819142997264862
Validation loss = 0.21861597895622253
Validation loss = 0.2068173885345459
Validation loss = 0.2104119211435318
Validation loss = 0.22780552506446838
Validation loss = 0.21525804698467255
Validation loss = 0.21692052483558655
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 680
average number of affinization = 287.0769230769231
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 741
average number of affinization = 319.5
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 662
average number of affinization = 342.3333333333333
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 749
average number of affinization = 367.75
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 690
average number of affinization = 386.70588235294116
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 589
average number of affinization = 397.94444444444446
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.86e+03 |
| Iteration     | 1         |
| MaximumReturn | -1.28e+03 |
| MinimumReturn | -2.98e+03 |
| TotalSamples  | 12000     |
-----------------------------
itr #2 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.20586906373500824
Validation loss = 0.1894230842590332
Validation loss = 0.1960490345954895
Validation loss = 0.19478584825992584
Validation loss = 0.19333213567733765
Validation loss = 0.20400112867355347
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.2049267441034317
Validation loss = 0.1953553557395935
Validation loss = 0.19211339950561523
Validation loss = 0.20315389335155487
Validation loss = 0.20003674924373627
Validation loss = 0.20305126905441284
Validation loss = 0.19711071252822876
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.21278071403503418
Validation loss = 0.194694384932518
Validation loss = 0.19661612808704376
Validation loss = 0.19767265021800995
Validation loss = 0.21358902752399445
Validation loss = 0.19783665239810944
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.19910837709903717
Validation loss = 0.18500979244709015
Validation loss = 0.19281665980815887
Validation loss = 0.19305486977100372
Validation loss = 0.19722449779510498
Validation loss = 0.20506925880908966
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.21023307740688324
Validation loss = 0.19701819121837616
Validation loss = 0.19988274574279785
Validation loss = 0.19410647451877594
Validation loss = 0.19966544210910797
Validation loss = 0.20112358033657074
Validation loss = 0.2089611440896988
Validation loss = 0.2071063071489334
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 575
average number of affinization = 407.2631578947368
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 538
average number of affinization = 413.8
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 514
average number of affinization = 418.57142857142856
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 523
average number of affinization = 423.3181818181818
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 487
average number of affinization = 426.0869565217391
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 551
average number of affinization = 431.2916666666667
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -2.14e+03 |
| Iteration     | 2         |
| MaximumReturn | -1.66e+03 |
| MinimumReturn | -2.67e+03 |
| TotalSamples  | 16000     |
-----------------------------
itr #3 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.21060258150100708
Validation loss = 0.18736481666564941
Validation loss = 0.18955114483833313
Validation loss = 0.18704098463058472
Validation loss = 0.18322217464447021
Validation loss = 0.18934352695941925
Validation loss = 0.18212541937828064
Validation loss = 0.18680405616760254
Validation loss = 0.17856650054454803
Validation loss = 0.18500292301177979
Validation loss = 0.1815704107284546
Validation loss = 0.18438619375228882
Validation loss = 0.18674175441265106
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.2153225541114807
Validation loss = 0.19061800837516785
Validation loss = 0.18803654611110687
Validation loss = 0.1838565319776535
Validation loss = 0.18697085976600647
Validation loss = 0.1914128065109253
Validation loss = 0.1885209083557129
Validation loss = 0.1956489384174347
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.20643563568592072
Validation loss = 0.18643775582313538
Validation loss = 0.18888498842716217
Validation loss = 0.19599518179893494
Validation loss = 0.18973514437675476
Validation loss = 0.1902710348367691
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.20025743544101715
Validation loss = 0.1908584088087082
Validation loss = 0.1878582239151001
Validation loss = 0.18905094265937805
Validation loss = 0.2003791630268097
Validation loss = 0.1917031854391098
Validation loss = 0.1924009919166565
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.21825574338436127
Validation loss = 0.18918529152870178
Validation loss = 0.19353491067886353
Validation loss = 0.19524694979190826
Validation loss = 0.1970548778772354
Validation loss = 0.19043362140655518
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 360
average number of affinization = 428.44
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 378
average number of affinization = 426.5
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 440
average number of affinization = 427.0
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 444
average number of affinization = 427.60714285714283
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 362
average number of affinization = 425.3448275862069
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 383
average number of affinization = 423.93333333333334
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.96e+03 |
| Iteration     | 3         |
| MaximumReturn | -1.02e+03 |
| MinimumReturn | -3.16e+03 |
| TotalSamples  | 20000     |
-----------------------------
itr #4 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.19667495787143707
Validation loss = 0.1801634281873703
Validation loss = 0.17642727494239807
Validation loss = 0.17540888488292694
Validation loss = 0.1774735301733017
Validation loss = 0.18203991651535034
Validation loss = 0.18010248243808746
Validation loss = 0.17882609367370605
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.19164815545082092
Validation loss = 0.1779545694589615
Validation loss = 0.18024106323719025
Validation loss = 0.1792578250169754
Validation loss = 0.17990908026695251
Validation loss = 0.17641787230968475
Validation loss = 0.179183691740036
Validation loss = 0.17910489439964294
Validation loss = 0.18175533413887024
Validation loss = 0.18137875199317932
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.20285853743553162
Validation loss = 0.17805585265159607
Validation loss = 0.17834444344043732
Validation loss = 0.18214361369609833
Validation loss = 0.18020984530448914
Validation loss = 0.18174904584884644
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.18924665451049805
Validation loss = 0.18086636066436768
Validation loss = 0.1761769950389862
Validation loss = 0.17276880145072937
Validation loss = 0.17755740880966187
Validation loss = 0.18642473220825195
Validation loss = 0.17906978726387024
Validation loss = 0.17699892818927765
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.20134899020195007
Validation loss = 0.1930539608001709
Validation loss = 0.1943667083978653
Validation loss = 0.1881103664636612
Validation loss = 0.18429411947727203
Validation loss = 0.1825341284275055
Validation loss = 0.18890298902988434
Validation loss = 0.19066879153251648
Validation loss = 0.1904543936252594
Validation loss = 0.18372157216072083
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 429
average number of affinization = 424.0967741935484
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 466
average number of affinization = 425.40625
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 488
average number of affinization = 427.3030303030303
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 415
average number of affinization = 426.94117647058823
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 479
average number of affinization = 428.42857142857144
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 477
average number of affinization = 429.77777777777777
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.21e+03 |
| Iteration     | 4         |
| MaximumReturn | -227      |
| MinimumReturn | -2.08e+03 |
| TotalSamples  | 24000     |
-----------------------------
itr #5 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.18884813785552979
Validation loss = 0.17384213209152222
Validation loss = 0.18143536150455475
Validation loss = 0.17799943685531616
Validation loss = 0.178423210978508
Validation loss = 0.17536580562591553
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.1767675280570984
Validation loss = 0.18004293739795685
Validation loss = 0.1806340217590332
Validation loss = 0.17656385898590088
Validation loss = 0.17595930397510529
Validation loss = 0.17641133069992065
Validation loss = 0.17776507139205933
Validation loss = 0.18299464881420135
Validation loss = 0.18486690521240234
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.1920403093099594
Validation loss = 0.1794542670249939
Validation loss = 0.1718805581331253
Validation loss = 0.1730695366859436
Validation loss = 0.18079213798046112
Validation loss = 0.1761806607246399
Validation loss = 0.18033467233181
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.1790214627981186
Validation loss = 0.1763729304075241
Validation loss = 0.17506062984466553
Validation loss = 0.17745961248874664
Validation loss = 0.17639414966106415
Validation loss = 0.17785705626010895
Validation loss = 0.1829378753900528
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.21186105906963348
Validation loss = 0.18684448301792145
Validation loss = 0.18830406665802002
Validation loss = 0.18651175498962402
Validation loss = 0.18458586931228638
Validation loss = 0.1800144761800766
Validation loss = 0.1953035593032837
Validation loss = 0.1823236495256424
Validation loss = 0.1882873773574829
Validation loss = 0.1901804804801941
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 395
average number of affinization = 428.8378378378378
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 403
average number of affinization = 428.1578947368421
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 444
average number of affinization = 428.56410256410254
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 418
average number of affinization = 428.3
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 431
average number of affinization = 428.3658536585366
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 448
average number of affinization = 428.8333333333333
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.12e+03 |
| Iteration     | 5         |
| MaximumReturn | -833      |
| MinimumReturn | -1.51e+03 |
| TotalSamples  | 28000     |
-----------------------------
itr #6 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.17172583937644958
Validation loss = 0.16720037162303925
Validation loss = 0.1596650928258896
Validation loss = 0.16027818620204926
Validation loss = 0.1643177568912506
Validation loss = 0.16150955855846405
Validation loss = 0.1578284651041031
Validation loss = 0.16156384348869324
Validation loss = 0.15887406468391418
Validation loss = 0.15481878817081451
Validation loss = 0.15718181431293488
Validation loss = 0.15546706318855286
Validation loss = 0.16104499995708466
Validation loss = 0.15836946666240692
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.17856085300445557
Validation loss = 0.17316024005413055
Validation loss = 0.15876658260822296
Validation loss = 0.1568109095096588
Validation loss = 0.15868064761161804
Validation loss = 0.16013774275779724
Validation loss = 0.15890750288963318
Validation loss = 0.16547857224941254
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.16864606738090515
Validation loss = 0.1662784069776535
Validation loss = 0.1713324338197708
Validation loss = 0.16411028802394867
Validation loss = 0.1601192057132721
Validation loss = 0.16027610003948212
Validation loss = 0.15964731574058533
Validation loss = 0.1594291478395462
Validation loss = 0.1610480546951294
Validation loss = 0.1618252992630005
Validation loss = 0.1594013124704361
Validation loss = 0.17152000963687897
Validation loss = 0.15913721919059753
Validation loss = 0.16322410106658936
Validation loss = 0.16157221794128418
Validation loss = 0.15929076075553894
Validation loss = 0.1569317877292633
Validation loss = 0.15845194458961487
Validation loss = 0.16395945847034454
Validation loss = 0.16127999126911163
Validation loss = 0.1646866351366043
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.1734975278377533
Validation loss = 0.17224037647247314
Validation loss = 0.17469815909862518
Validation loss = 0.16590800881385803
Validation loss = 0.16421112418174744
Validation loss = 0.16886135935783386
Validation loss = 0.16537995636463165
Validation loss = 0.17158390581607819
Validation loss = 0.16986528038978577
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.17846743762493134
Validation loss = 0.1737980991601944
Validation loss = 0.1706097573041916
Validation loss = 0.16957174241542816
Validation loss = 0.17028044164180756
Validation loss = 0.16525079309940338
Validation loss = 0.16935189068317413
Validation loss = 0.16623103618621826
Validation loss = 0.1687164455652237
Validation loss = 0.17348910868167877
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 443
average number of affinization = 429.16279069767444
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 448
average number of affinization = 429.59090909090907
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 446
average number of affinization = 429.9555555555556
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 472
average number of affinization = 430.8695652173913
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 502
average number of affinization = 432.3829787234043
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 452
average number of affinization = 432.7916666666667
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -214     |
| Iteration     | 6        |
| MaximumReturn | 558      |
| MinimumReturn | -799     |
| TotalSamples  | 32000    |
----------------------------
itr #7 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.1828102469444275
Validation loss = 0.15074390172958374
Validation loss = 0.15134672820568085
Validation loss = 0.14154042303562164
Validation loss = 0.14268669486045837
Validation loss = 0.14165326952934265
Validation loss = 0.14794984459877014
Validation loss = 0.1452401578426361
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.16278192400932312
Validation loss = 0.15785765647888184
Validation loss = 0.1470341831445694
Validation loss = 0.1486537754535675
Validation loss = 0.1484198123216629
Validation loss = 0.14079484343528748
Validation loss = 0.14491668343544006
Validation loss = 0.14792022109031677
Validation loss = 0.14413213729858398
Validation loss = 0.1433277726173401
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.16585573554039001
Validation loss = 0.1612950563430786
Validation loss = 0.148626446723938
Validation loss = 0.14147520065307617
Validation loss = 0.14398841559886932
Validation loss = 0.14448127150535583
Validation loss = 0.14282524585723877
Validation loss = 0.14316977560520172
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.1715785562992096
Validation loss = 0.1606195867061615
Validation loss = 0.15600034594535828
Validation loss = 0.15561923384666443
Validation loss = 0.1485040783882141
Validation loss = 0.15415742993354797
Validation loss = 0.14925441145896912
Validation loss = 0.14930009841918945
Validation loss = 0.15680311620235443
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.1773483008146286
Validation loss = 0.15434053540229797
Validation loss = 0.15528708696365356
Validation loss = 0.14973920583724976
Validation loss = 0.15112754702568054
Validation loss = 0.15448209643363953
Validation loss = 0.14892765879631042
Validation loss = 0.15283262729644775
Validation loss = 0.15412239730358124
Validation loss = 0.156654953956604
Validation loss = 0.16609826683998108
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 445
average number of affinization = 433.0408163265306
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 465
average number of affinization = 433.68
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 451
average number of affinization = 434.01960784313724
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 452
average number of affinization = 434.36538461538464
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 486
average number of affinization = 435.33962264150944
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 462
average number of affinization = 435.8333333333333
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 152      |
| Iteration     | 7        |
| MaximumReturn | 775      |
| MinimumReturn | -122     |
| TotalSamples  | 36000    |
----------------------------
itr #8 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.15029515326023102
Validation loss = 0.1383240520954132
Validation loss = 0.13211029767990112
Validation loss = 0.12528230249881744
Validation loss = 0.12478898465633392
Validation loss = 0.12514887750148773
Validation loss = 0.1321110874414444
Validation loss = 0.12644706666469574
Validation loss = 0.1278098076581955
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.14175091683864594
Validation loss = 0.13350458443164825
Validation loss = 0.12992611527442932
Validation loss = 0.13271939754486084
Validation loss = 0.12775874137878418
Validation loss = 0.1272560954093933
Validation loss = 0.1262788623571396
Validation loss = 0.12869857251644135
Validation loss = 0.12864655256271362
Validation loss = 0.12486948072910309
Validation loss = 0.13108700513839722
Validation loss = 0.13165175914764404
Validation loss = 0.12528063356876373
Validation loss = 0.12798002362251282
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.15181893110275269
Validation loss = 0.12972047924995422
Validation loss = 0.13346508145332336
Validation loss = 0.12964333593845367
Validation loss = 0.12834946811199188
Validation loss = 0.1276012510061264
Validation loss = 0.12569180130958557
Validation loss = 0.1263129562139511
Validation loss = 0.12801380455493927
Validation loss = 0.12907648086547852
Validation loss = 0.12373115122318268
Validation loss = 0.12611103057861328
Validation loss = 0.12935131788253784
Validation loss = 0.12824831902980804
Validation loss = 0.13174913823604584
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.14718879759311676
Validation loss = 0.13879695534706116
Validation loss = 0.13810843229293823
Validation loss = 0.1349538117647171
Validation loss = 0.1364787220954895
Validation loss = 0.13239014148712158
Validation loss = 0.12975817918777466
Validation loss = 0.1341807246208191
Validation loss = 0.1331188827753067
Validation loss = 0.12592586874961853
Validation loss = 0.1313471496105194
Validation loss = 0.12577605247497559
Validation loss = 0.13003171980381012
Validation loss = 0.13510316610336304
Validation loss = 0.12844179570674896
Validation loss = 0.12940281629562378
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.16602055728435516
Validation loss = 0.13789886236190796
Validation loss = 0.13356444239616394
Validation loss = 0.13389848172664642
Validation loss = 0.1384945809841156
Validation loss = 0.13283738493919373
Validation loss = 0.1298282891511917
Validation loss = 0.13328121602535248
Validation loss = 0.14104050397872925
Validation loss = 0.13168007135391235
Validation loss = 0.13005824387073517
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 495
average number of affinization = 436.90909090909093
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 457
average number of affinization = 437.26785714285717
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 488
average number of affinization = 438.1578947368421
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 455
average number of affinization = 438.44827586206895
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 461
average number of affinization = 438.8305084745763
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 477
average number of affinization = 439.46666666666664
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 178      |
| Iteration     | 8        |
| MaximumReturn | 386      |
| MinimumReturn | -208     |
| TotalSamples  | 40000    |
----------------------------
itr #9 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.1334632933139801
Validation loss = 0.11778923124074936
Validation loss = 0.11448290199041367
Validation loss = 0.11749862134456635
Validation loss = 0.11541011184453964
Validation loss = 0.11402662843465805
Validation loss = 0.11404488980770111
Validation loss = 0.1095966249704361
Validation loss = 0.11298984289169312
Validation loss = 0.11162283271551132
Validation loss = 0.11273974180221558
Validation loss = 0.11335957050323486
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.13679377734661102
Validation loss = 0.11913653463125229
Validation loss = 0.11668825149536133
Validation loss = 0.1139201894402504
Validation loss = 0.11822893470525742
Validation loss = 0.11571472883224487
Validation loss = 0.11613503843545914
Validation loss = 0.1152113676071167
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.12922516465187073
Validation loss = 0.11749334633350372
Validation loss = 0.11368664354085922
Validation loss = 0.11748691648244858
Validation loss = 0.11456398665904999
Validation loss = 0.1099090576171875
Validation loss = 0.11957714706659317
Validation loss = 0.11126838624477386
Validation loss = 0.12075555324554443
Validation loss = 0.10776839405298233
Validation loss = 0.10974270105361938
Validation loss = 0.11136656999588013
Validation loss = 0.1142985001206398
Validation loss = 0.1089438647031784
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.1285153329372406
Validation loss = 0.12217722088098526
Validation loss = 0.11479209363460541
Validation loss = 0.11294307559728622
Validation loss = 0.11113803088665009
Validation loss = 0.10804162919521332
Validation loss = 0.11802877485752106
Validation loss = 0.11309592425823212
Validation loss = 0.11447946727275848
Validation loss = 0.11137886345386505
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.13647103309631348
Validation loss = 0.119644895195961
Validation loss = 0.11954434216022491
Validation loss = 0.11451182514429092
Validation loss = 0.1194617971777916
Validation loss = 0.12074120342731476
Validation loss = 0.1148148626089096
Validation loss = 0.11804090440273285
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 457
average number of affinization = 439.75409836065575
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 490
average number of affinization = 440.56451612903226
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 471
average number of affinization = 441.04761904761904
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 415
average number of affinization = 440.640625
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 421
average number of affinization = 440.33846153846156
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 509
average number of affinization = 441.3787878787879
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 335      |
| Iteration     | 9        |
| MaximumReturn | 925      |
| MinimumReturn | -130     |
| TotalSamples  | 44000    |
----------------------------
itr #10 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.12606237828731537
Validation loss = 0.1044459119439125
Validation loss = 0.1123189926147461
Validation loss = 0.10126236081123352
Validation loss = 0.10072501003742218
Validation loss = 0.10404887795448303
Validation loss = 0.10128967463970184
Validation loss = 0.09991008788347244
Validation loss = 0.10483837127685547
Validation loss = 0.09704670310020447
Validation loss = 0.09697587788105011
Validation loss = 0.09878766536712646
Validation loss = 0.09718611091375351
Validation loss = 0.09950900822877884
Validation loss = 0.10408375412225723
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.11756284534931183
Validation loss = 0.1102689728140831
Validation loss = 0.10417161136865616
Validation loss = 0.10244905203580856
Validation loss = 0.10539121925830841
Validation loss = 0.10281965881586075
Validation loss = 0.10282272845506668
Validation loss = 0.10367786884307861
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.11643987149000168
Validation loss = 0.09845948219299316
Validation loss = 0.10052221268415451
Validation loss = 0.10545864701271057
Validation loss = 0.09981746971607208
Validation loss = 0.09897570312023163
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.11446480453014374
Validation loss = 0.1020149290561676
Validation loss = 0.1002158671617508
Validation loss = 0.10225717723369598
Validation loss = 0.09711870551109314
Validation loss = 0.09921589493751526
Validation loss = 0.09849341213703156
Validation loss = 0.10249041765928268
Validation loss = 0.09865539520978928
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.11777292937040329
Validation loss = 0.11841275542974472
Validation loss = 0.10293221473693848
Validation loss = 0.10383209586143494
Validation loss = 0.10347256809473038
Validation loss = 0.10411336272954941
Validation loss = 0.10218068957328796
Validation loss = 0.10443215072154999
Validation loss = 0.10366019606590271
Validation loss = 0.10408011078834534
Validation loss = 0.10187988728284836
Validation loss = 0.1073765829205513
Validation loss = 0.10340151190757751
Validation loss = 0.09983640164136887
Validation loss = 0.10932184010744095
Validation loss = 0.10105106979608536
Validation loss = 0.10066359490156174
Validation loss = 0.10744082182645798
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 450
average number of affinization = 441.5074626865672
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 469
average number of affinization = 441.9117647058824
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 412
average number of affinization = 441.4782608695652
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 476
average number of affinization = 441.9714285714286
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 474
average number of affinization = 442.4225352112676
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 486
average number of affinization = 443.02777777777777
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -707      |
| Iteration     | 10        |
| MaximumReturn | 179       |
| MinimumReturn | -2.31e+03 |
| TotalSamples  | 48000     |
-----------------------------
itr #11 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.10669072717428207
Validation loss = 0.09418503195047379
Validation loss = 0.09670275449752808
Validation loss = 0.09156323224306107
Validation loss = 0.09470965713262558
Validation loss = 0.09700343757867813
Validation loss = 0.09389618784189224
Validation loss = 0.09316744655370712
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.1218692883849144
Validation loss = 0.09805279225111008
Validation loss = 0.10051824897527695
Validation loss = 0.09524355083703995
Validation loss = 0.09694870561361313
Validation loss = 0.09869292378425598
Validation loss = 0.09774912148714066
Validation loss = 0.0988195613026619
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.10500574111938477
Validation loss = 0.09564965218305588
Validation loss = 0.09396713227033615
Validation loss = 0.09375161677598953
Validation loss = 0.09681567549705505
Validation loss = 0.09428747743368149
Validation loss = 0.09468314796686172
Validation loss = 0.1004059910774231
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.10795251280069351
Validation loss = 0.09768351912498474
Validation loss = 0.09434902667999268
Validation loss = 0.09550344944000244
Validation loss = 0.0942489504814148
Validation loss = 0.09665710479021072
Validation loss = 0.09829976409673691
Validation loss = 0.10002138465642929
Validation loss = 0.09160998463630676
Validation loss = 0.0918305516242981
Validation loss = 0.09931222349405289
Validation loss = 0.10441750288009644
Validation loss = 0.09523990005254745
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.108934186398983
Validation loss = 0.0996350422501564
Validation loss = 0.09831195324659348
Validation loss = 0.09797350317239761
Validation loss = 0.09731439501047134
Validation loss = 0.10072010010480881
Validation loss = 0.09631776064634323
Validation loss = 0.1009296178817749
Validation loss = 0.10519880801439285
Validation loss = 0.09273307770490646
Validation loss = 0.09262009710073471
Validation loss = 0.09619931131601334
Validation loss = 0.0967346653342247
Validation loss = 0.0947609543800354
Validation loss = 0.09630142897367477
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 430
average number of affinization = 442.8493150684931
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 485
average number of affinization = 443.4189189189189
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 457
average number of affinization = 443.6
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 456
average number of affinization = 443.7631578947368
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 503
average number of affinization = 444.53246753246754
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 516
average number of affinization = 445.44871794871796
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -110      |
| Iteration     | 11        |
| MaximumReturn | 1.06e+03  |
| MinimumReturn | -3.07e+03 |
| TotalSamples  | 52000     |
-----------------------------
itr #12 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.11176454275846481
Validation loss = 0.09363691508769989
Validation loss = 0.08560159802436829
Validation loss = 0.08606354147195816
Validation loss = 0.08874551951885223
Validation loss = 0.08949879556894302
Validation loss = 0.09133142977952957
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.11031421273946762
Validation loss = 0.09177172183990479
Validation loss = 0.08969052135944366
Validation loss = 0.0884711742401123
Validation loss = 0.08984827995300293
Validation loss = 0.0892833024263382
Validation loss = 0.0945172980427742
Validation loss = 0.08872172236442566
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.10454387217760086
Validation loss = 0.09154682606458664
Validation loss = 0.0880088359117508
Validation loss = 0.08869468420743942
Validation loss = 0.08896203339099884
Validation loss = 0.09040383249521255
Validation loss = 0.08556588739156723
Validation loss = 0.08405343443155289
Validation loss = 0.08962150663137436
Validation loss = 0.08855307847261429
Validation loss = 0.08826406300067902
Validation loss = 0.08404969424009323
Validation loss = 0.08452476561069489
Validation loss = 0.08764629065990448
Validation loss = 0.08840341120958328
Validation loss = 0.08862432092428207
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.09811144322156906
Validation loss = 0.09168840199708939
Validation loss = 0.08958836644887924
Validation loss = 0.09158743917942047
Validation loss = 0.09025312215089798
Validation loss = 0.08868645876646042
Validation loss = 0.08883946388959885
Validation loss = 0.09011668711900711
Validation loss = 0.08707864582538605
Validation loss = 0.08929671347141266
Validation loss = 0.09009440243244171
Validation loss = 0.08509840071201324
Validation loss = 0.085772804915905
Validation loss = 0.09183977544307709
Validation loss = 0.09048997610807419
Validation loss = 0.09613673388957977
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.10910617560148239
Validation loss = 0.09334176033735275
Validation loss = 0.08704348653554916
Validation loss = 0.08830134570598602
Validation loss = 0.08987267315387726
Validation loss = 0.10523232072591782
Validation loss = 0.08634379506111145
Validation loss = 0.0844189003109932
Validation loss = 0.0854162722826004
Validation loss = 0.08837354928255081
Validation loss = 0.08947793394327164
Validation loss = 0.08612410724163055
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 485
average number of affinization = 445.9493670886076
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 396
average number of affinization = 445.325
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 393
average number of affinization = 444.679012345679
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 430
average number of affinization = 444.5
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 397
average number of affinization = 443.9277108433735
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 490
average number of affinization = 444.4761904761905
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -617      |
| Iteration     | 12        |
| MaximumReturn | 1.03e+03  |
| MinimumReturn | -2.55e+03 |
| TotalSamples  | 56000     |
-----------------------------
itr #13 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.09875199943780899
Validation loss = 0.08580110222101212
Validation loss = 0.0853593498468399
Validation loss = 0.08610516786575317
Validation loss = 0.08935045450925827
Validation loss = 0.08689559996128082
Validation loss = 0.08474307507276535
Validation loss = 0.08348236232995987
Validation loss = 0.08652395755052567
Validation loss = 0.08420567959547043
Validation loss = 0.08420498669147491
Validation loss = 0.08362853527069092
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.0949435830116272
Validation loss = 0.08938593417406082
Validation loss = 0.08929241448640823
Validation loss = 0.08866821229457855
Validation loss = 0.08869560062885284
Validation loss = 0.09068065881729126
Validation loss = 0.08571124076843262
Validation loss = 0.08728277683258057
Validation loss = 0.08528761565685272
Validation loss = 0.08574055135250092
Validation loss = 0.08635963499546051
Validation loss = 0.08383535593748093
Validation loss = 0.0871962457895279
Validation loss = 0.08534877747297287
Validation loss = 0.08372284471988678
Validation loss = 0.0856662169098854
Validation loss = 0.08807558566331863
Validation loss = 0.0878247395157814
Validation loss = 0.08194766938686371
Validation loss = 0.0817413404583931
Validation loss = 0.08785492181777954
Validation loss = 0.08274271339178085
Validation loss = 0.08296201378107071
Validation loss = 0.08196210861206055
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.08703631162643433
Validation loss = 0.08322601020336151
Validation loss = 0.0818089097738266
Validation loss = 0.08583356440067291
Validation loss = 0.08356411755084991
Validation loss = 0.09438935667276382
Validation loss = 0.08098781108856201
Validation loss = 0.0848553404211998
Validation loss = 0.08201060444116592
Validation loss = 0.08067933470010757
Validation loss = 0.08167202025651932
Validation loss = 0.08185946196317673
Validation loss = 0.08464739471673965
Validation loss = 0.07947777956724167
Validation loss = 0.08722434937953949
Validation loss = 0.07940825074911118
Validation loss = 0.07715803384780884
Validation loss = 0.08157803863286972
Validation loss = 0.08158309012651443
Validation loss = 0.07812520116567612
Validation loss = 0.07943376153707504
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.09812901169061661
Validation loss = 0.08373336493968964
Validation loss = 0.08802881091833115
Validation loss = 0.08975394070148468
Validation loss = 0.08396877348423004
Validation loss = 0.08596730977296829
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.09291044622659683
Validation loss = 0.0858905091881752
Validation loss = 0.0843585878610611
Validation loss = 0.08489622175693512
Validation loss = 0.09048417955636978
Validation loss = 0.09020944684743881
Validation loss = 0.08043748885393143
Validation loss = 0.08269108086824417
Validation loss = 0.09132816642522812
Validation loss = 0.08364643156528473
Validation loss = 0.08248991519212723
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 513
average number of affinization = 445.2823529411765
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 492
average number of affinization = 445.8255813953488
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 478
average number of affinization = 446.1954022988506
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 458
average number of affinization = 446.32954545454544
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 469
average number of affinization = 446.5842696629214
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 466
average number of affinization = 446.8
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 845      |
| Iteration     | 13       |
| MaximumReturn | 1.28e+03 |
| MinimumReturn | 236      |
| TotalSamples  | 60000    |
----------------------------
itr #14 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.09679289907217026
Validation loss = 0.0819573849439621
Validation loss = 0.08136414736509323
Validation loss = 0.07645127922296524
Validation loss = 0.07769928872585297
Validation loss = 0.07797438651323318
Validation loss = 0.07628383487462997
Validation loss = 0.0789664015173912
Validation loss = 0.07836168259382248
Validation loss = 0.07531968504190445
Validation loss = 0.07881652563810349
Validation loss = 0.07488477975130081
Validation loss = 0.07725232094526291
Validation loss = 0.07671753317117691
Validation loss = 0.07678982615470886
Validation loss = 0.07802038639783859
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.0968879908323288
Validation loss = 0.0790991485118866
Validation loss = 0.0796654224395752
Validation loss = 0.07840941846370697
Validation loss = 0.07419324666261673
Validation loss = 0.07699171453714371
Validation loss = 0.0758359357714653
Validation loss = 0.07813727110624313
Validation loss = 0.07622016221284866
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.0865839421749115
Validation loss = 0.0736657902598381
Validation loss = 0.07322629541158676
Validation loss = 0.07449476420879364
Validation loss = 0.08307218551635742
Validation loss = 0.07414636760950089
Validation loss = 0.07310989499092102
Validation loss = 0.07820435613393784
Validation loss = 0.07159686088562012
Validation loss = 0.07128267735242844
Validation loss = 0.07276316732168198
Validation loss = 0.07487194240093231
Validation loss = 0.07138436287641525
Validation loss = 0.07285252213478088
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.08941194415092468
Validation loss = 0.08167987316846848
Validation loss = 0.07898373156785965
Validation loss = 0.07669908553361893
Validation loss = 0.07701568305492401
Validation loss = 0.07875842601060867
Validation loss = 0.07751190662384033
Validation loss = 0.078426793217659
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.09220147877931595
Validation loss = 0.07969629019498825
Validation loss = 0.07715488225221634
Validation loss = 0.07685182243585587
Validation loss = 0.07805974781513214
Validation loss = 0.07513047754764557
Validation loss = 0.07642805576324463
Validation loss = 0.07785823941230774
Validation loss = 0.07406140118837357
Validation loss = 0.0800706073641777
Validation loss = 0.07652182877063751
Validation loss = 0.07429241389036179
Validation loss = 0.08002019673585892
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 475
average number of affinization = 447.1098901098901
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 480
average number of affinization = 447.4673913043478
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 474
average number of affinization = 447.752688172043
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 467
average number of affinization = 447.9574468085106
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 495
average number of affinization = 448.4526315789474
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 442
average number of affinization = 448.3854166666667
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 825      |
| Iteration     | 14       |
| MaximumReturn | 1.27e+03 |
| MinimumReturn | 446      |
| TotalSamples  | 64000    |
----------------------------
itr #15 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.08289724588394165
Validation loss = 0.0746178925037384
Validation loss = 0.07235649228096008
Validation loss = 0.07257469743490219
Validation loss = 0.07224945724010468
Validation loss = 0.0738145112991333
Validation loss = 0.07279542088508606
Validation loss = 0.07532094419002533
Validation loss = 0.06991235166788101
Validation loss = 0.06913776695728302
Validation loss = 0.07291370630264282
Validation loss = 0.07295727729797363
Validation loss = 0.06946224719285965
Validation loss = 0.07090431451797485
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.08535979688167572
Validation loss = 0.07674067467451096
Validation loss = 0.07108823955059052
Validation loss = 0.07478633522987366
Validation loss = 0.07692615687847137
Validation loss = 0.0727439820766449
Validation loss = 0.07109439373016357
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.07836531102657318
Validation loss = 0.0709138959646225
Validation loss = 0.06703586131334305
Validation loss = 0.0689956545829773
Validation loss = 0.07212957739830017
Validation loss = 0.06959059834480286
Validation loss = 0.06769929826259613
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.08617574721574783
Validation loss = 0.07292962074279785
Validation loss = 0.07316118478775024
Validation loss = 0.07295078039169312
Validation loss = 0.07280346751213074
Validation loss = 0.07746841758489609
Validation loss = 0.07471072673797607
Validation loss = 0.07438872754573822
Validation loss = 0.07418408989906311
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.08258508890867233
Validation loss = 0.0727519690990448
Validation loss = 0.07146745920181274
Validation loss = 0.07489900290966034
Validation loss = 0.06957855820655823
Validation loss = 0.0775669664144516
Validation loss = 0.07272573560476303
Validation loss = 0.07247161865234375
Validation loss = 0.07039324194192886
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 499
average number of affinization = 448.9072164948454
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 528
average number of affinization = 449.7142857142857
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 478
average number of affinization = 450.0
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 486
average number of affinization = 450.36
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 492
average number of affinization = 450.7722772277228
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 476
average number of affinization = 451.01960784313724
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 944      |
| Iteration     | 15       |
| MaximumReturn | 1.31e+03 |
| MinimumReturn | 548      |
| TotalSamples  | 68000    |
----------------------------
itr #16 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.07592989504337311
Validation loss = 0.06977598369121552
Validation loss = 0.06749983876943588
Validation loss = 0.06721848994493484
Validation loss = 0.07070225477218628
Validation loss = 0.07181383669376373
Validation loss = 0.06835656613111496
Validation loss = 0.06771669536828995
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.07938777655363083
Validation loss = 0.0715046152472496
Validation loss = 0.07137437164783478
Validation loss = 0.07206390053033829
Validation loss = 0.06976539641618729
Validation loss = 0.06993437558412552
Validation loss = 0.06770583987236023
Validation loss = 0.07052072882652283
Validation loss = 0.06886744499206543
Validation loss = 0.06765013188123703
Validation loss = 0.06644900888204575
Validation loss = 0.07023140043020248
Validation loss = 0.06708664447069168
Validation loss = 0.06885888427495956
Validation loss = 0.0671495795249939
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.07819423079490662
Validation loss = 0.07123032957315445
Validation loss = 0.06582667678594589
Validation loss = 0.0651109367609024
Validation loss = 0.06637392193078995
Validation loss = 0.06572823226451874
Validation loss = 0.06739702820777893
Validation loss = 0.06536136567592621
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.08997207880020142
Validation loss = 0.06806306540966034
Validation loss = 0.06974329799413681
Validation loss = 0.07344862818717957
Validation loss = 0.0716637447476387
Validation loss = 0.06739326566457748
Validation loss = 0.06822337210178375
Validation loss = 0.07184582203626633
Validation loss = 0.06707984209060669
Validation loss = 0.06673779338598251
Validation loss = 0.0667877048254013
Validation loss = 0.0675337091088295
Validation loss = 0.06778144836425781
Validation loss = 0.06948965787887573
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.07804925739765167
Validation loss = 0.0690947100520134
Validation loss = 0.0691826194524765
Validation loss = 0.06863821297883987
Validation loss = 0.07052844762802124
Validation loss = 0.06789420545101166
Validation loss = 0.06745439767837524
Validation loss = 0.06756211817264557
Validation loss = 0.07194432616233826
Validation loss = 0.06945013999938965
Validation loss = 0.065899558365345
Validation loss = 0.06445171684026718
Validation loss = 0.06964757293462753
Validation loss = 0.06477583199739456
Validation loss = 0.06355005502700806
Validation loss = 0.06504926830530167
Validation loss = 0.06984661519527435
Validation loss = 0.0646100640296936
Validation loss = 0.06380821764469147
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 399
average number of affinization = 450.5145631067961
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 502
average number of affinization = 451.00961538461536
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 469
average number of affinization = 451.18095238095236
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 419
average number of affinization = 450.87735849056605
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 499
average number of affinization = 451.32710280373834
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 516
average number of affinization = 451.9259259259259
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | 335       |
| Iteration     | 16        |
| MaximumReturn | 1.42e+03  |
| MinimumReturn | -2.96e+03 |
| TotalSamples  | 72000     |
-----------------------------
itr #17 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.07869504392147064
Validation loss = 0.0676765888929367
Validation loss = 0.06663506478071213
Validation loss = 0.06696409732103348
Validation loss = 0.06614390015602112
Validation loss = 0.07022176682949066
Validation loss = 0.06749498844146729
Validation loss = 0.06763701140880585
Validation loss = 0.06398069113492966
Validation loss = 0.07210370153188705
Validation loss = 0.06897146999835968
Validation loss = 0.06276169419288635
Validation loss = 0.0657593235373497
Validation loss = 0.06699881702661514
Validation loss = 0.07012159377336502
Validation loss = 0.06283285468816757
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.07452213764190674
Validation loss = 0.06735852360725403
Validation loss = 0.06559517979621887
Validation loss = 0.06659974157810211
Validation loss = 0.06940993666648865
Validation loss = 0.06753300875425339
Validation loss = 0.06671223044395447
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.06902846693992615
Validation loss = 0.06845424324274063
Validation loss = 0.06284230202436447
Validation loss = 0.06599801778793335
Validation loss = 0.06416881829500198
Validation loss = 0.06571939587593079
Validation loss = 0.06461266428232193
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.08116786926984787
Validation loss = 0.07113102078437805
Validation loss = 0.06666985154151917
Validation loss = 0.06669750064611435
Validation loss = 0.06878577172756195
Validation loss = 0.06713151186704636
Validation loss = 0.06873992085456848
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.07691089063882828
Validation loss = 0.06753788143396378
Validation loss = 0.06478215754032135
Validation loss = 0.06537290662527084
Validation loss = 0.07818702608346939
Validation loss = 0.06372508406639099
Validation loss = 0.06254228204488754
Validation loss = 0.06897549331188202
Validation loss = 0.06367801874876022
Validation loss = 0.06463643908500671
Validation loss = 0.06506074219942093
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 441
average number of affinization = 451.8256880733945
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 459
average number of affinization = 451.8909090909091
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 481
average number of affinization = 452.15315315315314
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 490
average number of affinization = 452.49107142857144
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 441
average number of affinization = 452.3893805309734
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 443
average number of affinization = 452.3070175438597
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.25e+03 |
| Iteration     | 17       |
| MaximumReturn | 1.7e+03  |
| MinimumReturn | 1.04e+03 |
| TotalSamples  | 76000    |
----------------------------
itr #18 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.06898194551467896
Validation loss = 0.06293976306915283
Validation loss = 0.06211121380329132
Validation loss = 0.06520271301269531
Validation loss = 0.06068592146039009
Validation loss = 0.06432399898767471
Validation loss = 0.061937130987644196
Validation loss = 0.07683344930410385
Validation loss = 0.061188727617263794
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.07760384678840637
Validation loss = 0.06241640821099281
Validation loss = 0.06329290568828583
Validation loss = 0.06548228114843369
Validation loss = 0.06647209823131561
Validation loss = 0.06445100903511047
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.07033498585224152
Validation loss = 0.06139930337667465
Validation loss = 0.06238905340433121
Validation loss = 0.062227122485637665
Validation loss = 0.06242804601788521
Validation loss = 0.06085396930575371
Validation loss = 0.07065460830926895
Validation loss = 0.06074000149965286
Validation loss = 0.05949319154024124
Validation loss = 0.06014850735664368
Validation loss = 0.06262585520744324
Validation loss = 0.06697595119476318
Validation loss = 0.06063926964998245
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.0716152936220169
Validation loss = 0.06598151475191116
Validation loss = 0.06699191778898239
Validation loss = 0.06407580524682999
Validation loss = 0.06434755027294159
Validation loss = 0.06525515764951706
Validation loss = 0.06509632617235184
Validation loss = 0.07602646946907043
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.07449076324701309
Validation loss = 0.061578430235385895
Validation loss = 0.06122832000255585
Validation loss = 0.0706973448395729
Validation loss = 0.061058513820171356
Validation loss = 0.06189228221774101
Validation loss = 0.06224045157432556
Validation loss = 0.06259555369615555
Validation loss = 0.06070440635085106
Validation loss = 0.061402250081300735
Validation loss = 0.07205132395029068
Validation loss = 0.05942510813474655
Validation loss = 0.06170285865664482
Validation loss = 0.061797142028808594
Validation loss = 0.060565441846847534
Validation loss = 0.05945894122123718
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 506
average number of affinization = 452.7739130434783
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 439
average number of affinization = 452.6551724137931
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 460
average number of affinization = 452.71794871794873
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 472
average number of affinization = 452.8813559322034
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 457
average number of affinization = 452.9159663865546
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 422
average number of affinization = 452.65833333333336
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.09e+03 |
| Iteration     | 18       |
| MaximumReturn | 1.58e+03 |
| MinimumReturn | 525      |
| TotalSamples  | 80000    |
----------------------------
itr #19 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.07508071511983871
Validation loss = 0.060542576014995575
Validation loss = 0.060261569917201996
Validation loss = 0.05957301706075668
Validation loss = 0.060576193034648895
Validation loss = 0.05920899659395218
Validation loss = 0.05926458165049553
Validation loss = 0.05968155711889267
Validation loss = 0.06016973406076431
Validation loss = 0.05782274156808853
Validation loss = 0.057687900960445404
Validation loss = 0.05924967676401138
Validation loss = 0.05938296392560005
Validation loss = 0.05635697394609451
Validation loss = 0.05766286700963974
Validation loss = 0.06012529134750366
Validation loss = 0.0549754723906517
Validation loss = 0.057626426219940186
Validation loss = 0.06312467157840729
Validation loss = 0.05786191299557686
Validation loss = 0.05599856376647949
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.07126269489526749
Validation loss = 0.0609595961868763
Validation loss = 0.05920860171318054
Validation loss = 0.060811616480350494
Validation loss = 0.06081972271203995
Validation loss = 0.05774429440498352
Validation loss = 0.05789146572351456
Validation loss = 0.06337400525808334
Validation loss = 0.058118294924497604
Validation loss = 0.057844750583171844
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.06881888210773468
Validation loss = 0.060298871248960495
Validation loss = 0.05894564837217331
Validation loss = 0.060419440269470215
Validation loss = 0.060219742357730865
Validation loss = 0.05668562650680542
Validation loss = 0.0579792745411396
Validation loss = 0.06018554046750069
Validation loss = 0.056127697229385376
Validation loss = 0.05975337699055672
Validation loss = 0.05854790285229683
Validation loss = 0.058846212923526764
Validation loss = 0.05720409005880356
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.07321008294820786
Validation loss = 0.06160203367471695
Validation loss = 0.060370463877916336
Validation loss = 0.060298167169094086
Validation loss = 0.06198909133672714
Validation loss = 0.06319983303546906
Validation loss = 0.05836727097630501
Validation loss = 0.06102187559008598
Validation loss = 0.0595749132335186
Validation loss = 0.060604680329561234
Validation loss = 0.06287568807601929
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.06544721126556396
Validation loss = 0.058379124850034714
Validation loss = 0.057930827140808105
Validation loss = 0.05690736696124077
Validation loss = 0.0598551444709301
Validation loss = 0.05557515099644661
Validation loss = 0.05692432075738907
Validation loss = 0.05860119313001633
Validation loss = 0.058493100106716156
Validation loss = 0.056780628859996796
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 427
average number of affinization = 452.44628099173553
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 468
average number of affinization = 452.57377049180326
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 431
average number of affinization = 452.3983739837398
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 445
average number of affinization = 452.33870967741933
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 448
average number of affinization = 452.304
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 428
average number of affinization = 452.1111111111111
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.03e+03 |
| Iteration     | 19       |
| MaximumReturn | 1.65e+03 |
| MinimumReturn | 552      |
| TotalSamples  | 84000    |
----------------------------
itr #20 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.0646715760231018
Validation loss = 0.05678150802850723
Validation loss = 0.05805211886763573
Validation loss = 0.058937810361385345
Validation loss = 0.057871755212545395
Validation loss = 0.05537370219826698
Validation loss = 0.05663526803255081
Validation loss = 0.057306624948978424
Validation loss = 0.059233807027339935
Validation loss = 0.05549865588545799
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.06845681369304657
Validation loss = 0.05965448543429375
Validation loss = 0.05420662835240364
Validation loss = 0.05834916606545448
Validation loss = 0.05675243213772774
Validation loss = 0.0573115348815918
Validation loss = 0.05524240806698799
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.06547234952449799
Validation loss = 0.055222347378730774
Validation loss = 0.05779097601771355
Validation loss = 0.05954044684767723
Validation loss = 0.05898658558726311
Validation loss = 0.05454085022211075
Validation loss = 0.056724417954683304
Validation loss = 0.05776000767946243
Validation loss = 0.05735906958580017
Validation loss = 0.057082898914813995
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.06684776395559311
Validation loss = 0.05871101841330528
Validation loss = 0.05728064104914665
Validation loss = 0.057194191962480545
Validation loss = 0.057099536061286926
Validation loss = 0.0610325001180172
Validation loss = 0.05988457053899765
Validation loss = 0.05563752353191376
Validation loss = 0.061820972710847855
Validation loss = 0.05542184039950371
Validation loss = 0.05651313439011574
Validation loss = 0.06097166985273361
Validation loss = 0.05636272951960564
Validation loss = 0.058646026998758316
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.06535349786281586
Validation loss = 0.05621009320020676
Validation loss = 0.05997655913233757
Validation loss = 0.05458647757768631
Validation loss = 0.06331104040145874
Validation loss = 0.05684198811650276
Validation loss = 0.05798900127410889
Validation loss = 0.061119187623262405
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 438
average number of affinization = 452.0
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 414
average number of affinization = 451.703125
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 369
average number of affinization = 451.06201550387595
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 282
average number of affinization = 449.76153846153846
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 499
average number of affinization = 450.1374045801527
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 437
average number of affinization = 450.0378787878788
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 636      |
| Iteration     | 20       |
| MaximumReturn | 1.51e+03 |
| MinimumReturn | -2.6e+03 |
| TotalSamples  | 88000    |
----------------------------
itr #21 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.05978360399603844
Validation loss = 0.054053451865911484
Validation loss = 0.05552159249782562
Validation loss = 0.05848491191864014
Validation loss = 0.05778668820858002
Validation loss = 0.053984496742486954
Validation loss = 0.057708024978637695
Validation loss = 0.05557561293244362
Validation loss = 0.053149688988924026
Validation loss = 0.06283832341432571
Validation loss = 0.05517688766121864
Validation loss = 0.054383765906095505
Validation loss = 0.05650833621621132
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.06470703333616257
Validation loss = 0.05671536549925804
Validation loss = 0.05475948750972748
Validation loss = 0.05742044374346733
Validation loss = 0.0585116408765316
Validation loss = 0.05525243282318115
Validation loss = 0.05661287531256676
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.06315784901380539
Validation loss = 0.0581723190844059
Validation loss = 0.05327915772795677
Validation loss = 0.055777356028556824
Validation loss = 0.054956186562776566
Validation loss = 0.05572374165058136
Validation loss = 0.05463444069027901
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.0691620334982872
Validation loss = 0.05588700622320175
Validation loss = 0.05692268908023834
Validation loss = 0.05541922152042389
Validation loss = 0.05724109336733818
Validation loss = 0.05414031818509102
Validation loss = 0.053344834595918655
Validation loss = 0.058941975235939026
Validation loss = 0.054952479898929596
Validation loss = 0.05488032102584839
Validation loss = 0.05665387958288193
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.06509038805961609
Validation loss = 0.054463647305965424
Validation loss = 0.05556662380695343
Validation loss = 0.06134367361664772
Validation loss = 0.05386766046285629
Validation loss = 0.060855064541101456
Validation loss = 0.05471843481063843
Validation loss = 0.05594586208462715
Validation loss = 0.060405924916267395
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 442
average number of affinization = 449.97744360902254
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 418
average number of affinization = 449.73880597014926
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 468
average number of affinization = 449.8740740740741
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 354
average number of affinization = 449.16911764705884
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 443
average number of affinization = 449.1240875912409
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 416
average number of affinization = 448.8840579710145
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.24e+03 |
| Iteration     | 21       |
| MaximumReturn | 1.64e+03 |
| MinimumReturn | 998      |
| TotalSamples  | 92000    |
----------------------------
itr #22 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.0576920360326767
Validation loss = 0.053652483969926834
Validation loss = 0.054997656494379044
Validation loss = 0.05334684997797012
Validation loss = 0.06074078753590584
Validation loss = 0.05188203975558281
Validation loss = 0.05277738720178604
Validation loss = 0.057289253920316696
Validation loss = 0.05078115314245224
Validation loss = 0.05224459245800972
Validation loss = 0.054690439254045486
Validation loss = 0.051296085119247437
Validation loss = 0.05364571884274483
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.0653781071305275
Validation loss = 0.0540759339928627
Validation loss = 0.05437285453081131
Validation loss = 0.05461660772562027
Validation loss = 0.056287992745637894
Validation loss = 0.05319288372993469
Validation loss = 0.055382903665304184
Validation loss = 0.05427612364292145
Validation loss = 0.05286874622106552
Validation loss = 0.05484689399600029
Validation loss = 0.051257453858852386
Validation loss = 0.05058686062693596
Validation loss = 0.053470462560653687
Validation loss = 0.052235838025808334
Validation loss = 0.060521870851516724
Validation loss = 0.0504610650241375
Validation loss = 0.052343159914016724
Validation loss = 0.054353512823581696
Validation loss = 0.05352193862199783
Validation loss = 0.055582884699106216
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.06887420266866684
Validation loss = 0.05360671877861023
Validation loss = 0.052750363945961
Validation loss = 0.057749416679143906
Validation loss = 0.05190384015440941
Validation loss = 0.052569061517715454
Validation loss = 0.05758093670010567
Validation loss = 0.053832024335861206
Validation loss = 0.05285375565290451
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.0667438730597496
Validation loss = 0.05946289002895355
Validation loss = 0.05278485640883446
Validation loss = 0.05273812636733055
Validation loss = 0.05421134829521179
Validation loss = 0.05482842028141022
Validation loss = 0.05274750292301178
Validation loss = 0.05929648503661156
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.060501862317323685
Validation loss = 0.05298883467912674
Validation loss = 0.05273101106286049
Validation loss = 0.054552000015974045
Validation loss = 0.059562258422374725
Validation loss = 0.053209222853183746
Validation loss = 0.05753396824002266
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 402
average number of affinization = 448.54676258992805
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 304
average number of affinization = 447.51428571428573
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 365
average number of affinization = 446.92907801418437
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 388
average number of affinization = 446.51408450704224
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 427
average number of affinization = 446.3776223776224
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 341
average number of affinization = 445.6458333333333
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.59e+03 |
| Iteration     | 22       |
| MaximumReturn | 1.69e+03 |
| MinimumReturn | 1.51e+03 |
| TotalSamples  | 96000    |
----------------------------
itr #23 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.06520115584135056
Validation loss = 0.05203248932957649
Validation loss = 0.05196850374341011
Validation loss = 0.05823181942105293
Validation loss = 0.051893699914216995
Validation loss = 0.05067398026585579
Validation loss = 0.05423419550061226
Validation loss = 0.05186168849468231
Validation loss = 0.05051359534263611
Validation loss = 0.05115221440792084
Validation loss = 0.050009071826934814
Validation loss = 0.052885185927152634
Validation loss = 0.05194902420043945
Validation loss = 0.04990006610751152
Validation loss = 0.05032019689679146
Validation loss = 0.051557064056396484
Validation loss = 0.05014945566654205
Validation loss = 0.05565480515360832
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.06029362604022026
Validation loss = 0.050493184477090836
Validation loss = 0.05183678865432739
Validation loss = 0.05318986251950264
Validation loss = 0.05214284732937813
Validation loss = 0.05088081955909729
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.06503649801015854
Validation loss = 0.05101259425282478
Validation loss = 0.05247851833701134
Validation loss = 0.06406929343938828
Validation loss = 0.05555357411503792
Validation loss = 0.050900429487228394
Validation loss = 0.05504262074828148
Validation loss = 0.05332827940583229
Validation loss = 0.053082600235939026
Validation loss = 0.055238500237464905
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.06497900933027267
Validation loss = 0.05336074158549309
Validation loss = 0.05309130623936653
Validation loss = 0.052565429359674454
Validation loss = 0.05173805356025696
Validation loss = 0.052705761045217514
Validation loss = 0.054116129875183105
Validation loss = 0.05264778435230255
Validation loss = 0.05946313962340355
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.06591901928186417
Validation loss = 0.0514783151447773
Validation loss = 0.05295032262802124
Validation loss = 0.057105425745248795
Validation loss = 0.052877333015203476
Validation loss = 0.0525386817753315
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 388
average number of affinization = 445.24827586206897
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 411
average number of affinization = 445.013698630137
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 406
average number of affinization = 444.7482993197279
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 405
average number of affinization = 444.47972972972974
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 414
average number of affinization = 444.2751677852349
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 405
average number of affinization = 444.0133333333333
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.57e+03 |
| Iteration     | 23       |
| MaximumReturn | 1.83e+03 |
| MinimumReturn | 770      |
| TotalSamples  | 100000   |
----------------------------
itr #24 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.06282702833414078
Validation loss = 0.05021609365940094
Validation loss = 0.04988038167357445
Validation loss = 0.05198175460100174
Validation loss = 0.05228246748447418
Validation loss = 0.05499684438109398
Validation loss = 0.054160814732313156
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.06002318859100342
Validation loss = 0.051399655640125275
Validation loss = 0.05107109993696213
Validation loss = 0.054144199937582016
Validation loss = 0.049818091094493866
Validation loss = 0.050208333879709244
Validation loss = 0.06209436058998108
Validation loss = 0.04897717386484146
Validation loss = 0.04990163445472717
Validation loss = 0.052040062844753265
Validation loss = 0.049996983259916306
Validation loss = 0.049694422632455826
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.07023889571428299
Validation loss = 0.05092334747314453
Validation loss = 0.05256087705492973
Validation loss = 0.05575675889849663
Validation loss = 0.05053562670946121
Validation loss = 0.050852857530117035
Validation loss = 0.05149756371974945
Validation loss = 0.0515998937189579
Validation loss = 0.056113794445991516
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.06604510545730591
Validation loss = 0.051132235676050186
Validation loss = 0.05112867057323456
Validation loss = 0.051156047731637955
Validation loss = 0.051492877304553986
Validation loss = 0.04982103034853935
Validation loss = 0.05363106355071068
Validation loss = 0.05094633251428604
Validation loss = 0.05477011203765869
Validation loss = 0.050873443484306335
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.06802914291620255
Validation loss = 0.052368201315402985
Validation loss = 0.05012832209467888
Validation loss = 0.05305064842104912
Validation loss = 0.05144641920924187
Validation loss = 0.05139634758234024
Validation loss = 0.055106792598962784
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 357
average number of affinization = 443.4370860927152
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 343
average number of affinization = 442.7763157894737
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 422
average number of affinization = 442.640522875817
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 350
average number of affinization = 442.038961038961
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 447
average number of affinization = 442.0709677419355
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 390
average number of affinization = 441.7371794871795
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.59e+03 |
| Iteration     | 24       |
| MaximumReturn | 1.96e+03 |
| MinimumReturn | 862      |
| TotalSamples  | 104000   |
----------------------------
itr #25 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.05732186138629913
Validation loss = 0.04973727464675903
Validation loss = 0.05212579295039177
Validation loss = 0.0512414425611496
Validation loss = 0.0511871837079525
Validation loss = 0.050275079905986786
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.05830218642950058
Validation loss = 0.04935409501194954
Validation loss = 0.0523703508079052
Validation loss = 0.0496642030775547
Validation loss = 0.04968142509460449
Validation loss = 0.05155409127473831
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.06338362395763397
Validation loss = 0.0507708378136158
Validation loss = 0.05088049918413162
Validation loss = 0.05358709767460823
Validation loss = 0.051415152847766876
Validation loss = 0.05094626918435097
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.05988682806491852
Validation loss = 0.0530179925262928
Validation loss = 0.05043787509202957
Validation loss = 0.05665818601846695
Validation loss = 0.05141856521368027
Validation loss = 0.051758505403995514
Validation loss = 0.052808281034231186
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.06256408244371414
Validation loss = 0.05086468160152435
Validation loss = 0.0550423189997673
Validation loss = 0.05174422636628151
Validation loss = 0.05214521661400795
Validation loss = 0.05356786400079727
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 371
average number of affinization = 441.28662420382165
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 420
average number of affinization = 441.1518987341772
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 389
average number of affinization = 440.82389937106916
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 364
average number of affinization = 440.34375
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 430
average number of affinization = 440.27950310559004
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 422
average number of affinization = 440.1666666666667
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.36e+03 |
| Iteration     | 25       |
| MaximumReturn | 1.83e+03 |
| MinimumReturn | 569      |
| TotalSamples  | 108000   |
----------------------------
itr #26 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.06135249882936478
Validation loss = 0.04954148083925247
Validation loss = 0.05304625630378723
Validation loss = 0.05445760115981102
Validation loss = 0.04866969957947731
Validation loss = 0.05563459172844887
Validation loss = 0.05044594034552574
Validation loss = 0.05020750313997269
Validation loss = 0.05196237564086914
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.06774909049272537
Validation loss = 0.05017685145139694
Validation loss = 0.04994343966245651
Validation loss = 0.05175292491912842
Validation loss = 0.04879753664135933
Validation loss = 0.05396755039691925
Validation loss = 0.04827906936407089
Validation loss = 0.04807573929429054
Validation loss = 0.05386408790946007
Validation loss = 0.047814540565013885
Validation loss = 0.0486924946308136
Validation loss = 0.050724275410175323
Validation loss = 0.04771297425031662
Validation loss = 0.048789504915475845
Validation loss = 0.0487014539539814
Validation loss = 0.048482272773981094
Validation loss = 0.04906179755926132
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.060614991933107376
Validation loss = 0.05186370015144348
Validation loss = 0.051428694278001785
Validation loss = 0.05026793107390404
Validation loss = 0.0502786822617054
Validation loss = 0.051091089844703674
Validation loss = 0.04942013695836067
Validation loss = 0.051345665007829666
Validation loss = 0.050631385296583176
Validation loss = 0.05226787552237511
Validation loss = 0.050751492381095886
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.06571713835000992
Validation loss = 0.04989108443260193
Validation loss = 0.050685055553913116
Validation loss = 0.05690672993659973
Validation loss = 0.05049967020750046
Validation loss = 0.05373438075184822
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.05745730549097061
Validation loss = 0.05164835602045059
Validation loss = 0.051524631679058075
Validation loss = 0.05056310072541237
Validation loss = 0.05642540380358696
Validation loss = 0.051264580339193344
Validation loss = 0.05461423099040985
Validation loss = 0.05052376538515091
Validation loss = 0.05663003772497177
Validation loss = 0.05256958305835724
Validation loss = 0.052391648292541504
Validation loss = 0.05394727736711502
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 369
average number of affinization = 439.73006134969324
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 347
average number of affinization = 439.1646341463415
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 332
average number of affinization = 438.5151515151515
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 421
average number of affinization = 438.4096385542169
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 382
average number of affinization = 438.07185628742513
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 353
average number of affinization = 437.5654761904762
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.9e+03  |
| Iteration     | 26       |
| MaximumReturn | 2.22e+03 |
| MinimumReturn | 1.26e+03 |
| TotalSamples  | 112000   |
----------------------------
itr #27 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.06239965930581093
Validation loss = 0.049605853855609894
Validation loss = 0.0497550442814827
Validation loss = 0.04990953952074051
Validation loss = 0.05290530249476433
Validation loss = 0.04948677495121956
Validation loss = 0.06847300380468369
Validation loss = 0.049805063754320145
Validation loss = 0.04918210953474045
Validation loss = 0.05271042510867119
Validation loss = 0.048865992575883865
Validation loss = 0.060366928577423096
Validation loss = 0.04842118173837662
Validation loss = 0.05176660045981407
Validation loss = 0.050614532083272934
Validation loss = 0.05236639454960823
Validation loss = 0.0512559674680233
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.06273657828569412
Validation loss = 0.04765874147415161
Validation loss = 0.053280316293239594
Validation loss = 0.04705801233649254
Validation loss = 0.04772258177399635
Validation loss = 0.052517957985401154
Validation loss = 0.04650171473622322
Validation loss = 0.049658358097076416
Validation loss = 0.04805784672498703
Validation loss = 0.04891025647521019
Validation loss = 0.05050428584218025
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.06293617188930511
Validation loss = 0.04938289523124695
Validation loss = 0.051378410309553146
Validation loss = 0.05213044211268425
Validation loss = 0.04919222369790077
Validation loss = 0.0492500439286232
Validation loss = 0.05018392577767372
Validation loss = 0.05816609412431717
Validation loss = 0.04828206077218056
Validation loss = 0.04869632050395012
Validation loss = 0.06124823912978172
Validation loss = 0.048155851662158966
Validation loss = 0.05041778087615967
Validation loss = 0.05403737351298332
Validation loss = 0.04704898223280907
Validation loss = 0.04883774742484093
Validation loss = 0.05118802934885025
Validation loss = 0.04824924096465111
Validation loss = 0.04900234565138817
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.06004077568650246
Validation loss = 0.04981869459152222
Validation loss = 0.04892611503601074
Validation loss = 0.04991823062300682
Validation loss = 0.05014343932271004
Validation loss = 0.05172242596745491
Validation loss = 0.051385726779699326
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.06043222174048424
Validation loss = 0.04918694496154785
Validation loss = 0.052716340869665146
Validation loss = 0.05118569731712341
Validation loss = 0.051021069288253784
Validation loss = 0.0547180138528347
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 361
average number of affinization = 437.11242603550295
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 291
average number of affinization = 436.2529411764706
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 414
average number of affinization = 436.12280701754383
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 340
average number of affinization = 435.5639534883721
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 346
average number of affinization = 435.04624277456645
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 406
average number of affinization = 434.87931034482756
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.55e+03 |
| Iteration     | 27       |
| MaximumReturn | 2.37e+03 |
| MinimumReturn | -951     |
| TotalSamples  | 116000   |
----------------------------
itr #28 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.05206412449479103
Validation loss = 0.05129065737128258
Validation loss = 0.04851881042122841
Validation loss = 0.0557468980550766
Validation loss = 0.04922325164079666
Validation loss = 0.05221061035990715
Validation loss = 0.04821845516562462
Validation loss = 0.050949521362781525
Validation loss = 0.0544089637696743
Validation loss = 0.04838354513049126
Validation loss = 0.0486961230635643
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.05242856219410896
Validation loss = 0.047399137169122696
Validation loss = 0.05058060958981514
Validation loss = 0.047157205641269684
Validation loss = 0.0491362027823925
Validation loss = 0.0536368191242218
Validation loss = 0.04683307185769081
Validation loss = 0.049076054245233536
Validation loss = 0.047735437750816345
Validation loss = 0.04609573632478714
Validation loss = 0.04926536977291107
Validation loss = 0.04682066664099693
Validation loss = 0.045980293303728104
Validation loss = 0.05149121955037117
Validation loss = 0.04621443524956703
Validation loss = 0.046092696487903595
Validation loss = 0.046861402690410614
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.05535389110445976
Validation loss = 0.047171205282211304
Validation loss = 0.048589758574962616
Validation loss = 0.047513727098703384
Validation loss = 0.04775981605052948
Validation loss = 0.049565982073545456
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.056116849184036255
Validation loss = 0.04808150604367256
Validation loss = 0.054091788828372955
Validation loss = 0.05481835827231407
Validation loss = 0.04853475093841553
Validation loss = 0.04934873804450035
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.05711955204606056
Validation loss = 0.04944391921162605
Validation loss = 0.05005599185824394
Validation loss = 0.06340595334768295
Validation loss = 0.050216998904943466
Validation loss = 0.052223723381757736
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 419
average number of affinization = 434.7885714285714
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 374
average number of affinization = 434.4431818181818
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 463
average number of affinization = 434.6045197740113
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 369
average number of affinization = 434.23595505617976
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 338
average number of affinization = 433.69832402234636
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 340
average number of affinization = 433.1777777777778
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 2.17e+03 |
| Iteration     | 28       |
| MaximumReturn | 2.43e+03 |
| MinimumReturn | 2e+03    |
| TotalSamples  | 120000   |
----------------------------
itr #29 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.05647259205579758
Validation loss = 0.04846087470650673
Validation loss = 0.04755367711186409
Validation loss = 0.04842479154467583
Validation loss = 0.051582083106040955
Validation loss = 0.0467720702290535
Validation loss = 0.04998987913131714
Validation loss = 0.0485210083425045
Validation loss = 0.050170231610536575
Validation loss = 0.04962420091032982
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.05108834430575371
Validation loss = 0.04616823419928551
Validation loss = 0.04794815555214882
Validation loss = 0.04547671228647232
Validation loss = 0.05039132013916969
Validation loss = 0.04594863951206207
Validation loss = 0.04851776361465454
Validation loss = 0.04595733433961868
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.056560445576906204
Validation loss = 0.04706764593720436
Validation loss = 0.04686480015516281
Validation loss = 0.04707493260502815
Validation loss = 0.04728963226079941
Validation loss = 0.05873090401291847
Validation loss = 0.048202596604824066
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.05827971547842026
Validation loss = 0.04905395582318306
Validation loss = 0.048031218349933624
Validation loss = 0.05007470026612282
Validation loss = 0.052190106362104416
Validation loss = 0.05310506001114845
Validation loss = 0.04630758985877037
Validation loss = 0.0481596365571022
Validation loss = 0.059064969420433044
Validation loss = 0.04614581912755966
Validation loss = 0.050142545253038406
Validation loss = 0.047978390008211136
Validation loss = 0.04715780168771744
Validation loss = 0.051413800567388535
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.05445624142885208
Validation loss = 0.04849587380886078
Validation loss = 0.06074221059679985
Validation loss = 0.05003988370299339
Validation loss = 0.04819415882229805
Validation loss = 0.05248292163014412
Validation loss = 0.048858750611543655
Validation loss = 0.05256200209259987
Validation loss = 0.04913434013724327
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 391
average number of affinization = 432.9447513812155
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 393
average number of affinization = 432.72527472527474
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 374
average number of affinization = 432.40437158469945
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 361
average number of affinization = 432.01630434782606
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 387
average number of affinization = 431.772972972973
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 413
average number of affinization = 431.6720430107527
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.79e+03 |
| Iteration     | 29       |
| MaximumReturn | 1.89e+03 |
| MinimumReturn | 1.72e+03 |
| TotalSamples  | 124000   |
----------------------------
itr #30 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.07085934281349182
Validation loss = 0.04705093428492546
Validation loss = 0.04982243850827217
Validation loss = 0.05065034329891205
Validation loss = 0.047851063311100006
Validation loss = 0.04889373481273651
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.05165786296129227
Validation loss = 0.04588698595762253
Validation loss = 0.0484200045466423
Validation loss = 0.045545633882284164
Validation loss = 0.05156799405813217
Validation loss = 0.045322347432374954
Validation loss = 0.04629477486014366
Validation loss = 0.04613538086414337
Validation loss = 0.047428835183382034
Validation loss = 0.04716303572058678
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.05330824851989746
Validation loss = 0.05027107894420624
Validation loss = 0.049553513526916504
Validation loss = 0.04740925878286362
Validation loss = 0.05525602027773857
Validation loss = 0.04771868884563446
Validation loss = 0.05609729513525963
Validation loss = 0.04668092727661133
Validation loss = 0.05316179245710373
Validation loss = 0.04731173813343048
Validation loss = 0.045965854078531265
Validation loss = 0.05114208534359932
Validation loss = 0.048538971692323685
Validation loss = 0.046444427222013474
Validation loss = 0.049742113798856735
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.05732712522149086
Validation loss = 0.04691372811794281
Validation loss = 0.047518059611320496
Validation loss = 0.05528058484196663
Validation loss = 0.04625464603304863
Validation loss = 0.05354667082428932
Validation loss = 0.045561615377664566
Validation loss = 0.04758892208337784
Validation loss = 0.05081675201654434
Validation loss = 0.04607127979397774
Validation loss = 0.051149144768714905
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.052305348217487335
Validation loss = 0.05362774431705475
Validation loss = 0.050940439105033875
Validation loss = 0.05398162454366684
Validation loss = 0.04973043128848076
Validation loss = 0.04840338975191116
Validation loss = 0.04903154447674751
Validation loss = 0.04956439882516861
Validation loss = 0.04741063714027405
Validation loss = 0.05130496248602867
Validation loss = 0.05611128732562065
Validation loss = 0.048264361917972565
Validation loss = 0.0514882393181324
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 434
average number of affinization = 431.6844919786096
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 389
average number of affinization = 431.4574468085106
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 363
average number of affinization = 431.0952380952381
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 416
average number of affinization = 431.0157894736842
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 366
average number of affinization = 430.67539267015707
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 381
average number of affinization = 430.4166666666667
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.83e+03 |
| Iteration     | 30       |
| MaximumReturn | 1.87e+03 |
| MinimumReturn | 1.78e+03 |
| TotalSamples  | 128000   |
----------------------------
itr #31 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.05256994813680649
Validation loss = 0.04732280969619751
Validation loss = 0.04726947098970413
Validation loss = 0.05035218596458435
Validation loss = 0.047479912638664246
Validation loss = 0.05647451430559158
Validation loss = 0.046868957579135895
Validation loss = 0.05294734239578247
Validation loss = 0.046843692660331726
Validation loss = 0.046678327023983
Validation loss = 0.04795192927122116
Validation loss = 0.047074902802705765
Validation loss = 0.04800998419523239
Validation loss = 0.045899491757154465
Validation loss = 0.05003665015101433
Validation loss = 0.04642825946211815
Validation loss = 0.046564556658267975
Validation loss = 0.05281221866607666
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.04870295524597168
Validation loss = 0.044453639537096024
Validation loss = 0.05056719481945038
Validation loss = 0.04582434520125389
Validation loss = 0.045947227627038956
Validation loss = 0.0503823421895504
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.055245544761419296
Validation loss = 0.04604732617735863
Validation loss = 0.046661607921123505
Validation loss = 0.04945225268602371
Validation loss = 0.04942280799150467
Validation loss = 0.047136932611465454
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.0497683584690094
Validation loss = 0.04562458395957947
Validation loss = 0.04954783618450165
Validation loss = 0.0554833710193634
Validation loss = 0.0458759069442749
Validation loss = 0.04862906038761139
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.05875158682465553
Validation loss = 0.04744835942983627
Validation loss = 0.04741606488823891
Validation loss = 0.049939192831516266
Validation loss = 0.04810120910406113
Validation loss = 0.04935498535633087
Validation loss = 0.04729441553354263
Validation loss = 0.0474383644759655
Validation loss = 0.050275567919015884
Validation loss = 0.04857052117586136
Validation loss = 0.0514911413192749
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 404
average number of affinization = 430.27979274611397
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 402
average number of affinization = 430.1340206185567
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 376
average number of affinization = 429.85641025641024
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 416
average number of affinization = 429.7857142857143
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 422
average number of affinization = 429.74619289340103
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 392
average number of affinization = 429.55555555555554
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.92e+03 |
| Iteration     | 31       |
| MaximumReturn | 2e+03    |
| MinimumReturn | 1.84e+03 |
| TotalSamples  | 132000   |
----------------------------
itr #32 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.053955256938934326
Validation loss = 0.047646909952163696
Validation loss = 0.04627681151032448
Validation loss = 0.049843207001686096
Validation loss = 0.045872677117586136
Validation loss = 0.04743664711713791
Validation loss = 0.048360325396060944
Validation loss = 0.04713231325149536
Validation loss = 0.046184126287698746
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.04896477609872818
Validation loss = 0.04685273766517639
Validation loss = 0.04528122395277023
Validation loss = 0.04782445356249809
Validation loss = 0.0447494275867939
Validation loss = 0.04837637394666672
Validation loss = 0.04492087662220001
Validation loss = 0.04673900827765465
Validation loss = 0.045819640159606934
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.05240055173635483
Validation loss = 0.04738185927271843
Validation loss = 0.05160123482346535
Validation loss = 0.047918785363435745
Validation loss = 0.04872963950037956
Validation loss = 0.04820317402482033
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.051513586193323135
Validation loss = 0.04604954272508621
Validation loss = 0.06140642613172531
Validation loss = 0.04620620608329773
Validation loss = 0.054339390248060226
Validation loss = 0.0472663938999176
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.051387809216976166
Validation loss = 0.04716156795620918
Validation loss = 0.050466496497392654
Validation loss = 0.0500328503549099
Validation loss = 0.04799605906009674
Validation loss = 0.05065589025616646
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 367
average number of affinization = 429.24120603015075
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 348
average number of affinization = 428.835
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 389
average number of affinization = 428.636815920398
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 352
average number of affinization = 428.25742574257424
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 468
average number of affinization = 428.45320197044333
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 386
average number of affinization = 428.2450980392157
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 2.04e+03 |
| Iteration     | 32       |
| MaximumReturn | 2.51e+03 |
| MinimumReturn | 1.23e+03 |
| TotalSamples  | 136000   |
----------------------------
