Logging to experiments/hopper/nov1/w350e3_seed2431
Print configuration .....
{'env_name': 'hopper', 'random_seeds': [1234, 2431, 2531, 2231], 'save_variables': False, 'model_save_dir': '/tmp/hopper_models/', 'restore_variables': False, 'start_onpol_iter': 0, 'onpol_iters': 33, 'num_path_random': 6, 'num_path_onpol': 6, 'env_horizon': 1000, 'max_train_data': 200000, 'max_val_data': 100000, 'discard_ratio': 0.0, 'dynamics': {'pre_training': {'mode': 'intrinsic_reward', 'itr': 0, 'policy_itr': 20}, 'model': 'nn', 'ensemble': True, 'ensemble_model_count': 5, 'enable_particle_ensemble': True, 'particles': 5, 'obs_var': 1.0, 'intrinsic_reward_coeff': 1.0, 'ita': 1.0, 'mode': 'random', 'val': True, 'n_layers': 4, 'hidden_size': 1000, 'activation': 'relu', 'batch_size': 1000, 'learning_rate': 0.001, 'reg_coeff': 0.0, 'epochs': 200, 'kfac_params': {'learning_rate': 0.1, 'damping': 0.001, 'momentum': 0.9, 'kl_clip': 0.0001, 'cov_ema_decay': 0.99}}, 'policy': {'network_shape': [64, 64], 'init_logstd': 0.0, 'activation': 'tanh', 'reinitialize_every_itr': False}, 'trpo': {'horizon': 1000, 'gamma': 0.99, 'step_size': 0.01, 'iterations': 20, 'batch_size': 50000, 'gae': 0.95, 'visualization': False, 'visualize_iterations': [0]}, 'algo': 'trpo'}
Generating random rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 0
average number of affinization = 0.0
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 0
average number of affinization = 0.0
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 0
average number of affinization = 0.0
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 0
average number of affinization = 0.0
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 0
average number of affinization = 0.0
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 0
average number of affinization = 0.0
Done generating random rollouts.
Creating normalization for training data.
Done creating normalization for training data.
Particle ensemble enabled? True
An ensemble of 5 dynamics model <class 'model.dynamics.NNDynamicsModel'> initialized
Train dynamics model with intrinsic reward only? False
Pre-training enabled. Using only intrinsic reward.
Pre-training dynamics model for 0 iterations...
Done pre-training dynamics model.
Using external reward only.
itr #0 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7614381313323975
Validation loss = 0.6750555634498596
Validation loss = 0.6541881561279297
Validation loss = 0.6755111813545227
Validation loss = 0.6771447658538818
Validation loss = 0.6943888664245605
Validation loss = 0.7124286890029907
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7845127582550049
Validation loss = 0.6708688139915466
Validation loss = 0.6471453905105591
Validation loss = 0.6519099473953247
Validation loss = 0.6701761484146118
Validation loss = 0.68464595079422
Validation loss = 0.6961104273796082
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.8110358715057373
Validation loss = 0.6739037036895752
Validation loss = 0.6564823389053345
Validation loss = 0.6618465185165405
Validation loss = 0.6637000441551208
Validation loss = 0.6942976713180542
Validation loss = 0.7039395570755005
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7894020080566406
Validation loss = 0.6760067939758301
Validation loss = 0.6533952951431274
Validation loss = 0.6607834100723267
Validation loss = 0.6737174987792969
Validation loss = 0.6787084341049194
Validation loss = 0.6958729028701782
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.8082912564277649
Validation loss = 0.6762019395828247
Validation loss = 0.6570412516593933
Validation loss = 0.6557365655899048
Validation loss = 0.6670486927032471
Validation loss = 0.6736133694648743
Validation loss = 0.697729229927063
Validation loss = 0.7416926622390747
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 392
average number of affinization = 56.0
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 385
average number of affinization = 97.125
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 360
average number of affinization = 126.33333333333333
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 393
average number of affinization = 153.0
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 415
average number of affinization = 176.8181818181818
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 374
average number of affinization = 193.25
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -2.25e+03 |
| Iteration     | 0         |
| MaximumReturn | -2.12e+03 |
| MinimumReturn | -2.35e+03 |
| TotalSamples  | 8000      |
-----------------------------
itr #1 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6623731851577759
Validation loss = 0.5993093848228455
Validation loss = 0.6105935573577881
Validation loss = 0.6067955493927002
Validation loss = 0.627802312374115
Validation loss = 0.6412166357040405
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6653473377227783
Validation loss = 0.597522497177124
Validation loss = 0.5947391390800476
Validation loss = 0.6236976385116577
Validation loss = 0.6241814494132996
Validation loss = 0.647814154624939
Validation loss = 0.6701255440711975
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.6613767147064209
Validation loss = 0.6063666939735413
Validation loss = 0.6012438535690308
Validation loss = 0.6062946319580078
Validation loss = 0.6329150795936584
Validation loss = 0.6444884538650513
Validation loss = 0.6656628251075745
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.655164361000061
Validation loss = 0.599045991897583
Validation loss = 0.6181811094284058
Validation loss = 0.6110247373580933
Validation loss = 0.630412757396698
Validation loss = 0.6445974111557007
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6682244539260864
Validation loss = 0.5949070453643799
Validation loss = 0.602692186832428
Validation loss = 0.6301403045654297
Validation loss = 0.6320182681083679
Validation loss = 0.6568459272384644
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 638
average number of affinization = 227.46153846153845
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 609
average number of affinization = 254.71428571428572
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 650
average number of affinization = 281.06666666666666
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 620
average number of affinization = 302.25
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 630
average number of affinization = 321.52941176470586
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 665
average number of affinization = 340.6111111111111
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -2.72e+03 |
| Iteration     | 1         |
| MaximumReturn | -2.58e+03 |
| MinimumReturn | -2.9e+03  |
| TotalSamples  | 12000     |
-----------------------------
itr #2 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6189222931861877
Validation loss = 0.6224760413169861
Validation loss = 0.648792564868927
Validation loss = 0.6656205654144287
Validation loss = 0.6795278191566467
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6375466585159302
Validation loss = 0.6502547860145569
Validation loss = 0.6642272472381592
Validation loss = 0.6783413290977478
Validation loss = 0.6846309304237366
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.622555673122406
Validation loss = 0.6495018601417542
Validation loss = 0.6636263132095337
Validation loss = 0.6724957823753357
Validation loss = 0.6996312737464905
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.6121380925178528
Validation loss = 0.6377245187759399
Validation loss = 0.6528910994529724
Validation loss = 0.6671416759490967
Validation loss = 0.678464412689209
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6299755573272705
Validation loss = 0.6335693001747131
Validation loss = 0.6537616848945618
Validation loss = 0.6818575859069824
Validation loss = 0.6905204653739929
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 325
average number of affinization = 339.7894736842105
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 402
average number of affinization = 342.9
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 412
average number of affinization = 346.1904761904762
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 350
average number of affinization = 346.3636363636364
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 353
average number of affinization = 346.6521739130435
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 343
average number of affinization = 346.5
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -2.91e+03 |
| Iteration     | 2         |
| MaximumReturn | -2.62e+03 |
| MinimumReturn | -3.09e+03 |
| TotalSamples  | 16000     |
-----------------------------
itr #3 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6368565559387207
Validation loss = 0.6633950471878052
Validation loss = 0.6769331693649292
Validation loss = 0.6986348628997803
Validation loss = 0.7204599380493164
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6482557058334351
Validation loss = 0.6750867366790771
Validation loss = 0.6802754402160645
Validation loss = 0.6920816898345947
Validation loss = 0.7065504789352417
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.6427968740463257
Validation loss = 0.6688925623893738
Validation loss = 0.6926146745681763
Validation loss = 0.6979735493659973
Validation loss = 0.7098513841629028
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.6404671669006348
Validation loss = 0.6631969213485718
Validation loss = 0.6821371912956238
Validation loss = 0.692886471748352
Validation loss = 0.7100957036018372
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6437612175941467
Validation loss = 0.6682848930358887
Validation loss = 0.6799956560134888
Validation loss = 0.7002012729644775
Validation loss = 0.7010427713394165
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 122
average number of affinization = 337.52
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 347
average number of affinization = 337.88461538461536
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 359
average number of affinization = 338.6666666666667
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 368
average number of affinization = 339.7142857142857
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 148
average number of affinization = 333.1034482758621
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 236
average number of affinization = 329.8666666666667
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -2.36e+03 |
| Iteration     | 3         |
| MaximumReturn | -1.25e+03 |
| MinimumReturn | -3.23e+03 |
| TotalSamples  | 20000     |
-----------------------------
itr #4 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6647417545318604
Validation loss = 0.7022892236709595
Validation loss = 0.7243069410324097
Validation loss = 0.7265925407409668
Validation loss = 0.7365061044692993
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6807910203933716
Validation loss = 0.6930862665176392
Validation loss = 0.713166356086731
Validation loss = 0.716644823551178
Validation loss = 0.7216731309890747
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.6928410530090332
Validation loss = 0.7101526856422424
Validation loss = 0.7249599695205688
Validation loss = 0.7262353301048279
Validation loss = 0.7349343299865723
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.681054413318634
Validation loss = 0.7099676728248596
Validation loss = 0.7171013951301575
Validation loss = 0.7304147481918335
Validation loss = 0.7388324737548828
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6713939309120178
Validation loss = 0.7108047008514404
Validation loss = 0.7160599827766418
Validation loss = 0.7200115919113159
Validation loss = 0.7330675721168518
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 255
average number of affinization = 327.4516129032258
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 654
average number of affinization = 337.65625
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 550
average number of affinization = 344.09090909090907
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 526
average number of affinization = 349.44117647058823
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 607
average number of affinization = 356.8
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 510
average number of affinization = 361.05555555555554
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -2.33e+03 |
| Iteration     | 4         |
| MaximumReturn | -1.25e+03 |
| MinimumReturn | -2.92e+03 |
| TotalSamples  | 24000     |
-----------------------------
itr #5 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7104249596595764
Validation loss = 0.7203044295310974
Validation loss = 0.7338242530822754
Validation loss = 0.7527382969856262
Validation loss = 0.7530391812324524
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6997129917144775
Validation loss = 0.7220659852027893
Validation loss = 0.7377488017082214
Validation loss = 0.7415590882301331
Validation loss = 0.744602620601654
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7119211554527283
Validation loss = 0.7290282249450684
Validation loss = 0.7353963255882263
Validation loss = 0.7436034679412842
Validation loss = 0.7498674392700195
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7128903865814209
Validation loss = 0.7316009998321533
Validation loss = 0.7378926277160645
Validation loss = 0.7541646361351013
Validation loss = 0.7579582333564758
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7171253561973572
Validation loss = 0.7388386726379395
Validation loss = 0.7384073138237
Validation loss = 0.7518966197967529
Validation loss = 0.7633646130561829
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 550
average number of affinization = 366.1621621621622
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 617
average number of affinization = 372.7631578947368
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 475
average number of affinization = 375.38461538461536
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 656
average number of affinization = 382.4
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 493
average number of affinization = 385.0975609756098
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 503
average number of affinization = 387.9047619047619
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -2.9e+03  |
| Iteration     | 5         |
| MaximumReturn | -2.68e+03 |
| MinimumReturn | -3.03e+03 |
| TotalSamples  | 28000     |
-----------------------------
itr #6 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6979524493217468
Validation loss = 0.7028759121894836
Validation loss = 0.716280460357666
Validation loss = 0.7135393023490906
Validation loss = 0.7190153002738953
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6900584101676941
Validation loss = 0.6975790858268738
Validation loss = 0.7009596228599548
Validation loss = 0.7106571793556213
Validation loss = 0.7146565318107605
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.6873885989189148
Validation loss = 0.6957198977470398
Validation loss = 0.7061576843261719
Validation loss = 0.705741822719574
Validation loss = 0.7147158980369568
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.6949625611305237
Validation loss = 0.6967015266418457
Validation loss = 0.7084529995918274
Validation loss = 0.7146041989326477
Validation loss = 0.7214564085006714
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6970335841178894
Validation loss = 0.7148661613464355
Validation loss = 0.7109774351119995
Validation loss = 0.7130249738693237
Validation loss = 0.7296249270439148
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 742
average number of affinization = 396.13953488372096
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 644
average number of affinization = 401.77272727272725
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 507
average number of affinization = 404.1111111111111
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 355
average number of affinization = 403.04347826086956
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 764
average number of affinization = 410.72340425531917
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 766
average number of affinization = 418.125
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.79e+03 |
| Iteration     | 6         |
| MaximumReturn | -1.24e+03 |
| MinimumReturn | -2.07e+03 |
| TotalSamples  | 32000     |
-----------------------------
itr #7 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7067853212356567
Validation loss = 0.7204654812812805
Validation loss = 0.7402267456054688
Validation loss = 0.7364807724952698
Validation loss = 0.7555129528045654
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7162607908248901
Validation loss = 0.7217919230461121
Validation loss = 0.72871994972229
Validation loss = 0.7356157302856445
Validation loss = 0.7463250160217285
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7186149954795837
Validation loss = 0.7220280170440674
Validation loss = 0.7365932464599609
Validation loss = 0.7453373670578003
Validation loss = 0.7425332069396973
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7019776105880737
Validation loss = 0.725794792175293
Validation loss = 0.7365908026695251
Validation loss = 0.7410023808479309
Validation loss = 0.745971143245697
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7093570232391357
Validation loss = 0.7303177118301392
Validation loss = 0.7419939637184143
Validation loss = 0.7455859184265137
Validation loss = 0.7480282783508301
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 669
average number of affinization = 423.2448979591837
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 664
average number of affinization = 428.06
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 615
average number of affinization = 431.72549019607845
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 652
average number of affinization = 435.96153846153845
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 633
average number of affinization = 439.6792452830189
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 626
average number of affinization = 443.1296296296296
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.49e+03 |
| Iteration     | 7         |
| MaximumReturn | -1.17e+03 |
| MinimumReturn | -1.87e+03 |
| TotalSamples  | 36000     |
-----------------------------
itr #8 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7167337536811829
Validation loss = 0.7247595191001892
Validation loss = 0.7262493968009949
Validation loss = 0.7421309351921082
Validation loss = 0.7443748712539673
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7077964544296265
Validation loss = 0.7128587365150452
Validation loss = 0.721899151802063
Validation loss = 0.7305812239646912
Validation loss = 0.7398144602775574
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7046115398406982
Validation loss = 0.7150217294692993
Validation loss = 0.7249882221221924
Validation loss = 0.7301690578460693
Validation loss = 0.7420663833618164
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.704035758972168
Validation loss = 0.7225468754768372
Validation loss = 0.7373647689819336
Validation loss = 0.7314150929450989
Validation loss = 0.7445711493492126
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7085522413253784
Validation loss = 0.7260009050369263
Validation loss = 0.7360661625862122
Validation loss = 0.7442362904548645
Validation loss = 0.7526269555091858
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 531
average number of affinization = 444.72727272727275
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 552
average number of affinization = 446.64285714285717
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 563
average number of affinization = 448.6842105263158
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 574
average number of affinization = 450.8448275862069
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 594
average number of affinization = 453.271186440678
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 539
average number of affinization = 454.7
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -586      |
| Iteration     | 8         |
| MaximumReturn | 56.9      |
| MinimumReturn | -1.08e+03 |
| TotalSamples  | 40000     |
-----------------------------
itr #9 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6823550462722778
Validation loss = 0.6952460408210754
Validation loss = 0.7105351686477661
Validation loss = 0.7084893584251404
Validation loss = 0.7176386713981628
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6714705228805542
Validation loss = 0.6870851516723633
Validation loss = 0.6986898183822632
Validation loss = 0.7057276964187622
Validation loss = 0.7073599100112915
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.6855686902999878
Validation loss = 0.6864906549453735
Validation loss = 0.7008395195007324
Validation loss = 0.70292067527771
Validation loss = 0.7097104787826538
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.691473126411438
Validation loss = 0.6916629076004028
Validation loss = 0.6988770365715027
Validation loss = 0.7061353325843811
Validation loss = 0.7170383930206299
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6949561834335327
Validation loss = 0.7032850980758667
Validation loss = 0.7052124738693237
Validation loss = 0.7129117250442505
Validation loss = 0.7222664952278137
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 572
average number of affinization = 456.62295081967216
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 610
average number of affinization = 459.0967741935484
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 665
average number of affinization = 462.36507936507934
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 645
average number of affinization = 465.21875
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 576
average number of affinization = 466.9230769230769
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 571
average number of affinization = 468.5
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -461      |
| Iteration     | 9         |
| MaximumReturn | 327       |
| MinimumReturn | -1.46e+03 |
| TotalSamples  | 44000     |
-----------------------------
itr #10 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6815616488456726
Validation loss = 0.6776150465011597
Validation loss = 0.6865651607513428
Validation loss = 0.6909146308898926
Validation loss = 0.6989099979400635
Validation loss = 0.7077220678329468
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6688312888145447
Validation loss = 0.6743811964988708
Validation loss = 0.6826248168945312
Validation loss = 0.6871669292449951
Validation loss = 0.6884806752204895
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.6694069504737854
Validation loss = 0.6738364696502686
Validation loss = 0.6801102757453918
Validation loss = 0.6913761496543884
Validation loss = 0.6850566864013672
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.6649893522262573
Validation loss = 0.6796316504478455
Validation loss = 0.682557225227356
Validation loss = 0.6907441020011902
Validation loss = 0.6974427103996277
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6835203170776367
Validation loss = 0.6866630911827087
Validation loss = 0.6864011883735657
Validation loss = 0.692711591720581
Validation loss = 0.6973066329956055
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 615
average number of affinization = 470.6865671641791
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 575
average number of affinization = 472.22058823529414
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 586
average number of affinization = 473.8695652173913
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 539
average number of affinization = 474.8
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 590
average number of affinization = 476.4225352112676
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 628
average number of affinization = 478.52777777777777
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -811      |
| Iteration     | 10        |
| MaximumReturn | -217      |
| MinimumReturn | -1.72e+03 |
| TotalSamples  | 48000     |
-----------------------------
itr #11 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6496419310569763
Validation loss = 0.6490841507911682
Validation loss = 0.6508989334106445
Validation loss = 0.6613734364509583
Validation loss = 0.6665733456611633
Validation loss = 0.665213942527771
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6469491720199585
Validation loss = 0.6455596089363098
Validation loss = 0.6583842039108276
Validation loss = 0.6497825980186462
Validation loss = 0.6556100845336914
Validation loss = 0.6622347235679626
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.6492576003074646
Validation loss = 0.6512636542320251
Validation loss = 0.6510229706764221
Validation loss = 0.6531884670257568
Validation loss = 0.6637487411499023
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.6545999050140381
Validation loss = 0.6506249308586121
Validation loss = 0.658410370349884
Validation loss = 0.6619930863380432
Validation loss = 0.6610437035560608
Validation loss = 0.6681563258171082
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6574907302856445
Validation loss = 0.6562135815620422
Validation loss = 0.6581274271011353
Validation loss = 0.6612024307250977
Validation loss = 0.6646676659584045
Validation loss = 0.6678898930549622
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 617
average number of affinization = 480.4246575342466
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 600
average number of affinization = 482.0405405405405
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 570
average number of affinization = 483.2133333333333
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 602
average number of affinization = 484.7763157894737
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 630
average number of affinization = 486.6623376623377
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 647
average number of affinization = 488.71794871794873
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -663      |
| Iteration     | 11        |
| MaximumReturn | -131      |
| MinimumReturn | -1.54e+03 |
| TotalSamples  | 52000     |
-----------------------------
itr #12 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6584147810935974
Validation loss = 0.6455208659172058
Validation loss = 0.6492636799812317
Validation loss = 0.662856936454773
Validation loss = 0.6628601551055908
Validation loss = 0.6658000946044922
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6485013365745544
Validation loss = 0.6413356065750122
Validation loss = 0.6501041054725647
Validation loss = 0.6542546153068542
Validation loss = 0.6550121903419495
Validation loss = 0.6559250354766846
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.6494778990745544
Validation loss = 0.6451572775840759
Validation loss = 0.6469539403915405
Validation loss = 0.6532391905784607
Validation loss = 0.6533385515213013
Validation loss = 0.6571049094200134
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.6522591710090637
Validation loss = 0.6481606960296631
Validation loss = 0.6522048115730286
Validation loss = 0.6576438546180725
Validation loss = 0.6616641283035278
Validation loss = 0.6605721712112427
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6548011302947998
Validation loss = 0.6525008082389832
Validation loss = 0.6566126942634583
Validation loss = 0.6585369110107422
Validation loss = 0.6667839288711548
Validation loss = 0.6676909327507019
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 666
average number of affinization = 490.9620253164557
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 658
average number of affinization = 493.05
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 698
average number of affinization = 495.58024691358025
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 605
average number of affinization = 496.9146341463415
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 648
average number of affinization = 498.73493975903614
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 657
average number of affinization = 500.6190476190476
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -645      |
| Iteration     | 12        |
| MaximumReturn | 284       |
| MinimumReturn | -1.31e+03 |
| TotalSamples  | 56000     |
-----------------------------
itr #13 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6390158534049988
Validation loss = 0.6369993090629578
Validation loss = 0.6382089257240295
Validation loss = 0.6406236886978149
Validation loss = 0.6431654691696167
Validation loss = 0.6496681571006775
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6301803588867188
Validation loss = 0.6288554072380066
Validation loss = 0.6346387267112732
Validation loss = 0.6322535276412964
Validation loss = 0.6373506188392639
Validation loss = 0.6412110328674316
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.6359413862228394
Validation loss = 0.6262454390525818
Validation loss = 0.6316098570823669
Validation loss = 0.6359478831291199
Validation loss = 0.6398190259933472
Validation loss = 0.6443461775779724
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.6367373466491699
Validation loss = 0.6376661062240601
Validation loss = 0.6352900266647339
Validation loss = 0.6398696899414062
Validation loss = 0.6412414312362671
Validation loss = 0.6419687867164612
Validation loss = 0.6490678787231445
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6359736323356628
Validation loss = 0.6360592246055603
Validation loss = 0.6392998099327087
Validation loss = 0.6515065431594849
Validation loss = 0.6398819088935852
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 655
average number of affinization = 502.43529411764706
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 641
average number of affinization = 504.04651162790697
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 726
average number of affinization = 506.5977011494253
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 615
average number of affinization = 507.82954545454544
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 675
average number of affinization = 509.70786516853934
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 735
average number of affinization = 512.2111111111111
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.12e+03 |
| Iteration     | 13        |
| MaximumReturn | -700      |
| MinimumReturn | -1.61e+03 |
| TotalSamples  | 60000     |
-----------------------------
itr #14 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6371334195137024
Validation loss = 0.6303033232688904
Validation loss = 0.627190351486206
Validation loss = 0.6334534883499146
Validation loss = 0.6352435946464539
Validation loss = 0.632893443107605
Validation loss = 0.6345152258872986
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6240258812904358
Validation loss = 0.6266829371452332
Validation loss = 0.6256147027015686
Validation loss = 0.6295889616012573
Validation loss = 0.6318540573120117
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.6267654895782471
Validation loss = 0.6184394359588623
Validation loss = 0.6282225847244263
Validation loss = 0.6312878727912903
Validation loss = 0.6304702162742615
Validation loss = 0.6333041787147522
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.62934809923172
Validation loss = 0.623315155506134
Validation loss = 0.635638415813446
Validation loss = 0.6366048455238342
Validation loss = 0.6400123834609985
Validation loss = 0.6406144499778748
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6315878033638
Validation loss = 0.6266360878944397
Validation loss = 0.6310951709747314
Validation loss = 0.6346014142036438
Validation loss = 0.6397324800491333
Validation loss = 0.639147162437439
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 723
average number of affinization = 514.5274725274726
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 743
average number of affinization = 517.0108695652174
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 702
average number of affinization = 519.0
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 735
average number of affinization = 521.2978723404256
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 712
average number of affinization = 523.3052631578947
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 716
average number of affinization = 525.3125
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -863      |
| Iteration     | 14        |
| MaximumReturn | -316      |
| MinimumReturn | -1.24e+03 |
| TotalSamples  | 64000     |
-----------------------------
itr #15 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6147653460502625
Validation loss = 0.6039363741874695
Validation loss = 0.6064022183418274
Validation loss = 0.6074426174163818
Validation loss = 0.6123591661453247
Validation loss = 0.6164184212684631
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.601913332939148
Validation loss = 0.5936422348022461
Validation loss = 0.6034276485443115
Validation loss = 0.6024715304374695
Validation loss = 0.6080296039581299
Validation loss = 0.6094545722007751
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.6047293543815613
Validation loss = 0.5958792567253113
Validation loss = 0.6041311025619507
Validation loss = 0.6070118546485901
Validation loss = 0.6073120832443237
Validation loss = 0.6080339550971985
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.6047800779342651
Validation loss = 0.602303147315979
Validation loss = 0.6047598123550415
Validation loss = 0.6082864999771118
Validation loss = 0.6120449900627136
Validation loss = 0.6103197932243347
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6131638884544373
Validation loss = 0.6067638397216797
Validation loss = 0.6069848537445068
Validation loss = 0.6148719787597656
Validation loss = 0.6186658143997192
Validation loss = 0.6164873838424683
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 746
average number of affinization = 527.5876288659794
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 714
average number of affinization = 529.4897959183673
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 701
average number of affinization = 531.2222222222222
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 743
average number of affinization = 533.34
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 719
average number of affinization = 535.1782178217821
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 707
average number of affinization = 536.8627450980392
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -43.9    |
| Iteration     | 15       |
| MaximumReturn | 344      |
| MinimumReturn | -301     |
| TotalSamples  | 68000    |
----------------------------
itr #16 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.5903170704841614
Validation loss = 0.5932798981666565
Validation loss = 0.5931785106658936
Validation loss = 0.5968880653381348
Validation loss = 0.5986219644546509
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.5974602103233337
Validation loss = 0.582859218120575
Validation loss = 0.59051513671875
Validation loss = 0.5933657288551331
Validation loss = 0.593220591545105
Validation loss = 0.5972505807876587
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.5823407173156738
Validation loss = 0.585538387298584
Validation loss = 0.5928733348846436
Validation loss = 0.5945900678634644
Validation loss = 0.5999298691749573
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.5982866883277893
Validation loss = 0.5913283228874207
Validation loss = 0.5932772755622864
Validation loss = 0.5970403552055359
Validation loss = 0.597652792930603
Validation loss = 0.6006190180778503
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.5966114401817322
Validation loss = 0.5914092659950256
Validation loss = 0.592437744140625
Validation loss = 0.5985435247421265
Validation loss = 0.5991998314857483
Validation loss = 0.6010940670967102
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 743
average number of affinization = 538.8640776699029
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 744
average number of affinization = 540.8365384615385
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 728
average number of affinization = 542.6190476190476
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 747
average number of affinization = 544.5471698113207
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 725
average number of affinization = 546.2336448598131
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 687
average number of affinization = 547.5370370370371
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -225     |
| Iteration     | 16       |
| MaximumReturn | 86.8     |
| MinimumReturn | -433     |
| TotalSamples  | 72000    |
----------------------------
itr #17 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.5868702530860901
Validation loss = 0.5790696740150452
Validation loss = 0.5880481600761414
Validation loss = 0.5895121693611145
Validation loss = 0.5887259840965271
Validation loss = 0.5892320871353149
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.5866380929946899
Validation loss = 0.5734153985977173
Validation loss = 0.5829969048500061
Validation loss = 0.5842580795288086
Validation loss = 0.5863996744155884
Validation loss = 0.5845831632614136
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.578554630279541
Validation loss = 0.5791295170783997
Validation loss = 0.581533670425415
Validation loss = 0.5848476886749268
Validation loss = 0.5888468027114868
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.5938578844070435
Validation loss = 0.5826013088226318
Validation loss = 0.5858546495437622
Validation loss = 0.5850894451141357
Validation loss = 0.5877673625946045
Validation loss = 0.5907810926437378
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.5884657502174377
Validation loss = 0.5820443034172058
Validation loss = 0.5905025005340576
Validation loss = 0.5914366245269775
Validation loss = 0.5956573486328125
Validation loss = 0.5933848023414612
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 764
average number of affinization = 549.5229357798165
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 747
average number of affinization = 551.3181818181819
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 760
average number of affinization = 553.1981981981982
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 787
average number of affinization = 555.2857142857143
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 775
average number of affinization = 557.2300884955753
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 760
average number of affinization = 559.0087719298245
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -184     |
| Iteration     | 17       |
| MaximumReturn | 206      |
| MinimumReturn | -517     |
| TotalSamples  | 76000    |
----------------------------
itr #18 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.5781607031822205
Validation loss = 0.5774431228637695
Validation loss = 0.5773453116416931
Validation loss = 0.5782067179679871
Validation loss = 0.5799962282180786
Validation loss = 0.5814195871353149
Validation loss = 0.5850145220756531
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.5722851753234863
Validation loss = 0.5673519372940063
Validation loss = 0.5739290118217468
Validation loss = 0.5736629366874695
Validation loss = 0.5835896134376526
Validation loss = 0.5751566290855408
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.575390636920929
Validation loss = 0.5717968344688416
Validation loss = 0.571205735206604
Validation loss = 0.5753079652786255
Validation loss = 0.578301727771759
Validation loss = 0.5768311619758606
Validation loss = 0.5740303993225098
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.5745336413383484
Validation loss = 0.5694923996925354
Validation loss = 0.5772731304168701
Validation loss = 0.5782753229141235
Validation loss = 0.5779504179954529
Validation loss = 0.5765690803527832
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.5787661671638489
Validation loss = 0.573326587677002
Validation loss = 0.5769177675247192
Validation loss = 0.5806812047958374
Validation loss = 0.5833221077919006
Validation loss = 0.5810254216194153
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 737
average number of affinization = 560.5565217391304
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 744
average number of affinization = 562.1379310344828
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 760
average number of affinization = 563.8290598290598
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 753
average number of affinization = 565.4322033898305
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 725
average number of affinization = 566.7731092436975
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 770
average number of affinization = 568.4666666666667
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -189     |
| Iteration     | 18       |
| MaximumReturn | 367      |
| MinimumReturn | -887     |
| TotalSamples  | 80000    |
----------------------------
itr #19 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.5707997679710388
Validation loss = 0.5668554306030273
Validation loss = 0.5698297619819641
Validation loss = 0.5732753872871399
Validation loss = 0.5740131139755249
Validation loss = 0.5716215968132019
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.5648276209831238
Validation loss = 0.561519980430603
Validation loss = 0.5624640583992004
Validation loss = 0.5684195756912231
Validation loss = 0.5726012587547302
Validation loss = 0.566020667552948
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.568385899066925
Validation loss = 0.5613844990730286
Validation loss = 0.5615742802619934
Validation loss = 0.5711523294448853
Validation loss = 0.5680071115493774
Validation loss = 0.5712302327156067
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.5640090107917786
Validation loss = 0.5634622573852539
Validation loss = 0.5680851936340332
Validation loss = 0.5694361925125122
Validation loss = 0.5718496441841125
Validation loss = 0.5719867944717407
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.5703359842300415
Validation loss = 0.5633635520935059
Validation loss = 0.5686870813369751
Validation loss = 0.5721614956855774
Validation loss = 0.5740514993667603
Validation loss = 0.5715204477310181
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 732
average number of affinization = 569.8181818181819
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 733
average number of affinization = 571.155737704918
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 776
average number of affinization = 572.8211382113822
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 736
average number of affinization = 574.1370967741935
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 654
average number of affinization = 574.776
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 709
average number of affinization = 575.8412698412699
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -173     |
| Iteration     | 19       |
| MaximumReturn | 360      |
| MinimumReturn | -677     |
| TotalSamples  | 84000    |
----------------------------
itr #20 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.557994544506073
Validation loss = 0.5581462383270264
Validation loss = 0.5602326393127441
Validation loss = 0.5650075078010559
Validation loss = 0.5635119676589966
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.5708411335945129
Validation loss = 0.5510485768318176
Validation loss = 0.5581520795822144
Validation loss = 0.5597820281982422
Validation loss = 0.563920795917511
Validation loss = 0.5597137212753296
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.5616534948348999
Validation loss = 0.5545924305915833
Validation loss = 0.5584255456924438
Validation loss = 0.5602437257766724
Validation loss = 0.5606752634048462
Validation loss = 0.5611061453819275
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.5676966905593872
Validation loss = 0.556708037853241
Validation loss = 0.5623546838760376
Validation loss = 0.5661574602127075
Validation loss = 0.5653347969055176
Validation loss = 0.563748836517334
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.5601325035095215
Validation loss = 0.5606167912483215
Validation loss = 0.5621190667152405
Validation loss = 0.5635334849357605
Validation loss = 0.5664700269699097
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 739
average number of affinization = 577.1259842519685
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 745
average number of affinization = 578.4375
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 755
average number of affinization = 579.8062015503876
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 729
average number of affinization = 580.9538461538461
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 692
average number of affinization = 581.8015267175573
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 752
average number of affinization = 583.0909090909091
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -399      |
| Iteration     | 20        |
| MaximumReturn | 342       |
| MinimumReturn | -1.11e+03 |
| TotalSamples  | 88000     |
-----------------------------
itr #21 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.5498375296592712
Validation loss = 0.5511905550956726
Validation loss = 0.5569872260093689
Validation loss = 0.5577952861785889
Validation loss = 0.5558016300201416
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.5486641526222229
Validation loss = 0.5456305742263794
Validation loss = 0.5475391745567322
Validation loss = 0.5522791743278503
Validation loss = 0.5534002780914307
Validation loss = 0.5541929006576538
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.5529320240020752
Validation loss = 0.5458494424819946
Validation loss = 0.5502502918243408
Validation loss = 0.5524735450744629
Validation loss = 0.5539864897727966
Validation loss = 0.5541514754295349
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.5523813962936401
Validation loss = 0.5503588318824768
Validation loss = 0.5509827733039856
Validation loss = 0.5565791130065918
Validation loss = 0.558329164981842
Validation loss = 0.5606979727745056
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.5553640723228455
Validation loss = 0.55100017786026
Validation loss = 0.559286892414093
Validation loss = 0.5573081970214844
Validation loss = 0.5585528016090393
Validation loss = 0.5580026507377625
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 766
average number of affinization = 584.4661654135339
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 791
average number of affinization = 586.0074626865671
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 716
average number of affinization = 586.9703703703703
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 772
average number of affinization = 588.3308823529412
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 786
average number of affinization = 589.7737226277372
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 774
average number of affinization = 591.1086956521739
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -837      |
| Iteration     | 21        |
| MaximumReturn | -299      |
| MinimumReturn | -1.61e+03 |
| TotalSamples  | 92000     |
-----------------------------
itr #22 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.5515684485435486
Validation loss = 0.5464956164360046
Validation loss = 0.5472089648246765
Validation loss = 0.5531829595565796
Validation loss = 0.5508834719657898
Validation loss = 0.5516634583473206
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.545462965965271
Validation loss = 0.5428752303123474
Validation loss = 0.5478647351264954
Validation loss = 0.5485736131668091
Validation loss = 0.5473229289054871
Validation loss = 0.5481513738632202
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.5455091595649719
Validation loss = 0.5431666374206543
Validation loss = 0.5470942854881287
Validation loss = 0.5451465845108032
Validation loss = 0.5483905076980591
Validation loss = 0.5461783409118652
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.5476837754249573
Validation loss = 0.5453765392303467
Validation loss = 0.5470483899116516
Validation loss = 0.5513139963150024
Validation loss = 0.5494147539138794
Validation loss = 0.5523562431335449
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.5531694293022156
Validation loss = 0.5456728935241699
Validation loss = 0.5505321025848389
Validation loss = 0.5522903203964233
Validation loss = 0.5510140061378479
Validation loss = 0.5523185729980469
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 773
average number of affinization = 592.4172661870504
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 750
average number of affinization = 593.5428571428571
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 782
average number of affinization = 594.8794326241135
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 782
average number of affinization = 596.1971830985915
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 800
average number of affinization = 597.6223776223776
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 629
average number of affinization = 597.8402777777778
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -888      |
| Iteration     | 22        |
| MaximumReturn | -48.7     |
| MinimumReturn | -1.96e+03 |
| TotalSamples  | 96000     |
-----------------------------
itr #23 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.5513067245483398
Validation loss = 0.5439832210540771
Validation loss = 0.5486354231834412
Validation loss = 0.54793781042099
Validation loss = 0.5487723350524902
Validation loss = 0.5515076518058777
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.5455706119537354
Validation loss = 0.5372520089149475
Validation loss = 0.5427106022834778
Validation loss = 0.5448451638221741
Validation loss = 0.5445119738578796
Validation loss = 0.5449267029762268
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.5416355729103088
Validation loss = 0.5370466113090515
Validation loss = 0.5410155653953552
Validation loss = 0.5438807606697083
Validation loss = 0.5422971844673157
Validation loss = 0.5444729924201965
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.5419177412986755
Validation loss = 0.5416784882545471
Validation loss = 0.5439913868904114
Validation loss = 0.5440999269485474
Validation loss = 0.5496705174446106
Validation loss = 0.5501800775527954
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.5443845987319946
Validation loss = 0.5434704422950745
Validation loss = 0.5486310124397278
Validation loss = 0.5496465563774109
Validation loss = 0.550804853439331
Validation loss = 0.5491234064102173
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 750
average number of affinization = 598.8896551724138
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 725
average number of affinization = 599.7534246575342
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 807
average number of affinization = 601.1632653061224
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 796
average number of affinization = 602.4797297297297
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 709
average number of affinization = 603.1946308724832
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 712
average number of affinization = 603.92
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.16e+03 |
| Iteration     | 23        |
| MaximumReturn | -831      |
| MinimumReturn | -1.68e+03 |
| TotalSamples  | 100000    |
-----------------------------
itr #24 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.5455396771430969
Validation loss = 0.5424146056175232
Validation loss = 0.5434731841087341
Validation loss = 0.5461083650588989
Validation loss = 0.5485526919364929
Validation loss = 0.5461114048957825
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.5415047407150269
Validation loss = 0.5346240401268005
Validation loss = 0.5402788519859314
Validation loss = 0.5404025912284851
Validation loss = 0.5433902144432068
Validation loss = 0.5407543778419495
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.5411701798439026
Validation loss = 0.5362251400947571
Validation loss = 0.5386455655097961
Validation loss = 0.5451371669769287
Validation loss = 0.5415788888931274
Validation loss = 0.5448094010353088
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.5389431715011597
Validation loss = 0.5428283214569092
Validation loss = 0.5426486730575562
Validation loss = 0.5439433455467224
Validation loss = 0.5473247766494751
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.5439563393592834
Validation loss = 0.5397942662239075
Validation loss = 0.5463922619819641
Validation loss = 0.5483071804046631
Validation loss = 0.5480809211730957
Validation loss = 0.5464749336242676
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 640
average number of affinization = 604.158940397351
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 787
average number of affinization = 605.3618421052631
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 810
average number of affinization = 606.6993464052288
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 811
average number of affinization = 608.025974025974
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 805
average number of affinization = 609.2967741935483
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 759
average number of affinization = 610.2564102564103
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.02e+03 |
| Iteration     | 24        |
| MaximumReturn | 7.07      |
| MinimumReturn | -1.87e+03 |
| TotalSamples  | 104000    |
-----------------------------
itr #25 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.5397642850875854
Validation loss = 0.5381214618682861
Validation loss = 0.5425999760627747
Validation loss = 0.5454708933830261
Validation loss = 0.5438547134399414
Validation loss = 0.5460352301597595
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.5368008017539978
Validation loss = 0.5360056161880493
Validation loss = 0.5404632687568665
Validation loss = 0.5375163555145264
Validation loss = 0.5417487025260925
Validation loss = 0.5412026643753052
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.5377401113510132
Validation loss = 0.5362317562103271
Validation loss = 0.5364526510238647
Validation loss = 0.5411157011985779
Validation loss = 0.539885401725769
Validation loss = 0.5426589250564575
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.5405345559120178
Validation loss = 0.5367165207862854
Validation loss = 0.5408121943473816
Validation loss = 0.542009711265564
Validation loss = 0.5423414707183838
Validation loss = 0.5425426959991455
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.5421950817108154
Validation loss = 0.5405220985412598
Validation loss = 0.5437823534011841
Validation loss = 0.5431577563285828
Validation loss = 0.5409623980522156
Validation loss = 0.5439648628234863
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 768
average number of affinization = 611.2611464968153
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 773
average number of affinization = 612.2848101265823
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 787
average number of affinization = 613.3836477987421
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 742
average number of affinization = 614.1875
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 792
average number of affinization = 615.2919254658385
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 779
average number of affinization = 616.3024691358024
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -257     |
| Iteration     | 25       |
| MaximumReturn | 72.1     |
| MinimumReturn | -717     |
| TotalSamples  | 108000   |
----------------------------
itr #26 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.5353453755378723
Validation loss = 0.5343518257141113
Validation loss = 0.5362210273742676
Validation loss = 0.541434645652771
Validation loss = 0.5423557758331299
Validation loss = 0.5407763719558716
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.535348653793335
Validation loss = 0.5318006873130798
Validation loss = 0.5359075665473938
Validation loss = 0.5382062196731567
Validation loss = 0.5362181663513184
Validation loss = 0.5367848873138428
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.5308911204338074
Validation loss = 0.5317513346672058
Validation loss = 0.5346030592918396
Validation loss = 0.5351186394691467
Validation loss = 0.5346931219100952
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.5448284149169922
Validation loss = 0.5332942008972168
Validation loss = 0.5372721552848816
Validation loss = 0.5387841463088989
Validation loss = 0.5422696471214294
Validation loss = 0.5406666398048401
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.5396697521209717
Validation loss = 0.5347235202789307
Validation loss = 0.5423941016197205
Validation loss = 0.5380263328552246
Validation loss = 0.5380657315254211
Validation loss = 0.5408490300178528
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 783
average number of affinization = 617.3251533742331
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 772
average number of affinization = 618.2682926829268
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 762
average number of affinization = 619.1393939393939
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 794
average number of affinization = 620.1927710843373
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 785
average number of affinization = 621.1796407185628
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 741
average number of affinization = 621.8928571428571
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -378     |
| Iteration     | 26       |
| MaximumReturn | -126     |
| MinimumReturn | -662     |
| TotalSamples  | 112000   |
----------------------------
itr #27 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.5370718240737915
Validation loss = 0.5349439382553101
Validation loss = 0.5353804230690002
Validation loss = 0.5349608659744263
Validation loss = 0.5371443629264832
Validation loss = 0.5361490249633789
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.5356763601303101
Validation loss = 0.5301181077957153
Validation loss = 0.531985342502594
Validation loss = 0.5338712334632874
Validation loss = 0.5338707566261292
Validation loss = 0.5362676978111267
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.5292614698410034
Validation loss = 0.5283924341201782
Validation loss = 0.5301808714866638
Validation loss = 0.5297971963882446
Validation loss = 0.5337268710136414
Validation loss = 0.5315402150154114
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.5332845449447632
Validation loss = 0.5332677960395813
Validation loss = 0.5366029143333435
Validation loss = 0.5375538468360901
Validation loss = 0.5352269411087036
Validation loss = 0.5358373522758484
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.5334383845329285
Validation loss = 0.5307735204696655
Validation loss = 0.5347150564193726
Validation loss = 0.5392781496047974
Validation loss = 0.5344213843345642
Validation loss = 0.5385230779647827
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 809
average number of affinization = 623.0
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 824
average number of affinization = 624.1823529411764
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 793
average number of affinization = 625.1695906432749
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 809
average number of affinization = 626.2383720930233
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 783
average number of affinization = 627.1445086705203
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 803
average number of affinization = 628.1551724137931
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -695      |
| Iteration     | 27        |
| MaximumReturn | 10.9      |
| MinimumReturn | -2.04e+03 |
| TotalSamples  | 116000    |
-----------------------------
itr #28 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.5321571230888367
Validation loss = 0.5298087000846863
Validation loss = 0.5334743857383728
Validation loss = 0.5364075303077698
Validation loss = 0.5337375402450562
Validation loss = 0.5345625877380371
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.533149778842926
Validation loss = 0.5262231230735779
Validation loss = 0.5322480797767639
Validation loss = 0.533053457736969
Validation loss = 0.5336918234825134
Validation loss = 0.5326645374298096
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.5291060209274292
Validation loss = 0.5274778604507446
Validation loss = 0.5289909243583679
Validation loss = 0.5327274799346924
Validation loss = 0.5318884253501892
Validation loss = 0.5325701832771301
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.5312913656234741
Validation loss = 0.5300328731536865
Validation loss = 0.5323696732521057
Validation loss = 0.533534586429596
Validation loss = 0.5348348617553711
Validation loss = 0.5337671041488647
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.5338389873504639
Validation loss = 0.5311121344566345
Validation loss = 0.5363567471504211
Validation loss = 0.535947322845459
Validation loss = 0.536358654499054
Validation loss = 0.5358981490135193
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 796
average number of affinization = 629.1142857142858
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 768
average number of affinization = 629.9034090909091
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 762
average number of affinization = 630.6497175141243
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 805
average number of affinization = 631.629213483146
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 801
average number of affinization = 632.5754189944134
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 781
average number of affinization = 633.4
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -557      |
| Iteration     | 28        |
| MaximumReturn | -164      |
| MinimumReturn | -1.15e+03 |
| TotalSamples  | 120000    |
-----------------------------
itr #29 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.5333914160728455
Validation loss = 0.5273361206054688
Validation loss = 0.5297560095787048
Validation loss = 0.530884325504303
Validation loss = 0.5311743021011353
Validation loss = 0.5356101989746094
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.5280435681343079
Validation loss = 0.5243236422538757
Validation loss = 0.5285236239433289
Validation loss = 0.5308834314346313
Validation loss = 0.5287049412727356
Validation loss = 0.529168426990509
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.5255624651908875
Validation loss = 0.525225818157196
Validation loss = 0.5298267006874084
Validation loss = 0.5303240418434143
Validation loss = 0.5308204889297485
Validation loss = 0.5295391082763672
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.5256488919258118
Validation loss = 0.5281420350074768
Validation loss = 0.527897298336029
Validation loss = 0.5327498912811279
Validation loss = 0.5323224067687988
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.5343717932701111
Validation loss = 0.5279636979103088
Validation loss = 0.5310654640197754
Validation loss = 0.5338493585586548
Validation loss = 0.532350480556488
Validation loss = 0.5350541472434998
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 795
average number of affinization = 634.292817679558
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 740
average number of affinization = 634.8736263736264
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 776
average number of affinization = 635.6448087431694
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 775
average number of affinization = 636.4021739130435
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 788
average number of affinization = 637.2216216216216
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 546
average number of affinization = 636.7311827956989
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -749      |
| Iteration     | 29        |
| MaximumReturn | -145      |
| MinimumReturn | -1.81e+03 |
| TotalSamples  | 124000    |
-----------------------------
itr #30 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.5254989862442017
Validation loss = 0.5233837366104126
Validation loss = 0.5277532935142517
Validation loss = 0.5295543670654297
Validation loss = 0.5292333364486694
Validation loss = 0.52951580286026
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.5223888158798218
Validation loss = 0.521344780921936
Validation loss = 0.5236700177192688
Validation loss = 0.5265187621116638
Validation loss = 0.5250813364982605
Validation loss = 0.5263158082962036
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.5218174457550049
Validation loss = 0.5238503217697144
Validation loss = 0.5257787108421326
Validation loss = 0.5261503458023071
Validation loss = 0.5277217626571655
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.5266386866569519
Validation loss = 0.5243951678276062
Validation loss = 0.526877760887146
Validation loss = 0.5284879207611084
Validation loss = 0.5284837484359741
Validation loss = 0.5284258723258972
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.5254897475242615
Validation loss = 0.5239754319190979
Validation loss = 0.5266789197921753
Validation loss = 0.5275729298591614
Validation loss = 0.5293018221855164
Validation loss = 0.5298216938972473
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 811
average number of affinization = 637.663101604278
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 762
average number of affinization = 638.3244680851063
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 748
average number of affinization = 638.9047619047619
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 742
average number of affinization = 639.4473684210526
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 778
average number of affinization = 640.17277486911
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 742
average number of affinization = 640.703125
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -707      |
| Iteration     | 30        |
| MaximumReturn | -188      |
| MinimumReturn | -1.49e+03 |
| TotalSamples  | 128000    |
-----------------------------
itr #31 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.52251136302948
Validation loss = 0.5221450328826904
Validation loss = 0.525983452796936
Validation loss = 0.5256493091583252
Validation loss = 0.5275373458862305
Validation loss = 0.5275493860244751
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.5196341276168823
Validation loss = 0.5183918476104736
Validation loss = 0.5211384892463684
Validation loss = 0.5244238376617432
Validation loss = 0.5249086022377014
Validation loss = 0.5238136649131775
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.5199348330497742
Validation loss = 0.5215989351272583
Validation loss = 0.524215817451477
Validation loss = 0.5260269641876221
Validation loss = 0.5241833329200745
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.5179975032806396
Validation loss = 0.5222073793411255
Validation loss = 0.5226459503173828
Validation loss = 0.5257576704025269
Validation loss = 0.5256375074386597
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.5202044248580933
Validation loss = 0.52204829454422
Validation loss = 0.5251598954200745
Validation loss = 0.5270001888275146
Validation loss = 0.525901734828949
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 807
average number of affinization = 641.5647668393782
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 745
average number of affinization = 642.0979381443299
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 788
average number of affinization = 642.8461538461538
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 809
average number of affinization = 643.6938775510204
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 802
average number of affinization = 644.4974619289341
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 789
average number of affinization = 645.2272727272727
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -871      |
| Iteration     | 31        |
| MaximumReturn | 229       |
| MinimumReturn | -1.89e+03 |
| TotalSamples  | 132000    |
-----------------------------
itr #32 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.5224697589874268
Validation loss = 0.5196790099143982
Validation loss = 0.5228768587112427
Validation loss = 0.5247501730918884
Validation loss = 0.5261877179145813
Validation loss = 0.5238752365112305
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.5188999176025391
Validation loss = 0.516470193862915
Validation loss = 0.5189681649208069
Validation loss = 0.5191182494163513
Validation loss = 0.522352397441864
Validation loss = 0.5205138325691223
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.5227336287498474
Validation loss = 0.5200884342193604
Validation loss = 0.5232818722724915
Validation loss = 0.5230968594551086
Validation loss = 0.5231702923774719
Validation loss = 0.5229063034057617
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.5216968059539795
Validation loss = 0.5194005370140076
Validation loss = 0.5233007073402405
Validation loss = 0.5221740007400513
Validation loss = 0.5229716300964355
Validation loss = 0.525449275970459
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.5226284861564636
Validation loss = 0.520537793636322
Validation loss = 0.5226527452468872
Validation loss = 0.5257431864738464
Validation loss = 0.5251980423927307
Validation loss = 0.5243488550186157
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 795
average number of affinization = 645.9798994974874
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 806
average number of affinization = 646.78
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 768
average number of affinization = 647.3830845771145
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 789
average number of affinization = 648.0841584158416
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 717
average number of affinization = 648.423645320197
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 764
average number of affinization = 648.9901960784314
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -931      |
| Iteration     | 32        |
| MaximumReturn | -408      |
| MinimumReturn | -1.74e+03 |
| TotalSamples  | 136000    |
-----------------------------
