Logging to experiments/hopper/nov1/w350e3_seed1234
Print configuration .....
{'env_name': 'hopper', 'random_seeds': [1234, 2431, 2531, 2231], 'save_variables': False, 'model_save_dir': '/tmp/hopper_models/', 'restore_variables': False, 'start_onpol_iter': 0, 'onpol_iters': 33, 'num_path_random': 6, 'num_path_onpol': 6, 'env_horizon': 1000, 'max_train_data': 200000, 'max_val_data': 100000, 'discard_ratio': 0.0, 'dynamics': {'pre_training': {'mode': 'intrinsic_reward', 'itr': 0, 'policy_itr': 20}, 'model': 'nn', 'ensemble': True, 'ensemble_model_count': 5, 'enable_particle_ensemble': True, 'particles': 5, 'obs_var': 1.0, 'intrinsic_reward_coeff': 1.0, 'ita': 1.0, 'mode': 'random', 'val': True, 'n_layers': 4, 'hidden_size': 1000, 'activation': 'relu', 'batch_size': 1000, 'learning_rate': 0.001, 'reg_coeff': 0.0, 'epochs': 200, 'kfac_params': {'learning_rate': 0.1, 'damping': 0.001, 'momentum': 0.9, 'kl_clip': 0.0001, 'cov_ema_decay': 0.99}}, 'policy': {'network_shape': [64, 64], 'init_logstd': 0.0, 'activation': 'tanh', 'reinitialize_every_itr': False}, 'trpo': {'horizon': 1000, 'gamma': 0.99, 'step_size': 0.01, 'iterations': 20, 'batch_size': 50000, 'gae': 0.95, 'visualization': False, 'visualize_iterations': [0]}, 'algo': 'trpo'}
Generating random rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 0
average number of affinization = 0.0
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 0
average number of affinization = 0.0
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 0
average number of affinization = 0.0
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 0
average number of affinization = 0.0
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 0
average number of affinization = 0.0
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 0
average number of affinization = 0.0
Done generating random rollouts.
Creating normalization for training data.
Done creating normalization for training data.
Particle ensemble enabled? True
An ensemble of 5 dynamics model <class 'model.dynamics.NNDynamicsModel'> initialized
Train dynamics model with intrinsic reward only? False
Pre-training enabled. Using only intrinsic reward.
Pre-training dynamics model for 0 iterations...
Done pre-training dynamics model.
Using external reward only.
itr #0 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.8146185874938965
Validation loss = 0.7245424389839172
Validation loss = 0.700682520866394
Validation loss = 0.7569547891616821
Validation loss = 0.7785621881484985
Validation loss = 0.8330814242362976
Validation loss = 0.9219412207603455
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.89227294921875
Validation loss = 0.7276986837387085
Validation loss = 0.7018197178840637
Validation loss = 0.7548730969429016
Validation loss = 0.7910068035125732
Validation loss = 0.8353781700134277
Validation loss = 1.01863431930542
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7878282070159912
Validation loss = 0.7207399010658264
Validation loss = 0.7080780863761902
Validation loss = 0.7225664854049683
Validation loss = 0.7519243955612183
Validation loss = 0.7853787541389465
Validation loss = 0.8443859815597534
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.8693312406539917
Validation loss = 0.7199414968490601
Validation loss = 0.7059279680252075
Validation loss = 0.7405520677566528
Validation loss = 0.7851951122283936
Validation loss = 0.8641207218170166
Validation loss = 0.9602630734443665
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.9463687539100647
Validation loss = 0.7236467599868774
Validation loss = 0.7102161049842834
Validation loss = 0.7325180768966675
Validation loss = 0.7734001874923706
Validation loss = 0.8332256078720093
Validation loss = 0.9366849064826965
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 440
average number of affinization = 62.857142857142854
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 407
average number of affinization = 105.875
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 400
average number of affinization = 138.55555555555554
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 382
average number of affinization = 162.9
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 402
average number of affinization = 184.63636363636363
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 388
average number of affinization = 201.58333333333334
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -2.58e+03 |
| Iteration     | 0         |
| MaximumReturn | -2.52e+03 |
| MinimumReturn | -2.65e+03 |
| TotalSamples  | 8000      |
-----------------------------
itr #1 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7336735129356384
Validation loss = 0.7236121892929077
Validation loss = 0.7351362705230713
Validation loss = 0.7756587266921997
Validation loss = 0.8163411021232605
Validation loss = 0.8346118330955505
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7269763350486755
Validation loss = 0.6842150092124939
Validation loss = 0.7088583707809448
Validation loss = 0.7546641230583191
Validation loss = 0.7469100952148438
Validation loss = 0.815691351890564
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7248324155807495
Validation loss = 0.7150009274482727
Validation loss = 0.7594505548477173
Validation loss = 0.7553234100341797
Validation loss = 0.7903320789337158
Validation loss = 0.811152994632721
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7311266660690308
Validation loss = 0.7286885380744934
Validation loss = 0.7491811513900757
Validation loss = 0.7668546438217163
Validation loss = 0.7854080200195312
Validation loss = 0.8891603946685791
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7173295021057129
Validation loss = 0.7326787114143372
Validation loss = 0.7253867387771606
Validation loss = 0.7329135537147522
Validation loss = 0.7807012796401978
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 525
average number of affinization = 226.46153846153845
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 534
average number of affinization = 248.42857142857142
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 519
average number of affinization = 266.46666666666664
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 550
average number of affinization = 284.1875
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 519
average number of affinization = 298.0
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 552
average number of affinization = 312.1111111111111
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -2.32e+03 |
| Iteration     | 1         |
| MaximumReturn | -2.19e+03 |
| MinimumReturn | -2.43e+03 |
| TotalSamples  | 12000     |
-----------------------------
itr #2 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7058960795402527
Validation loss = 0.7406263947486877
Validation loss = 0.7789945602416992
Validation loss = 0.8023756146430969
Validation loss = 0.8523495197296143
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6858010292053223
Validation loss = 0.7292242646217346
Validation loss = 0.7417128682136536
Validation loss = 0.7633827328681946
Validation loss = 0.8193585872650146
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.6795654296875
Validation loss = 0.734126091003418
Validation loss = 0.743804395198822
Validation loss = 0.7753520011901855
Validation loss = 0.8112285733222961
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.6993368268013
Validation loss = 0.7457528114318848
Validation loss = 0.7691319584846497
Validation loss = 0.7653904557228088
Validation loss = 0.8094034790992737
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6943163871765137
Validation loss = 0.7101044058799744
Validation loss = 0.7312402725219727
Validation loss = 0.7710578441619873
Validation loss = 0.7713567614555359
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 698
average number of affinization = 332.42105263157896
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 728
average number of affinization = 352.2
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 712
average number of affinization = 369.3333333333333
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 709
average number of affinization = 384.77272727272725
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 733
average number of affinization = 399.9130434782609
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 709
average number of affinization = 412.7916666666667
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -2.38e+03 |
| Iteration     | 2         |
| MaximumReturn | -2.22e+03 |
| MinimumReturn | -2.45e+03 |
| TotalSamples  | 16000     |
-----------------------------
itr #3 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7289249897003174
Validation loss = 0.7537925243377686
Validation loss = 0.769045352935791
Validation loss = 0.7911285161972046
Validation loss = 0.8049026131629944
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6790478229522705
Validation loss = 0.7616568207740784
Validation loss = 0.7603470087051392
Validation loss = 0.7700619101524353
Validation loss = 0.7869343757629395
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.6746267080307007
Validation loss = 0.7531577348709106
Validation loss = 0.7839614152908325
Validation loss = 0.7927483320236206
Validation loss = 0.8108224868774414
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.6924371123313904
Validation loss = 0.7298452854156494
Validation loss = 0.7657329440116882
Validation loss = 0.7907621264457703
Validation loss = 0.7788445949554443
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6865905523300171
Validation loss = 0.7342592477798462
Validation loss = 0.7618705034255981
Validation loss = 0.7888580560684204
Validation loss = 0.8082180023193359
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 761
average number of affinization = 426.72
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 770
average number of affinization = 439.9230769230769
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 783
average number of affinization = 452.6296296296296
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 739
average number of affinization = 462.85714285714283
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 777
average number of affinization = 473.6896551724138
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 785
average number of affinization = 484.06666666666666
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -2.53e+03 |
| Iteration     | 3         |
| MaximumReturn | -2.43e+03 |
| MinimumReturn | -2.59e+03 |
| TotalSamples  | 20000     |
-----------------------------
itr #4 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7191766500473022
Validation loss = 0.7475444078445435
Validation loss = 0.7615841627120972
Validation loss = 0.7895992994308472
Validation loss = 0.7961102724075317
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6921795010566711
Validation loss = 0.7450774908065796
Validation loss = 0.7497018575668335
Validation loss = 0.7649705410003662
Validation loss = 0.7885202169418335
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7264315485954285
Validation loss = 0.7506173849105835
Validation loss = 0.7605741024017334
Validation loss = 0.7788481712341309
Validation loss = 0.7925021648406982
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.6975981593132019
Validation loss = 0.7421943545341492
Validation loss = 0.755947470664978
Validation loss = 0.7719208002090454
Validation loss = 0.7719266414642334
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6935948133468628
Validation loss = 0.7331846952438354
Validation loss = 0.7514861822128296
Validation loss = 0.7682713270187378
Validation loss = 0.7822429537773132
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 508
average number of affinization = 484.83870967741933
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 472
average number of affinization = 484.4375
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 505
average number of affinization = 485.06060606060606
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 505
average number of affinization = 485.6470588235294
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 449
average number of affinization = 484.6
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 500
average number of affinization = 485.02777777777777
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -2.12e+03 |
| Iteration     | 4         |
| MaximumReturn | -1.82e+03 |
| MinimumReturn | -2.72e+03 |
| TotalSamples  | 24000     |
-----------------------------
itr #5 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6981234550476074
Validation loss = 0.7344772815704346
Validation loss = 0.7367022633552551
Validation loss = 0.7632583975791931
Validation loss = 0.7683860659599304
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7008100152015686
Validation loss = 0.725886344909668
Validation loss = 0.7357267737388611
Validation loss = 0.7306759357452393
Validation loss = 0.7527542114257812
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7185439467430115
Validation loss = 0.7622568011283875
Validation loss = 0.7568259239196777
Validation loss = 0.7702887058258057
Validation loss = 0.7793285250663757
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.698191225528717
Validation loss = 0.7432689070701599
Validation loss = 0.7427436709403992
Validation loss = 0.7407487034797668
Validation loss = 0.7553904056549072
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7208450436592102
Validation loss = 0.7306705117225647
Validation loss = 0.7365627288818359
Validation loss = 0.7482885718345642
Validation loss = 0.7661828994750977
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 478
average number of affinization = 484.8378378378378
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 461
average number of affinization = 484.2105263157895
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 506
average number of affinization = 484.7692307692308
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 515
average number of affinization = 485.525
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 462
average number of affinization = 484.9512195121951
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 435
average number of affinization = 483.76190476190476
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -2.7e+03  |
| Iteration     | 5         |
| MaximumReturn | -2.58e+03 |
| MinimumReturn | -2.88e+03 |
| TotalSamples  | 28000     |
-----------------------------
itr #6 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6752768158912659
Validation loss = 0.7151821255683899
Validation loss = 0.7319921851158142
Validation loss = 0.7357028722763062
Validation loss = 0.7580384612083435
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6817554831504822
Validation loss = 0.7145814895629883
Validation loss = 0.7177854776382446
Validation loss = 0.7426635026931763
Validation loss = 0.7488473057746887
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.6720343828201294
Validation loss = 0.7175534963607788
Validation loss = 0.7318430542945862
Validation loss = 0.7554548382759094
Validation loss = 0.7603505849838257
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.6764559745788574
Validation loss = 0.707349419593811
Validation loss = 0.7212573289871216
Validation loss = 0.7430576682090759
Validation loss = 0.754976212978363
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6740091443061829
Validation loss = 0.7192395925521851
Validation loss = 0.7188453674316406
Validation loss = 0.7484925389289856
Validation loss = 0.7547182440757751
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 462
average number of affinization = 483.25581395348837
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 454
average number of affinization = 482.59090909090907
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 420
average number of affinization = 481.2
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 378
average number of affinization = 478.95652173913044
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 411
average number of affinization = 477.51063829787233
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 516
average number of affinization = 478.3125
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.46e+03 |
| Iteration     | 6         |
| MaximumReturn | -810      |
| MinimumReturn | -2.64e+03 |
| TotalSamples  | 32000     |
-----------------------------
itr #7 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7227598428726196
Validation loss = 0.7465167045593262
Validation loss = 0.7773762941360474
Validation loss = 0.7819862961769104
Validation loss = 0.7976045608520508
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7393567562103271
Validation loss = 0.7425673007965088
Validation loss = 0.7744770050048828
Validation loss = 0.7807631492614746
Validation loss = 0.7976204752922058
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7097073793411255
Validation loss = 0.7549502849578857
Validation loss = 0.7669764757156372
Validation loss = 0.7866555452346802
Validation loss = 0.8023509979248047
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7327603101730347
Validation loss = 0.7419258952140808
Validation loss = 0.7469940185546875
Validation loss = 0.7781160473823547
Validation loss = 0.8014332056045532
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6969907879829407
Validation loss = 0.7336272597312927
Validation loss = 0.7476586103439331
Validation loss = 0.7670148611068726
Validation loss = 0.7783521413803101
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 444
average number of affinization = 477.61224489795916
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 460
average number of affinization = 477.26
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 478
average number of affinization = 477.27450980392155
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 471
average number of affinization = 477.15384615384613
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 322
average number of affinization = 474.22641509433964
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 484
average number of affinization = 474.4074074074074
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.35e+03 |
| Iteration     | 7         |
| MaximumReturn | -1.01e+03 |
| MinimumReturn | -2.48e+03 |
| TotalSamples  | 36000     |
-----------------------------
itr #8 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6777122020721436
Validation loss = 0.7062014937400818
Validation loss = 0.7274038791656494
Validation loss = 0.7346379160881042
Validation loss = 0.7490849494934082
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.67917799949646
Validation loss = 0.7107879519462585
Validation loss = 0.7169751524925232
Validation loss = 0.7277934551239014
Validation loss = 0.7376481294631958
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.6975427269935608
Validation loss = 0.6997897028923035
Validation loss = 0.7210977673530579
Validation loss = 0.7441611289978027
Validation loss = 0.7572213411331177
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.6761128902435303
Validation loss = 0.6929068565368652
Validation loss = 0.7243820428848267
Validation loss = 0.742462694644928
Validation loss = 0.7494712471961975
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.669197142124176
Validation loss = 0.6950490474700928
Validation loss = 0.7201840877532959
Validation loss = 0.7292868494987488
Validation loss = 0.7400017976760864
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 496
average number of affinization = 474.8
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 427
average number of affinization = 473.94642857142856
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 552
average number of affinization = 475.3157894736842
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 507
average number of affinization = 475.86206896551727
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 493
average number of affinization = 476.1525423728813
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 412
average number of affinization = 475.0833333333333
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -808      |
| Iteration     | 8         |
| MaximumReturn | -542      |
| MinimumReturn | -1.51e+03 |
| TotalSamples  | 40000     |
-----------------------------
itr #9 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6951526999473572
Validation loss = 0.7284124493598938
Validation loss = 0.7329341769218445
Validation loss = 0.7498825192451477
Validation loss = 0.7737194895744324
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.710157036781311
Validation loss = 0.7186915874481201
Validation loss = 0.7291290163993835
Validation loss = 0.7567348480224609
Validation loss = 0.7480862736701965
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.6916557550430298
Validation loss = 0.719089686870575
Validation loss = 0.7202065587043762
Validation loss = 0.743586003780365
Validation loss = 0.7514075636863708
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7103147506713867
Validation loss = 0.7282313108444214
Validation loss = 0.7382830381393433
Validation loss = 0.7647134065628052
Validation loss = 0.7748133540153503
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6943594813346863
Validation loss = 0.7220767736434937
Validation loss = 0.7341211438179016
Validation loss = 0.7404137849807739
Validation loss = 0.7516671419143677
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 376
average number of affinization = 473.4590163934426
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 363
average number of affinization = 471.6774193548387
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 443
average number of affinization = 471.22222222222223
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 527
average number of affinization = 472.09375
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 509
average number of affinization = 472.66153846153844
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 506
average number of affinization = 473.1666666666667
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.31e+03 |
| Iteration     | 9         |
| MaximumReturn | -820      |
| MinimumReturn | -2.23e+03 |
| TotalSamples  | 44000     |
-----------------------------
itr #10 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6674205660820007
Validation loss = 0.711205244064331
Validation loss = 0.711668074131012
Validation loss = 0.7245280742645264
Validation loss = 0.7302549481391907
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6707603931427002
Validation loss = 0.7013038992881775
Validation loss = 0.7163956165313721
Validation loss = 0.7184197902679443
Validation loss = 0.7244458794593811
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.6758368611335754
Validation loss = 0.7038319110870361
Validation loss = 0.7151788473129272
Validation loss = 0.7355002760887146
Validation loss = 0.7360813021659851
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.6818543076515198
Validation loss = 0.7032971978187561
Validation loss = 0.7276508808135986
Validation loss = 0.731450080871582
Validation loss = 0.7385156154632568
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6679736971855164
Validation loss = 0.7026731967926025
Validation loss = 0.7129498720169067
Validation loss = 0.7202097773551941
Validation loss = 0.7281885147094727
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 427
average number of affinization = 472.4776119402985
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 542
average number of affinization = 473.5
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 354
average number of affinization = 471.768115942029
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 421
average number of affinization = 471.04285714285714
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 522
average number of affinization = 471.76056338028167
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 442
average number of affinization = 471.34722222222223
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.34e+03 |
| Iteration     | 10        |
| MaximumReturn | -775      |
| MinimumReturn | -2.04e+03 |
| TotalSamples  | 48000     |
-----------------------------
itr #11 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6858169436454773
Validation loss = 0.7048152089118958
Validation loss = 0.7116739153862
Validation loss = 0.724721372127533
Validation loss = 0.730177104473114
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6955863833427429
Validation loss = 0.7079266905784607
Validation loss = 0.7147674560546875
Validation loss = 0.7189741134643555
Validation loss = 0.7283406257629395
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.6917076110839844
Validation loss = 0.7115947604179382
Validation loss = 0.7139900326728821
Validation loss = 0.7304412722587585
Validation loss = 0.7299439907073975
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7087300419807434
Validation loss = 0.7146008014678955
Validation loss = 0.7109603881835938
Validation loss = 0.721324622631073
Validation loss = 0.72295743227005
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6954228281974792
Validation loss = 0.6972754001617432
Validation loss = 0.7070229649543762
Validation loss = 0.716932475566864
Validation loss = 0.7223146557807922
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 547
average number of affinization = 472.3835616438356
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 569
average number of affinization = 473.68918918918916
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 545
average number of affinization = 474.64
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 514
average number of affinization = 475.1578947368421
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 575
average number of affinization = 476.45454545454544
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 419
average number of affinization = 475.71794871794873
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -954      |
| Iteration     | 11        |
| MaximumReturn | -729      |
| MinimumReturn | -1.17e+03 |
| TotalSamples  | 52000     |
-----------------------------
itr #12 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6879538297653198
Validation loss = 0.7019758820533752
Validation loss = 0.6989520788192749
Validation loss = 0.7026189565658569
Validation loss = 0.7190484404563904
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6891645789146423
Validation loss = 0.6881190538406372
Validation loss = 0.6995366215705872
Validation loss = 0.7065473794937134
Validation loss = 0.7117923498153687
Validation loss = 0.7109666466712952
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.6969339847564697
Validation loss = 0.6945695877075195
Validation loss = 0.707647979259491
Validation loss = 0.7105512619018555
Validation loss = 0.714599609375
Validation loss = 0.7162176966667175
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.6938642859458923
Validation loss = 0.7038158774375916
Validation loss = 0.7099000215530396
Validation loss = 0.7130504846572876
Validation loss = 0.7130597829818726
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6805948615074158
Validation loss = 0.6883043646812439
Validation loss = 0.6996482014656067
Validation loss = 0.7012714147567749
Validation loss = 0.7052292227745056
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 481
average number of affinization = 475.7848101265823
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 532
average number of affinization = 476.4875
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 526
average number of affinization = 477.0987654320988
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 578
average number of affinization = 478.3292682926829
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 504
average number of affinization = 478.6385542168675
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 516
average number of affinization = 479.0833333333333
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.14e+03 |
| Iteration     | 12        |
| MaximumReturn | -431      |
| MinimumReturn | -1.91e+03 |
| TotalSamples  | 56000     |
-----------------------------
itr #13 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6823719143867493
Validation loss = 0.6939006447792053
Validation loss = 0.6982731819152832
Validation loss = 0.700578510761261
Validation loss = 0.7061373591423035
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6821717619895935
Validation loss = 0.6887392401695251
Validation loss = 0.6952325105667114
Validation loss = 0.6979871988296509
Validation loss = 0.7016437649726868
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.6781591176986694
Validation loss = 0.7024999260902405
Validation loss = 0.7061691284179688
Validation loss = 0.705001950263977
Validation loss = 0.7098216414451599
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.6863625645637512
Validation loss = 0.6997506022453308
Validation loss = 0.7039574980735779
Validation loss = 0.7128292918205261
Validation loss = 0.7181025147438049
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6836671233177185
Validation loss = 0.6862278580665588
Validation loss = 0.6967538595199585
Validation loss = 0.702660858631134
Validation loss = 0.7047349214553833
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 389
average number of affinization = 478.0235294117647
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 585
average number of affinization = 479.2674418604651
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 475
average number of affinization = 479.2183908045977
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 559
average number of affinization = 480.125
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 555
average number of affinization = 480.96629213483146
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 577
average number of affinization = 482.03333333333336
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.21e+03 |
| Iteration     | 13        |
| MaximumReturn | -733      |
| MinimumReturn | -1.67e+03 |
| TotalSamples  | 60000     |
-----------------------------
itr #14 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6798921823501587
Validation loss = 0.6787577271461487
Validation loss = 0.69273841381073
Validation loss = 0.706739068031311
Validation loss = 0.7008212208747864
Validation loss = 0.7050248384475708
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6754568219184875
Validation loss = 0.685653030872345
Validation loss = 0.6921596527099609
Validation loss = 0.6990175247192383
Validation loss = 0.7005687355995178
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.6917171478271484
Validation loss = 0.69661545753479
Validation loss = 0.7064802646636963
Validation loss = 0.7032946348190308
Validation loss = 0.7154218554496765
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.6815946698188782
Validation loss = 0.690016508102417
Validation loss = 0.6988971829414368
Validation loss = 0.7068674564361572
Validation loss = 0.7151384949684143
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6859062314033508
Validation loss = 0.6808173656463623
Validation loss = 0.6970415115356445
Validation loss = 0.7028271555900574
Validation loss = 0.7049360871315002
Validation loss = 0.7092739343643188
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 505
average number of affinization = 482.2857142857143
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 367
average number of affinization = 481.0326086956522
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 319
average number of affinization = 479.2903225806452
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 558
average number of affinization = 480.1276595744681
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 451
average number of affinization = 479.82105263157894
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 532
average number of affinization = 480.3645833333333
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.19e+03 |
| Iteration     | 14        |
| MaximumReturn | -694      |
| MinimumReturn | -2.06e+03 |
| TotalSamples  | 64000     |
-----------------------------
itr #15 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6752091646194458
Validation loss = 0.6759166121482849
Validation loss = 0.6845189332962036
Validation loss = 0.6890624165534973
Validation loss = 0.691187858581543
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6617845892906189
Validation loss = 0.6687272787094116
Validation loss = 0.6809454560279846
Validation loss = 0.685907781124115
Validation loss = 0.6902329921722412
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.680418074131012
Validation loss = 0.6854156255722046
Validation loss = 0.689698338508606
Validation loss = 0.6898671388626099
Validation loss = 0.6986114978790283
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.6761094331741333
Validation loss = 0.6830191016197205
Validation loss = 0.6871169805526733
Validation loss = 0.6905554533004761
Validation loss = 0.6962436437606812
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6684999465942383
Validation loss = 0.6723437309265137
Validation loss = 0.6728127598762512
Validation loss = 0.684147834777832
Validation loss = 0.6797745823860168
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 555
average number of affinization = 481.1340206185567
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 609
average number of affinization = 482.4387755102041
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 437
average number of affinization = 481.979797979798
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 447
average number of affinization = 481.63
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 486
average number of affinization = 481.6732673267327
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 592
average number of affinization = 482.7549019607843
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.45e+03 |
| Iteration     | 15        |
| MaximumReturn | -290      |
| MinimumReturn | -2.7e+03  |
| TotalSamples  | 68000     |
-----------------------------
itr #16 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6763061881065369
Validation loss = 0.672558069229126
Validation loss = 0.6762232184410095
Validation loss = 0.6845807433128357
Validation loss = 0.6872904896736145
Validation loss = 0.6904962658882141
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6672565937042236
Validation loss = 0.6676082015037537
Validation loss = 0.6776747107505798
Validation loss = 0.6777771711349487
Validation loss = 0.6846318244934082
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.6720614433288574
Validation loss = 0.6847700476646423
Validation loss = 0.689174234867096
Validation loss = 0.6894774436950684
Validation loss = 0.6893277764320374
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.6753398180007935
Validation loss = 0.6757974028587341
Validation loss = 0.6810514330863953
Validation loss = 0.6828053593635559
Validation loss = 0.6895601153373718
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6669006943702698
Validation loss = 0.6661229133605957
Validation loss = 0.6781339049339294
Validation loss = 0.6812167167663574
Validation loss = 0.6834181547164917
Validation loss = 0.6856456995010376
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 589
average number of affinization = 483.7864077669903
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 538
average number of affinization = 484.3076923076923
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 363
average number of affinization = 483.15238095238095
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 608
average number of affinization = 484.3301886792453
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 633
average number of affinization = 485.7196261682243
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 607
average number of affinization = 486.8425925925926
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.42e+03 |
| Iteration     | 16        |
| MaximumReturn | -276      |
| MinimumReturn | -2.26e+03 |
| TotalSamples  | 72000     |
-----------------------------
itr #17 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6653347611427307
Validation loss = 0.6702476739883423
Validation loss = 0.6758138537406921
Validation loss = 0.6873157620429993
Validation loss = 0.6805505156517029
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6707891225814819
Validation loss = 0.6663148999214172
Validation loss = 0.6748334169387817
Validation loss = 0.6685279607772827
Validation loss = 0.6844433546066284
Validation loss = 0.6821556687355042
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.6737017035484314
Validation loss = 0.678134560585022
Validation loss = 0.6894766092300415
Validation loss = 0.6872430443763733
Validation loss = 0.6890532970428467
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.6674976348876953
Validation loss = 0.6704251766204834
Validation loss = 0.6819971799850464
Validation loss = 0.6763404011726379
Validation loss = 0.6826599836349487
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6746907234191895
Validation loss = 0.6684661507606506
Validation loss = 0.6763543486595154
Validation loss = 0.6780402660369873
Validation loss = 0.6831401586532593
Validation loss = 0.6857987642288208
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 486
average number of affinization = 486.8348623853211
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 606
average number of affinization = 487.91818181818184
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 527
average number of affinization = 488.27027027027026
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 638
average number of affinization = 489.60714285714283
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 582
average number of affinization = 490.42477876106193
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 591
average number of affinization = 491.3070175438597
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -919      |
| Iteration     | 17        |
| MaximumReturn | 22.9      |
| MinimumReturn | -1.52e+03 |
| TotalSamples  | 76000     |
-----------------------------
itr #18 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6794397234916687
Validation loss = 0.6680893301963806
Validation loss = 0.6784721612930298
Validation loss = 0.6865823268890381
Validation loss = 0.6870647072792053
Validation loss = 0.685124397277832
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.667337954044342
Validation loss = 0.6713002920150757
Validation loss = 0.6737479567527771
Validation loss = 0.6783110499382019
Validation loss = 0.6854056715965271
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.6795614957809448
Validation loss = 0.6770175099372864
Validation loss = 0.6829260587692261
Validation loss = 0.6916308403015137
Validation loss = 0.6880896091461182
Validation loss = 0.6909393072128296
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.6801074147224426
Validation loss = 0.6690592765808105
Validation loss = 0.6781212687492371
Validation loss = 0.682796835899353
Validation loss = 0.6811947822570801
Validation loss = 0.685734748840332
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6733142137527466
Validation loss = 0.6704177856445312
Validation loss = 0.6744116544723511
Validation loss = 0.6793737411499023
Validation loss = 0.6763858199119568
Validation loss = 0.6844086647033691
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 511
average number of affinization = 491.4782608695652
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 566
average number of affinization = 492.12068965517244
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 614
average number of affinization = 493.1623931623932
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 506
average number of affinization = 493.271186440678
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 549
average number of affinization = 493.73949579831935
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 639
average number of affinization = 494.95
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -956      |
| Iteration     | 18        |
| MaximumReturn | -619      |
| MinimumReturn | -1.55e+03 |
| TotalSamples  | 80000     |
-----------------------------
itr #19 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6660460233688354
Validation loss = 0.671744704246521
Validation loss = 0.6724894046783447
Validation loss = 0.6773945093154907
Validation loss = 0.6860033869743347
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6672876477241516
Validation loss = 0.6659113764762878
Validation loss = 0.669407308101654
Validation loss = 0.6769899129867554
Validation loss = 0.6760464310646057
Validation loss = 0.6735336184501648
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.6704031825065613
Validation loss = 0.6740953922271729
Validation loss = 0.6775487065315247
Validation loss = 0.6795021295547485
Validation loss = 0.6841483116149902
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.6704844236373901
Validation loss = 0.6733359098434448
Validation loss = 0.6741867065429688
Validation loss = 0.6814444661140442
Validation loss = 0.6782842874526978
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6655672788619995
Validation loss = 0.6595672369003296
Validation loss = 0.6673885583877563
Validation loss = 0.6756734848022461
Validation loss = 0.6747997999191284
Validation loss = 0.6748124957084656
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 446
average number of affinization = 494.54545454545456
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 429
average number of affinization = 494.0081967213115
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 459
average number of affinization = 493.7235772357724
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 416
average number of affinization = 493.0967741935484
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 458
average number of affinization = 492.816
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 556
average number of affinization = 493.3174603174603
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.09e+03 |
| Iteration     | 19        |
| MaximumReturn | -637      |
| MinimumReturn | -1.56e+03 |
| TotalSamples  | 84000     |
-----------------------------
itr #20 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6756296157836914
Validation loss = 0.6686890125274658
Validation loss = 0.6771174073219299
Validation loss = 0.6787847876548767
Validation loss = 0.674136757850647
Validation loss = 0.6764055490493774
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6642116904258728
Validation loss = 0.6613694429397583
Validation loss = 0.6662556529045105
Validation loss = 0.6713591814041138
Validation loss = 0.6707595586776733
Validation loss = 0.6683224439620972
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.6698247194290161
Validation loss = 0.6720619201660156
Validation loss = 0.6766008734703064
Validation loss = 0.6771972179412842
Validation loss = 0.6761336326599121
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.6677560210227966
Validation loss = 0.6712803840637207
Validation loss = 0.674660861492157
Validation loss = 0.675725519657135
Validation loss = 0.6762660145759583
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6701143980026245
Validation loss = 0.666210949420929
Validation loss = 0.6626812815666199
Validation loss = 0.6740215420722961
Validation loss = 0.6712623238563538
Validation loss = 0.6728767156600952
Validation loss = 0.6697037816047668
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 603
average number of affinization = 494.18110236220474
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 601
average number of affinization = 495.015625
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 488
average number of affinization = 494.9612403100775
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 471
average number of affinization = 494.7769230769231
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 497
average number of affinization = 494.793893129771
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 508
average number of affinization = 494.8939393939394
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.03e+03 |
| Iteration     | 20        |
| MaximumReturn | -602      |
| MinimumReturn | -1.56e+03 |
| TotalSamples  | 88000     |
-----------------------------
itr #21 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6604264974594116
Validation loss = 0.6560966968536377
Validation loss = 0.6679630875587463
Validation loss = 0.6686547994613647
Validation loss = 0.6703845262527466
Validation loss = 0.6702357530593872
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6558152437210083
Validation loss = 0.6538228392601013
Validation loss = 0.6584200263023376
Validation loss = 0.6610996127128601
Validation loss = 0.6660560965538025
Validation loss = 0.6648560762405396
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.6538581252098083
Validation loss = 0.6572964191436768
Validation loss = 0.6651386618614197
Validation loss = 0.6665284633636475
Validation loss = 0.6677170395851135
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.653709352016449
Validation loss = 0.6564405560493469
Validation loss = 0.6644501686096191
Validation loss = 0.6657097339630127
Validation loss = 0.6665322184562683
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6527541875839233
Validation loss = 0.6547417044639587
Validation loss = 0.6571281552314758
Validation loss = 0.6582334637641907
Validation loss = 0.6607366800308228
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 582
average number of affinization = 495.54887218045116
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 602
average number of affinization = 496.34328358208955
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 465
average number of affinization = 496.1111111111111
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 620
average number of affinization = 497.0220588235294
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 484
average number of affinization = 496.9270072992701
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 508
average number of affinization = 497.0072463768116
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.13e+03 |
| Iteration     | 21        |
| MaximumReturn | -892      |
| MinimumReturn | -1.5e+03  |
| TotalSamples  | 92000     |
-----------------------------
itr #22 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6462144255638123
Validation loss = 0.6504263281822205
Validation loss = 0.6559857726097107
Validation loss = 0.6604730486869812
Validation loss = 0.6548416018486023
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6444841623306274
Validation loss = 0.6461970806121826
Validation loss = 0.6468287706375122
Validation loss = 0.6541646122932434
Validation loss = 0.650826632976532
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.6529098153114319
Validation loss = 0.651262104511261
Validation loss = 0.6579415202140808
Validation loss = 0.6568965315818787
Validation loss = 0.6598235368728638
Validation loss = 0.6585807800292969
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.6466463804244995
Validation loss = 0.6505014300346375
Validation loss = 0.6548025012016296
Validation loss = 0.6578953266143799
Validation loss = 0.6605924367904663
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6448488831520081
Validation loss = 0.6443514823913574
Validation loss = 0.6492365002632141
Validation loss = 0.650400698184967
Validation loss = 0.6513011455535889
Validation loss = 0.6560307741165161
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 383
average number of affinization = 496.1870503597122
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 501
average number of affinization = 496.2214285714286
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 424
average number of affinization = 495.709219858156
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 594
average number of affinization = 496.40140845070425
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 662
average number of affinization = 497.55944055944053
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 668
average number of affinization = 498.74305555555554
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.14e+03 |
| Iteration     | 22        |
| MaximumReturn | -438      |
| MinimumReturn | -2.08e+03 |
| TotalSamples  | 96000     |
-----------------------------
itr #23 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6545482277870178
Validation loss = 0.6545552611351013
Validation loss = 0.6551212668418884
Validation loss = 0.6532330513000488
Validation loss = 0.6588250994682312
Validation loss = 0.660199761390686
Validation loss = 0.6610425710678101
Validation loss = 0.655138373374939
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6459022760391235
Validation loss = 0.6440979242324829
Validation loss = 0.646638810634613
Validation loss = 0.6501109600067139
Validation loss = 0.6477702856063843
Validation loss = 0.6543884873390198
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.6489292979240417
Validation loss = 0.652948796749115
Validation loss = 0.6545527577400208
Validation loss = 0.6568769812583923
Validation loss = 0.654948890209198
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.655156135559082
Validation loss = 0.6479138135910034
Validation loss = 0.6537700891494751
Validation loss = 0.6564249396324158
Validation loss = 0.6575344204902649
Validation loss = 0.6595234274864197
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6437616944313049
Validation loss = 0.6454402208328247
Validation loss = 0.6454192996025085
Validation loss = 0.6476831436157227
Validation loss = 0.6498549580574036
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 615
average number of affinization = 499.5448275862069
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 336
average number of affinization = 498.4246575342466
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 632
average number of affinization = 499.3333333333333
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 488
average number of affinization = 499.2567567567568
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 655
average number of affinization = 500.3020134228188
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 652
average number of affinization = 501.31333333333333
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.32e+03 |
| Iteration     | 23        |
| MaximumReturn | -810      |
| MinimumReturn | -1.67e+03 |
| TotalSamples  | 100000    |
-----------------------------
itr #24 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6467178463935852
Validation loss = 0.6506007313728333
Validation loss = 0.6535914540290833
Validation loss = 0.6530295610427856
Validation loss = 0.6538630723953247
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6428264379501343
Validation loss = 0.641126275062561
Validation loss = 0.6438934803009033
Validation loss = 0.6474738121032715
Validation loss = 0.6507270336151123
Validation loss = 0.6497033834457397
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.6428744792938232
Validation loss = 0.6478851437568665
Validation loss = 0.6527911424636841
Validation loss = 0.6543777585029602
Validation loss = 0.6573407649993896
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.6507294178009033
Validation loss = 0.6482565402984619
Validation loss = 0.6521553993225098
Validation loss = 0.6571002006530762
Validation loss = 0.6519298553466797
Validation loss = 0.6551371812820435
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6464138031005859
Validation loss = 0.6406732201576233
Validation loss = 0.6426346302032471
Validation loss = 0.6478645205497742
Validation loss = 0.6451190710067749
Validation loss = 0.6486228704452515
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 324
average number of affinization = 500.13907284768214
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 443
average number of affinization = 499.7631578947368
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 463
average number of affinization = 499.52287581699346
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 322
average number of affinization = 498.37012987012986
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 482
average number of affinization = 498.26451612903224
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 421
average number of affinization = 497.7692307692308
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.57e+03 |
| Iteration     | 24        |
| MaximumReturn | -930      |
| MinimumReturn | -2.64e+03 |
| TotalSamples  | 104000    |
-----------------------------
itr #25 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6528915166854858
Validation loss = 0.6422467827796936
Validation loss = 0.6436315178871155
Validation loss = 0.6493340730667114
Validation loss = 0.6463235020637512
Validation loss = 0.6547914147377014
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6454114317893982
Validation loss = 0.6420522332191467
Validation loss = 0.6442776918411255
Validation loss = 0.645355224609375
Validation loss = 0.6478008031845093
Validation loss = 0.64790278673172
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.6467733383178711
Validation loss = 0.6476171016693115
Validation loss = 0.654758095741272
Validation loss = 0.6533108949661255
Validation loss = 0.6526564955711365
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.6475187540054321
Validation loss = 0.6455600261688232
Validation loss = 0.6521574854850769
Validation loss = 0.6544856429100037
Validation loss = 0.6523666381835938
Validation loss = 0.6529210209846497
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6467123031616211
Validation loss = 0.6421604156494141
Validation loss = 0.6425380706787109
Validation loss = 0.6445739269256592
Validation loss = 0.6452199220657349
Validation loss = 0.6460959911346436
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 426
average number of affinization = 497.312101910828
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 649
average number of affinization = 498.2721518987342
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 532
average number of affinization = 498.4842767295597
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 384
average number of affinization = 497.76875
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 574
average number of affinization = 498.2422360248447
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 505
average number of affinization = 498.28395061728395
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.17e+03 |
| Iteration     | 25        |
| MaximumReturn | -777      |
| MinimumReturn | -2.01e+03 |
| TotalSamples  | 108000    |
-----------------------------
itr #26 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6452727317810059
Validation loss = 0.6402539014816284
Validation loss = 0.6474750638008118
Validation loss = 0.648321270942688
Validation loss = 0.6474130153656006
Validation loss = 0.6495675444602966
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6399024724960327
Validation loss = 0.6346326470375061
Validation loss = 0.6400002241134644
Validation loss = 0.6417928338050842
Validation loss = 0.6484534740447998
Validation loss = 0.6456025242805481
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.6438475251197815
Validation loss = 0.6444628834724426
Validation loss = 0.6476919651031494
Validation loss = 0.6456285119056702
Validation loss = 0.6534789800643921
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.6433918476104736
Validation loss = 0.6404113173484802
Validation loss = 0.6454429626464844
Validation loss = 0.648291826248169
Validation loss = 0.6493856906890869
Validation loss = 0.6468433141708374
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.639962375164032
Validation loss = 0.6359733939170837
Validation loss = 0.6398398876190186
Validation loss = 0.6415164470672607
Validation loss = 0.64028400182724
Validation loss = 0.6420242190361023
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 591
average number of affinization = 498.85276073619633
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 635
average number of affinization = 499.6829268292683
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 628
average number of affinization = 500.46060606060604
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 635
average number of affinization = 501.2710843373494
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 676
average number of affinization = 502.31736526946105
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 648
average number of affinization = 503.1845238095238
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -888      |
| Iteration     | 26        |
| MaximumReturn | -443      |
| MinimumReturn | -1.16e+03 |
| TotalSamples  | 112000    |
-----------------------------
itr #27 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6306021809577942
Validation loss = 0.6361525654792786
Validation loss = 0.6425027847290039
Validation loss = 0.6398200988769531
Validation loss = 0.6397146582603455
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6307172775268555
Validation loss = 0.6323990225791931
Validation loss = 0.6349342465400696
Validation loss = 0.6371399164199829
Validation loss = 0.6362056732177734
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.6338872909545898
Validation loss = 0.6345921754837036
Validation loss = 0.6395493745803833
Validation loss = 0.6409301161766052
Validation loss = 0.643829882144928
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.6349953413009644
Validation loss = 0.63410884141922
Validation loss = 0.6390275955200195
Validation loss = 0.6415101885795593
Validation loss = 0.6424177289009094
Validation loss = 0.642227292060852
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6332907676696777
Validation loss = 0.6302286982536316
Validation loss = 0.6358804702758789
Validation loss = 0.6312137246131897
Validation loss = 0.6368598937988281
Validation loss = 0.6410848498344421
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 663
average number of affinization = 504.1301775147929
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 655
average number of affinization = 505.0176470588235
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 384
average number of affinization = 504.3099415204678
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 640
average number of affinization = 505.0988372093023
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 678
average number of affinization = 506.0982658959538
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 652
average number of affinization = 506.9367816091954
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -585     |
| Iteration     | 27       |
| MaximumReturn | -271     |
| MinimumReturn | -865     |
| TotalSamples  | 116000   |
----------------------------
itr #28 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6307821869850159
Validation loss = 0.6321835517883301
Validation loss = 0.6372305154800415
Validation loss = 0.6338661313056946
Validation loss = 0.6351962685585022
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6329933404922485
Validation loss = 0.6255630254745483
Validation loss = 0.6301131248474121
Validation loss = 0.6324546933174133
Validation loss = 0.6346870064735413
Validation loss = 0.6342005729675293
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.6297599673271179
Validation loss = 0.633821427822113
Validation loss = 0.637563943862915
Validation loss = 0.6398703455924988
Validation loss = 0.6379281282424927
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.6388638019561768
Validation loss = 0.6316629648208618
Validation loss = 0.6344774961471558
Validation loss = 0.6381916403770447
Validation loss = 0.6429387927055359
Validation loss = 0.6405081152915955
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6269155740737915
Validation loss = 0.6283158659934998
Validation loss = 0.6266236305236816
Validation loss = 0.6344096064567566
Validation loss = 0.6319261789321899
Validation loss = 0.6346948742866516
Validation loss = 0.6336036920547485
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 581
average number of affinization = 507.36
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 577
average number of affinization = 507.7556818181818
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 567
average number of affinization = 508.090395480226
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 564
average number of affinization = 508.40449438202245
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 672
average number of affinization = 509.3184357541899
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 648
average number of affinization = 510.0888888888889
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.08e+03 |
| Iteration     | 28        |
| MaximumReturn | -854      |
| MinimumReturn | -1.55e+03 |
| TotalSamples  | 120000    |
-----------------------------
itr #29 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6273258328437805
Validation loss = 0.6294649839401245
Validation loss = 0.6291952133178711
Validation loss = 0.6338904500007629
Validation loss = 0.6334918737411499
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6300567984580994
Validation loss = 0.6270179748535156
Validation loss = 0.6296107769012451
Validation loss = 0.6312231421470642
Validation loss = 0.6288442611694336
Validation loss = 0.6317134499549866
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.6302757263183594
Validation loss = 0.6312111020088196
Validation loss = 0.6355810761451721
Validation loss = 0.6353581547737122
Validation loss = 0.63820880651474
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.6302000284194946
Validation loss = 0.6294168829917908
Validation loss = 0.6363300681114197
Validation loss = 0.6342490315437317
Validation loss = 0.6364480257034302
Validation loss = 0.6366631984710693
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6235148310661316
Validation loss = 0.6249935030937195
Validation loss = 0.6252623200416565
Validation loss = 0.6273975968360901
Validation loss = 0.6275476217269897
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 565
average number of affinization = 510.3922651933702
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 545
average number of affinization = 510.5824175824176
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 625
average number of affinization = 511.20765027322403
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 651
average number of affinization = 511.9673913043478
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 581
average number of affinization = 512.3405405405406
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 652
average number of affinization = 513.0913978494624
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.02e+03 |
| Iteration     | 29        |
| MaximumReturn | -370      |
| MinimumReturn | -2.04e+03 |
| TotalSamples  | 124000    |
-----------------------------
itr #30 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6303035616874695
Validation loss = 0.6297297477722168
Validation loss = 0.6354416012763977
Validation loss = 0.6337520480155945
Validation loss = 0.631102979183197
Validation loss = 0.6369673609733582
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6247040629386902
Validation loss = 0.6239268183708191
Validation loss = 0.6286997199058533
Validation loss = 0.6331799626350403
Validation loss = 0.6317844390869141
Validation loss = 0.6300303936004639
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.6323586106300354
Validation loss = 0.6314209699630737
Validation loss = 0.6338672637939453
Validation loss = 0.6365604400634766
Validation loss = 0.6383660435676575
Validation loss = 0.6373238563537598
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.6323700547218323
Validation loss = 0.6278461217880249
Validation loss = 0.6336625814437866
Validation loss = 0.6330742835998535
Validation loss = 0.6372131109237671
Validation loss = 0.6330549120903015
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6227445006370544
Validation loss = 0.6299135088920593
Validation loss = 0.6282484531402588
Validation loss = 0.6299201846122742
Validation loss = 0.6314513683319092
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 725
average number of affinization = 514.2245989304813
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 410
average number of affinization = 513.6702127659574
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 297
average number of affinization = 512.5238095238095
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 718
average number of affinization = 513.6052631578947
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 644
average number of affinization = 514.2879581151833
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 590
average number of affinization = 514.6822916666666
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -998      |
| Iteration     | 30        |
| MaximumReturn | -568      |
| MinimumReturn | -1.46e+03 |
| TotalSamples  | 128000    |
-----------------------------
itr #31 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6295564770698547
Validation loss = 0.6300600171089172
Validation loss = 0.6341755986213684
Validation loss = 0.6341067552566528
Validation loss = 0.6337794065475464
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6266927719116211
Validation loss = 0.626270592212677
Validation loss = 0.6309738159179688
Validation loss = 0.6305719614028931
Validation loss = 0.6303504109382629
Validation loss = 0.6296623349189758
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.6305011510848999
Validation loss = 0.627556324005127
Validation loss = 0.6350955963134766
Validation loss = 0.6368048787117004
Validation loss = 0.6369346380233765
Validation loss = 0.6357191205024719
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.6295076608657837
Validation loss = 0.6259950995445251
Validation loss = 0.6311448216438293
Validation loss = 0.6296969652175903
Validation loss = 0.6310296058654785
Validation loss = 0.6337922215461731
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6231871843338013
Validation loss = 0.6240024566650391
Validation loss = 0.6262421011924744
Validation loss = 0.6306434869766235
Validation loss = 0.629414439201355
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 626
average number of affinization = 515.259067357513
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 694
average number of affinization = 516.180412371134
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 652
average number of affinization = 516.876923076923
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 587
average number of affinization = 517.234693877551
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 652
average number of affinization = 517.9187817258884
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 714
average number of affinization = 518.9090909090909
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -923      |
| Iteration     | 31        |
| MaximumReturn | -317      |
| MinimumReturn | -1.43e+03 |
| TotalSamples  | 132000    |
-----------------------------
itr #32 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.627797544002533
Validation loss = 0.6304194331169128
Validation loss = 0.6289636492729187
Validation loss = 0.6317981481552124
Validation loss = 0.6330267190933228
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.626440167427063
Validation loss = 0.6267169713973999
Validation loss = 0.6302079558372498
Validation loss = 0.628538191318512
Validation loss = 0.6299106478691101
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.629035472869873
Validation loss = 0.631165623664856
Validation loss = 0.632459282875061
Validation loss = 0.6361717581748962
Validation loss = 0.6356458067893982
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.6280192136764526
Validation loss = 0.6298220753669739
Validation loss = 0.6325837969779968
Validation loss = 0.633857250213623
Validation loss = 0.6331354975700378
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6255750060081482
Validation loss = 0.6217585802078247
Validation loss = 0.6298120617866516
Validation loss = 0.6289968490600586
Validation loss = 0.6319409608840942
Validation loss = 0.6308072805404663
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 676
average number of affinization = 519.6984924623116
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 422
average number of affinization = 519.21
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 590
average number of affinization = 519.5621890547263
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 387
average number of affinization = 518.9059405940594
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 456
average number of affinization = 518.5960591133005
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 755
average number of affinization = 519.7549019607843
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.08e+03 |
| Iteration     | 32        |
| MaximumReturn | -352      |
| MinimumReturn | -2.23e+03 |
| TotalSamples  | 136000    |
-----------------------------
