Logging to experiments/gym_cheetahA01/oct29/w350e3_seed2314
Print configuration .....
{'env_name': 'gym_cheetahA01', 'random_seeds': [4321, 2314, 2341, 3421], 'save_variables': False, 'model_save_dir': '/tmp/gym_cheetahA01_models/', 'restore_variables': False, 'start_onpol_iter': 0, 'onpol_iters': 33, 'num_path_random': 6, 'num_path_onpol': 6, 'env_horizon': 1000, 'max_train_data': 200000, 'max_val_data': 100000, 'discard_ratio': 0.0, 'dynamics': {'pre_training': {'mode': 'intrinsic_reward', 'itr': 0, 'policy_itr': 20}, 'model': 'nn', 'ensemble': True, 'ensemble_model_count': 5, 'enable_particle_ensemble': True, 'particles': 5, 'intrinsic_reward_only': False, 'external_reward_evaluation_interval': 5, 'obs_var': 1.0, 'intrinsic_reward_coeff': 1.0, 'ita': 1.0, 'mode': 'random', 'val': True, 'n_layers': 4, 'hidden_size': 1000, 'activation': 'relu', 'batch_size': 1000, 'learning_rate': 0.001, 'epochs': 200, 'kfac_params': {'learning_rate': 0.1, 'damping': 0.001, 'momentum': 0.9, 'kl_clip': 0.0001, 'cov_ema_decay': 0.99}}, 'policy': {'network_shape': [32, 32], 'init_logstd': 0.0, 'activation': 'tanh', 'reinitialize_every_itr': False}, 'trpo': {'horizon': 1000, 'gamma': 0.99, 'step_size': 0.01, 'iterations': 20, 'batch_size': 50000, 'gae': 0.95}, 'trpo_ext_reward': {'horizon': 1000, 'gamma': 0.99, 'step_size': 0.01, 'iterations': 20, 'batch_size': 50000, 'gae': 0.95}, 'algo': 'trpo'}
Generating random rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 0
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 0
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 0
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 0
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 0
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 0
Done generating random rollouts.
Creating normalization for training data.
Done creating normalization for training data.
Particle ensemble enabled? True
An ensemble of 5 dynamics model <class 'model.dynamics.NNDynamicsModel'> initialized
Train dynamics model with intrinsic reward only? False
Pre-training enabled. Using only intrinsic reward.
Pre-training dynamics model for 0 iterations...
Done pre-training dynamics model.
Using external reward only.
itr #0 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.41626662015914917
Validation loss = 0.15509875118732452
Validation loss = 0.10188835114240646
Validation loss = 0.0828862339258194
Validation loss = 0.07486587762832642
Validation loss = 0.07052556425333023
Validation loss = 0.0707787424325943
Validation loss = 0.06376060098409653
Validation loss = 0.07325275987386703
Validation loss = 0.0639643520116806
Validation loss = 0.06092337518930435
Validation loss = 0.06680887937545776
Validation loss = 0.05894017592072487
Validation loss = 0.05931881442666054
Validation loss = 0.061219312250614166
Validation loss = 0.06023404747247696
Validation loss = 0.06033241003751755
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.5270459651947021
Validation loss = 0.1578218638896942
Validation loss = 0.10231120139360428
Validation loss = 0.08332541584968567
Validation loss = 0.0790555402636528
Validation loss = 0.07149282842874527
Validation loss = 0.06579296290874481
Validation loss = 0.07222697883844376
Validation loss = 0.062230080366134644
Validation loss = 0.06514972448348999
Validation loss = 0.06294597685337067
Validation loss = 0.06061825156211853
Validation loss = 0.07252871990203857
Validation loss = 0.06054210662841797
Validation loss = 0.05781339854001999
Validation loss = 0.057095691561698914
Validation loss = 0.0801030695438385
Validation loss = 0.05813115835189819
Validation loss = 0.05636683106422424
Validation loss = 0.06107965484261513
Validation loss = 0.06450890004634857
Validation loss = 0.05919352546334267
Validation loss = 0.06034218147397041
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.7194564342498779
Validation loss = 0.15403011441230774
Validation loss = 0.09534990787506104
Validation loss = 0.0808405727148056
Validation loss = 0.07475271821022034
Validation loss = 0.07955816388130188
Validation loss = 0.0682373195886612
Validation loss = 0.06413424015045166
Validation loss = 0.06875036656856537
Validation loss = 0.06667788326740265
Validation loss = 0.0648099035024643
Validation loss = 0.07026326656341553
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.4168115258216858
Validation loss = 0.160335510969162
Validation loss = 0.1024191677570343
Validation loss = 0.08291620761156082
Validation loss = 0.07628598809242249
Validation loss = 0.06898380815982819
Validation loss = 0.06979377567768097
Validation loss = 0.06434393674135208
Validation loss = 0.061984702944755554
Validation loss = 0.066761814057827
Validation loss = 0.0650540441274643
Validation loss = 0.06251318752765656
Validation loss = 0.061903368681669235
Validation loss = 0.057776495814323425
Validation loss = 0.08463446795940399
Validation loss = 0.06438225507736206
Validation loss = 0.05891190096735954
Validation loss = 0.05578720197081566
Validation loss = 0.055825404822826385
Validation loss = 0.059083547443151474
Validation loss = 0.06234246492385864
Validation loss = 0.05655355751514435
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.4789770543575287
Validation loss = 0.15313327312469482
Validation loss = 0.09661578387022018
Validation loss = 0.0806705430150032
Validation loss = 0.07437600195407867
Validation loss = 0.06978943943977356
Validation loss = 0.06724195182323456
Validation loss = 0.07072744518518448
Validation loss = 0.06316046416759491
Validation loss = 0.06307576596736908
Validation loss = 0.0713154673576355
Validation loss = 0.06119466572999954
Validation loss = 0.0583486333489418
Validation loss = 0.060092948377132416
Validation loss = 0.058309897780418396
Validation loss = 0.058903101831674576
Validation loss = 0.06381808966398239
Validation loss = 0.057270657271146774
Validation loss = 0.09521748125553131
Validation loss = 0.05907409265637398
Validation loss = 0.06417862325906754
Validation loss = 0.0584954172372818
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 208
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 207
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 194
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 174
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 201
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 196
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -313     |
| Iteration     | 0        |
| MaximumReturn | -252     |
| MinimumReturn | -412     |
| TotalSamples  | 8000     |
----------------------------
itr #1 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.07706505805253983
Validation loss = 0.05663612484931946
Validation loss = 0.054163940250873566
Validation loss = 0.05325084924697876
Validation loss = 0.054905690252780914
Validation loss = 0.04992659389972687
Validation loss = 0.04879085347056389
Validation loss = 0.04974175989627838
Validation loss = 0.05025473237037659
Validation loss = 0.047061555087566376
Validation loss = 0.04740848019719124
Validation loss = 0.04697001725435257
Validation loss = 0.0465560108423233
Validation loss = 0.049525462090969086
Validation loss = 0.04749840125441551
Validation loss = 0.0459335520863533
Validation loss = 0.048529334366321564
Validation loss = 0.04851020872592926
Validation loss = 0.048861924558877945
Validation loss = 0.05060083046555519
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.0774092972278595
Validation loss = 0.05436067283153534
Validation loss = 0.053333546966314316
Validation loss = 0.0537298247218132
Validation loss = 0.051449958235025406
Validation loss = 0.053752485662698746
Validation loss = 0.04828810691833496
Validation loss = 0.04765767604112625
Validation loss = 0.0485658198595047
Validation loss = 0.05120597034692764
Validation loss = 0.04912910237908363
Validation loss = 0.05200554430484772
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.08057626336812973
Validation loss = 0.05715175345540047
Validation loss = 0.05606171861290932
Validation loss = 0.052004486322402954
Validation loss = 0.05211237072944641
Validation loss = 0.05497274920344353
Validation loss = 0.04984743520617485
Validation loss = 0.048499006778001785
Validation loss = 0.055923424661159515
Validation loss = 0.04796069115400314
Validation loss = 0.04675368592143059
Validation loss = 0.04702877998352051
Validation loss = 0.04787711799144745
Validation loss = 0.050637468695640564
Validation loss = 0.045200929045677185
Validation loss = 0.04626400023698807
Validation loss = 0.04610820114612579
Validation loss = 0.04568670317530632
Validation loss = 0.04585546255111694
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.07995443046092987
Validation loss = 0.0542806014418602
Validation loss = 0.05038360506296158
Validation loss = 0.0515047125518322
Validation loss = 0.049671050161123276
Validation loss = 0.05038555711507797
Validation loss = 0.04803525656461716
Validation loss = 0.04651029407978058
Validation loss = 0.04812579229474068
Validation loss = 0.046782348304986954
Validation loss = 0.05283161625266075
Validation loss = 0.04837774485349655
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.07790130376815796
Validation loss = 0.053308818489313126
Validation loss = 0.052158910781145096
Validation loss = 0.05379582196474075
Validation loss = 0.05000042915344238
Validation loss = 0.053470660001039505
Validation loss = 0.051232680678367615
Validation loss = 0.04871524125337601
Validation loss = 0.047946151345968246
Validation loss = 0.05272839963436127
Validation loss = 0.04630144685506821
Validation loss = 0.046712592244148254
Validation loss = 0.04594063386321068
Validation loss = 0.04674360528588295
Validation loss = 0.04542411118745804
Validation loss = 0.04457790032029152
Validation loss = 0.05223492532968521
Validation loss = 0.0452355220913887
Validation loss = 0.04452193155884743
Validation loss = 0.04617994278669357
Validation loss = 0.04562917351722717
Validation loss = 0.04608222097158432
Validation loss = 0.04966914281249046
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 258
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 447
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 392
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 432
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 202
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 401
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 123      |
| Iteration     | 1        |
| MaximumReturn | 427      |
| MinimumReturn | -295     |
| TotalSamples  | 12000    |
----------------------------
itr #2 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.06364312022924423
Validation loss = 0.05355098843574524
Validation loss = 0.051978837698698044
Validation loss = 0.052765339612960815
Validation loss = 0.05296297371387482
Validation loss = 0.049532800912857056
Validation loss = 0.04910207912325859
Validation loss = 0.052467893809080124
Validation loss = 0.0592748261988163
Validation loss = 0.04884926974773407
Validation loss = 0.05093522369861603
Validation loss = 0.05044974014163017
Validation loss = 0.048272546380758286
Validation loss = 0.05220210552215576
Validation loss = 0.050177138298749924
Validation loss = 0.04961572587490082
Validation loss = 0.04986982420086861
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.06366627663373947
Validation loss = 0.05427065119147301
Validation loss = 0.051801543682813644
Validation loss = 0.052833184599876404
Validation loss = 0.052289072424173355
Validation loss = 0.057992201298475266
Validation loss = 0.054548412561416626
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.06086677312850952
Validation loss = 0.054237160831689835
Validation loss = 0.05590757355093956
Validation loss = 0.057153210043907166
Validation loss = 0.05013478174805641
Validation loss = 0.05208368971943855
Validation loss = 0.05439570173621178
Validation loss = 0.05451817438006401
Validation loss = 0.05127681791782379
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.06231074407696724
Validation loss = 0.05329950153827667
Validation loss = 0.05376690253615379
Validation loss = 0.05425079166889191
Validation loss = 0.05302979424595833
Validation loss = 0.05468359962105751
Validation loss = 0.05198226496577263
Validation loss = 0.054466452449560165
Validation loss = 0.05138471722602844
Validation loss = 0.05185970664024353
Validation loss = 0.053398165851831436
Validation loss = 0.05039656534790993
Validation loss = 0.053731903433799744
Validation loss = 0.05359038710594177
Validation loss = 0.0502629317343235
Validation loss = 0.051631245762109756
Validation loss = 0.05098065361380577
Validation loss = 0.05213918909430504
Validation loss = 0.05105720832943916
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.06432498246431351
Validation loss = 0.051265209913253784
Validation loss = 0.0523335225880146
Validation loss = 0.05150115489959717
Validation loss = 0.054281655699014664
Validation loss = 0.05131829157471657
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 43
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 321
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 354
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 137
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 321
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 372
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -78.2    |
| Iteration     | 2        |
| MaximumReturn | 325      |
| MinimumReturn | -405     |
| TotalSamples  | 16000    |
----------------------------
itr #3 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.11694641411304474
Validation loss = 0.06888694316148758
Validation loss = 0.06768200546503067
Validation loss = 0.06700123846530914
Validation loss = 0.06290838867425919
Validation loss = 0.0635172575712204
Validation loss = 0.06446179747581482
Validation loss = 0.06613092869520187
Validation loss = 0.061142049729824066
Validation loss = 0.06300009787082672
Validation loss = 0.061832278966903687
Validation loss = 0.06062312051653862
Validation loss = 0.06076858937740326
Validation loss = 0.06265569478273392
Validation loss = 0.06477575749158859
Validation loss = 0.06311604380607605
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.11270304024219513
Validation loss = 0.07171239703893661
Validation loss = 0.07059283554553986
Validation loss = 0.0635392963886261
Validation loss = 0.06632544100284576
Validation loss = 0.06272824108600616
Validation loss = 0.06401005387306213
Validation loss = 0.06486055254936218
Validation loss = 0.06462413817644119
Validation loss = 0.06563614308834076
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.1366734802722931
Validation loss = 0.07058212161064148
Validation loss = 0.07013419270515442
Validation loss = 0.06540952622890472
Validation loss = 0.06420831382274628
Validation loss = 0.06332794576883316
Validation loss = 0.06238273158669472
Validation loss = 0.0634005218744278
Validation loss = 0.06421957910060883
Validation loss = 0.06783275306224823
Validation loss = 0.06165604293346405
Validation loss = 0.062109991908073425
Validation loss = 0.06281031668186188
Validation loss = 0.0634893849492073
Validation loss = 0.06423496454954147
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.11868158727884293
Validation loss = 0.07011903822422028
Validation loss = 0.065546914935112
Validation loss = 0.06343191862106323
Validation loss = 0.06318645179271698
Validation loss = 0.06740526109933853
Validation loss = 0.0636419951915741
Validation loss = 0.06448221951723099
Validation loss = 0.06255653500556946
Validation loss = 0.06372705101966858
Validation loss = 0.0683290883898735
Validation loss = 0.06454373151063919
Validation loss = 0.06264807283878326
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.1244734525680542
Validation loss = 0.07043439894914627
Validation loss = 0.06848407536745071
Validation loss = 0.06299475580453873
Validation loss = 0.06320028752088547
Validation loss = 0.06253071129322052
Validation loss = 0.0610400065779686
Validation loss = 0.06253670156002045
Validation loss = 0.06322838366031647
Validation loss = 0.06338441371917725
Validation loss = 0.06190013885498047
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 577
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 423
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 530
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 568
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 432
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 554
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 35.8     |
| Iteration     | 3        |
| MaximumReturn | 361      |
| MinimumReturn | -336     |
| TotalSamples  | 20000    |
----------------------------
itr #4 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.07034976035356522
Validation loss = 0.06182122230529785
Validation loss = 0.06058144569396973
Validation loss = 0.059089045971632004
Validation loss = 0.058422308415174484
Validation loss = 0.058221809566020966
Validation loss = 0.06004276126623154
Validation loss = 0.05981643125414848
Validation loss = 0.057549335062503815
Validation loss = 0.05736198276281357
Validation loss = 0.06166471168398857
Validation loss = 0.059953976422548294
Validation loss = 0.05939602851867676
Validation loss = 0.05788857862353325
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.07413935661315918
Validation loss = 0.060652025043964386
Validation loss = 0.061678797006607056
Validation loss = 0.05945547670125961
Validation loss = 0.05830882862210274
Validation loss = 0.0605793297290802
Validation loss = 0.05795581266283989
Validation loss = 0.06018739193677902
Validation loss = 0.06127730756998062
Validation loss = 0.0592794232070446
Validation loss = 0.05995295196771622
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.06895381957292557
Validation loss = 0.062445469200611115
Validation loss = 0.06151176244020462
Validation loss = 0.060126565396785736
Validation loss = 0.06143588945269585
Validation loss = 0.059849679470062256
Validation loss = 0.06055820733308792
Validation loss = 0.0603918731212616
Validation loss = 0.05988685041666031
Validation loss = 0.06001877784729004
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.07070793211460114
Validation loss = 0.06289514154195786
Validation loss = 0.05960167571902275
Validation loss = 0.06097255274653435
Validation loss = 0.05911312252283096
Validation loss = 0.05852748826146126
Validation loss = 0.058586664497852325
Validation loss = 0.060162533074617386
Validation loss = 0.05968208983540535
Validation loss = 0.05998866632580757
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.07190130650997162
Validation loss = 0.05879976600408554
Validation loss = 0.058724235743284225
Validation loss = 0.05982166528701782
Validation loss = 0.05820765346288681
Validation loss = 0.05943599343299866
Validation loss = 0.05921800062060356
Validation loss = 0.06159418821334839
Validation loss = 0.05701380968093872
Validation loss = 0.05979573726654053
Validation loss = 0.05821148306131363
Validation loss = 0.06064876168966293
Validation loss = 0.058691300451755524
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 692
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 653
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 708
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 632
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 730
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 680
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 900      |
| Iteration     | 4        |
| MaximumReturn | 1.45e+03 |
| MinimumReturn | 121      |
| TotalSamples  | 24000    |
----------------------------
itr #5 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.05390606448054314
Validation loss = 0.05063621327280998
Validation loss = 0.04872773587703705
Validation loss = 0.04870419204235077
Validation loss = 0.05057865381240845
Validation loss = 0.049376700073480606
Validation loss = 0.0478195957839489
Validation loss = 0.04992501437664032
Validation loss = 0.048790037631988525
Validation loss = 0.04792375862598419
Validation loss = 0.04898926243185997
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.057002753019332886
Validation loss = 0.052971553057432175
Validation loss = 0.050537362694740295
Validation loss = 0.05062955617904663
Validation loss = 0.048120807856321335
Validation loss = 0.048847541213035583
Validation loss = 0.04989922419190407
Validation loss = 0.0500195138156414
Validation loss = 0.048615679144859314
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.05988515540957451
Validation loss = 0.05317409336566925
Validation loss = 0.05023014172911644
Validation loss = 0.05225955322384834
Validation loss = 0.0489262230694294
Validation loss = 0.04946242645382881
Validation loss = 0.0510246604681015
Validation loss = 0.05026013031601906
Validation loss = 0.049894124269485474
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.05982936918735504
Validation loss = 0.05050693079829216
Validation loss = 0.050577372312545776
Validation loss = 0.049482595175504684
Validation loss = 0.052032727748155594
Validation loss = 0.05026422068476677
Validation loss = 0.05167306587100029
Validation loss = 0.049694497138261795
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.053508881479501724
Validation loss = 0.05118468776345253
Validation loss = 0.048766013234853745
Validation loss = 0.04853236675262451
Validation loss = 0.048835497349500656
Validation loss = 0.04900473728775978
Validation loss = 0.050223711878061295
Validation loss = 0.04888113960623741
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 724
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 694
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 730
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 779
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 732
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 709
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.49e+03 |
| Iteration     | 5        |
| MaximumReturn | 1.74e+03 |
| MinimumReturn | 812      |
| TotalSamples  | 28000    |
----------------------------
itr #6 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.049223173409700394
Validation loss = 0.04330721125006676
Validation loss = 0.04235494136810303
Validation loss = 0.04165403172373772
Validation loss = 0.04115772619843483
Validation loss = 0.04283822327852249
Validation loss = 0.04264358431100845
Validation loss = 0.041033826768398285
Validation loss = 0.04080859199166298
Validation loss = 0.04346340149641037
Validation loss = 0.04196556657552719
Validation loss = 0.04118359088897705
Validation loss = 0.0410970114171505
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.0485537126660347
Validation loss = 0.04508375748991966
Validation loss = 0.04298955202102661
Validation loss = 0.04325488582253456
Validation loss = 0.04315008595585823
Validation loss = 0.044186946004629135
Validation loss = 0.044216904789209366
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.04868054762482643
Validation loss = 0.045282404869794846
Validation loss = 0.04433121532201767
Validation loss = 0.04425559565424919
Validation loss = 0.04320913925766945
Validation loss = 0.04272938519716263
Validation loss = 0.04245999827980995
Validation loss = 0.04290485009551048
Validation loss = 0.04425794258713722
Validation loss = 0.041438035666942596
Validation loss = 0.04118456318974495
Validation loss = 0.04248138144612312
Validation loss = 0.042300619184970856
Validation loss = 0.041486505419015884
Validation loss = 0.044019777327775955
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.047412626445293427
Validation loss = 0.04172798991203308
Validation loss = 0.04187409207224846
Validation loss = 0.04298261180520058
Validation loss = 0.045613862574100494
Validation loss = 0.04280281439423561
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.046053458005189896
Validation loss = 0.04398719221353531
Validation loss = 0.044361673295497894
Validation loss = 0.040972571820020676
Validation loss = 0.04287341237068176
Validation loss = 0.0436558797955513
Validation loss = 0.045144207775592804
Validation loss = 0.04074695706367493
Validation loss = 0.04205095395445824
Validation loss = 0.040263693779706955
Validation loss = 0.041309986263513565
Validation loss = 0.0412968285381794
Validation loss = 0.040368907153606415
Validation loss = 0.046273697167634964
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 732
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 810
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 755
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 733
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 732
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 740
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 772      |
| Iteration     | 6        |
| MaximumReturn | 1.3e+03  |
| MinimumReturn | -552     |
| TotalSamples  | 32000    |
----------------------------
itr #7 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.0421934500336647
Validation loss = 0.03845404088497162
Validation loss = 0.03624800592660904
Validation loss = 0.03916585445404053
Validation loss = 0.03750080242753029
Validation loss = 0.03642572462558746
Validation loss = 0.036709509789943695
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.045596882700920105
Validation loss = 0.0407140851020813
Validation loss = 0.039802148938179016
Validation loss = 0.03788460046052933
Validation loss = 0.038430966436862946
Validation loss = 0.038195978850126266
Validation loss = 0.039063818752765656
Validation loss = 0.03987249732017517
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.043027088046073914
Validation loss = 0.03905617445707321
Validation loss = 0.037069402635097504
Validation loss = 0.03926334157586098
Validation loss = 0.03741326183080673
Validation loss = 0.0369897224009037
Validation loss = 0.036158882081508636
Validation loss = 0.03892280161380768
Validation loss = 0.03662940114736557
Validation loss = 0.03639713674783707
Validation loss = 0.03834650292992592
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.04193807393312454
Validation loss = 0.0380210280418396
Validation loss = 0.04069458320736885
Validation loss = 0.038263849914073944
Validation loss = 0.03832904249429703
Validation loss = 0.03843367472290993
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.043034691363573074
Validation loss = 0.03657076507806778
Validation loss = 0.037247732281684875
Validation loss = 0.039196163415908813
Validation loss = 0.03846123069524765
Validation loss = 0.03767291083931923
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 759
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 769
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 768
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 765
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 752
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 726
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.76e+03 |
| Iteration     | 7        |
| MaximumReturn | 2.13e+03 |
| MinimumReturn | 623      |
| TotalSamples  | 36000    |
----------------------------
itr #8 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.03724417835474014
Validation loss = 0.03408508375287056
Validation loss = 0.03470505774021149
Validation loss = 0.03359454497694969
Validation loss = 0.032926615327596664
Validation loss = 0.03554714843630791
Validation loss = 0.035838447511196136
Validation loss = 0.03396766260266304
Validation loss = 0.033211592584848404
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.037384986877441406
Validation loss = 0.03603367507457733
Validation loss = 0.03520308434963226
Validation loss = 0.03405746817588806
Validation loss = 0.03553100302815437
Validation loss = 0.03480624780058861
Validation loss = 0.03568100929260254
Validation loss = 0.03457087650895119
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.03665339574217796
Validation loss = 0.03420133888721466
Validation loss = 0.03426707535982132
Validation loss = 0.03606490418314934
Validation loss = 0.03380347043275833
Validation loss = 0.03368017449975014
Validation loss = 0.03432796150445938
Validation loss = 0.03318142518401146
Validation loss = 0.033433809876441956
Validation loss = 0.03413534164428711
Validation loss = 0.033970907330513
Validation loss = 0.03362490236759186
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.03715014457702637
Validation loss = 0.034660112112760544
Validation loss = 0.034000616520643234
Validation loss = 0.034620217978954315
Validation loss = 0.03383680060505867
Validation loss = 0.03527848795056343
Validation loss = 0.03533409535884857
Validation loss = 0.033723313361406326
Validation loss = 0.03357439860701561
Validation loss = 0.037228673696517944
Validation loss = 0.03339800611138344
Validation loss = 0.03312934562563896
Validation loss = 0.03338557481765747
Validation loss = 0.033116672188043594
Validation loss = 0.035310328006744385
Validation loss = 0.03327493742108345
Validation loss = 0.03599730134010315
Validation loss = 0.03316010534763336
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.03497102111577988
Validation loss = 0.03602515906095505
Validation loss = 0.03552829474210739
Validation loss = 0.035178638994693756
Validation loss = 0.03448398783802986
Validation loss = 0.034735918045043945
Validation loss = 0.03411471098661423
Validation loss = 0.03503476083278656
Validation loss = 0.03523842617869377
Validation loss = 0.03262331336736679
Validation loss = 0.03318027779459953
Validation loss = 0.03346293047070503
Validation loss = 0.03319239243865013
Validation loss = 0.032782744616270065
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 738
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 792
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 795
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 776
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 787
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 819
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.64e+03 |
| Iteration     | 8        |
| MaximumReturn | 2.42e+03 |
| MinimumReturn | 328      |
| TotalSamples  | 40000    |
----------------------------
itr #9 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.032975807785987854
Validation loss = 0.029899295419454575
Validation loss = 0.029801836237311363
Validation loss = 0.029459640383720398
Validation loss = 0.031096527352929115
Validation loss = 0.02995442971587181
Validation loss = 0.030553560703992844
Validation loss = 0.030749637633562088
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.03552975505590439
Validation loss = 0.03276801481842995
Validation loss = 0.03281950205564499
Validation loss = 0.033776093274354935
Validation loss = 0.03200893849134445
Validation loss = 0.0311763696372509
Validation loss = 0.030779540538787842
Validation loss = 0.030777279287576675
Validation loss = 0.031676724553108215
Validation loss = 0.031745795160532
Validation loss = 0.0321943461894989
Validation loss = 0.030719095841050148
Validation loss = 0.0305672325193882
Validation loss = 0.030160272493958473
Validation loss = 0.03048640862107277
Validation loss = 0.030491570010781288
Validation loss = 0.030354926362633705
Validation loss = 0.029955986887216568
Validation loss = 0.030095109716057777
Validation loss = 0.030006956309080124
Validation loss = 0.030510282143950462
Validation loss = 0.029618194326758385
Validation loss = 0.0302006546407938
Validation loss = 0.029841799288988113
Validation loss = 0.029510384425520897
Validation loss = 0.030478160828351974
Validation loss = 0.02989933453500271
Validation loss = 0.02968446910381317
Validation loss = 0.02988075278699398
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.03764832392334938
Validation loss = 0.03093675896525383
Validation loss = 0.0293174646794796
Validation loss = 0.030527561902999878
Validation loss = 0.03112201765179634
Validation loss = 0.031218335032463074
Validation loss = 0.029111236333847046
Validation loss = 0.0304926335811615
Validation loss = 0.030376141890883446
Validation loss = 0.031025221571326256
Validation loss = 0.02946881577372551
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.03616803139448166
Validation loss = 0.031556036323308945
Validation loss = 0.030699709430336952
Validation loss = 0.031730376183986664
Validation loss = 0.030906502157449722
Validation loss = 0.0319833979010582
Validation loss = 0.030405912548303604
Validation loss = 0.030573617666959763
Validation loss = 0.03215070813894272
Validation loss = 0.03084319829940796
Validation loss = 0.030669067054986954
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.03360465541481972
Validation loss = 0.030909478664398193
Validation loss = 0.03183082863688469
Validation loss = 0.029393494129180908
Validation loss = 0.02959848940372467
Validation loss = 0.030772661790251732
Validation loss = 0.031277917325496674
Validation loss = 0.030083060264587402
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 774
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 761
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 800
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 774
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 800
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 785
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.49e+03 |
| Iteration     | 9        |
| MaximumReturn | 2.44e+03 |
| MinimumReturn | -201     |
| TotalSamples  | 44000    |
----------------------------
itr #10 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.031060798093676567
Validation loss = 0.02909940294921398
Validation loss = 0.027137570083141327
Validation loss = 0.02763058990240097
Validation loss = 0.026577036827802658
Validation loss = 0.026417015120387077
Validation loss = 0.02630055509507656
Validation loss = 0.02621089667081833
Validation loss = 0.025686291977763176
Validation loss = 0.026284294202923775
Validation loss = 0.026903273537755013
Validation loss = 0.026330340653657913
Validation loss = 0.02650805562734604
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.03209618851542473
Validation loss = 0.02742033638060093
Validation loss = 0.026991896331310272
Validation loss = 0.027306359261274338
Validation loss = 0.02710729092359543
Validation loss = 0.025634916499257088
Validation loss = 0.025598444044589996
Validation loss = 0.026309698820114136
Validation loss = 0.02696067839860916
Validation loss = 0.02644534222781658
Validation loss = 0.025184238329529762
Validation loss = 0.025992808863520622
Validation loss = 0.026222364977002144
Validation loss = 0.02898707240819931
Validation loss = 0.026426518335938454
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.03153986483812332
Validation loss = 0.026958880946040154
Validation loss = 0.02620070055127144
Validation loss = 0.026157867163419724
Validation loss = 0.026061205193400383
Validation loss = 0.02903764694929123
Validation loss = 0.0260145366191864
Validation loss = 0.02645888179540634
Validation loss = 0.025122283026576042
Validation loss = 0.028056683018803596
Validation loss = 0.025670744478702545
Validation loss = 0.025605183094739914
Validation loss = 0.025790875777602196
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.03399428352713585
Validation loss = 0.028552250936627388
Validation loss = 0.028086453676223755
Validation loss = 0.02782229334115982
Validation loss = 0.027060795575380325
Validation loss = 0.0272389966994524
Validation loss = 0.027023542672395706
Validation loss = 0.02771756611764431
Validation loss = 0.02784685045480728
Validation loss = 0.026441948488354683
Validation loss = 0.026910193264484406
Validation loss = 0.028160160407423973
Validation loss = 0.0264663714915514
Validation loss = 0.026452310383319855
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.0316760390996933
Validation loss = 0.028127623721957207
Validation loss = 0.026711398735642433
Validation loss = 0.028870178386569023
Validation loss = 0.028018873184919357
Validation loss = 0.027404040098190308
Validation loss = 0.02605205960571766
Validation loss = 0.02642952650785446
Validation loss = 0.026390479877591133
Validation loss = 0.02791064977645874
Validation loss = 0.025236494839191437
Validation loss = 0.025933600962162018
Validation loss = 0.02526872232556343
Validation loss = 0.02606828697025776
Validation loss = 0.025724900886416435
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 825
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 815
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 812
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 819
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 826
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 815
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 2.15e+03 |
| Iteration     | 10       |
| MaximumReturn | 2.34e+03 |
| MinimumReturn | 1.95e+03 |
| TotalSamples  | 48000    |
----------------------------
itr #11 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.025799060240387917
Validation loss = 0.02411697804927826
Validation loss = 0.024764450266957283
Validation loss = 0.024855514988303185
Validation loss = 0.024444429203867912
Validation loss = 0.022963538765907288
Validation loss = 0.02423112653195858
Validation loss = 0.023394977673888206
Validation loss = 0.023939207196235657
Validation loss = 0.02439952827990055
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.025917358696460724
Validation loss = 0.024302123114466667
Validation loss = 0.02438756264746189
Validation loss = 0.024506544694304466
Validation loss = 0.023230023682117462
Validation loss = 0.025293240323662758
Validation loss = 0.023322680965065956
Validation loss = 0.02347235567867756
Validation loss = 0.022921161726117134
Validation loss = 0.022368131205439568
Validation loss = 0.023997286334633827
Validation loss = 0.02452818863093853
Validation loss = 0.024578282609581947
Validation loss = 0.02335631288588047
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.029096195474267006
Validation loss = 0.023644059896469116
Validation loss = 0.023388022556900978
Validation loss = 0.02335672825574875
Validation loss = 0.023279210552573204
Validation loss = 0.023703673854470253
Validation loss = 0.024846075102686882
Validation loss = 0.023089060559868813
Validation loss = 0.022954434156417847
Validation loss = 0.023401344195008278
Validation loss = 0.023686742410063744
Validation loss = 0.022529372945427895
Validation loss = 0.023121744394302368
Validation loss = 0.023309307172894478
Validation loss = 0.02237231284379959
Validation loss = 0.022121423855423927
Validation loss = 0.022844349965453148
Validation loss = 0.021929031237959862
Validation loss = 0.023495793342590332
Validation loss = 0.022831618785858154
Validation loss = 0.022408245131373405
Validation loss = 0.0242364052683115
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.027811983600258827
Validation loss = 0.02487621270120144
Validation loss = 0.024702047929167747
Validation loss = 0.023993993178009987
Validation loss = 0.02500125765800476
Validation loss = 0.025901632383465767
Validation loss = 0.02478206716477871
Validation loss = 0.023763583973050117
Validation loss = 0.02637125551700592
Validation loss = 0.024394579231739044
Validation loss = 0.024218866601586342
Validation loss = 0.02415415085852146
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.026305770501494408
Validation loss = 0.02332848496735096
Validation loss = 0.024417562410235405
Validation loss = 0.02345195971429348
Validation loss = 0.02300109528005123
Validation loss = 0.02333015203475952
Validation loss = 0.02347663976252079
Validation loss = 0.02341487444937229
Validation loss = 0.022954849526286125
Validation loss = 0.02328653074800968
Validation loss = 0.02284621261060238
Validation loss = 0.02321714162826538
Validation loss = 0.023832282051444054
Validation loss = 0.023588255047798157
Validation loss = 0.024024659767746925
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 832
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 820
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 812
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 817
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 803
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 815
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 2.33e+03 |
| Iteration     | 11       |
| MaximumReturn | 2.39e+03 |
| MinimumReturn | 2.16e+03 |
| TotalSamples  | 52000    |
----------------------------
itr #12 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.024638976901769638
Validation loss = 0.022563381120562553
Validation loss = 0.021666329354047775
Validation loss = 0.02121928334236145
Validation loss = 0.02264821156859398
Validation loss = 0.021472914144396782
Validation loss = 0.021004293113946915
Validation loss = 0.024741845205426216
Validation loss = 0.020928435027599335
Validation loss = 0.02103368751704693
Validation loss = 0.02177734114229679
Validation loss = 0.02164497971534729
Validation loss = 0.020802201703190804
Validation loss = 0.021405745297670364
Validation loss = 0.024865008890628815
Validation loss = 0.021029016003012657
Validation loss = 0.020053502172231674
Validation loss = 0.021927811205387115
Validation loss = 0.020257102325558662
Validation loss = 0.021140385419130325
Validation loss = 0.020439157262444496
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.0239473395049572
Validation loss = 0.021921195089817047
Validation loss = 0.02206418104469776
Validation loss = 0.02220739983022213
Validation loss = 0.02177717536687851
Validation loss = 0.02224479429423809
Validation loss = 0.021470239385962486
Validation loss = 0.021969087421894073
Validation loss = 0.022120565176010132
Validation loss = 0.021122736856341362
Validation loss = 0.022014647722244263
Validation loss = 0.021058380603790283
Validation loss = 0.021464211866259575
Validation loss = 0.021591342985630035
Validation loss = 0.022248463705182076
Validation loss = 0.02178923413157463
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.023314159363508224
Validation loss = 0.02037391997873783
Validation loss = 0.02214968577027321
Validation loss = 0.023145310580730438
Validation loss = 0.02095579355955124
Validation loss = 0.020532114431262016
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.0264622550457716
Validation loss = 0.02212904952466488
Validation loss = 0.02319491282105446
Validation loss = 0.02325870282948017
Validation loss = 0.022801995277404785
Validation loss = 0.022911658510565758
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.02507546730339527
Validation loss = 0.021556202322244644
Validation loss = 0.021575897932052612
Validation loss = 0.024382181465625763
Validation loss = 0.020937921479344368
Validation loss = 0.021327298134565353
Validation loss = 0.021430104970932007
Validation loss = 0.021787386387586594
Validation loss = 0.021469708532094955
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 825
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 851
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 852
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 855
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 843
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 808
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 2.17e+03 |
| Iteration     | 12       |
| MaximumReturn | 2.49e+03 |
| MinimumReturn | 1.26e+03 |
| TotalSamples  | 56000    |
----------------------------
itr #13 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.020517436787486076
Validation loss = 0.020146310329437256
Validation loss = 0.020978284999728203
Validation loss = 0.020222922787070274
Validation loss = 0.019446874037384987
Validation loss = 0.021489357575774193
Validation loss = 0.01937752030789852
Validation loss = 0.01953880488872528
Validation loss = 0.01917753554880619
Validation loss = 0.01974886655807495
Validation loss = 0.0202946700155735
Validation loss = 0.021599775180220604
Validation loss = 0.018568670377135277
Validation loss = 0.019027935341000557
Validation loss = 0.019506778568029404
Validation loss = 0.019560039043426514
Validation loss = 0.01834149844944477
Validation loss = 0.019355354830622673
Validation loss = 0.01995019242167473
Validation loss = 0.02052830345928669
Validation loss = 0.018558500334620476
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.021886708214879036
Validation loss = 0.020136242732405663
Validation loss = 0.01961524412035942
Validation loss = 0.019710425287485123
Validation loss = 0.02084425278007984
Validation loss = 0.019299643114209175
Validation loss = 0.02002139389514923
Validation loss = 0.02020598016679287
Validation loss = 0.019344905391335487
Validation loss = 0.019339507445693016
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.021125687286257744
Validation loss = 0.02103005349636078
Validation loss = 0.020287076011300087
Validation loss = 0.020590994507074356
Validation loss = 0.019791321828961372
Validation loss = 0.019722076132893562
Validation loss = 0.019998030737042427
Validation loss = 0.019518490880727768
Validation loss = 0.019198937341570854
Validation loss = 0.019262021407485008
Validation loss = 0.020441096276044846
Validation loss = 0.020847138017416
Validation loss = 0.019701793789863586
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.02352830208837986
Validation loss = 0.021358594298362732
Validation loss = 0.021940719336271286
Validation loss = 0.021247196942567825
Validation loss = 0.021947937086224556
Validation loss = 0.022097725421190262
Validation loss = 0.021667664870619774
Validation loss = 0.020733309909701347
Validation loss = 0.020340414717793465
Validation loss = 0.020979681983590126
Validation loss = 0.02154053933918476
Validation loss = 0.02070353552699089
Validation loss = 0.021420929580926895
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.021585410460829735
Validation loss = 0.019667966291308403
Validation loss = 0.020330751314759254
Validation loss = 0.019889667630195618
Validation loss = 0.022699208930134773
Validation loss = 0.021061109378933907
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 843
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 849
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 863
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 859
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 782
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 850
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.79e+03 |
| Iteration     | 13       |
| MaximumReturn | 2.41e+03 |
| MinimumReturn | -439     |
| TotalSamples  | 60000    |
----------------------------
itr #14 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.02006314881145954
Validation loss = 0.019235264509916306
Validation loss = 0.018106065690517426
Validation loss = 0.01805807463824749
Validation loss = 0.017706427723169327
Validation loss = 0.02039160206913948
Validation loss = 0.017885973677039146
Validation loss = 0.01831299439072609
Validation loss = 0.018794791772961617
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.02107935957610607
Validation loss = 0.01897212117910385
Validation loss = 0.020459860563278198
Validation loss = 0.01838601380586624
Validation loss = 0.020171701908111572
Validation loss = 0.020054291933774948
Validation loss = 0.020497839897871017
Validation loss = 0.01823093742132187
Validation loss = 0.019446732476353645
Validation loss = 0.018947813659906387
Validation loss = 0.019012557342648506
Validation loss = 0.019232720136642456
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.021047057583928108
Validation loss = 0.01908636838197708
Validation loss = 0.0192942526191473
Validation loss = 0.019216980785131454
Validation loss = 0.019841965287923813
Validation loss = 0.01901174895465374
Validation loss = 0.019817572087049484
Validation loss = 0.02061418630182743
Validation loss = 0.018289053812623024
Validation loss = 0.018978625535964966
Validation loss = 0.0193906519562006
Validation loss = 0.01915731467306614
Validation loss = 0.01965351589024067
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.0219484381377697
Validation loss = 0.020471805706620216
Validation loss = 0.020097365602850914
Validation loss = 0.02036711387336254
Validation loss = 0.020574016496539116
Validation loss = 0.020217329263687134
Validation loss = 0.019938750192523003
Validation loss = 0.019327234476804733
Validation loss = 0.019998567178845406
Validation loss = 0.02038383297622204
Validation loss = 0.019945012405514717
Validation loss = 0.019584227353334427
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.021732844412326813
Validation loss = 0.019493194296956062
Validation loss = 0.020937632769346237
Validation loss = 0.01975252293050289
Validation loss = 0.020203936845064163
Validation loss = 0.019706759601831436
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 863
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 877
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 862
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 780
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 849
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 865
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 2.15e+03 |
| Iteration     | 14       |
| MaximumReturn | 2.61e+03 |
| MinimumReturn | 328      |
| TotalSamples  | 64000    |
----------------------------
itr #15 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.019274553284049034
Validation loss = 0.01751231774687767
Validation loss = 0.017378535121679306
Validation loss = 0.01770501211285591
Validation loss = 0.017965441569685936
Validation loss = 0.017558462917804718
Validation loss = 0.016562949866056442
Validation loss = 0.016478760167956352
Validation loss = 0.017227698117494583
Validation loss = 0.01767096295952797
Validation loss = 0.016624566167593002
Validation loss = 0.017667140811681747
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.020799046382308006
Validation loss = 0.018093232065439224
Validation loss = 0.01683492213487625
Validation loss = 0.018933037295937538
Validation loss = 0.016990050673484802
Validation loss = 0.016953209415078163
Validation loss = 0.017023373395204544
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.02086659148335457
Validation loss = 0.01758553460240364
Validation loss = 0.016862697899341583
Validation loss = 0.018084293231368065
Validation loss = 0.01772339642047882
Validation loss = 0.018028970807790756
Validation loss = 0.017619416117668152
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.022206710651516914
Validation loss = 0.018191225826740265
Validation loss = 0.01825445145368576
Validation loss = 0.018535278737545013
Validation loss = 0.018208252266049385
Validation loss = 0.017580846324563026
Validation loss = 0.019159667193889618
Validation loss = 0.01793714240193367
Validation loss = 0.017979610711336136
Validation loss = 0.019390536472201347
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.02065381407737732
Validation loss = 0.018783263862133026
Validation loss = 0.018617786467075348
Validation loss = 0.018581299111247063
Validation loss = 0.0184443611651659
Validation loss = 0.018543902784585953
Validation loss = 0.01806022599339485
Validation loss = 0.018323350697755814
Validation loss = 0.018080033361911774
Validation loss = 0.019271627068519592
Validation loss = 0.0171589907258749
Validation loss = 0.017959339544177055
Validation loss = 0.01761842519044876
Validation loss = 0.018734581768512726
Validation loss = 0.01742416061460972
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 907
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 888
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 878
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 871
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 891
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 882
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 2.77e+03 |
| Iteration     | 15       |
| MaximumReturn | 3.04e+03 |
| MinimumReturn | 2.65e+03 |
| TotalSamples  | 68000    |
----------------------------
itr #16 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.018532980233430862
Validation loss = 0.015782415866851807
Validation loss = 0.015882983803749084
Validation loss = 0.017140991985797882
Validation loss = 0.01592644304037094
Validation loss = 0.016547568142414093
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.02001277729868889
Validation loss = 0.01591198891401291
Validation loss = 0.01616501808166504
Validation loss = 0.016796188428997993
Validation loss = 0.016688404604792595
Validation loss = 0.015948301181197166
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.018426893278956413
Validation loss = 0.016286516562104225
Validation loss = 0.01647617481648922
Validation loss = 0.016807783395051956
Validation loss = 0.01574501395225525
Validation loss = 0.016113193705677986
Validation loss = 0.01602933183312416
Validation loss = 0.016080008819699287
Validation loss = 0.01715168170630932
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.02073546312749386
Validation loss = 0.01690046861767769
Validation loss = 0.01686478778719902
Validation loss = 0.017442364245653152
Validation loss = 0.01667853817343712
Validation loss = 0.017428630962967873
Validation loss = 0.017635179683566093
Validation loss = 0.016690008342266083
Validation loss = 0.017451586201786995
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.018283143639564514
Validation loss = 0.01683463715016842
Validation loss = 0.017002994194626808
Validation loss = 0.01723308116197586
Validation loss = 0.01698453724384308
Validation loss = 0.01614413782954216
Validation loss = 0.01680912636220455
Validation loss = 0.017293842509388924
Validation loss = 0.015989499166607857
Validation loss = 0.015993094071745872
Validation loss = 0.01695231720805168
Validation loss = 0.01559305191040039
Validation loss = 0.01591901294887066
Validation loss = 0.01742631569504738
Validation loss = 0.01565498299896717
Validation loss = 0.015515120700001717
Validation loss = 0.015694884583353996
Validation loss = 0.016865059733390808
Validation loss = 0.016809901222586632
Validation loss = 0.015266411006450653
Validation loss = 0.01620863564312458
Validation loss = 0.016454562544822693
Validation loss = 0.014990913681685925
Validation loss = 0.015480076894164085
Validation loss = 0.016725167632102966
Validation loss = 0.015216669999063015
Validation loss = 0.015532225370407104
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 895
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 879
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 850
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 896
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 882
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 868
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 2.49e+03 |
| Iteration     | 16       |
| MaximumReturn | 2.96e+03 |
| MinimumReturn | 886      |
| TotalSamples  | 72000    |
----------------------------
itr #17 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.01692310906946659
Validation loss = 0.015681752935051918
Validation loss = 0.015070052817463875
Validation loss = 0.015822267159819603
Validation loss = 0.014287350699305534
Validation loss = 0.015031390823423862
Validation loss = 0.014529243111610413
Validation loss = 0.016090819612145424
Validation loss = 0.014463188126683235
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.01835535652935505
Validation loss = 0.01519740093499422
Validation loss = 0.015187312848865986
Validation loss = 0.015450729057192802
Validation loss = 0.015457924455404282
Validation loss = 0.015462362207472324
Validation loss = 0.01492390688508749
Validation loss = 0.014731090515851974
Validation loss = 0.015593351796269417
Validation loss = 0.015451689250767231
Validation loss = 0.014952310360968113
Validation loss = 0.014307890087366104
Validation loss = 0.01612490601837635
Validation loss = 0.014777155593037605
Validation loss = 0.014893357641994953
Validation loss = 0.014533613808453083
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.017922524362802505
Validation loss = 0.015332751907408237
Validation loss = 0.015526842325925827
Validation loss = 0.015830788761377335
Validation loss = 0.015557203441858292
Validation loss = 0.016890184953808784
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.017376592382788658
Validation loss = 0.015597795136272907
Validation loss = 0.015625037252902985
Validation loss = 0.015891442075371742
Validation loss = 0.015078069642186165
Validation loss = 0.015007459558546543
Validation loss = 0.015305668115615845
Validation loss = 0.016955802217125893
Validation loss = 0.01463104598224163
Validation loss = 0.014986498281359673
Validation loss = 0.015059258788824081
Validation loss = 0.014750037342309952
Validation loss = 0.015364047139883041
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.018103092908859253
Validation loss = 0.01439639925956726
Validation loss = 0.01596640795469284
Validation loss = 0.01500660553574562
Validation loss = 0.014828741550445557
Validation loss = 0.014671670272946358
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 879
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 780
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 863
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 858
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 869
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 869
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 2.52e+03 |
| Iteration     | 17       |
| MaximumReturn | 3.1e+03  |
| MinimumReturn | 306      |
| TotalSamples  | 76000    |
----------------------------
itr #18 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.016678526997566223
Validation loss = 0.014248275198042393
Validation loss = 0.013987078331410885
Validation loss = 0.013928242027759552
Validation loss = 0.013965953141450882
Validation loss = 0.013430600054562092
Validation loss = 0.01441115327179432
Validation loss = 0.013227909803390503
Validation loss = 0.014216901734471321
Validation loss = 0.012999922968447208
Validation loss = 0.013449044898152351
Validation loss = 0.015109967440366745
Validation loss = 0.01266004890203476
Validation loss = 0.01354227215051651
Validation loss = 0.01284666359424591
Validation loss = 0.013647399842739105
Validation loss = 0.014101517386734486
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.017396902665495872
Validation loss = 0.014419766142964363
Validation loss = 0.013828947208821774
Validation loss = 0.013673119246959686
Validation loss = 0.013688608072698116
Validation loss = 0.013519222848117352
Validation loss = 0.013802088797092438
Validation loss = 0.014317341148853302
Validation loss = 0.013624491170048714
Validation loss = 0.013737848028540611
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.016276899725198746
Validation loss = 0.01468205451965332
Validation loss = 0.014666439965367317
Validation loss = 0.014138838276267052
Validation loss = 0.014784413389861584
Validation loss = 0.014683558605611324
Validation loss = 0.014566942118108273
Validation loss = 0.014184296131134033
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.016845105215907097
Validation loss = 0.014558695256710052
Validation loss = 0.014353204518556595
Validation loss = 0.014926960691809654
Validation loss = 0.013590230606496334
Validation loss = 0.01544180978089571
Validation loss = 0.013426062650978565
Validation loss = 0.014582017436623573
Validation loss = 0.013761259615421295
Validation loss = 0.013973256573081017
Validation loss = 0.013388363644480705
Validation loss = 0.014640274457633495
Validation loss = 0.013362829573452473
Validation loss = 0.014032013714313507
Validation loss = 0.01368545088917017
Validation loss = 0.013055148534476757
Validation loss = 0.013545094057917595
Validation loss = 0.01293083094060421
Validation loss = 0.013433658517897129
Validation loss = 0.013561452738940716
Validation loss = 0.01492749061435461
Validation loss = 0.013436974957585335
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.017122840508818626
Validation loss = 0.014094370417296886
Validation loss = 0.013953828252851963
Validation loss = 0.015160609036684036
Validation loss = 0.013236665166914463
Validation loss = 0.014123949222266674
Validation loss = 0.014655339531600475
Validation loss = 0.013883466832339764
Validation loss = 0.014067466370761395
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 886
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 861
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 887
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 868
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 877
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 882
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 2.97e+03 |
| Iteration     | 18       |
| MaximumReturn | 3.17e+03 |
| MinimumReturn | 2.8e+03  |
| TotalSamples  | 80000    |
----------------------------
itr #19 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.01423430722206831
Validation loss = 0.01269577443599701
Validation loss = 0.012764583341777325
Validation loss = 0.01267504133284092
Validation loss = 0.012451802380383015
Validation loss = 0.014199984259903431
Validation loss = 0.012519730255007744
Validation loss = 0.012691396288573742
Validation loss = 0.013011852279305458
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.015617890283465385
Validation loss = 0.012745541520416737
Validation loss = 0.013571104034781456
Validation loss = 0.013080453500151634
Validation loss = 0.013018563389778137
Validation loss = 0.01297829020768404
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.014917248860001564
Validation loss = 0.01427055336534977
Validation loss = 0.013848060742020607
Validation loss = 0.013870373368263245
Validation loss = 0.013795977458357811
Validation loss = 0.01428159512579441
Validation loss = 0.013481062836945057
Validation loss = 0.013157298788428307
Validation loss = 0.01398137491196394
Validation loss = 0.014005644246935844
Validation loss = 0.013699996285140514
Validation loss = 0.012886728160083294
Validation loss = 0.013074973598122597
Validation loss = 0.013000911101698875
Validation loss = 0.013086268678307533
Validation loss = 0.014118297025561333
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.013841912150382996
Validation loss = 0.01295943558216095
Validation loss = 0.013037684373557568
Validation loss = 0.012509385123848915
Validation loss = 0.012431157752871513
Validation loss = 0.012508709914982319
Validation loss = 0.016224214807152748
Validation loss = 0.012592938728630543
Validation loss = 0.012705950066447258
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.014252597466111183
Validation loss = 0.013137688860297203
Validation loss = 0.013340036384761333
Validation loss = 0.014379885978996754
Validation loss = 0.013135259039700031
Validation loss = 0.012667262926697731
Validation loss = 0.013570907525718212
Validation loss = 0.012633895501494408
Validation loss = 0.014297249726951122
Validation loss = 0.01315833069384098
Validation loss = 0.012941534630954266
Validation loss = 0.012968423776328564
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 897
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 889
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 881
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 893
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 881
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 899
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 3.05e+03 |
| Iteration     | 19       |
| MaximumReturn | 3.11e+03 |
| MinimumReturn | 2.95e+03 |
| TotalSamples  | 84000    |
----------------------------
itr #20 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.012255857698619366
Validation loss = 0.012097897939383984
Validation loss = 0.013392546214163303
Validation loss = 0.011674933135509491
Validation loss = 0.011932984925806522
Validation loss = 0.012584826909005642
Validation loss = 0.012166731059551239
Validation loss = 0.012757913209497929
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.014381898567080498
Validation loss = 0.012389288283884525
Validation loss = 0.01267473865300417
Validation loss = 0.015402977354824543
Validation loss = 0.0122701246291399
Validation loss = 0.011831498704850674
Validation loss = 0.013063599355518818
Validation loss = 0.012201083824038506
Validation loss = 0.012394564226269722
Validation loss = 0.011765601113438606
Validation loss = 0.012128032743930817
Validation loss = 0.013976986519992352
Validation loss = 0.012325672432780266
Validation loss = 0.01223622728139162
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.013829518109560013
Validation loss = 0.013693961314857006
Validation loss = 0.012358571402728558
Validation loss = 0.012904803268611431
Validation loss = 0.012823332101106644
Validation loss = 0.01228927168995142
Validation loss = 0.013258939608931541
Validation loss = 0.011904297396540642
Validation loss = 0.012524024583399296
Validation loss = 0.012342404574155807
Validation loss = 0.012564759701490402
Validation loss = 0.011908273212611675
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.012989716604351997
Validation loss = 0.01184809673577547
Validation loss = 0.011984131298959255
Validation loss = 0.012123174965381622
Validation loss = 0.012278762646019459
Validation loss = 0.012254048138856888
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.012750793248414993
Validation loss = 0.012200068682432175
Validation loss = 0.012696092016994953
Validation loss = 0.012025956995785236
Validation loss = 0.012850926257669926
Validation loss = 0.012790040113031864
Validation loss = 0.01225454080849886
Validation loss = 0.013028784655034542
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 885
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 888
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 902
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 901
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 902
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 904
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 3.06e+03 |
| Iteration     | 20       |
| MaximumReturn | 3.22e+03 |
| MinimumReturn | 2.91e+03 |
| TotalSamples  | 88000    |
----------------------------
itr #21 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.013121203519403934
Validation loss = 0.011202216148376465
Validation loss = 0.011753479018807411
Validation loss = 0.011517363600432873
Validation loss = 0.011849675327539444
Validation loss = 0.0124824782833457
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.01215309463441372
Validation loss = 0.01138360146433115
Validation loss = 0.012283426709473133
Validation loss = 0.011433329433202744
Validation loss = 0.012146142311394215
Validation loss = 0.011954261921346188
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.012143919244408607
Validation loss = 0.011911199428141117
Validation loss = 0.01184022706001997
Validation loss = 0.012717544101178646
Validation loss = 0.011708912439644337
Validation loss = 0.012263055890798569
Validation loss = 0.011877749115228653
Validation loss = 0.011706254445016384
Validation loss = 0.012753316201269627
Validation loss = 0.011559977196156979
Validation loss = 0.012139850296080112
Validation loss = 0.011637518182396889
Validation loss = 0.011340578086674213
Validation loss = 0.01245557889342308
Validation loss = 0.011677224189043045
Validation loss = 0.013731297105550766
Validation loss = 0.011428117752075195
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.012892473489046097
Validation loss = 0.011707469820976257
Validation loss = 0.012447144836187363
Validation loss = 0.011088901199400425
Validation loss = 0.011665335856378078
Validation loss = 0.01144399680197239
Validation loss = 0.011524784378707409
Validation loss = 0.012529459781944752
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.01244147215038538
Validation loss = 0.011487743817269802
Validation loss = 0.011904415674507618
Validation loss = 0.01166571769863367
Validation loss = 0.011792552657425404
Validation loss = 0.01186142023652792
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 906
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 915
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 905
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 910
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 919
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 905
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 3.17e+03 |
| Iteration     | 21       |
| MaximumReturn | 3.34e+03 |
| MinimumReturn | 3.05e+03 |
| TotalSamples  | 92000    |
----------------------------
itr #22 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.012081080116331577
Validation loss = 0.011023064143955708
Validation loss = 0.010995867662131786
Validation loss = 0.011294947005808353
Validation loss = 0.010892176069319248
Validation loss = 0.010524831712245941
Validation loss = 0.011031175032258034
Validation loss = 0.010821916162967682
Validation loss = 0.01081260945647955
Validation loss = 0.010717451572418213
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.012430972419679165
Validation loss = 0.011172606609761715
Validation loss = 0.011593010276556015
Validation loss = 0.010927902534604073
Validation loss = 0.01105483341962099
Validation loss = 0.011253619566559792
Validation loss = 0.01117987371981144
Validation loss = 0.011614917777478695
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.011705853044986725
Validation loss = 0.012102819979190826
Validation loss = 0.010951051488518715
Validation loss = 0.012479997240006924
Validation loss = 0.010917351581156254
Validation loss = 0.011157670989632607
Validation loss = 0.011127030476927757
Validation loss = 0.010649647563695908
Validation loss = 0.01228759903460741
Validation loss = 0.010471620596945286
Validation loss = 0.011156998574733734
Validation loss = 0.010674221441149712
Validation loss = 0.010504436679184437
Validation loss = 0.011454465799033642
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.011387842707335949
Validation loss = 0.012503073550760746
Validation loss = 0.011546067893505096
Validation loss = 0.010532344691455364
Validation loss = 0.011333758011460304
Validation loss = 0.010857172310352325
Validation loss = 0.01070569921284914
Validation loss = 0.010777699761092663
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.012366040609776974
Validation loss = 0.011799938976764679
Validation loss = 0.01154398825019598
Validation loss = 0.012316657230257988
Validation loss = 0.011513044126331806
Validation loss = 0.011640896089375019
Validation loss = 0.011183680035173893
Validation loss = 0.010831842198967934
Validation loss = 0.011230933479964733
Validation loss = 0.010808626189827919
Validation loss = 0.011615147814154625
Validation loss = 0.011048207990825176
Validation loss = 0.011241084896028042
Validation loss = 0.010704804211854935
Validation loss = 0.010486389510333538
Validation loss = 0.011004064232110977
Validation loss = 0.010696145705878735
Validation loss = 0.010579601861536503
Validation loss = 0.011271133087575436
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 906
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 925
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 911
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 911
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 926
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 912
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 3.09e+03 |
| Iteration     | 22       |
| MaximumReturn | 3.3e+03  |
| MinimumReturn | 2.9e+03  |
| TotalSamples  | 96000    |
----------------------------
itr #23 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.011392359621822834
Validation loss = 0.011175691150128841
Validation loss = 0.01032642088830471
Validation loss = 0.01016586646437645
Validation loss = 0.010175413452088833
Validation loss = 0.011237941682338715
Validation loss = 0.010180245153605938
Validation loss = 0.011210146360099316
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.011072211898863316
Validation loss = 0.010523554868996143
Validation loss = 0.011780674569308758
Validation loss = 0.010732692666351795
Validation loss = 0.01138538122177124
Validation loss = 0.011080600321292877
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.010974426753818989
Validation loss = 0.010393357835710049
Validation loss = 0.010715335607528687
Validation loss = 0.010523997247219086
Validation loss = 0.011606636457145214
Validation loss = 0.010418839752674103
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.010341468267142773
Validation loss = 0.010747882537543774
Validation loss = 0.009891333989799023
Validation loss = 0.010458995588123798
Validation loss = 0.010566561482846737
Validation loss = 0.010080044157803059
Validation loss = 0.011077627539634705
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.010957091115415096
Validation loss = 0.010759285651147366
Validation loss = 0.010072917677462101
Validation loss = 0.010363210923969746
Validation loss = 0.010420017875730991
Validation loss = 0.010384923778474331
Validation loss = 0.010866604745388031
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 912
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 903
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 920
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 926
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 908
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 897
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 3.2e+03  |
| Iteration     | 23       |
| MaximumReturn | 3.43e+03 |
| MinimumReturn | 2.95e+03 |
| TotalSamples  | 100000   |
----------------------------
itr #24 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.011503261514008045
Validation loss = 0.009817464277148247
Validation loss = 0.010810421779751778
Validation loss = 0.00968239177018404
Validation loss = 0.011478438042104244
Validation loss = 0.009964842349290848
Validation loss = 0.009893679991364479
Validation loss = 0.010396361351013184
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.010654804296791553
Validation loss = 0.012553033418953419
Validation loss = 0.010133618488907814
Validation loss = 0.010569094680249691
Validation loss = 0.010359866544604301
Validation loss = 0.010131794027984142
Validation loss = 0.009853657335042953
Validation loss = 0.010793131776154041
Validation loss = 0.009813590906560421
Validation loss = 0.010508792474865913
Validation loss = 0.010124445892870426
Validation loss = 0.00984134990721941
Validation loss = 0.009588944725692272
Validation loss = 0.01000006590038538
Validation loss = 0.010096666403114796
Validation loss = 0.009363462217152119
Validation loss = 0.009795451536774635
Validation loss = 0.009789489209651947
Validation loss = 0.010376266203820705
Validation loss = 0.00983551423996687
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.01087311189621687
Validation loss = 0.0104881776496768
Validation loss = 0.00978904590010643
Validation loss = 0.009901609271764755
Validation loss = 0.010830181650817394
Validation loss = 0.010374851524829865
Validation loss = 0.01032811775803566
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.010990994051098824
Validation loss = 0.010212186723947525
Validation loss = 0.009919474832713604
Validation loss = 0.00984257273375988
Validation loss = 0.010738926939666271
Validation loss = 0.009761195629835129
Validation loss = 0.010081790387630463
Validation loss = 0.010008876211941242
Validation loss = 0.009578445926308632
Validation loss = 0.010593457147479057
Validation loss = 0.00935746543109417
Validation loss = 0.009947705082595348
Validation loss = 0.009711015038192272
Validation loss = 0.009913447313010693
Validation loss = 0.009255955927073956
Validation loss = 0.010136493481695652
Validation loss = 0.009373463690280914
Validation loss = 0.00990808755159378
Validation loss = 0.009302926249802113
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.010575694032013416
Validation loss = 0.009853377006947994
Validation loss = 0.010051768273115158
Validation loss = 0.009799279272556305
Validation loss = 0.010274000465869904
Validation loss = 0.010545123368501663
Validation loss = 0.01092225406318903
Validation loss = 0.01010002288967371
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 928
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 909
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 911
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 913
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 897
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 898
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 2.67e+03 |
| Iteration     | 24       |
| MaximumReturn | 3.39e+03 |
| MinimumReturn | 338      |
| TotalSamples  | 104000   |
----------------------------
itr #25 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.01146366074681282
Validation loss = 0.009530050680041313
Validation loss = 0.009474468417465687
Validation loss = 0.009257730096578598
Validation loss = 0.009458201006054878
Validation loss = 0.010377420112490654
Validation loss = 0.009401275776326656
Validation loss = 0.009441222064197063
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.009498379193246365
Validation loss = 0.009418229572474957
Validation loss = 0.009399805217981339
Validation loss = 0.009425096213817596
Validation loss = 0.008950123563408852
Validation loss = 0.009232472628355026
Validation loss = 0.0098593570291996
Validation loss = 0.00957120768725872
Validation loss = 0.009515825659036636
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.009886294603347778
Validation loss = 0.010272839106619358
Validation loss = 0.009994002059102058
Validation loss = 0.009381826967000961
Validation loss = 0.009827748872339725
Validation loss = 0.009903543628752232
Validation loss = 0.010157492011785507
Validation loss = 0.009360709227621555
Validation loss = 0.00974206067621708
Validation loss = 0.009221870452165604
Validation loss = 0.00983388815075159
Validation loss = 0.009367873892188072
Validation loss = 0.010169897228479385
Validation loss = 0.009202438406646252
Validation loss = 0.009945450350642204
Validation loss = 0.00897985976189375
Validation loss = 0.009677932597696781
Validation loss = 0.009960650466382504
Validation loss = 0.009091791696846485
Validation loss = 0.009937765076756477
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.00951451063156128
Validation loss = 0.009880200028419495
Validation loss = 0.009173481725156307
Validation loss = 0.008979545906186104
Validation loss = 0.00925491563975811
Validation loss = 0.008990895003080368
Validation loss = 0.009414761327207088
Validation loss = 0.009994177147746086
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.010474713519215584
Validation loss = 0.010098005644977093
Validation loss = 0.00992067065089941
Validation loss = 0.009380854666233063
Validation loss = 0.00987053569406271
Validation loss = 0.009883973747491837
Validation loss = 0.009279035963118076
Validation loss = 0.009565310552716255
Validation loss = 0.009685845114290714
Validation loss = 0.00977854523807764
Validation loss = 0.00936159212142229
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 931
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 944
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 929
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 928
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 934
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 942
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 3.3e+03  |
| Iteration     | 25       |
| MaximumReturn | 3.48e+03 |
| MinimumReturn | 3.09e+03 |
| TotalSamples  | 108000   |
----------------------------
itr #26 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.009411988779902458
Validation loss = 0.009472966194152832
Validation loss = 0.008875969797372818
Validation loss = 0.008791886270046234
Validation loss = 0.009233604185283184
Validation loss = 0.009062868542969227
Validation loss = 0.009617379866540432
Validation loss = 0.00922552589327097
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.009393193759024143
Validation loss = 0.008583777584135532
Validation loss = 0.009264612570405006
Validation loss = 0.00909743458032608
Validation loss = 0.008957115933299065
Validation loss = 0.00899500958621502
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.00912034697830677
Validation loss = 0.009266015142202377
Validation loss = 0.009257846511900425
Validation loss = 0.008956224657595158
Validation loss = 0.008722009137272835
Validation loss = 0.008851774036884308
Validation loss = 0.009156312793493271
Validation loss = 0.008641515858471394
Validation loss = 0.009128285571932793
Validation loss = 0.009698215872049332
Validation loss = 0.008991088718175888
Validation loss = 0.008331472054123878
Validation loss = 0.009396740235388279
Validation loss = 0.009835900738835335
Validation loss = 0.008760812692344189
Validation loss = 0.00885190162807703
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.009241761639714241
Validation loss = 0.00896675605326891
Validation loss = 0.00936287734657526
Validation loss = 0.009023445658385754
Validation loss = 0.009283025749027729
Validation loss = 0.008956541307270527
Validation loss = 0.008975856937468052
Validation loss = 0.008833526633679867
Validation loss = 0.009514457546174526
Validation loss = 0.008889170363545418
Validation loss = 0.008896918036043644
Validation loss = 0.009149081073701382
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.010008345358073711
Validation loss = 0.009073485620319843
Validation loss = 0.009266122244298458
Validation loss = 0.009615070186555386
Validation loss = 0.009588158689439297
Validation loss = 0.009235085919499397
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 909
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 908
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 904
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 896
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 905
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 903
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 3.17e+03 |
| Iteration     | 26       |
| MaximumReturn | 3.28e+03 |
| MinimumReturn | 2.98e+03 |
| TotalSamples  | 112000   |
----------------------------
itr #27 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.00990524236112833
Validation loss = 0.009704356081783772
Validation loss = 0.008708148263394833
Validation loss = 0.008659038692712784
Validation loss = 0.008792455308139324
Validation loss = 0.00846883375197649
Validation loss = 0.009017324075102806
Validation loss = 0.008677246049046516
Validation loss = 0.008661377243697643
Validation loss = 0.009100171737372875
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.00946496706455946
Validation loss = 0.00852609146386385
Validation loss = 0.008821248076856136
Validation loss = 0.008858042769134045
Validation loss = 0.009993757121264935
Validation loss = 0.008969263173639774
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.00927005521953106
Validation loss = 0.008320259861648083
Validation loss = 0.009009514935314655
Validation loss = 0.008348783478140831
Validation loss = 0.00874461978673935
Validation loss = 0.008502790704369545
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.009260600432753563
Validation loss = 0.008311871439218521
Validation loss = 0.008448117412626743
Validation loss = 0.009106665849685669
Validation loss = 0.008631008677184582
Validation loss = 0.00830391887575388
Validation loss = 0.009077446535229683
Validation loss = 0.008221181109547615
Validation loss = 0.009419465437531471
Validation loss = 0.00796398427337408
Validation loss = 0.008290688507258892
Validation loss = 0.008280343376100063
Validation loss = 0.007806315086781979
Validation loss = 0.008308134973049164
Validation loss = 0.008506888523697853
Validation loss = 0.00842654425650835
Validation loss = 0.008764640428125858
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.01100193616002798
Validation loss = 0.008628356270492077
Validation loss = 0.0088249696418643
Validation loss = 0.009778165258467197
Validation loss = 0.009507528506219387
Validation loss = 0.009058596566319466
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 912
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 905
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 913
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 915
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 901
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 914
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 3.09e+03 |
| Iteration     | 27       |
| MaximumReturn | 3.41e+03 |
| MinimumReturn | 2.55e+03 |
| TotalSamples  | 116000   |
----------------------------
itr #28 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.008329588919878006
Validation loss = 0.008788947016000748
Validation loss = 0.008625793270766735
Validation loss = 0.00833991076797247
Validation loss = 0.008637265302240849
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.008708150126039982
Validation loss = 0.0093344422057271
Validation loss = 0.008695575408637524
Validation loss = 0.00916170421987772
Validation loss = 0.00842464342713356
Validation loss = 0.008622572757303715
Validation loss = 0.008353635668754578
Validation loss = 0.008421733975410461
Validation loss = 0.008780213072896004
Validation loss = 0.008314235135912895
Validation loss = 0.008040501736104488
Validation loss = 0.00869445875287056
Validation loss = 0.008332488127052784
Validation loss = 0.008993571624159813
Validation loss = 0.007938195951282978
Validation loss = 0.008012509904801846
Validation loss = 0.008441057987511158
Validation loss = 0.008935568854212761
Validation loss = 0.008236514404416084
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.008535114116966724
Validation loss = 0.008629722520709038
Validation loss = 0.008462922647595406
Validation loss = 0.00790643971413374
Validation loss = 0.008724636398255825
Validation loss = 0.008034772239625454
Validation loss = 0.008325275033712387
Validation loss = 0.008269024081528187
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.00839165598154068
Validation loss = 0.008538513444364071
Validation loss = 0.007950598374009132
Validation loss = 0.0079429280012846
Validation loss = 0.008461949415504932
Validation loss = 0.008450275287032127
Validation loss = 0.008401460945606232
Validation loss = 0.007815704680979252
Validation loss = 0.00872235931456089
Validation loss = 0.007842769846320152
Validation loss = 0.008038290776312351
Validation loss = 0.007974556647241116
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.009243440814316273
Validation loss = 0.008475037291646004
Validation loss = 0.009025150910019875
Validation loss = 0.008482713252305984
Validation loss = 0.008996857330203056
Validation loss = 0.00817987509071827
Validation loss = 0.008846957236528397
Validation loss = 0.009030326269567013
Validation loss = 0.008578892797231674
Validation loss = 0.008484910242259502
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 925
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 918
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 914
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 936
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 918
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 939
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 3.07e+03 |
| Iteration     | 28       |
| MaximumReturn | 3.38e+03 |
| MinimumReturn | 2.42e+03 |
| TotalSamples  | 120000   |
----------------------------
itr #29 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.008852616883814335
Validation loss = 0.00820103008300066
Validation loss = 0.008652741089463234
Validation loss = 0.007942037656903267
Validation loss = 0.00833174865692854
Validation loss = 0.008401050232350826
Validation loss = 0.008086275309324265
Validation loss = 0.008291439153254032
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.008356498554348946
Validation loss = 0.00793425552546978
Validation loss = 0.008069010451436043
Validation loss = 0.008005368523299694
Validation loss = 0.008145848289132118
Validation loss = 0.007854682393372059
Validation loss = 0.007467569783329964
Validation loss = 0.007981957867741585
Validation loss = 0.007687028963118792
Validation loss = 0.008380116894841194
Validation loss = 0.0077077304013073444
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.008377929218113422
Validation loss = 0.008377144113183022
Validation loss = 0.007857630960643291
Validation loss = 0.00798052828758955
Validation loss = 0.00838620774447918
Validation loss = 0.008929207921028137
Validation loss = 0.007795808371156454
Validation loss = 0.007669425569474697
Validation loss = 0.0079457713291049
Validation loss = 0.007563382387161255
Validation loss = 0.007498166989535093
Validation loss = 0.0076286508701741695
Validation loss = 0.0073701161891222
Validation loss = 0.007759032305330038
Validation loss = 0.007480810396373272
Validation loss = 0.007340807002037764
Validation loss = 0.007796532474458218
Validation loss = 0.007780247833579779
Validation loss = 0.007450382690876722
Validation loss = 0.007712536957114935
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.007832162082195282
Validation loss = 0.00884176604449749
Validation loss = 0.007377141155302525
Validation loss = 0.007915529422461987
Validation loss = 0.007567774970084429
Validation loss = 0.007525420747697353
Validation loss = 0.008263475261628628
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.00867417361587286
Validation loss = 0.007954072207212448
Validation loss = 0.00879775919020176
Validation loss = 0.008775885216891766
Validation loss = 0.00818293821066618
Validation loss = 0.008431141264736652
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 926
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 928
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 934
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 944
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 933
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 926
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 3.25e+03 |
| Iteration     | 29       |
| MaximumReturn | 3.37e+03 |
| MinimumReturn | 3.16e+03 |
| TotalSamples  | 124000   |
----------------------------
itr #30 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.008388274349272251
Validation loss = 0.007605549413710833
Validation loss = 0.008686079643666744
Validation loss = 0.007552129216492176
Validation loss = 0.007828042842447758
Validation loss = 0.008020504377782345
Validation loss = 0.00831594504415989
Validation loss = 0.007556214462965727
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.007829412817955017
Validation loss = 0.007436410523951054
Validation loss = 0.0073091681115329266
Validation loss = 0.007426813710480928
Validation loss = 0.007321085315197706
Validation loss = 0.0073423078283667564
Validation loss = 0.007925519719719887
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.007807486224919558
Validation loss = 0.007130463607609272
Validation loss = 0.007858862169086933
Validation loss = 0.0076036034151911736
Validation loss = 0.0075242905877530575
Validation loss = 0.008293413557112217
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.008470360189676285
Validation loss = 0.007283723913133144
Validation loss = 0.007776209153234959
Validation loss = 0.0073500508442521095
Validation loss = 0.007901827804744244
Validation loss = 0.007494158577173948
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.008118507452309132
Validation loss = 0.008515167981386185
Validation loss = 0.008333100937306881
Validation loss = 0.00765460729598999
Validation loss = 0.008028902113437653
Validation loss = 0.008061327040195465
Validation loss = 0.008202791213989258
Validation loss = 0.007659436669200659
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 911
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 904
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 911
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 916
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 912
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 908
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 3.33e+03 |
| Iteration     | 30       |
| MaximumReturn | 3.59e+03 |
| MinimumReturn | 3.08e+03 |
| TotalSamples  | 128000   |
----------------------------
itr #31 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.007712127640843391
Validation loss = 0.007187697105109692
Validation loss = 0.007382456213235855
Validation loss = 0.007633890490978956
Validation loss = 0.007411711383610964
Validation loss = 0.007962675765156746
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.007438211236149073
Validation loss = 0.007781478576362133
Validation loss = 0.0075220875442028046
Validation loss = 0.007328491657972336
Validation loss = 0.006957056932151318
Validation loss = 0.0075124697759747505
Validation loss = 0.007414600811898708
Validation loss = 0.007604366634041071
Validation loss = 0.007111517246812582
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.007295182906091213
Validation loss = 0.007048609666526318
Validation loss = 0.007462562993168831
Validation loss = 0.007439497858285904
Validation loss = 0.007475265767425299
Validation loss = 0.007359462324529886
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.007000948302447796
Validation loss = 0.007416077423840761
Validation loss = 0.007484995760023594
Validation loss = 0.007176554296165705
Validation loss = 0.00795815885066986
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.00828605704009533
Validation loss = 0.007614322006702423
Validation loss = 0.007894184440374374
Validation loss = 0.007588121108710766
Validation loss = 0.007657861802726984
Validation loss = 0.007970910519361496
Validation loss = 0.007858984172344208
Validation loss = 0.007779181934893131
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 924
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 941
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 929
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 928
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 935
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 929
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 3.19e+03 |
| Iteration     | 31       |
| MaximumReturn | 3.45e+03 |
| MinimumReturn | 2.98e+03 |
| TotalSamples  | 132000   |
----------------------------
itr #32 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.007732351776212454
Validation loss = 0.007563011255115271
Validation loss = 0.007597975432872772
Validation loss = 0.007429252378642559
Validation loss = 0.007520911283791065
Validation loss = 0.006922612898051739
Validation loss = 0.007413792423903942
Validation loss = 0.0072233472019433975
Validation loss = 0.007428579498082399
Validation loss = 0.007506282068789005
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.007597113028168678
Validation loss = 0.007254124153405428
Validation loss = 0.007169661112129688
Validation loss = 0.007062443532049656
Validation loss = 0.00726879108697176
Validation loss = 0.0075143976137042046
Validation loss = 0.007126107811927795
Validation loss = 0.007274940609931946
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.007251540664583445
Validation loss = 0.007108181249350309
Validation loss = 0.007075435481965542
Validation loss = 0.0070793102495372295
Validation loss = 0.007045831065624952
Validation loss = 0.007149384822696447
Validation loss = 0.007047633640468121
Validation loss = 0.006777520291507244
Validation loss = 0.0072807855904102325
Validation loss = 0.007076410576701164
Validation loss = 0.007052023429423571
Validation loss = 0.007323072757571936
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.007534354459494352
Validation loss = 0.00716444430872798
Validation loss = 0.007327083498239517
Validation loss = 0.0071531324647367
Validation loss = 0.007209309842437506
Validation loss = 0.007152490317821503
Validation loss = 0.006737226154655218
Validation loss = 0.007531704381108284
Validation loss = 0.007100430782884359
Validation loss = 0.007153648417443037
Validation loss = 0.007031457498669624
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.007578883320093155
Validation loss = 0.007476992905139923
Validation loss = 0.008207564242184162
Validation loss = 0.007252583745867014
Validation loss = 0.0071858009323477745
Validation loss = 0.008205488324165344
Validation loss = 0.007256324868649244
Validation loss = 0.007467271760106087
Validation loss = 0.007632898632436991
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 925
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 925
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 926
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 943
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 931
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 926
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 3.32e+03 |
| Iteration     | 32       |
| MaximumReturn | 3.68e+03 |
| MinimumReturn | 3.1e+03  |
| TotalSamples  | 136000   |
----------------------------
