Logging to experiments/gym_fswimmer/nov4/SO01w350e1_seed2631
Print configuration .....
{'env_name': 'gym_fswimmer', 'random_seeds': [2312, 1231, 2631, 5543], 'save_variables': False, 'model_save_dir': '/tmp/gym_fswimmer_models/', 'restore_variables': False, 'start_onpol_iter': 0, 'onpol_iters': 33, 'num_path_random': 6, 'num_path_onpol': 6, 'env_horizon': 1000, 'max_train_data': 200000, 'max_val_data': 100000, 'discard_ratio': 0.0, 'dynamics': {'pre_training': {'mode': 'intrinsic_reward', 'itr': 0, 'policy_itr': 20}, 'model': 'nn', 'ensemble': True, 'ensemble_model_count': 5, 'enable_particle_ensemble': True, 'particles': 5, 'intrinsic_reward_only': False, 'external_reward_evaluation_interval': 5, 'obs_var': 1.0, 'intrinsic_reward_coeff': 1.0, 'ita': 1.0, 'mode': 'random', 'val': True, 'n_layers': 4, 'hidden_size': 1000, 'activation': 'relu', 'batch_size': 1000, 'learning_rate': 0.001, 'epochs': 200, 'kfac_params': {'learning_rate': 0.1, 'damping': 0.001, 'momentum': 0.9, 'kl_clip': 0.0001, 'cov_ema_decay': 0.99}}, 'policy': {'network_shape': [32, 32], 'init_logstd': 0.0, 'activation': 'tanh', 'reinitialize_every_itr': False}, 'trpo': {'horizon': 200, 'gamma': 0.99, 'step_size': 0.01, 'iterations': 20, 'batch_size': 50000, 'gae': 0.95}, 'trpo_ext_reward': {'horizon': 1000, 'gamma': 0.99, 'step_size': 0.01, 'iterations': 20, 'batch_size': 50000, 'gae': 0.95}, 'algo': 'trpo'}
Generating random rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 0
average number of affinization = 0.0
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 0
average number of affinization = 0.0
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 0
average number of affinization = 0.0
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 0
average number of affinization = 0.0
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 0
average number of affinization = 0.0
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 0
average number of affinization = 0.0
Done generating random rollouts.
Creating normalization for training data.
Done creating normalization for training data.
Particle ensemble enabled? True
An ensemble of 5 dynamics model <class 'model.dynamics.NNDynamicsModel'> initialized
Train dynamics model with intrinsic reward only? False
Pre-training enabled. Using only intrinsic reward.
Pre-training dynamics model for 0 iterations...
Done pre-training dynamics model.
Using external reward only.
itr #0 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 1.00160813331604
Validation loss = 0.42019689083099365
Validation loss = 0.36239534616470337
Validation loss = 0.3456764817237854
Validation loss = 0.3402765989303589
Validation loss = 0.32927584648132324
Validation loss = 0.33430856466293335
Validation loss = 0.34775853157043457
Validation loss = 0.3487553894519806
Validation loss = 0.35082894563674927
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6409639716148376
Validation loss = 0.3844696283340454
Validation loss = 0.3508591651916504
Validation loss = 0.34233057498931885
Validation loss = 0.34191131591796875
Validation loss = 0.3474312424659729
Validation loss = 0.3411664366722107
Validation loss = 0.35930997133255005
Validation loss = 0.35085344314575195
Validation loss = 0.3644719123840332
Validation loss = 0.35393786430358887
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.8440694808959961
Validation loss = 0.4238305687904358
Validation loss = 0.36267149448394775
Validation loss = 0.34353166818618774
Validation loss = 0.33612510561943054
Validation loss = 0.33421701192855835
Validation loss = 0.3365097641944885
Validation loss = 0.3520289659500122
Validation loss = 0.3510746359825134
Validation loss = 0.3784126043319702
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.6482499837875366
Validation loss = 0.3988479971885681
Validation loss = 0.3529735803604126
Validation loss = 0.3379756212234497
Validation loss = 0.33544039726257324
Validation loss = 0.34123098850250244
Validation loss = 0.35914891958236694
Validation loss = 0.3470459580421448
Validation loss = 0.37233656644821167
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6351341009140015
Validation loss = 0.3975647985935211
Validation loss = 0.3504589796066284
Validation loss = 0.33784082531929016
Validation loss = 0.33332228660583496
Validation loss = 0.33550480008125305
Validation loss = 0.34492284059524536
Validation loss = 0.34509870409965515
Validation loss = 0.36007028818130493
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 20
average number of affinization = 2.857142857142857
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 25
average number of affinization = 5.625
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 31
average number of affinization = 8.444444444444445
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 27
average number of affinization = 10.3
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 36
average number of affinization = 12.636363636363637
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 28
average number of affinization = 13.916666666666666
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -7.47    |
| Iteration     | 0        |
| MaximumReturn | 2.86     |
| MinimumReturn | -24.5    |
| TotalSamples  | 8000     |
----------------------------
itr #1 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.2613229751586914
Validation loss = 0.2379036545753479
Validation loss = 0.23332127928733826
Validation loss = 0.23114144802093506
Validation loss = 0.23052366077899933
Validation loss = 0.23408687114715576
Validation loss = 0.23106642067432404
Validation loss = 0.2350797951221466
Validation loss = 0.23478317260742188
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.27143335342407227
Validation loss = 0.2352788895368576
Validation loss = 0.23249901831150055
Validation loss = 0.23443958163261414
Validation loss = 0.23351749777793884
Validation loss = 0.2383396029472351
Validation loss = 0.2333945333957672
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.26991918683052063
Validation loss = 0.23754723370075226
Validation loss = 0.23724672198295593
Validation loss = 0.2323855608701706
Validation loss = 0.24001476168632507
Validation loss = 0.2347646951675415
Validation loss = 0.23129189014434814
Validation loss = 0.2347489446401596
Validation loss = 0.2388649582862854
Validation loss = 0.24036318063735962
Validation loss = 0.24449044466018677
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.26273420453071594
Validation loss = 0.23438218235969543
Validation loss = 0.23259210586547852
Validation loss = 0.23427873849868774
Validation loss = 0.23342539370059967
Validation loss = 0.23986950516700745
Validation loss = 0.23412929475307465
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.26922422647476196
Validation loss = 0.23582299053668976
Validation loss = 0.23485685884952545
Validation loss = 0.23380298912525177
Validation loss = 0.2315751612186432
Validation loss = 0.2338251769542694
Validation loss = 0.23447221517562866
Validation loss = 0.2332548201084137
Validation loss = 0.23287656903266907
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 76
average number of affinization = 18.692307692307693
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 30
average number of affinization = 19.5
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 91
average number of affinization = 24.266666666666666
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 65
average number of affinization = 26.8125
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 36
average number of affinization = 27.352941176470587
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 60
average number of affinization = 29.166666666666668
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 5.67     |
| Iteration     | 1        |
| MaximumReturn | 10.2     |
| MinimumReturn | -0.249   |
| TotalSamples  | 12000    |
----------------------------
itr #2 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.23512153327465057
Validation loss = 0.23255544900894165
Validation loss = 0.23586182296276093
Validation loss = 0.23895883560180664
Validation loss = 0.23606491088867188
Validation loss = 0.2368062138557434
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.23013795912265778
Validation loss = 0.22977332770824432
Validation loss = 0.23199526965618134
Validation loss = 0.23048673570156097
Validation loss = 0.23066675662994385
Validation loss = 0.23322047293186188
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.23622532188892365
Validation loss = 0.23556025326251984
Validation loss = 0.23373879492282867
Validation loss = 0.24260181188583374
Validation loss = 0.2375956028699875
Validation loss = 0.23872585594654083
Validation loss = 0.24262821674346924
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.23216809332370758
Validation loss = 0.2301512360572815
Validation loss = 0.2246137112379074
Validation loss = 0.23323875665664673
Validation loss = 0.23416072130203247
Validation loss = 0.23179614543914795
Validation loss = 0.23744864761829376
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.2322000414133072
Validation loss = 0.22770579159259796
Validation loss = 0.23186756670475006
Validation loss = 0.23059965670108795
Validation loss = 0.2331794649362564
Validation loss = 0.23273484408855438
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 100
average number of affinization = 32.89473684210526
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 93
average number of affinization = 35.9
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 139
average number of affinization = 40.80952380952381
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 120
average number of affinization = 44.40909090909091
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 29
average number of affinization = 43.73913043478261
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 21
average number of affinization = 42.791666666666664
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -5.31    |
| Iteration     | 2        |
| MaximumReturn | 6.81     |
| MinimumReturn | -15.9    |
| TotalSamples  | 16000    |
----------------------------
itr #3 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.2507418394088745
Validation loss = 0.24884033203125
Validation loss = 0.24705801904201508
Validation loss = 0.24579346179962158
Validation loss = 0.25250574946403503
Validation loss = 0.25008392333984375
Validation loss = 0.25937190651893616
Validation loss = 0.26687461137771606
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.24124254286289215
Validation loss = 0.24333880841732025
Validation loss = 0.24281129240989685
Validation loss = 0.2473340630531311
Validation loss = 0.24781566858291626
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.24962690472602844
Validation loss = 0.25200557708740234
Validation loss = 0.2525525689125061
Validation loss = 0.25751107931137085
Validation loss = 0.2576182782649994
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.24821126461029053
Validation loss = 0.2464403510093689
Validation loss = 0.24464009702205658
Validation loss = 0.2516821622848511
Validation loss = 0.25545310974121094
Validation loss = 0.2538939416408539
Validation loss = 0.26054954528808594
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.2420126497745514
Validation loss = 0.24221134185791016
Validation loss = 0.24523329734802246
Validation loss = 0.24951770901679993
Validation loss = 0.2466009110212326
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 116
average number of affinization = 45.72
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 318
average number of affinization = 56.19230769230769
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 420
average number of affinization = 69.66666666666667
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 203
average number of affinization = 74.42857142857143
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 305
average number of affinization = 82.37931034482759
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 246
average number of affinization = 87.83333333333333
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -7.32    |
| Iteration     | 3        |
| MaximumReturn | 20.3     |
| MinimumReturn | -24.7    |
| TotalSamples  | 20000    |
----------------------------
itr #4 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.2755417823791504
Validation loss = 0.2606857419013977
Validation loss = 0.26126739382743835
Validation loss = 0.2702776789665222
Validation loss = 0.2712724506855011
Validation loss = 0.27658796310424805
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.25934991240501404
Validation loss = 0.2541649341583252
Validation loss = 0.25300127267837524
Validation loss = 0.2579743564128876
Validation loss = 0.26055553555488586
Validation loss = 0.26278597116470337
Validation loss = 0.26828351616859436
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.2656921148300171
Validation loss = 0.2593189477920532
Validation loss = 0.26796215772628784
Validation loss = 0.26646989583969116
Validation loss = 0.26993703842163086
Validation loss = 0.2754977345466614
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.2586853504180908
Validation loss = 0.25938549637794495
Validation loss = 0.27751410007476807
Validation loss = 0.26496371626853943
Validation loss = 0.2682870030403137
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.2566593289375305
Validation loss = 0.25742578506469727
Validation loss = 0.25739341974258423
Validation loss = 0.26176971197128296
Validation loss = 0.2639680504798889
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 380
average number of affinization = 97.25806451612904
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 197
average number of affinization = 100.375
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 179
average number of affinization = 102.75757575757575
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 290
average number of affinization = 108.26470588235294
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 261
average number of affinization = 112.62857142857143
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 265
average number of affinization = 116.86111111111111
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -2.12    |
| Iteration     | 4        |
| MaximumReturn | 19.1     |
| MinimumReturn | -21.7    |
| TotalSamples  | 24000    |
----------------------------
itr #5 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.26422688364982605
Validation loss = 0.27499380707740784
Validation loss = 0.27473029494285583
Validation loss = 0.2773422300815582
Validation loss = 0.2792821228504181
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.26006296277046204
Validation loss = 0.26984333992004395
Validation loss = 0.2652060091495514
Validation loss = 0.2699967324733734
Validation loss = 0.2732081115245819
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.2707657516002655
Validation loss = 0.2734319269657135
Validation loss = 0.27142730355262756
Validation loss = 0.27173152565956116
Validation loss = 0.2778284549713135
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.26566746830940247
Validation loss = 0.26717713475227356
Validation loss = 0.2777644097805023
Validation loss = 0.2725216746330261
Validation loss = 0.2760782241821289
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.26145514845848083
Validation loss = 0.25911155343055725
Validation loss = 0.26505738496780396
Validation loss = 0.2693069577217102
Validation loss = 0.2674088180065155
Validation loss = 0.27045416831970215
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 490
average number of affinization = 126.94594594594595
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 616
average number of affinization = 139.81578947368422
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 654
average number of affinization = 153.0
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 543
average number of affinization = 162.75
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 666
average number of affinization = 175.02439024390245
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 424
average number of affinization = 180.95238095238096
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -5.27    |
| Iteration     | 5        |
| MaximumReturn | 14.1     |
| MinimumReturn | -22.7    |
| TotalSamples  | 28000    |
----------------------------
itr #6 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.2790047228336334
Validation loss = 0.28187572956085205
Validation loss = 0.2856098413467407
Validation loss = 0.2831602394580841
Validation loss = 0.293597549200058
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.2800487279891968
Validation loss = 0.27992376685142517
Validation loss = 0.2784166932106018
Validation loss = 0.28553393483161926
Validation loss = 0.28073927760124207
Validation loss = 0.28654831647872925
Validation loss = 0.2892126142978668
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.2825404703617096
Validation loss = 0.2773914039134979
Validation loss = 0.28730788826942444
Validation loss = 0.28680771589279175
Validation loss = 0.2906493544578552
Validation loss = 0.2920900881290436
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.28013065457344055
Validation loss = 0.284699410200119
Validation loss = 0.2853250801563263
Validation loss = 0.28714922070503235
Validation loss = 0.2890329957008362
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.27804920077323914
Validation loss = 0.2755473256111145
Validation loss = 0.2811760902404785
Validation loss = 0.2844260334968567
Validation loss = 0.28879156708717346
Validation loss = 0.2865760624408722
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 302
average number of affinization = 183.7674418604651
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 98
average number of affinization = 181.8181818181818
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 230
average number of affinization = 182.88888888888889
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 185
average number of affinization = 182.93478260869566
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 240
average number of affinization = 184.14893617021278
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 267
average number of affinization = 185.875
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 4.53     |
| Iteration     | 6        |
| MaximumReturn | 19.1     |
| MinimumReturn | -23.1    |
| TotalSamples  | 32000    |
----------------------------
itr #7 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.2932026982307434
Validation loss = 0.29356804490089417
Validation loss = 0.2932645380496979
Validation loss = 0.29764094948768616
Validation loss = 0.299124538898468
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.29024162888526917
Validation loss = 0.2945965826511383
Validation loss = 0.2960701882839203
Validation loss = 0.2965274453163147
Validation loss = 0.29861289262771606
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.29152411222457886
Validation loss = 0.2910708785057068
Validation loss = 0.2949928939342499
Validation loss = 0.294177770614624
Validation loss = 0.29659274220466614
Validation loss = 0.3031659722328186
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.28955671191215515
Validation loss = 0.2896811366081238
Validation loss = 0.29368430376052856
Validation loss = 0.29766616225242615
Validation loss = 0.29547691345214844
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.2877664268016815
Validation loss = 0.2941192388534546
Validation loss = 0.2910451889038086
Validation loss = 0.2982179522514343
Validation loss = 0.3050706386566162
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 221
average number of affinization = 186.59183673469389
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 265
average number of affinization = 188.16
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 300
average number of affinization = 190.35294117647058
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 179
average number of affinization = 190.1346153846154
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 257
average number of affinization = 191.39622641509433
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 194
average number of affinization = 191.44444444444446
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -4.17    |
| Iteration     | 7        |
| MaximumReturn | 23.1     |
| MinimumReturn | -21      |
| TotalSamples  | 36000    |
----------------------------
itr #8 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3005426228046417
Validation loss = 0.30550968647003174
Validation loss = 0.3038756549358368
Validation loss = 0.3073021173477173
Validation loss = 0.30782991647720337
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3038955330848694
Validation loss = 0.30485209822654724
Validation loss = 0.3024953305721283
Validation loss = 0.30913054943084717
Validation loss = 0.31164929270744324
Validation loss = 0.3130260407924652
Validation loss = 0.3229847550392151
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.30177342891693115
Validation loss = 0.30152514576911926
Validation loss = 0.30933475494384766
Validation loss = 0.31169191002845764
Validation loss = 0.31198909878730774
Validation loss = 0.3099154233932495
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.299109548330307
Validation loss = 0.3047630786895752
Validation loss = 0.3080965280532837
Validation loss = 0.304991751909256
Validation loss = 0.3085819482803345
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.29984238743782043
Validation loss = 0.3045716881752014
Validation loss = 0.30674847960472107
Validation loss = 0.303731769323349
Validation loss = 0.3080776035785675
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 562
average number of affinization = 198.1818181818182
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 607
average number of affinization = 205.48214285714286
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 622
average number of affinization = 212.78947368421052
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 511
average number of affinization = 217.93103448275863
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 565
average number of affinization = 223.8135593220339
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 714
average number of affinization = 231.98333333333332
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -7.78    |
| Iteration     | 8        |
| MaximumReturn | 19.5     |
| MinimumReturn | -22.1    |
| TotalSamples  | 40000    |
----------------------------
itr #9 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3097466826438904
Validation loss = 0.3097303509712219
Validation loss = 0.31086069345474243
Validation loss = 0.315409779548645
Validation loss = 0.31913110613822937
Validation loss = 0.32273977994918823
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.316648006439209
Validation loss = 0.31780725717544556
Validation loss = 0.318532794713974
Validation loss = 0.32264038920402527
Validation loss = 0.33029118180274963
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.314262330532074
Validation loss = 0.31652510166168213
Validation loss = 0.31639420986175537
Validation loss = 0.3163083791732788
Validation loss = 0.3193422853946686
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.31404396891593933
Validation loss = 0.31159693002700806
Validation loss = 0.31713971495628357
Validation loss = 0.32032403349876404
Validation loss = 0.31598836183547974
Validation loss = 0.3202587962150574
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.30623242259025574
Validation loss = 0.30926913022994995
Validation loss = 0.3141436278820038
Validation loss = 0.3124423921108246
Validation loss = 0.3188846707344055
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 949
average number of affinization = 243.7377049180328
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 971
average number of affinization = 255.46774193548387
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 988
average number of affinization = 267.0952380952381
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 983
average number of affinization = 278.28125
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 986
average number of affinization = 289.16923076923075
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 991
average number of affinization = 299.8030303030303
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 8.52     |
| Iteration     | 9        |
| MaximumReturn | 20.1     |
| MinimumReturn | -8.95    |
| TotalSamples  | 44000    |
----------------------------
itr #10 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.31919771432876587
Validation loss = 0.3168776333332062
Validation loss = 0.322799414396286
Validation loss = 0.32000043988227844
Validation loss = 0.3243512809276581
Validation loss = 0.32804635167121887
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3228250741958618
Validation loss = 0.3229742646217346
Validation loss = 0.3300215005874634
Validation loss = 0.3231257200241089
Validation loss = 0.3346269130706787
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3226384222507477
Validation loss = 0.3218957185745239
Validation loss = 0.3255062699317932
Validation loss = 0.325726181268692
Validation loss = 0.33006852865219116
Validation loss = 0.3351944088935852
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.32078292965888977
Validation loss = 0.320649117231369
Validation loss = 0.32033267617225647
Validation loss = 0.321515291929245
Validation loss = 0.3326570987701416
Validation loss = 0.33446240425109863
Validation loss = 0.3332214653491974
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3203221559524536
Validation loss = 0.317804753780365
Validation loss = 0.32268545031547546
Validation loss = 0.3295450508594513
Validation loss = 0.3261295258998871
Validation loss = 0.3292607367038727
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 813
average number of affinization = 307.46268656716416
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 858
average number of affinization = 315.55882352941177
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 874
average number of affinization = 323.6521739130435
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 873
average number of affinization = 331.5
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 863
average number of affinization = 338.98591549295776
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 807
average number of affinization = 345.4861111111111
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 3.21     |
| Iteration     | 10       |
| MaximumReturn | 13.9     |
| MinimumReturn | -8.24    |
| TotalSamples  | 48000    |
----------------------------
itr #11 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3306247889995575
Validation loss = 0.3318157494068146
Validation loss = 0.3334132730960846
Validation loss = 0.33568477630615234
Validation loss = 0.3417114317417145
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3334839642047882
Validation loss = 0.3340364694595337
Validation loss = 0.34297406673431396
Validation loss = 0.3433099687099457
Validation loss = 0.33854377269744873
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3313390910625458
Validation loss = 0.3313731253147125
Validation loss = 0.3407406806945801
Validation loss = 0.34008312225341797
Validation loss = 0.3468005657196045
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3356517255306244
Validation loss = 0.3369327485561371
Validation loss = 0.3381185829639435
Validation loss = 0.34340283274650574
Validation loss = 0.3458247184753418
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3348461091518402
Validation loss = 0.3305145800113678
Validation loss = 0.33536526560783386
Validation loss = 0.3392816483974457
Validation loss = 0.3375185430049896
Validation loss = 0.33877453207969666
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 378
average number of affinization = 345.93150684931504
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 504
average number of affinization = 348.06756756756755
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 250
average number of affinization = 346.76
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 312
average number of affinization = 346.30263157894734
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 349
average number of affinization = 346.3376623376623
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 289
average number of affinization = 345.6025641025641
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -7.34    |
| Iteration     | 11       |
| MaximumReturn | 4.6      |
| MinimumReturn | -13.1    |
| TotalSamples  | 52000    |
----------------------------
itr #12 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3398696184158325
Validation loss = 0.3417145907878876
Validation loss = 0.3454734981060028
Validation loss = 0.34888771176338196
Validation loss = 0.352972149848938
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.34032443165779114
Validation loss = 0.3446509540081024
Validation loss = 0.35118919610977173
Validation loss = 0.34988176822662354
Validation loss = 0.35851213335990906
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3423788249492645
Validation loss = 0.3478224575519562
Validation loss = 0.3528345823287964
Validation loss = 0.34847769141197205
Validation loss = 0.3586941957473755
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3456803858280182
Validation loss = 0.34943997859954834
Validation loss = 0.34792599081993103
Validation loss = 0.3535148501396179
Validation loss = 0.3568461239337921
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3449416160583496
Validation loss = 0.34159594774246216
Validation loss = 0.3444002866744995
Validation loss = 0.35373714566230774
Validation loss = 0.3560127913951874
Validation loss = 0.3591858148574829
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 831
average number of affinization = 351.746835443038
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 893
average number of affinization = 358.5125
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 715
average number of affinization = 362.91358024691357
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 892
average number of affinization = 369.3658536585366
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 876
average number of affinization = 375.4698795180723
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 868
average number of affinization = 381.3333333333333
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 4.02     |
| Iteration     | 12       |
| MaximumReturn | 21.2     |
| MinimumReturn | -15.7    |
| TotalSamples  | 56000    |
----------------------------
itr #13 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.34862464666366577
Validation loss = 0.3514322340488434
Validation loss = 0.35433855652809143
Validation loss = 0.3576368987560272
Validation loss = 0.361844539642334
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.34986454248428345
Validation loss = 0.35698750615119934
Validation loss = 0.35950350761413574
Validation loss = 0.35866084694862366
Validation loss = 0.3665849566459656
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.35643211007118225
Validation loss = 0.35631081461906433
Validation loss = 0.358454167842865
Validation loss = 0.3642377555370331
Validation loss = 0.3666832149028778
Validation loss = 0.369718074798584
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3594367504119873
Validation loss = 0.3529408276081085
Validation loss = 0.35840892791748047
Validation loss = 0.36254557967185974
Validation loss = 0.3659781813621521
Validation loss = 0.3684774339199066
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.35129064321517944
Validation loss = 0.3566247820854187
Validation loss = 0.3619246482849121
Validation loss = 0.3761778473854065
Validation loss = 0.3725523054599762
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 846
average number of affinization = 386.8
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 836
average number of affinization = 392.0232558139535
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 808
average number of affinization = 396.8045977011494
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 846
average number of affinization = 401.90909090909093
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 540
average number of affinization = 403.46067415730334
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 747
average number of affinization = 407.27777777777777
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 2.54     |
| Iteration     | 13       |
| MaximumReturn | 18       |
| MinimumReturn | -13.3    |
| TotalSamples  | 60000    |
----------------------------
itr #14 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3650480806827545
Validation loss = 0.37142759561538696
Validation loss = 0.3752121031284332
Validation loss = 0.3723330795764923
Validation loss = 0.3789253234863281
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3680007755756378
Validation loss = 0.3682735860347748
Validation loss = 0.3736831843852997
Validation loss = 0.3749507665634155
Validation loss = 0.3760784864425659
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3729715049266815
Validation loss = 0.3723793625831604
Validation loss = 0.3831581771373749
Validation loss = 0.37738415598869324
Validation loss = 0.3846977949142456
Validation loss = 0.39018478989601135
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.37119635939598083
Validation loss = 0.37499749660491943
Validation loss = 0.3723822832107544
Validation loss = 0.38700029253959656
Validation loss = 0.38154470920562744
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3689636290073395
Validation loss = 0.3762589991092682
Validation loss = 0.37822288274765015
Validation loss = 0.37770524621009827
Validation loss = 0.38605234026908875
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 955
average number of affinization = 413.2967032967033
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 969
average number of affinization = 419.3369565217391
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 953
average number of affinization = 425.0752688172043
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 961
average number of affinization = 430.77659574468083
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 965
average number of affinization = 436.4
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 958
average number of affinization = 441.8333333333333
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -12.3    |
| Iteration     | 14       |
| MaximumReturn | 15.8     |
| MinimumReturn | -22.1    |
| TotalSamples  | 64000    |
----------------------------
itr #15 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.37780994176864624
Validation loss = 0.38309091329574585
Validation loss = 0.38027477264404297
Validation loss = 0.3801034688949585
Validation loss = 0.38397157192230225
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3716842234134674
Validation loss = 0.3765835464000702
Validation loss = 0.38123613595962524
Validation loss = 0.390103280544281
Validation loss = 0.3854933977127075
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.383581280708313
Validation loss = 0.38472801446914673
Validation loss = 0.3914511799812317
Validation loss = 0.39274996519088745
Validation loss = 0.39553016424179077
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.38114285469055176
Validation loss = 0.3831241726875305
Validation loss = 0.3824363350868225
Validation loss = 0.3884565532207489
Validation loss = 0.39282673597335815
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3826584219932556
Validation loss = 0.38327622413635254
Validation loss = 0.3882732093334198
Validation loss = 0.3917655348777771
Validation loss = 0.39340776205062866
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 625
average number of affinization = 443.7216494845361
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 607
average number of affinization = 445.38775510204084
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 580
average number of affinization = 446.74747474747477
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 417
average number of affinization = 446.45
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 555
average number of affinization = 447.5247524752475
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 622
average number of affinization = 449.2352941176471
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.74     |
| Iteration     | 15       |
| MaximumReturn | 21.6     |
| MinimumReturn | -15.3    |
| TotalSamples  | 68000    |
----------------------------
itr #16 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3809175193309784
Validation loss = 0.3863888382911682
Validation loss = 0.38519200682640076
Validation loss = 0.39072951674461365
Validation loss = 0.3987923264503479
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.38185906410217285
Validation loss = 0.3839125335216522
Validation loss = 0.39084213972091675
Validation loss = 0.39220738410949707
Validation loss = 0.39664918184280396
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3819555640220642
Validation loss = 0.3928922116756439
Validation loss = 0.3962409794330597
Validation loss = 0.39742064476013184
Validation loss = 0.4008472263813019
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3864023983478546
Validation loss = 0.3878357708454132
Validation loss = 0.3909599483013153
Validation loss = 0.3977930247783661
Validation loss = 0.39834076166152954
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3830265700817108
Validation loss = 0.38926541805267334
Validation loss = 0.3912953734397888
Validation loss = 0.3963353931903839
Validation loss = 0.4006296694278717
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 524
average number of affinization = 449.9611650485437
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 571
average number of affinization = 451.125
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 566
average number of affinization = 452.2190476190476
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 552
average number of affinization = 453.16037735849056
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 536
average number of affinization = 453.93457943925233
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 521
average number of affinization = 454.55555555555554
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 8.25     |
| Iteration     | 16       |
| MaximumReturn | 14.1     |
| MinimumReturn | 4.42     |
| TotalSamples  | 72000    |
----------------------------
itr #17 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3848154842853546
Validation loss = 0.3945392668247223
Validation loss = 0.3902777433395386
Validation loss = 0.4032241702079773
Validation loss = 0.4065593481063843
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.38787752389907837
Validation loss = 0.3934999406337738
Validation loss = 0.39935213327407837
Validation loss = 0.4043784737586975
Validation loss = 0.40479639172554016
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3985547125339508
Validation loss = 0.39824041724205017
Validation loss = 0.40513911843299866
Validation loss = 0.4113985598087311
Validation loss = 0.4133015275001526
Validation loss = 0.41552862524986267
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.39104679226875305
Validation loss = 0.3938642144203186
Validation loss = 0.40016865730285645
Validation loss = 0.4024139642715454
Validation loss = 0.4057350754737854
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3903156518936157
Validation loss = 0.3956093490123749
Validation loss = 0.4023638069629669
Validation loss = 0.4034377336502075
Validation loss = 0.4085689187049866
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 553
average number of affinization = 455.45871559633025
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 481
average number of affinization = 455.6909090909091
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 563
average number of affinization = 456.65765765765764
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 549
average number of affinization = 457.48214285714283
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 606
average number of affinization = 458.79646017699116
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 511
average number of affinization = 459.2543859649123
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 4.89     |
| Iteration     | 17       |
| MaximumReturn | 19.3     |
| MinimumReturn | -12.1    |
| TotalSamples  | 76000    |
----------------------------
itr #18 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.39963436126708984
Validation loss = 0.3998008072376251
Validation loss = 0.40772169828414917
Validation loss = 0.415143221616745
Validation loss = 0.4123116731643677
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.398124098777771
Validation loss = 0.4093262851238251
Validation loss = 0.40937966108322144
Validation loss = 0.41002118587493896
Validation loss = 0.4137604236602783
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.4051487445831299
Validation loss = 0.4141504466533661
Validation loss = 0.4127272963523865
Validation loss = 0.4190913736820221
Validation loss = 0.4191230833530426
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3971875011920929
Validation loss = 0.406122088432312
Validation loss = 0.40853357315063477
Validation loss = 0.41390663385391235
Validation loss = 0.4151771664619446
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.4022163152694702
Validation loss = 0.40396153926849365
Validation loss = 0.4067642390727997
Validation loss = 0.4105478525161743
Validation loss = 0.4180145263671875
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 436
average number of affinization = 459.0521739130435
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 455
average number of affinization = 459.01724137931035
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 495
average number of affinization = 459.3247863247863
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 435
average number of affinization = 459.1186440677966
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 525
average number of affinization = 459.672268907563
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 407
average number of affinization = 459.23333333333335
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -0.269   |
| Iteration     | 18       |
| MaximumReturn | 19.6     |
| MinimumReturn | -14.9    |
| TotalSamples  | 80000    |
----------------------------
itr #19 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.4062758982181549
Validation loss = 0.4143458306789398
Validation loss = 0.4149072766304016
Validation loss = 0.4176614284515381
Validation loss = 0.42209118604660034
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.4069996774196625
Validation loss = 0.41419410705566406
Validation loss = 0.4149325489997864
Validation loss = 0.41798949241638184
Validation loss = 0.42242464423179626
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.41677770018577576
Validation loss = 0.4194888174533844
Validation loss = 0.4213462471961975
Validation loss = 0.426718145608902
Validation loss = 0.42752495408058167
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.4113193452358246
Validation loss = 0.41251811385154724
Validation loss = 0.41505345702171326
Validation loss = 0.42290249466896057
Validation loss = 0.4222099184989929
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.4055265784263611
Validation loss = 0.41327518224716187
Validation loss = 0.41555437445640564
Validation loss = 0.4232192933559418
Validation loss = 0.42330169677734375
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 565
average number of affinization = 460.10743801652893
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 514
average number of affinization = 460.54918032786884
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 539
average number of affinization = 461.1869918699187
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 535
average number of affinization = 461.78225806451616
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 477
average number of affinization = 461.904
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 522
average number of affinization = 462.3809523809524
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.04     |
| Iteration     | 19       |
| MaximumReturn | 21.7     |
| MinimumReturn | -17.9    |
| TotalSamples  | 84000    |
----------------------------
itr #20 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.4140116274356842
Validation loss = 0.41903921961784363
Validation loss = 0.42171236872673035
Validation loss = 0.4302733838558197
Validation loss = 0.4319828450679779
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.41958850622177124
Validation loss = 0.4200366735458374
Validation loss = 0.4241698980331421
Validation loss = 0.42695295810699463
Validation loss = 0.42959707975387573
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.4222005307674408
Validation loss = 0.42426085472106934
Validation loss = 0.4263877272605896
Validation loss = 0.43302321434020996
Validation loss = 0.43479928374290466
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.41930294036865234
Validation loss = 0.41996002197265625
Validation loss = 0.4259350299835205
Validation loss = 0.4280376434326172
Validation loss = 0.4316144585609436
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.42172351479530334
Validation loss = 0.4202626049518585
Validation loss = 0.4227813184261322
Validation loss = 0.4290693402290344
Validation loss = 0.4298425018787384
Validation loss = 0.432546466588974
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 520
average number of affinization = 462.8346456692913
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 506
average number of affinization = 463.171875
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 467
average number of affinization = 463.2015503875969
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 547
average number of affinization = 463.84615384615387
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 564
average number of affinization = 464.6106870229008
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 511
average number of affinization = 464.9621212121212
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -2.28    |
| Iteration     | 20       |
| MaximumReturn | 20.3     |
| MinimumReturn | -23      |
| TotalSamples  | 88000    |
----------------------------
itr #21 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.42186328768730164
Validation loss = 0.42044487595558167
Validation loss = 0.42535096406936646
Validation loss = 0.4304139316082001
Validation loss = 0.43080681562423706
Validation loss = 0.4335210621356964
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.4170527756214142
Validation loss = 0.42513447999954224
Validation loss = 0.4281575083732605
Validation loss = 0.4338390529155731
Validation loss = 0.4332222640514374
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.4281998872756958
Validation loss = 0.4286614656448364
Validation loss = 0.4280618727207184
Validation loss = 0.4332895576953888
Validation loss = 0.43570852279663086
Validation loss = 0.43526288866996765
Validation loss = 0.4378599226474762
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.42012739181518555
Validation loss = 0.4245262145996094
Validation loss = 0.4295038878917694
Validation loss = 0.42849966883659363
Validation loss = 0.43344563245773315
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.4259788393974304
Validation loss = 0.42703038454055786
Validation loss = 0.426812082529068
Validation loss = 0.4316382110118866
Validation loss = 0.4374747574329376
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 613
average number of affinization = 466.0751879699248
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 527
average number of affinization = 466.52985074626866
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 556
average number of affinization = 467.1925925925926
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 542
average number of affinization = 467.74264705882354
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 540
average number of affinization = 468.2700729927007
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 554
average number of affinization = 468.89130434782606
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 8.12     |
| Iteration     | 21       |
| MaximumReturn | 20.6     |
| MinimumReturn | -25.9    |
| TotalSamples  | 92000    |
----------------------------
itr #22 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.42509880661964417
Validation loss = 0.4284507632255554
Validation loss = 0.4335498809814453
Validation loss = 0.43652090430259705
Validation loss = 0.43674376606941223
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.4247643053531647
Validation loss = 0.4320525825023651
Validation loss = 0.43748682737350464
Validation loss = 0.43725457787513733
Validation loss = 0.43648824095726013
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.4319642186164856
Validation loss = 0.43321850895881653
Validation loss = 0.43481162190437317
Validation loss = 0.4396900236606598
Validation loss = 0.43799397349357605
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.42413318157196045
Validation loss = 0.43298643827438354
Validation loss = 0.4347696006298065
Validation loss = 0.4371182918548584
Validation loss = 0.4352664649486542
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.4278288781642914
Validation loss = 0.43034258484840393
Validation loss = 0.43611153960227966
Validation loss = 0.4360112249851227
Validation loss = 0.4386650025844574
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 569
average number of affinization = 469.6115107913669
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 534
average number of affinization = 470.07142857142856
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 573
average number of affinization = 470.8014184397163
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 539
average number of affinization = 471.28169014084506
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 529
average number of affinization = 471.68531468531467
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 604
average number of affinization = 472.6041666666667
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -5.82    |
| Iteration     | 22       |
| MaximumReturn | 13.6     |
| MinimumReturn | -24.5    |
| TotalSamples  | 96000    |
----------------------------
itr #23 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.4304160177707672
Validation loss = 0.4334113597869873
Validation loss = 0.4373915195465088
Validation loss = 0.43556681275367737
Validation loss = 0.4356597363948822
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.4269653856754303
Validation loss = 0.4335114061832428
Validation loss = 0.4343479573726654
Validation loss = 0.4383445680141449
Validation loss = 0.4404127299785614
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.4329940974712372
Validation loss = 0.4354330599308014
Validation loss = 0.4376192092895508
Validation loss = 0.4408017098903656
Validation loss = 0.4419546127319336
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.42971810698509216
Validation loss = 0.4348703920841217
Validation loss = 0.43623700737953186
Validation loss = 0.43539664149284363
Validation loss = 0.4457077980041504
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.4300500452518463
Validation loss = 0.43400922417640686
Validation loss = 0.43635639548301697
Validation loss = 0.4399298429489136
Validation loss = 0.441683292388916
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 552
average number of affinization = 473.151724137931
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 525
average number of affinization = 473.5068493150685
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 540
average number of affinization = 473.9591836734694
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 562
average number of affinization = 474.55405405405406
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 499
average number of affinization = 474.71812080536915
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 566
average number of affinization = 475.32666666666665
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -2.36    |
| Iteration     | 23       |
| MaximumReturn | 18       |
| MinimumReturn | -17.8    |
| TotalSamples  | 100000   |
----------------------------
itr #24 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.4378531277179718
Validation loss = 0.4391231834888458
Validation loss = 0.4398484528064728
Validation loss = 0.4418627619743347
Validation loss = 0.44219696521759033
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.43663251399993896
Validation loss = 0.44066235423088074
Validation loss = 0.4450593590736389
Validation loss = 0.44421523809432983
Validation loss = 0.446933776140213
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.43558403849601746
Validation loss = 0.4385874271392822
Validation loss = 0.4447675049304962
Validation loss = 0.4446321427822113
Validation loss = 0.4464814066886902
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.4339367151260376
Validation loss = 0.43582481145858765
Validation loss = 0.43906211853027344
Validation loss = 0.43939492106437683
Validation loss = 0.4445827901363373
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.4359772503376007
Validation loss = 0.4396432042121887
Validation loss = 0.4413677453994751
Validation loss = 0.445383220911026
Validation loss = 0.44442734122276306
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 608
average number of affinization = 476.20529801324506
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 573
average number of affinization = 476.8421052631579
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 591
average number of affinization = 477.5882352941176
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 553
average number of affinization = 478.0779220779221
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 453
average number of affinization = 477.9161290322581
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 580
average number of affinization = 478.5705128205128
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 2.78     |
| Iteration     | 24       |
| MaximumReturn | 19.5     |
| MinimumReturn | -22.1    |
| TotalSamples  | 104000   |
----------------------------
itr #25 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.43372949957847595
Validation loss = 0.43914058804512024
Validation loss = 0.44025087356567383
Validation loss = 0.44346708059310913
Validation loss = 0.4476192593574524
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.44111543893814087
Validation loss = 0.44284093379974365
Validation loss = 0.4439951479434967
Validation loss = 0.44946596026420593
Validation loss = 0.4462375044822693
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.43722423911094666
Validation loss = 0.44027218222618103
Validation loss = 0.4461398422718048
Validation loss = 0.4466613531112671
Validation loss = 0.44927161931991577
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.4364291727542877
Validation loss = 0.43642979860305786
Validation loss = 0.4409944713115692
Validation loss = 0.44596007466316223
Validation loss = 0.44691941142082214
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.440083384513855
Validation loss = 0.4401259124279022
Validation loss = 0.4442318081855774
Validation loss = 0.4470784366130829
Validation loss = 0.44746482372283936
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 600
average number of affinization = 479.343949044586
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 540
average number of affinization = 479.7278481012658
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 635
average number of affinization = 480.70440251572325
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 675
average number of affinization = 481.91875
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 629
average number of affinization = 482.832298136646
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 644
average number of affinization = 483.82716049382714
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -3.08    |
| Iteration     | 25       |
| MaximumReturn | 16       |
| MinimumReturn | -18      |
| TotalSamples  | 108000   |
----------------------------
itr #26 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.43549054861068726
Validation loss = 0.4386201500892639
Validation loss = 0.4424559772014618
Validation loss = 0.44468310475349426
Validation loss = 0.4478217661380768
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.43907830119132996
Validation loss = 0.4449473023414612
Validation loss = 0.4456912577152252
Validation loss = 0.4513903856277466
Validation loss = 0.4502493441104889
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.43950846791267395
Validation loss = 0.44312119483947754
Validation loss = 0.4461251497268677
Validation loss = 0.44808316230773926
Validation loss = 0.4504472613334656
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.44005027413368225
Validation loss = 0.43871238827705383
Validation loss = 0.4444299340248108
Validation loss = 0.45036014914512634
Validation loss = 0.4491676986217499
Validation loss = 0.4484940767288208
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.44386470317840576
Validation loss = 0.4447048008441925
Validation loss = 0.4472654163837433
Validation loss = 0.4466329514980316
Validation loss = 0.45377612113952637
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 675
average number of affinization = 485.0
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 680
average number of affinization = 486.1890243902439
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 695
average number of affinization = 487.45454545454544
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 675
average number of affinization = 488.5843373493976
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 637
average number of affinization = 489.47305389221555
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 598
average number of affinization = 490.1190476190476
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -11.7    |
| Iteration     | 26       |
| MaximumReturn | 19.6     |
| MinimumReturn | -23.7    |
| TotalSamples  | 112000   |
----------------------------
itr #27 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.44268766045570374
Validation loss = 0.44527071714401245
Validation loss = 0.4464080333709717
Validation loss = 0.44603466987609863
Validation loss = 0.44978979229927063
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.44548267126083374
Validation loss = 0.4483114778995514
Validation loss = 0.44929060339927673
Validation loss = 0.45474550127983093
Validation loss = 0.45272451639175415
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.4434394836425781
Validation loss = 0.4472573399543762
Validation loss = 0.44993856549263
Validation loss = 0.45096924901008606
Validation loss = 0.45533058047294617
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.4442595839500427
Validation loss = 0.4450025260448456
Validation loss = 0.44961169362068176
Validation loss = 0.45000097155570984
Validation loss = 0.45115819573402405
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.4441470801830292
Validation loss = 0.450944721698761
Validation loss = 0.4554223418235779
Validation loss = 0.4528559148311615
Validation loss = 0.4552781581878662
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 627
average number of affinization = 490.92899408284023
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 636
average number of affinization = 491.7823529411765
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 629
average number of affinization = 492.58479532163744
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 606
average number of affinization = 493.24418604651163
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 663
average number of affinization = 494.22543352601156
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 694
average number of affinization = 495.3735632183908
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 8.46     |
| Iteration     | 27       |
| MaximumReturn | 26.4     |
| MinimumReturn | -16.7    |
| TotalSamples  | 116000   |
----------------------------
itr #28 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.44346752762794495
Validation loss = 0.4481663405895233
Validation loss = 0.4466530382633209
Validation loss = 0.4489136040210724
Validation loss = 0.45124733448028564
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.4445330500602722
Validation loss = 0.4500952363014221
Validation loss = 0.4504219889640808
Validation loss = 0.45169296860694885
Validation loss = 0.4537983238697052
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.44718125462532043
Validation loss = 0.4481884241104126
Validation loss = 0.4518711566925049
Validation loss = 0.4520889222621918
Validation loss = 0.45072323083877563
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.4420354664325714
Validation loss = 0.4446640908718109
Validation loss = 0.44942277669906616
Validation loss = 0.44951319694519043
Validation loss = 0.45570892095565796
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.44525423645973206
Validation loss = 0.44873854517936707
Validation loss = 0.4499852657318115
Validation loss = 0.45604729652404785
Validation loss = 0.4532233476638794
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 606
average number of affinization = 496.0057142857143
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 620
average number of affinization = 496.71022727272725
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 688
average number of affinization = 497.7909604519774
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 623
average number of affinization = 498.4943820224719
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 601
average number of affinization = 499.0670391061453
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 665
average number of affinization = 499.9888888888889
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 5.79     |
| Iteration     | 28       |
| MaximumReturn | 21.2     |
| MinimumReturn | -20      |
| TotalSamples  | 120000   |
----------------------------
itr #29 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.4448327124118805
Validation loss = 0.4473362863063812
Validation loss = 0.4512639045715332
Validation loss = 0.4507352411746979
Validation loss = 0.4515569806098938
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.4446418583393097
Validation loss = 0.4502238929271698
Validation loss = 0.45550063252449036
Validation loss = 0.4549046456813812
Validation loss = 0.45685723423957825
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.4460960328578949
Validation loss = 0.4465430974960327
Validation loss = 0.45260483026504517
Validation loss = 0.4506467878818512
Validation loss = 0.4530123770236969
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.44583362340927124
Validation loss = 0.44775596261024475
Validation loss = 0.4510331451892853
Validation loss = 0.4530197083950043
Validation loss = 0.45130056142807007
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.4460445046424866
Validation loss = 0.4484720528125763
Validation loss = 0.45178842544555664
Validation loss = 0.4552381634712219
Validation loss = 0.45583704113960266
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 757
average number of affinization = 501.4088397790055
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 698
average number of affinization = 502.489010989011
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 736
average number of affinization = 503.76502732240436
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 757
average number of affinization = 505.14130434782606
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 747
average number of affinization = 506.44864864864866
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 770
average number of affinization = 507.86559139784947
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 0.973    |
| Iteration     | 29       |
| MaximumReturn | 21.1     |
| MinimumReturn | -11.5    |
| TotalSamples  | 124000   |
----------------------------
itr #30 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.4475046396255493
Validation loss = 0.45115983486175537
Validation loss = 0.4501578211784363
Validation loss = 0.4523986279964447
Validation loss = 0.45173317193984985
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.4521384835243225
Validation loss = 0.45110371708869934
Validation loss = 0.45264604687690735
Validation loss = 0.45528024435043335
Validation loss = 0.4607572853565216
Validation loss = 0.45647260546684265
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.4463386535644531
Validation loss = 0.44932979345321655
Validation loss = 0.45281749963760376
Validation loss = 0.4552229642868042
Validation loss = 0.4534113109111786
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.44607582688331604
Validation loss = 0.4512634873390198
Validation loss = 0.4515877068042755
Validation loss = 0.4542909860610962
Validation loss = 0.45415550470352173
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.4468291103839874
Validation loss = 0.4504275918006897
Validation loss = 0.45387935638427734
Validation loss = 0.45460036396980286
Validation loss = 0.4550957977771759
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 703
average number of affinization = 508.90909090909093
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 735
average number of affinization = 510.11170212765956
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 727
average number of affinization = 511.25925925925924
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 709
average number of affinization = 512.3
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 663
average number of affinization = 513.0890052356021
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 748
average number of affinization = 514.3125
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 2.25     |
| Iteration     | 30       |
| MaximumReturn | 23.1     |
| MinimumReturn | -21.4    |
| TotalSamples  | 128000   |
----------------------------
itr #31 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.44916126132011414
Validation loss = 0.45249730348587036
Validation loss = 0.4507156312465668
Validation loss = 0.45385563373565674
Validation loss = 0.4529520571231842
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.4500976800918579
Validation loss = 0.4524037539958954
Validation loss = 0.4554992914199829
Validation loss = 0.4585879445075989
Validation loss = 0.45674192905426025
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.44914937019348145
Validation loss = 0.45151251554489136
Validation loss = 0.45152613520622253
Validation loss = 0.4532257914543152
Validation loss = 0.4565889835357666
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.4464401602745056
Validation loss = 0.45124325156211853
Validation loss = 0.45244908332824707
Validation loss = 0.4536873698234558
Validation loss = 0.45489242672920227
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.447986900806427
Validation loss = 0.45062780380249023
Validation loss = 0.45408952236175537
Validation loss = 0.4559513330459595
Validation loss = 0.4578355550765991
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 758
average number of affinization = 515.5751295336787
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 770
average number of affinization = 516.8865979381443
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 648
average number of affinization = 517.5589743589744
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 761
average number of affinization = 518.8010204081633
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 756
average number of affinization = 520.005076142132
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 805
average number of affinization = 521.4444444444445
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -8.97    |
| Iteration     | 31       |
| MaximumReturn | 13.2     |
| MinimumReturn | -22.8    |
| TotalSamples  | 132000   |
----------------------------
itr #32 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.44780072569847107
Validation loss = 0.4521740674972534
Validation loss = 0.45263269543647766
Validation loss = 0.4549016058444977
Validation loss = 0.45418214797973633
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.45119360089302063
Validation loss = 0.4544648230075836
Validation loss = 0.4539315700531006
Validation loss = 0.45722371339797974
Validation loss = 0.4607350528240204
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.44944170117378235
Validation loss = 0.4522039592266083
Validation loss = 0.4535321891307831
Validation loss = 0.45543813705444336
Validation loss = 0.4565669894218445
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.44860324263572693
Validation loss = 0.45653682947158813
Validation loss = 0.45475277304649353
Validation loss = 0.4580380320549011
Validation loss = 0.457315057516098
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.4512384831905365
Validation loss = 0.4538494646549225
Validation loss = 0.4566541612148285
Validation loss = 0.4567890465259552
Validation loss = 0.4616115987300873
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 699
average number of affinization = 522.3366834170854
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 673
average number of affinization = 523.09
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 711
average number of affinization = 524.0248756218906
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 653
average number of affinization = 524.6633663366337
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 743
average number of affinization = 525.7389162561576
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 721
average number of affinization = 526.6960784313726
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -5.24    |
| Iteration     | 32       |
| MaximumReturn | 16.4     |
| MinimumReturn | -20.4    |
| TotalSamples  | 136000   |
----------------------------
