Logging to experiments/gym_fswimmer/nov4/SO01w350e1_seed5543
Print configuration .....
{'env_name': 'gym_fswimmer', 'random_seeds': [2312, 1231, 2631, 5543], 'save_variables': False, 'model_save_dir': '/tmp/gym_fswimmer_models/', 'restore_variables': False, 'start_onpol_iter': 0, 'onpol_iters': 33, 'num_path_random': 6, 'num_path_onpol': 6, 'env_horizon': 1000, 'max_train_data': 200000, 'max_val_data': 100000, 'discard_ratio': 0.0, 'dynamics': {'pre_training': {'mode': 'intrinsic_reward', 'itr': 0, 'policy_itr': 20}, 'model': 'nn', 'ensemble': True, 'ensemble_model_count': 5, 'enable_particle_ensemble': True, 'particles': 5, 'intrinsic_reward_only': False, 'external_reward_evaluation_interval': 5, 'obs_var': 1.0, 'intrinsic_reward_coeff': 1.0, 'ita': 1.0, 'mode': 'random', 'val': True, 'n_layers': 4, 'hidden_size': 1000, 'activation': 'relu', 'batch_size': 1000, 'learning_rate': 0.001, 'epochs': 200, 'kfac_params': {'learning_rate': 0.1, 'damping': 0.001, 'momentum': 0.9, 'kl_clip': 0.0001, 'cov_ema_decay': 0.99}}, 'policy': {'network_shape': [32, 32], 'init_logstd': 0.0, 'activation': 'tanh', 'reinitialize_every_itr': False}, 'trpo': {'horizon': 200, 'gamma': 0.99, 'step_size': 0.01, 'iterations': 20, 'batch_size': 50000, 'gae': 0.95}, 'trpo_ext_reward': {'horizon': 1000, 'gamma': 0.99, 'step_size': 0.01, 'iterations': 20, 'batch_size': 50000, 'gae': 0.95}, 'algo': 'trpo'}
Generating random rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 0
average number of affinization = 0.0
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 0
average number of affinization = 0.0
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 0
average number of affinization = 0.0
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 0
average number of affinization = 0.0
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 0
average number of affinization = 0.0
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 0
average number of affinization = 0.0
Done generating random rollouts.
Creating normalization for training data.
Done creating normalization for training data.
Particle ensemble enabled? True
An ensemble of 5 dynamics model <class 'model.dynamics.NNDynamicsModel'> initialized
Train dynamics model with intrinsic reward only? False
Pre-training enabled. Using only intrinsic reward.
Pre-training dynamics model for 0 iterations...
Done pre-training dynamics model.
Using external reward only.
itr #0 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6897114515304565
Validation loss = 0.43401700258255005
Validation loss = 0.3674159049987793
Validation loss = 0.35221874713897705
Validation loss = 0.3336460590362549
Validation loss = 0.3378334641456604
Validation loss = 0.3306225538253784
Validation loss = 0.35911625623703003
Validation loss = 0.34451740980148315
Validation loss = 0.3480820655822754
Validation loss = 0.35221701860427856
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7396042346954346
Validation loss = 0.44227713346481323
Validation loss = 0.379249632358551
Validation loss = 0.34862959384918213
Validation loss = 0.34234005212783813
Validation loss = 0.33639422059059143
Validation loss = 0.33753079175949097
Validation loss = 0.3596811294555664
Validation loss = 0.3442847728729248
Validation loss = 0.3368784785270691
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.6931300163269043
Validation loss = 0.4099288582801819
Validation loss = 0.36550581455230713
Validation loss = 0.3425770401954651
Validation loss = 0.33086609840393066
Validation loss = 0.3335520327091217
Validation loss = 0.32615146040916443
Validation loss = 0.3306165039539337
Validation loss = 0.3497186005115509
Validation loss = 0.3536744713783264
Validation loss = 0.3348117470741272
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.6836310625076294
Validation loss = 0.4317174255847931
Validation loss = 0.37110328674316406
Validation loss = 0.3480697274208069
Validation loss = 0.33870404958724976
Validation loss = 0.33176082372665405
Validation loss = 0.3401334881782532
Validation loss = 0.34873515367507935
Validation loss = 0.3423159122467041
Validation loss = 0.33309394121170044
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6687099933624268
Validation loss = 0.4376465678215027
Validation loss = 0.3689638376235962
Validation loss = 0.3467708230018616
Validation loss = 0.3349023461341858
Validation loss = 0.3346555829048157
Validation loss = 0.33528682589530945
Validation loss = 0.3465622663497925
Validation loss = 0.36310485005378723
Validation loss = 0.33806854486465454
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 35
average number of affinization = 5.0
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 53
average number of affinization = 11.0
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 43
average number of affinization = 14.555555555555555
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 24
average number of affinization = 15.5
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 11
average number of affinization = 15.090909090909092
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 32
average number of affinization = 16.5
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -3.24    |
| Iteration     | 0        |
| MaximumReturn | -0.5     |
| MinimumReturn | -7.71    |
| TotalSamples  | 8000     |
----------------------------
itr #1 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.340282678604126
Validation loss = 0.31986361742019653
Validation loss = 0.320671409368515
Validation loss = 0.3229127526283264
Validation loss = 0.3347513675689697
Validation loss = 0.34075427055358887
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3446839451789856
Validation loss = 0.32498040795326233
Validation loss = 0.31696903705596924
Validation loss = 0.31962308287620544
Validation loss = 0.32187989354133606
Validation loss = 0.3197652995586395
Validation loss = 0.3282235562801361
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.34657612442970276
Validation loss = 0.31562793254852295
Validation loss = 0.3192240595817566
Validation loss = 0.3178889751434326
Validation loss = 0.3245972990989685
Validation loss = 0.3427283763885498
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.34304970502853394
Validation loss = 0.31923189759254456
Validation loss = 0.3146580457687378
Validation loss = 0.32219988107681274
Validation loss = 0.32782554626464844
Validation loss = 0.32304465770721436
Validation loss = 0.3321470618247986
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3396461308002472
Validation loss = 0.3187270164489746
Validation loss = 0.31062859296798706
Validation loss = 0.3155224323272705
Validation loss = 0.3189292252063751
Validation loss = 0.32393503189086914
Validation loss = 0.3256799578666687
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 0
average number of affinization = 15.23076923076923
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 143
average number of affinization = 24.357142857142858
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 81
average number of affinization = 28.133333333333333
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 62
average number of affinization = 30.25
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 55
average number of affinization = 31.705882352941178
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 34
average number of affinization = 31.833333333333332
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -4.25    |
| Iteration     | 1        |
| MaximumReturn | 2.81     |
| MinimumReturn | -9.16    |
| TotalSamples  | 12000    |
----------------------------
itr #2 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.32249075174331665
Validation loss = 0.31638088822364807
Validation loss = 0.31790050864219666
Validation loss = 0.3304242193698883
Validation loss = 0.3318513333797455
Validation loss = 0.3367050588130951
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3197416365146637
Validation loss = 0.32511621713638306
Validation loss = 0.32454296946525574
Validation loss = 0.32310980558395386
Validation loss = 0.33790671825408936
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.32269763946533203
Validation loss = 0.3204011619091034
Validation loss = 0.32257604598999023
Validation loss = 0.3201247751712799
Validation loss = 0.3374905288219452
Validation loss = 0.34069934487342834
Validation loss = 0.3374067544937134
Validation loss = 0.338265985250473
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.32896938920021057
Validation loss = 0.32181140780448914
Validation loss = 0.32394590973854065
Validation loss = 0.31797024607658386
Validation loss = 0.3420335054397583
Validation loss = 0.34408000111579895
Validation loss = 0.3383720815181732
Validation loss = 0.35269391536712646
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.32549041509628296
Validation loss = 0.32400986552238464
Validation loss = 0.326830118894577
Validation loss = 0.3261699378490448
Validation loss = 0.3324543237686157
Validation loss = 0.3389185667037964
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 171
average number of affinization = 39.1578947368421
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 481
average number of affinization = 61.25
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 399
average number of affinization = 77.33333333333333
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 251
average number of affinization = 85.22727272727273
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 355
average number of affinization = 96.95652173913044
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 55
average number of affinization = 95.20833333333333
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -2.17    |
| Iteration     | 2        |
| MaximumReturn | 14.5     |
| MinimumReturn | -12.8    |
| TotalSamples  | 16000    |
----------------------------
itr #3 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.32996758818626404
Validation loss = 0.3224232494831085
Validation loss = 0.3266155421733856
Validation loss = 0.3351328670978546
Validation loss = 0.33465492725372314
Validation loss = 0.3473659157752991
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.32275721430778503
Validation loss = 0.3218423128128052
Validation loss = 0.3243813216686249
Validation loss = 0.3323146402835846
Validation loss = 0.32912302017211914
Validation loss = 0.33993807435035706
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3297496438026428
Validation loss = 0.3345687687397003
Validation loss = 0.33276838064193726
Validation loss = 0.3418101668357849
Validation loss = 0.33890700340270996
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3360198438167572
Validation loss = 0.32762467861175537
Validation loss = 0.3399447202682495
Validation loss = 0.3410259187221527
Validation loss = 0.3620026111602783
Validation loss = 0.34304410219192505
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.32913607358932495
Validation loss = 0.3264501094818115
Validation loss = 0.3257390260696411
Validation loss = 0.3414326310157776
Validation loss = 0.3368968963623047
Validation loss = 0.3413078486919403
Validation loss = 0.352085143327713
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 888
average number of affinization = 126.92
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 829
average number of affinization = 153.92307692307693
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 761
average number of affinization = 176.40740740740742
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 890
average number of affinization = 201.89285714285714
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 834
average number of affinization = 223.68965517241378
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 842
average number of affinization = 244.3
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 4.42     |
| Iteration     | 3        |
| MaximumReturn | 18.8     |
| MinimumReturn | -17.1    |
| TotalSamples  | 20000    |
----------------------------
itr #4 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3250700831413269
Validation loss = 0.33339762687683105
Validation loss = 0.3391261100769043
Validation loss = 0.3426043689250946
Validation loss = 0.3459838032722473
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.32924091815948486
Validation loss = 0.32470667362213135
Validation loss = 0.3401342034339905
Validation loss = 0.33656924962997437
Validation loss = 0.3445996642112732
Validation loss = 0.3456922769546509
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.33274132013320923
Validation loss = 0.3411588966846466
Validation loss = 0.33465105295181274
Validation loss = 0.34548765420913696
Validation loss = 0.34816867113113403
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3393573760986328
Validation loss = 0.34195569157600403
Validation loss = 0.3417516052722931
Validation loss = 0.35335493087768555
Validation loss = 0.35151785612106323
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.33203569054603577
Validation loss = 0.33861708641052246
Validation loss = 0.33618229627609253
Validation loss = 0.34442850947380066
Validation loss = 0.346339613199234
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 677
average number of affinization = 258.258064516129
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 638
average number of affinization = 270.125
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 687
average number of affinization = 282.75757575757575
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 733
average number of affinization = 296.0
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 708
average number of affinization = 307.77142857142854
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 745
average number of affinization = 319.9166666666667
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 2.71     |
| Iteration     | 4        |
| MaximumReturn | 24.8     |
| MinimumReturn | -22.6    |
| TotalSamples  | 24000    |
----------------------------
itr #5 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.342921644449234
Validation loss = 0.3402276337146759
Validation loss = 0.34747055172920227
Validation loss = 0.3501906394958496
Validation loss = 0.35034310817718506
Validation loss = 0.3572043478488922
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3487759828567505
Validation loss = 0.3408629894256592
Validation loss = 0.344571590423584
Validation loss = 0.3529902994632721
Validation loss = 0.35192957520484924
Validation loss = 0.35761430859565735
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.34438833594322205
Validation loss = 0.33996596932411194
Validation loss = 0.3523571789264679
Validation loss = 0.34894856810569763
Validation loss = 0.35897138714790344
Validation loss = 0.35480308532714844
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3440609276294708
Validation loss = 0.3542165756225586
Validation loss = 0.3504025936126709
Validation loss = 0.3535633385181427
Validation loss = 0.36489105224609375
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.34653088450431824
Validation loss = 0.34833595156669617
Validation loss = 0.35696327686309814
Validation loss = 0.3498976230621338
Validation loss = 0.3539651334285736
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 710
average number of affinization = 330.4594594594595
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 691
average number of affinization = 339.94736842105266
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 696
average number of affinization = 349.0769230769231
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 703
average number of affinization = 357.925
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 633
average number of affinization = 364.6341463414634
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 689
average number of affinization = 372.35714285714283
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -5.64    |
| Iteration     | 5        |
| MaximumReturn | 6.27     |
| MinimumReturn | -18.6    |
| TotalSamples  | 28000    |
----------------------------
itr #6 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.35704880952835083
Validation loss = 0.35486727952957153
Validation loss = 0.3567390441894531
Validation loss = 0.36149904131889343
Validation loss = 0.3648434579372406
Validation loss = 0.36479946970939636
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3501487076282501
Validation loss = 0.35439467430114746
Validation loss = 0.35574668645858765
Validation loss = 0.3595016300678253
Validation loss = 0.362430602312088
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3597116768360138
Validation loss = 0.35473302006721497
Validation loss = 0.3669876456260681
Validation loss = 0.36056238412857056
Validation loss = 0.3622705936431885
Validation loss = 0.36846205592155457
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3542417585849762
Validation loss = 0.35195276141166687
Validation loss = 0.36104297637939453
Validation loss = 0.3601929545402527
Validation loss = 0.3627721965312958
Validation loss = 0.3635377287864685
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3530741333961487
Validation loss = 0.3516022264957428
Validation loss = 0.36067885160446167
Validation loss = 0.3584040701389313
Validation loss = 0.36637312173843384
Validation loss = 0.3676692545413971
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 921
average number of affinization = 385.1162790697674
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 861
average number of affinization = 395.9318181818182
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 927
average number of affinization = 407.73333333333335
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 808
average number of affinization = 416.4347826086956
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 893
average number of affinization = 426.5744680851064
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 869
average number of affinization = 435.7916666666667
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 5.11     |
| Iteration     | 6        |
| MaximumReturn | 24       |
| MinimumReturn | -24      |
| TotalSamples  | 32000    |
----------------------------
itr #7 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3587106764316559
Validation loss = 0.36538833379745483
Validation loss = 0.36730754375457764
Validation loss = 0.36609646677970886
Validation loss = 0.37867555022239685
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3530384600162506
Validation loss = 0.36751788854599
Validation loss = 0.3674355447292328
Validation loss = 0.36682265996932983
Validation loss = 0.36843281984329224
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.36251160502433777
Validation loss = 0.366855263710022
Validation loss = 0.3632788062095642
Validation loss = 0.3672664165496826
Validation loss = 0.3727831244468689
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.35796058177948
Validation loss = 0.36139822006225586
Validation loss = 0.3711123466491699
Validation loss = 0.3658362030982971
Validation loss = 0.3758176565170288
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3600463569164276
Validation loss = 0.3608306646347046
Validation loss = 0.3737402856349945
Validation loss = 0.3659012019634247
Validation loss = 0.37471625208854675
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 893
average number of affinization = 445.1224489795918
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 990
average number of affinization = 456.02
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 961
average number of affinization = 465.921568627451
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 970
average number of affinization = 475.61538461538464
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 953
average number of affinization = 484.62264150943395
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 979
average number of affinization = 493.77777777777777
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -2.27    |
| Iteration     | 7        |
| MaximumReturn | 19.3     |
| MinimumReturn | -22.2    |
| TotalSamples  | 36000    |
----------------------------
itr #8 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3671228289604187
Validation loss = 0.3702608346939087
Validation loss = 0.37129324674606323
Validation loss = 0.3750920295715332
Validation loss = 0.3788919150829315
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3652273118495941
Validation loss = 0.3708771765232086
Validation loss = 0.37573516368865967
Validation loss = 0.3730069100856781
Validation loss = 0.37631404399871826
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.36963334679603577
Validation loss = 0.3708370625972748
Validation loss = 0.37394702434539795
Validation loss = 0.37660926580429077
Validation loss = 0.3796985149383545
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.36702385544776917
Validation loss = 0.37440067529678345
Validation loss = 0.36910930275917053
Validation loss = 0.37384533882141113
Validation loss = 0.38325098156929016
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3671061396598816
Validation loss = 0.3711373805999756
Validation loss = 0.37331730127334595
Validation loss = 0.37564462423324585
Validation loss = 0.37713783979415894
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 949
average number of affinization = 502.05454545454546
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 956
average number of affinization = 510.1607142857143
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 953
average number of affinization = 517.9298245614035
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 932
average number of affinization = 525.0689655172414
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 982
average number of affinization = 532.8135593220339
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 946
average number of affinization = 539.7
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -4.4     |
| Iteration     | 8        |
| MaximumReturn | 12.2     |
| MinimumReturn | -15.7    |
| TotalSamples  | 40000    |
----------------------------
itr #9 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.365045428276062
Validation loss = 0.3706548810005188
Validation loss = 0.37366244196891785
Validation loss = 0.37914615869522095
Validation loss = 0.38054147362709045
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.37280288338661194
Validation loss = 0.37135910987854004
Validation loss = 0.3789185881614685
Validation loss = 0.38630226254463196
Validation loss = 0.3837893605232239
Validation loss = 0.380988210439682
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.36822444200515747
Validation loss = 0.372231125831604
Validation loss = 0.37683677673339844
Validation loss = 0.381815642118454
Validation loss = 0.3867940604686737
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.36898988485336304
Validation loss = 0.36991778016090393
Validation loss = 0.37380680441856384
Validation loss = 0.38958317041397095
Validation loss = 0.3848198354244232
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3716418147087097
Validation loss = 0.3659825325012207
Validation loss = 0.3761826157569885
Validation loss = 0.3804798424243927
Validation loss = 0.3826177418231964
Validation loss = 0.38229528069496155
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 651
average number of affinization = 541.5245901639345
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 676
average number of affinization = 543.6935483870968
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 654
average number of affinization = 545.4444444444445
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 651
average number of affinization = 547.09375
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 635
average number of affinization = 548.4461538461538
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 665
average number of affinization = 550.2121212121212
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -3.51    |
| Iteration     | 9        |
| MaximumReturn | 5.36     |
| MinimumReturn | -17.7    |
| TotalSamples  | 44000    |
----------------------------
itr #10 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3716850280761719
Validation loss = 0.36819809675216675
Validation loss = 0.3782750368118286
Validation loss = 0.3847506642341614
Validation loss = 0.38268670439720154
Validation loss = 0.398697167634964
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3723994493484497
Validation loss = 0.3787789046764374
Validation loss = 0.3868808448314667
Validation loss = 0.38582003116607666
Validation loss = 0.39157435297966003
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3726775050163269
Validation loss = 0.37150031328201294
Validation loss = 0.38041555881500244
Validation loss = 0.38541433215141296
Validation loss = 0.39255645871162415
Validation loss = 0.39551007747650146
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3712102770805359
Validation loss = 0.37183257937431335
Validation loss = 0.376161128282547
Validation loss = 0.38414719700813293
Validation loss = 0.39568623900413513
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.36916056275367737
Validation loss = 0.3798285722732544
Validation loss = 0.3850758373737335
Validation loss = 0.38841021060943604
Validation loss = 0.38403767347335815
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 867
average number of affinization = 554.9402985074627
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 889
average number of affinization = 559.8529411764706
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 934
average number of affinization = 565.2753623188406
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 887
average number of affinization = 569.8714285714286
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 898
average number of affinization = 574.4929577464789
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 895
average number of affinization = 578.9444444444445
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 14.3     |
| Iteration     | 10       |
| MaximumReturn | 23.9     |
| MinimumReturn | -3.47    |
| TotalSamples  | 48000    |
----------------------------
itr #11 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3835557997226715
Validation loss = 0.38708868622779846
Validation loss = 0.3912528455257416
Validation loss = 0.39367032051086426
Validation loss = 0.39911970496177673
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3809690475463867
Validation loss = 0.38925114274024963
Validation loss = 0.3862861096858978
Validation loss = 0.39188873767852783
Validation loss = 0.39938342571258545
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.38470712304115295
Validation loss = 0.39041146636009216
Validation loss = 0.39424729347229004
Validation loss = 0.39766231179237366
Validation loss = 0.40406084060668945
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3827514350414276
Validation loss = 0.3876175880432129
Validation loss = 0.38778701424598694
Validation loss = 0.3921431601047516
Validation loss = 0.39576587080955505
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.38648608326911926
Validation loss = 0.38680651783943176
Validation loss = 0.39033615589141846
Validation loss = 0.38971972465515137
Validation loss = 0.39616283774375916
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 591
average number of affinization = 579.1095890410959
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 666
average number of affinization = 580.2837837837837
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 585
average number of affinization = 580.3466666666667
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 639
average number of affinization = 581.1184210526316
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 585
average number of affinization = 581.1688311688312
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 648
average number of affinization = 582.025641025641
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -0.0263  |
| Iteration     | 11       |
| MaximumReturn | 9.1      |
| MinimumReturn | -8.22    |
| TotalSamples  | 52000    |
----------------------------
itr #12 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.386593222618103
Validation loss = 0.38867104053497314
Validation loss = 0.39556220173835754
Validation loss = 0.39596307277679443
Validation loss = 0.4035939574241638
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.3857288360595703
Validation loss = 0.38855937123298645
Validation loss = 0.3981006443500519
Validation loss = 0.39900267124176025
Validation loss = 0.4033842384815216
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3840312957763672
Validation loss = 0.3920362591743469
Validation loss = 0.3983229398727417
Validation loss = 0.39588087797164917
Validation loss = 0.4036003351211548
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.3810650110244751
Validation loss = 0.3890012502670288
Validation loss = 0.3910307288169861
Validation loss = 0.3987029492855072
Validation loss = 0.3998813331127167
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.38755983114242554
Validation loss = 0.3879987597465515
Validation loss = 0.3942648470401764
Validation loss = 0.3977462649345398
Validation loss = 0.4039592444896698
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 589
average number of affinization = 582.1139240506329
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 563
average number of affinization = 581.875
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 598
average number of affinization = 582.074074074074
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 575
average number of affinization = 581.9878048780488
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 588
average number of affinization = 582.0602409638554
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 591
average number of affinization = 582.1666666666666
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 11.4     |
| Iteration     | 12       |
| MaximumReturn | 33.9     |
| MinimumReturn | -4.45    |
| TotalSamples  | 56000    |
----------------------------
itr #13 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.39115625619888306
Validation loss = 0.39619454741477966
Validation loss = 0.4008356034755707
Validation loss = 0.4023086130619049
Validation loss = 0.40684953331947327
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.39846253395080566
Validation loss = 0.39707255363464355
Validation loss = 0.402809202671051
Validation loss = 0.4039556384086609
Validation loss = 0.40804117918014526
Validation loss = 0.41173991560935974
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.38682883977890015
Validation loss = 0.39410310983657837
Validation loss = 0.4009031653404236
Validation loss = 0.4075126051902771
Validation loss = 0.4035522937774658
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.39145007729530334
Validation loss = 0.3962549865245819
Validation loss = 0.3994351923465729
Validation loss = 0.40271633863449097
Validation loss = 0.4059028625488281
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3919665217399597
Validation loss = 0.393839031457901
Validation loss = 0.399036705493927
Validation loss = 0.4029518961906433
Validation loss = 0.40968021750450134
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 584
average number of affinization = 582.1882352941177
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 592
average number of affinization = 582.3023255813954
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 569
average number of affinization = 582.1494252873563
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 602
average number of affinization = 582.375
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 613
average number of affinization = 582.7191011235955
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 617
average number of affinization = 583.1
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 14.1     |
| Iteration     | 13       |
| MaximumReturn | 30.7     |
| MinimumReturn | 2.07     |
| TotalSamples  | 60000    |
----------------------------
itr #14 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3940558135509491
Validation loss = 0.39928966760635376
Validation loss = 0.39760079979896545
Validation loss = 0.402819961309433
Validation loss = 0.4031533896923065
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.39685171842575073
Validation loss = 0.4001442790031433
Validation loss = 0.4013843536376953
Validation loss = 0.4061318635940552
Validation loss = 0.40855515003204346
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.3933733403682709
Validation loss = 0.3970222473144531
Validation loss = 0.40093496441841125
Validation loss = 0.40976330637931824
Validation loss = 0.4103890359401703
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.39592063426971436
Validation loss = 0.3930750787258148
Validation loss = 0.396621435880661
Validation loss = 0.4039430022239685
Validation loss = 0.4090254604816437
Validation loss = 0.4086894690990448
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3909973204135895
Validation loss = 0.3942851722240448
Validation loss = 0.4029255211353302
Validation loss = 0.40635523200035095
Validation loss = 0.4086795151233673
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 519
average number of affinization = 582.3956043956044
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 522
average number of affinization = 581.7391304347826
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 529
average number of affinization = 581.1720430107526
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 551
average number of affinization = 580.8510638297872
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 538
average number of affinization = 580.4
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 531
average number of affinization = 579.8854166666666
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 17       |
| Iteration     | 14       |
| MaximumReturn | 33       |
| MinimumReturn | -1.69    |
| TotalSamples  | 64000    |
----------------------------
itr #15 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.3915931284427643
Validation loss = 0.4007503092288971
Validation loss = 0.4048641622066498
Validation loss = 0.40782320499420166
Validation loss = 0.41326653957366943
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.39986562728881836
Validation loss = 0.40643876791000366
Validation loss = 0.40627598762512207
Validation loss = 0.4150870442390442
Validation loss = 0.4190753996372223
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.4019189774990082
Validation loss = 0.40212416648864746
Validation loss = 0.403744101524353
Validation loss = 0.4120219349861145
Validation loss = 0.41552606225013733
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.40230593085289
Validation loss = 0.4027916491031647
Validation loss = 0.4142490327358246
Validation loss = 0.40753743052482605
Validation loss = 0.4174210727214813
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.4018772840499878
Validation loss = 0.4032732844352722
Validation loss = 0.40453383326530457
Validation loss = 0.407299280166626
Validation loss = 0.40810540318489075
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 534
average number of affinization = 579.4123711340206
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 528
average number of affinization = 578.8877551020408
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 544
average number of affinization = 578.5353535353536
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 536
average number of affinization = 578.11
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 520
average number of affinization = 577.5346534653465
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 534
average number of affinization = 577.1078431372549
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 14.1     |
| Iteration     | 15       |
| MaximumReturn | 26.3     |
| MinimumReturn | -1.11    |
| TotalSamples  | 68000    |
----------------------------
itr #16 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.40108436346054077
Validation loss = 0.4053589105606079
Validation loss = 0.40766218304634094
Validation loss = 0.4073145389556885
Validation loss = 0.4139729142189026
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.4069266617298126
Validation loss = 0.4088609218597412
Validation loss = 0.41463062167167664
Validation loss = 0.40777164697647095
Validation loss = 0.41864660382270813
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.40389835834503174
Validation loss = 0.4062058925628662
Validation loss = 0.4080474376678467
Validation loss = 0.4104073941707611
Validation loss = 0.4156404137611389
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.41191914677619934
Validation loss = 0.40877917408943176
Validation loss = 0.407833993434906
Validation loss = 0.41155219078063965
Validation loss = 0.41718748211860657
Validation loss = 0.4195018708705902
Validation loss = 0.41978296637535095
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.40394237637519836
Validation loss = 0.40392017364501953
Validation loss = 0.4055325388908386
Validation loss = 0.40898779034614563
Validation loss = 0.4073239266872406
Validation loss = 0.4123571217060089
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 548
average number of affinization = 576.8252427184466
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 595
average number of affinization = 577.0
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 597
average number of affinization = 577.1904761904761
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 607
average number of affinization = 577.4716981132076
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 606
average number of affinization = 577.7383177570093
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 598
average number of affinization = 577.925925925926
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 16.4     |
| Iteration     | 16       |
| MaximumReturn | 26.8     |
| MinimumReturn | 3.32     |
| TotalSamples  | 72000    |
----------------------------
itr #17 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.40139928460121155
Validation loss = 0.40852078795433044
Validation loss = 0.4096511900424957
Validation loss = 0.41225066781044006
Validation loss = 0.4161166548728943
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.408891499042511
Validation loss = 0.40864798426628113
Validation loss = 0.4109078347682953
Validation loss = 0.41425034403800964
Validation loss = 0.41733092069625854
Validation loss = 0.4216404855251312
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.4066908061504364
Validation loss = 0.4075363874435425
Validation loss = 0.40819790959358215
Validation loss = 0.4123791456222534
Validation loss = 0.41832292079925537
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.4132792353630066
Validation loss = 0.4135817587375641
Validation loss = 0.4160047173500061
Validation loss = 0.41784122586250305
Validation loss = 0.4202098548412323
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.40588271617889404
Validation loss = 0.40912967920303345
Validation loss = 0.41360822319984436
Validation loss = 0.4158856272697449
Validation loss = 0.41693633794784546
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 564
average number of affinization = 577.7981651376147
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 607
average number of affinization = 578.0636363636364
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 594
average number of affinization = 578.2072072072073
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 531
average number of affinization = 577.7857142857143
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 614
average number of affinization = 578.1061946902655
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 560
average number of affinization = 577.9473684210526
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 14.9     |
| Iteration     | 17       |
| MaximumReturn | 25.3     |
| MinimumReturn | 4.12     |
| TotalSamples  | 76000    |
----------------------------
itr #18 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.40455999970436096
Validation loss = 0.40874001383781433
Validation loss = 0.40992075204849243
Validation loss = 0.4120158851146698
Validation loss = 0.41741204261779785
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.40873005986213684
Validation loss = 0.4092121124267578
Validation loss = 0.4128013551235199
Validation loss = 0.41210222244262695
Validation loss = 0.42049911618232727
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.40473341941833496
Validation loss = 0.40865033864974976
Validation loss = 0.4152645170688629
Validation loss = 0.41101646423339844
Validation loss = 0.4127643406391144
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.4158988893032074
Validation loss = 0.4106440842151642
Validation loss = 0.41616538166999817
Validation loss = 0.4182420074939728
Validation loss = 0.42064642906188965
Validation loss = 0.4228292405605316
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.40378597378730774
Validation loss = 0.41241785883903503
Validation loss = 0.4122500419616699
Validation loss = 0.4107077717781067
Validation loss = 0.41727060079574585
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 563
average number of affinization = 577.8173913043478
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 555
average number of affinization = 577.6206896551724
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 561
average number of affinization = 577.4786324786324
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 588
average number of affinization = 577.5677966101695
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 595
average number of affinization = 577.7142857142857
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 589
average number of affinization = 577.8083333333333
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 12.2     |
| Iteration     | 18       |
| MaximumReturn | 28.7     |
| MinimumReturn | -1.86    |
| TotalSamples  | 80000    |
----------------------------
itr #19 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.40634021162986755
Validation loss = 0.41283416748046875
Validation loss = 0.4125429093837738
Validation loss = 0.42144984006881714
Validation loss = 0.4225254952907562
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.41223764419555664
Validation loss = 0.41179245710372925
Validation loss = 0.4158021807670593
Validation loss = 0.4182632863521576
Validation loss = 0.4210375249385834
Validation loss = 0.42572855949401855
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.4073364734649658
Validation loss = 0.41411203145980835
Validation loss = 0.4131333827972412
Validation loss = 0.4148237109184265
Validation loss = 0.42421168088912964
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.415147066116333
Validation loss = 0.4195205569267273
Validation loss = 0.4214487671852112
Validation loss = 0.4244441092014313
Validation loss = 0.42577093839645386
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.4087255001068115
Validation loss = 0.4103301465511322
Validation loss = 0.41579145193099976
Validation loss = 0.41784486174583435
Validation loss = 0.4177320599555969
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 528
average number of affinization = 577.396694214876
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 543
average number of affinization = 577.1147540983607
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 564
average number of affinization = 577.0081300813008
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 551
average number of affinization = 576.7983870967741
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 561
average number of affinization = 576.672
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 523
average number of affinization = 576.2460317460317
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 14.3     |
| Iteration     | 19       |
| MaximumReturn | 25.6     |
| MinimumReturn | -0.973   |
| TotalSamples  | 84000    |
----------------------------
itr #20 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.4163169860839844
Validation loss = 0.41685566306114197
Validation loss = 0.4190528094768524
Validation loss = 0.4227730333805084
Validation loss = 0.42594242095947266
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.4155462682247162
Validation loss = 0.4189832806587219
Validation loss = 0.42015302181243896
Validation loss = 0.42586714029312134
Validation loss = 0.42531388998031616
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.41522449254989624
Validation loss = 0.41536059975624084
Validation loss = 0.4194129407405853
Validation loss = 0.42204415798187256
Validation loss = 0.425259530544281
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.4194105565547943
Validation loss = 0.4216380715370178
Validation loss = 0.4240376651287079
Validation loss = 0.4237743020057678
Validation loss = 0.4290781021118164
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.4115223288536072
Validation loss = 0.41751402616500854
Validation loss = 0.41993528604507446
Validation loss = 0.42274025082588196
Validation loss = 0.4239143133163452
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 540
average number of affinization = 575.9606299212599
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 534
average number of affinization = 575.6328125
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 531
average number of affinization = 575.2868217054264
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 523
average number of affinization = 574.8846153846154
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 551
average number of affinization = 574.7022900763359
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 542
average number of affinization = 574.4545454545455
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 14.7     |
| Iteration     | 20       |
| MaximumReturn | 26.5     |
| MinimumReturn | 2.34     |
| TotalSamples  | 88000    |
----------------------------
itr #21 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.41642430424690247
Validation loss = 0.42000722885131836
Validation loss = 0.4201923608779907
Validation loss = 0.4302199184894562
Validation loss = 0.429721474647522
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.4176171123981476
Validation loss = 0.4188937842845917
Validation loss = 0.42407578229904175
Validation loss = 0.4263920485973358
Validation loss = 0.4266100227832794
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.4204351305961609
Validation loss = 0.4203256666660309
Validation loss = 0.4243950843811035
Validation loss = 0.4251800775527954
Validation loss = 0.4301008880138397
Validation loss = 0.4306291341781616
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.4196944534778595
Validation loss = 0.422268271446228
Validation loss = 0.42677050828933716
Validation loss = 0.42581385374069214
Validation loss = 0.4286661148071289
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.4158613085746765
Validation loss = 0.42191141843795776
Validation loss = 0.424744576215744
Validation loss = 0.42673757672309875
Validation loss = 0.42506664991378784
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 580
average number of affinization = 574.4962406015038
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 586
average number of affinization = 574.5820895522388
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 555
average number of affinization = 574.437037037037
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 579
average number of affinization = 574.4705882352941
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 557
average number of affinization = 574.3430656934306
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 581
average number of affinization = 574.3913043478261
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 13.8     |
| Iteration     | 21       |
| MaximumReturn | 25.9     |
| MinimumReturn | -0.174   |
| TotalSamples  | 92000    |
----------------------------
itr #22 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.4221118092536926
Validation loss = 0.42433962225914
Validation loss = 0.42189526557922363
Validation loss = 0.4287303388118744
Validation loss = 0.4294552505016327
Validation loss = 0.4309232532978058
Validation loss = 0.4332146942615509
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.4180077612400055
Validation loss = 0.42019590735435486
Validation loss = 0.42152735590934753
Validation loss = 0.42595726251602173
Validation loss = 0.4294582009315491
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.41755905747413635
Validation loss = 0.4257321059703827
Validation loss = 0.42568856477737427
Validation loss = 0.43042299151420593
Validation loss = 0.4311986565589905
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.42308342456817627
Validation loss = 0.42167168855667114
Validation loss = 0.42720797657966614
Validation loss = 0.4265649914741516
Validation loss = 0.4308434724807739
Validation loss = 0.432636022567749
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.41715240478515625
Validation loss = 0.4176158905029297
Validation loss = 0.42379018664360046
Validation loss = 0.4244216978549957
Validation loss = 0.4279576241970062
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 615
average number of affinization = 574.68345323741
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 621
average number of affinization = 575.0142857142857
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 624
average number of affinization = 575.3617021276596
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 638
average number of affinization = 575.8028169014085
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 620
average number of affinization = 576.1118881118881
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 622
average number of affinization = 576.4305555555555
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 16.1     |
| Iteration     | 22       |
| MaximumReturn | 22.3     |
| MinimumReturn | 9.94     |
| TotalSamples  | 96000    |
----------------------------
itr #23 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.4232381284236908
Validation loss = 0.4241579473018646
Validation loss = 0.4282134771347046
Validation loss = 0.4293065071105957
Validation loss = 0.4277379810810089
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.42183732986450195
Validation loss = 0.42052149772644043
Validation loss = 0.4246949851512909
Validation loss = 0.42730453610420227
Validation loss = 0.42714405059814453
Validation loss = 0.4312080442905426
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.4196499288082123
Validation loss = 0.422435998916626
Validation loss = 0.4257725477218628
Validation loss = 0.4285808503627777
Validation loss = 0.4290177822113037
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.42308309674263
Validation loss = 0.4252048432826996
Validation loss = 0.4259989261627197
Validation loss = 0.4268084466457367
Validation loss = 0.4311220645904541
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.41912147402763367
Validation loss = 0.4212181568145752
Validation loss = 0.42329660058021545
Validation loss = 0.42508020997047424
Validation loss = 0.42712318897247314
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 596
average number of affinization = 576.5655172413793
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 626
average number of affinization = 576.9041095890411
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 594
average number of affinization = 577.0204081632653
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 610
average number of affinization = 577.2432432432432
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 644
average number of affinization = 577.6912751677852
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 592
average number of affinization = 577.7866666666666
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 14       |
| Iteration     | 23       |
| MaximumReturn | 21.6     |
| MinimumReturn | -0.423   |
| TotalSamples  | 100000   |
----------------------------
itr #24 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.4216574430465698
Validation loss = 0.4214652180671692
Validation loss = 0.42660632729530334
Validation loss = 0.42586854100227356
Validation loss = 0.4311944246292114
Validation loss = 0.4305358827114105
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.42124882340431213
Validation loss = 0.4218638241291046
Validation loss = 0.42463624477386475
Validation loss = 0.42542093992233276
Validation loss = 0.4303113520145416
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.420817494392395
Validation loss = 0.42402032017707825
Validation loss = 0.4260256588459015
Validation loss = 0.4268880784511566
Validation loss = 0.4280983805656433
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.42315971851348877
Validation loss = 0.4240545630455017
Validation loss = 0.4264252185821533
Validation loss = 0.4282698333263397
Validation loss = 0.430678129196167
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.41988539695739746
Validation loss = 0.41978055238723755
Validation loss = 0.426492303609848
Validation loss = 0.4248387813568115
Validation loss = 0.42919033765792847
Validation loss = 0.43172600865364075
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 613
average number of affinization = 578.0198675496689
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 683
average number of affinization = 578.7105263157895
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 743
average number of affinization = 579.7843137254902
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 662
average number of affinization = 580.3181818181819
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 580
average number of affinization = 580.3161290322581
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 746
average number of affinization = 581.3782051282051
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 17.4     |
| Iteration     | 24       |
| MaximumReturn | 30       |
| MinimumReturn | 3.99     |
| TotalSamples  | 104000   |
----------------------------
itr #25 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.4268892705440521
Validation loss = 0.429212749004364
Validation loss = 0.42909687757492065
Validation loss = 0.4309930205345154
Validation loss = 0.43321216106414795
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.42352357506752014
Validation loss = 0.4258778393268585
Validation loss = 0.42905357480049133
Validation loss = 0.43194660544395447
Validation loss = 0.4332544207572937
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.42343294620513916
Validation loss = 0.4280698597431183
Validation loss = 0.43075549602508545
Validation loss = 0.430480420589447
Validation loss = 0.43133223056793213
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.4246046245098114
Validation loss = 0.42360788583755493
Validation loss = 0.4292293190956116
Validation loss = 0.43159717321395874
Validation loss = 0.43551477789878845
Validation loss = 0.4363463521003723
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.4266759753227234
Validation loss = 0.4258417785167694
Validation loss = 0.42782366275787354
Validation loss = 0.43090832233428955
Validation loss = 0.4331119656562805
Validation loss = 0.434486448764801
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 777
average number of affinization = 582.624203821656
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 720
average number of affinization = 583.493670886076
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 792
average number of affinization = 584.8050314465409
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 682
average number of affinization = 585.4125
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 804
average number of affinization = 586.7701863354038
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 788
average number of affinization = 588.0123456790124
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 10.4     |
| Iteration     | 25       |
| MaximumReturn | 17.5     |
| MinimumReturn | -1.37    |
| TotalSamples  | 108000   |
----------------------------
itr #26 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.42613035440444946
Validation loss = 0.42911258339881897
Validation loss = 0.43031278252601624
Validation loss = 0.4323372542858124
Validation loss = 0.4334518313407898
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.42561352252960205
Validation loss = 0.4252665936946869
Validation loss = 0.4278896749019623
Validation loss = 0.4296431839466095
Validation loss = 0.4359844923019409
Validation loss = 0.4330122768878937
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.42450201511383057
Validation loss = 0.427031010389328
Validation loss = 0.43275725841522217
Validation loss = 0.43358418345451355
Validation loss = 0.43293851613998413
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.42711493372917175
Validation loss = 0.4312474727630615
Validation loss = 0.43052181601524353
Validation loss = 0.4328300356864929
Validation loss = 0.43560847640037537
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.42784959077835083
Validation loss = 0.42342644929885864
Validation loss = 0.4279947578907013
Validation loss = 0.4301385283470154
Validation loss = 0.4364868700504303
Validation loss = 0.43412455916404724
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 738
average number of affinization = 588.9325153374233
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 703
average number of affinization = 589.6280487804878
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 765
average number of affinization = 590.6909090909091
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 651
average number of affinization = 591.0542168674699
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 695
average number of affinization = 591.6766467065868
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 667
average number of affinization = 592.125
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 12.8     |
| Iteration     | 26       |
| MaximumReturn | 31.3     |
| MinimumReturn | -0.371   |
| TotalSamples  | 112000   |
----------------------------
itr #27 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.4281599223613739
Validation loss = 0.4267023503780365
Validation loss = 0.43146780133247375
Validation loss = 0.43237563967704773
Validation loss = 0.4367103576660156
Validation loss = 0.43657582998275757
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.42431411147117615
Validation loss = 0.42759114503860474
Validation loss = 0.4322679340839386
Validation loss = 0.43196412920951843
Validation loss = 0.43467649817466736
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.4251304268836975
Validation loss = 0.4301464259624481
Validation loss = 0.4311569333076477
Validation loss = 0.43374231457710266
Validation loss = 0.4343663156032562
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.4287983477115631
Validation loss = 0.42884206771850586
Validation loss = 0.43257051706314087
Validation loss = 0.4347369968891144
Validation loss = 0.4344540536403656
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.4243071377277374
Validation loss = 0.42787328362464905
Validation loss = 0.4322417378425598
Validation loss = 0.4324781000614166
Validation loss = 0.43598321080207825
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 731
average number of affinization = 592.9467455621302
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 774
average number of affinization = 594.0117647058823
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 760
average number of affinization = 594.9824561403509
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 740
average number of affinization = 595.8255813953489
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 812
average number of affinization = 597.0751445086705
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 766
average number of affinization = 598.0459770114943
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 12.3     |
| Iteration     | 27       |
| MaximumReturn | 26.5     |
| MinimumReturn | -1.5     |
| TotalSamples  | 116000   |
----------------------------
itr #28 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.43059468269348145
Validation loss = 0.42913705110549927
Validation loss = 0.43357449769973755
Validation loss = 0.4350835680961609
Validation loss = 0.4358968138694763
Validation loss = 0.43605971336364746
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.4265736937522888
Validation loss = 0.4283098876476288
Validation loss = 0.43014228343963623
Validation loss = 0.4342847764492035
Validation loss = 0.4348161220550537
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.4274030327796936
Validation loss = 0.42886221408843994
Validation loss = 0.43086495995521545
Validation loss = 0.4344047009944916
Validation loss = 0.4344131648540497
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.42870938777923584
Validation loss = 0.4286487102508545
Validation loss = 0.43250757455825806
Validation loss = 0.4336361289024353
Validation loss = 0.43549972772598267
Validation loss = 0.43700191378593445
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.42404523491859436
Validation loss = 0.4282863438129425
Validation loss = 0.43064361810684204
Validation loss = 0.4326253831386566
Validation loss = 0.43318817019462585
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 797
average number of affinization = 599.1828571428572
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 740
average number of affinization = 599.9829545454545
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 826
average number of affinization = 601.2598870056497
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 764
average number of affinization = 602.1741573033707
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 788
average number of affinization = 603.2122905027933
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 739
average number of affinization = 603.9666666666667
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 14.1     |
| Iteration     | 28       |
| MaximumReturn | 34.3     |
| MinimumReturn | 4.06     |
| TotalSamples  | 120000   |
----------------------------
itr #29 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.4279416501522064
Validation loss = 0.4313680827617645
Validation loss = 0.4335392713546753
Validation loss = 0.4352734386920929
Validation loss = 0.4379437565803528
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.42746037244796753
Validation loss = 0.42822226881980896
Validation loss = 0.4314567744731903
Validation loss = 0.4325118660926819
Validation loss = 0.4350754916667938
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.4258376359939575
Validation loss = 0.4272981584072113
Validation loss = 0.431670218706131
Validation loss = 0.4333486258983612
Validation loss = 0.43691396713256836
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.42789360880851746
Validation loss = 0.43147504329681396
Validation loss = 0.4332009553909302
Validation loss = 0.43504592776298523
Validation loss = 0.43719103932380676
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.4272540807723999
Validation loss = 0.42917340993881226
Validation loss = 0.43068602681159973
Validation loss = 0.43369045853614807
Validation loss = 0.4356173872947693
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 876
average number of affinization = 605.4696132596686
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 890
average number of affinization = 607.032967032967
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 765
average number of affinization = 607.896174863388
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 783
average number of affinization = 608.8478260869565
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 846
average number of affinization = 610.1297297297298
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 811
average number of affinization = 611.2096774193549
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 15.1     |
| Iteration     | 29       |
| MaximumReturn | 28.6     |
| MinimumReturn | -2.81    |
| TotalSamples  | 124000   |
----------------------------
itr #30 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.42853018641471863
Validation loss = 0.4322957396507263
Validation loss = 0.4339013993740082
Validation loss = 0.43498456478118896
Validation loss = 0.43708446621894836
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.42840510606765747
Validation loss = 0.42883387207984924
Validation loss = 0.429964154958725
Validation loss = 0.433474600315094
Validation loss = 0.43648213148117065
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.4295675754547119
Validation loss = 0.42859184741973877
Validation loss = 0.43237921595573425
Validation loss = 0.43582096695899963
Validation loss = 0.4343215227127075
Validation loss = 0.43564552068710327
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.43049657344818115
Validation loss = 0.43270233273506165
Validation loss = 0.43280094861984253
Validation loss = 0.43476560711860657
Validation loss = 0.4372779428958893
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.4251458942890167
Validation loss = 0.42943185567855835
Validation loss = 0.43134066462516785
Validation loss = 0.43290698528289795
Validation loss = 0.4327923059463501
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 811
average number of affinization = 612.2780748663101
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 715
average number of affinization = 612.8244680851063
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 761
average number of affinization = 613.6084656084656
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 802
average number of affinization = 614.6
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 741
average number of affinization = 615.261780104712
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 755
average number of affinization = 615.9895833333334
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 12.9     |
| Iteration     | 30       |
| MaximumReturn | 29.9     |
| MinimumReturn | -2.98    |
| TotalSamples  | 128000   |
----------------------------
itr #31 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.42802438139915466
Validation loss = 0.4333040714263916
Validation loss = 0.4332813322544098
Validation loss = 0.4362698197364807
Validation loss = 0.4364880323410034
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.4305475354194641
Validation loss = 0.4308044910430908
Validation loss = 0.4336402118206024
Validation loss = 0.4339180588722229
Validation loss = 0.4370408058166504
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.42953479290008545
Validation loss = 0.4311394989490509
Validation loss = 0.4315486550331116
Validation loss = 0.43451496958732605
Validation loss = 0.4368600845336914
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.42884600162506104
Validation loss = 0.4304574131965637
Validation loss = 0.43330979347229004
Validation loss = 0.4356011152267456
Validation loss = 0.439020574092865
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.4293668270111084
Validation loss = 0.4312379062175751
Validation loss = 0.43296748399734497
Validation loss = 0.43352583050727844
Validation loss = 0.43631863594055176
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 810
average number of affinization = 616.9948186528497
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 753
average number of affinization = 617.6958762886597
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 840
average number of affinization = 618.8358974358974
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 806
average number of affinization = 619.7908163265306
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 740
average number of affinization = 620.4010152284264
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 823
average number of affinization = 621.4242424242424
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 16       |
| Iteration     | 31       |
| MaximumReturn | 29.5     |
| MinimumReturn | -2.12    |
| TotalSamples  | 132000   |
----------------------------
itr #32 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.43103381991386414
Validation loss = 0.432311087846756
Validation loss = 0.43459609150886536
Validation loss = 0.4356938302516937
Validation loss = 0.4381484091281891
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.42977356910705566
Validation loss = 0.4318239688873291
Validation loss = 0.4325594902038574
Validation loss = 0.437220960855484
Validation loss = 0.4366074204444885
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.43032923340797424
Validation loss = 0.43579408526420593
Validation loss = 0.43619638681411743
Validation loss = 0.43620628118515015
Validation loss = 0.43694189190864563
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.43078774213790894
Validation loss = 0.4324606955051422
Validation loss = 0.4372027814388275
Validation loss = 0.43685439229011536
Validation loss = 0.43925201892852783
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.4318860173225403
Validation loss = 0.4327307939529419
Validation loss = 0.4330226480960846
Validation loss = 0.4354363977909088
Validation loss = 0.43864601850509644
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 1 is 764
average number of affinization = 622.1407035175879
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 1 is 779
average number of affinization = 622.925
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 1 is 694
average number of affinization = 623.2786069651742
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 1 is 761
average number of affinization = 623.960396039604
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 1 is 693
average number of affinization = 624.3004926108374
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 1 is 749
average number of affinization = 624.9117647058823
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 12.6     |
| Iteration     | 32       |
| MaximumReturn | 32.4     |
| MinimumReturn | -3.68    |
| TotalSamples  | 136000   |
----------------------------
