Logging to experiments/hopper/nov1/w350e3_seed2231
Print configuration .....
{'env_name': 'hopper', 'random_seeds': [1234, 2431, 2531, 2231], 'save_variables': False, 'model_save_dir': '/tmp/hopper_models/', 'restore_variables': False, 'start_onpol_iter': 0, 'onpol_iters': 33, 'num_path_random': 6, 'num_path_onpol': 6, 'env_horizon': 1000, 'max_train_data': 200000, 'max_val_data': 100000, 'discard_ratio': 0.0, 'dynamics': {'pre_training': {'mode': 'intrinsic_reward', 'itr': 0, 'policy_itr': 20}, 'model': 'nn', 'ensemble': True, 'ensemble_model_count': 5, 'enable_particle_ensemble': True, 'particles': 5, 'obs_var': 1.0, 'intrinsic_reward_coeff': 1.0, 'ita': 1.0, 'mode': 'random', 'val': True, 'n_layers': 4, 'hidden_size': 1000, 'activation': 'relu', 'batch_size': 1000, 'learning_rate': 0.001, 'reg_coeff': 0.0, 'epochs': 200, 'kfac_params': {'learning_rate': 0.1, 'damping': 0.001, 'momentum': 0.9, 'kl_clip': 0.0001, 'cov_ema_decay': 0.99}}, 'policy': {'network_shape': [64, 64], 'init_logstd': 0.0, 'activation': 'tanh', 'reinitialize_every_itr': False}, 'trpo': {'horizon': 1000, 'gamma': 0.99, 'step_size': 0.01, 'iterations': 20, 'batch_size': 50000, 'gae': 0.95, 'visualization': False, 'visualize_iterations': [0]}, 'algo': 'trpo'}
Generating random rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 0
average number of affinization = 0.0
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 0
average number of affinization = 0.0
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 0
average number of affinization = 0.0
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 0
average number of affinization = 0.0
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 0
average number of affinization = 0.0
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 0
average number of affinization = 0.0
Done generating random rollouts.
Creating normalization for training data.
Done creating normalization for training data.
Particle ensemble enabled? True
An ensemble of 5 dynamics model <class 'model.dynamics.NNDynamicsModel'> initialized
Train dynamics model with intrinsic reward only? False
Pre-training enabled. Using only intrinsic reward.
Pre-training dynamics model for 0 iterations...
Done pre-training dynamics model.
Using external reward only.
itr #0 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6875240206718445
Validation loss = 0.632596492767334
Validation loss = 0.6256214380264282
Validation loss = 0.622687578201294
Validation loss = 0.6427077054977417
Validation loss = 0.6424970030784607
Validation loss = 0.7046917676925659
Validation loss = 0.7076294422149658
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.7129368782043457
Validation loss = 0.6437629461288452
Validation loss = 0.6193255186080933
Validation loss = 0.627927303314209
Validation loss = 0.6346196532249451
Validation loss = 0.6580578088760376
Validation loss = 0.6690536737442017
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.8374477624893188
Validation loss = 0.6365854144096375
Validation loss = 0.6167067289352417
Validation loss = 0.6242541074752808
Validation loss = 0.6536709070205688
Validation loss = 0.6455539464950562
Validation loss = 0.6783179044723511
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7094918489456177
Validation loss = 0.6299054622650146
Validation loss = 0.6242414712905884
Validation loss = 0.6356098055839539
Validation loss = 0.6419968605041504
Validation loss = 0.6552616357803345
Validation loss = 0.7069807052612305
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7124227285385132
Validation loss = 0.6249691247940063
Validation loss = 0.621030330657959
Validation loss = 0.6261141300201416
Validation loss = 0.6410769820213318
Validation loss = 0.661407470703125
Validation loss = 0.6842712163925171
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 197
average number of affinization = 28.142857142857142
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 100
average number of affinization = 37.125
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 86
average number of affinization = 42.55555555555556
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 94
average number of affinization = 47.7
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 92
average number of affinization = 51.72727272727273
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 67
average number of affinization = 53.0
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.83e+03 |
| Iteration     | 0         |
| MaximumReturn | -1.3e+03  |
| MinimumReturn | -2.17e+03 |
| TotalSamples  | 8000      |
-----------------------------
itr #1 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6777867078781128
Validation loss = 0.6392000317573547
Validation loss = 0.6232452392578125
Validation loss = 0.6252623796463013
Validation loss = 0.6396844387054443
Validation loss = 0.6641638278961182
Validation loss = 0.6598252654075623
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6722346544265747
Validation loss = 0.6302027702331543
Validation loss = 0.624980092048645
Validation loss = 0.6206510663032532
Validation loss = 0.6267703771591187
Validation loss = 0.6423050165176392
Validation loss = 0.6640483736991882
Validation loss = 0.6747921109199524
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.6656568050384521
Validation loss = 0.6240365505218506
Validation loss = 0.6191166639328003
Validation loss = 0.6196874976158142
Validation loss = 0.6345543265342712
Validation loss = 0.6436470746994019
Validation loss = 0.6593691110610962
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.6685007214546204
Validation loss = 0.6249160766601562
Validation loss = 0.6117685437202454
Validation loss = 0.6289023756980896
Validation loss = 0.6348804235458374
Validation loss = 0.6353285312652588
Validation loss = 0.6578857898712158
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6778428554534912
Validation loss = 0.6318823099136353
Validation loss = 0.6208672523498535
Validation loss = 0.6257883310317993
Validation loss = 0.638857364654541
Validation loss = 0.650971531867981
Validation loss = 0.6877040863037109
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 400
average number of affinization = 79.6923076923077
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 458
average number of affinization = 106.71428571428571
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 505
average number of affinization = 133.26666666666668
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 446
average number of affinization = 152.8125
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 472
average number of affinization = 171.58823529411765
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 452
average number of affinization = 187.16666666666666
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.25e+03 |
| Iteration     | 1         |
| MaximumReturn | -1.09e+03 |
| MinimumReturn | -1.41e+03 |
| TotalSamples  | 12000     |
-----------------------------
itr #2 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6208152174949646
Validation loss = 0.613028347492218
Validation loss = 0.6341353058815002
Validation loss = 0.6513059735298157
Validation loss = 0.6459420323371887
Validation loss = 0.6641033291816711
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6205479502677917
Validation loss = 0.6128647923469543
Validation loss = 0.626440167427063
Validation loss = 0.6579568982124329
Validation loss = 0.6522214412689209
Validation loss = 0.6695404052734375
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.6165488362312317
Validation loss = 0.603337287902832
Validation loss = 0.6174458861351013
Validation loss = 0.6503997445106506
Validation loss = 0.6338537335395813
Validation loss = 0.6560423970222473
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.6155937314033508
Validation loss = 0.6177733540534973
Validation loss = 0.6198559403419495
Validation loss = 0.6234371066093445
Validation loss = 0.637911856174469
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6191883683204651
Validation loss = 0.5993692874908447
Validation loss = 0.6176387667655945
Validation loss = 0.6267299056053162
Validation loss = 0.6553484201431274
Validation loss = 0.6551973223686218
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 565
average number of affinization = 207.05263157894737
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 592
average number of affinization = 226.3
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 572
average number of affinization = 242.76190476190476
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 589
average number of affinization = 258.5
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 578
average number of affinization = 272.39130434782606
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 556
average number of affinization = 284.2083333333333
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.19e+03 |
| Iteration     | 2         |
| MaximumReturn | -1.12e+03 |
| MinimumReturn | -1.26e+03 |
| TotalSamples  | 16000     |
-----------------------------
itr #3 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.5589736700057983
Validation loss = 0.5584028959274292
Validation loss = 0.5744986534118652
Validation loss = 0.5723592042922974
Validation loss = 0.5834999084472656
Validation loss = 0.5869535803794861
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.5672879219055176
Validation loss = 0.5675144195556641
Validation loss = 0.5687084794044495
Validation loss = 0.5835912227630615
Validation loss = 0.583096981048584
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.5614442825317383
Validation loss = 0.5573936700820923
Validation loss = 0.5648879408836365
Validation loss = 0.5735188722610474
Validation loss = 0.5751560926437378
Validation loss = 0.5853469371795654
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.5465811491012573
Validation loss = 0.5578713417053223
Validation loss = 0.5606523752212524
Validation loss = 0.5671582221984863
Validation loss = 0.5835607647895813
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.5716615319252014
Validation loss = 0.5595253705978394
Validation loss = 0.5752167105674744
Validation loss = 0.580520749092102
Validation loss = 0.5860834121704102
Validation loss = 0.5886750221252441
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 535
average number of affinization = 294.24
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 556
average number of affinization = 304.3076923076923
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 534
average number of affinization = 312.81481481481484
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 531
average number of affinization = 320.60714285714283
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 521
average number of affinization = 327.51724137931035
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 581
average number of affinization = 335.96666666666664
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1e+03    |
| Iteration     | 3         |
| MaximumReturn | -701      |
| MinimumReturn | -1.47e+03 |
| TotalSamples  | 20000     |
-----------------------------
itr #4 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.5261995196342468
Validation loss = 0.5551033616065979
Validation loss = 0.5577372908592224
Validation loss = 0.5695667266845703
Validation loss = 0.5706239938735962
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.5336261987686157
Validation loss = 0.5384771823883057
Validation loss = 0.549083411693573
Validation loss = 0.5595656633377075
Validation loss = 0.5658292174339294
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.529706597328186
Validation loss = 0.5541375875473022
Validation loss = 0.542957603931427
Validation loss = 0.5621336102485657
Validation loss = 0.5654823780059814
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.5201741456985474
Validation loss = 0.5413991808891296
Validation loss = 0.5492329597473145
Validation loss = 0.5423532724380493
Validation loss = 0.5539062023162842
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.5484771728515625
Validation loss = 0.5433253049850464
Validation loss = 0.5544453263282776
Validation loss = 0.5561980605125427
Validation loss = 0.5760990381240845
Validation loss = 0.5803876519203186
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 718
average number of affinization = 348.2903225806452
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 737
average number of affinization = 360.4375
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 752
average number of affinization = 372.3030303030303
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 738
average number of affinization = 383.05882352941177
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 742
average number of affinization = 393.3142857142857
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 733
average number of affinization = 402.75
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.62e+03 |
| Iteration     | 4         |
| MaximumReturn | -1.56e+03 |
| MinimumReturn | -1.71e+03 |
| TotalSamples  | 24000     |
-----------------------------
itr #5 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.5507583022117615
Validation loss = 0.5553264021873474
Validation loss = 0.5639901757240295
Validation loss = 0.5757027864456177
Validation loss = 0.581484854221344
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.5466947555541992
Validation loss = 0.5534995198249817
Validation loss = 0.5662619471549988
Validation loss = 0.5785275101661682
Validation loss = 0.5835121273994446
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.546722412109375
Validation loss = 0.5547686219215393
Validation loss = 0.5691743493080139
Validation loss = 0.5713363885879517
Validation loss = 0.5878464579582214
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.5402891635894775
Validation loss = 0.5471222996711731
Validation loss = 0.5598311424255371
Validation loss = 0.5657657980918884
Validation loss = 0.5739963054656982
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.5490979552268982
Validation loss = 0.559996485710144
Validation loss = 0.5695981383323669
Validation loss = 0.583844780921936
Validation loss = 0.5848227143287659
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 591
average number of affinization = 407.8378378378378
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 602
average number of affinization = 412.94736842105266
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 587
average number of affinization = 417.4102564102564
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 593
average number of affinization = 421.8
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 599
average number of affinization = 426.1219512195122
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 574
average number of affinization = 429.64285714285717
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.04e+03 |
| Iteration     | 5         |
| MaximumReturn | -760      |
| MinimumReturn | -1.26e+03 |
| TotalSamples  | 28000     |
-----------------------------
itr #6 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.5530957579612732
Validation loss = 0.5510097742080688
Validation loss = 0.561955988407135
Validation loss = 0.5733124017715454
Validation loss = 0.5810908675193787
Validation loss = 0.5864638090133667
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.5555411577224731
Validation loss = 0.5551725029945374
Validation loss = 0.5670550465583801
Validation loss = 0.5728387832641602
Validation loss = 0.5748938918113708
Validation loss = 0.5825366377830505
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.5451956987380981
Validation loss = 0.5557721853256226
Validation loss = 0.5618048906326294
Validation loss = 0.5729337930679321
Validation loss = 0.5803830027580261
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.5407372713088989
Validation loss = 0.5434004664421082
Validation loss = 0.5521684885025024
Validation loss = 0.5688229203224182
Validation loss = 0.5732084512710571
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.5510067939758301
Validation loss = 0.5505221486091614
Validation loss = 0.5674497485160828
Validation loss = 0.5700554847717285
Validation loss = 0.5952741503715515
Validation loss = 0.5831698775291443
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 581
average number of affinization = 433.16279069767444
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 605
average number of affinization = 437.0681818181818
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 624
average number of affinization = 441.22222222222223
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 639
average number of affinization = 445.5217391304348
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 601
average number of affinization = 448.82978723404256
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 620
average number of affinization = 452.3958333333333
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -651     |
| Iteration     | 6        |
| MaximumReturn | -438     |
| MinimumReturn | -853     |
| TotalSamples  | 32000    |
----------------------------
itr #7 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.548471212387085
Validation loss = 0.5482897758483887
Validation loss = 0.5585453510284424
Validation loss = 0.5631364583969116
Validation loss = 0.5675450563430786
Validation loss = 0.5719467401504517
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.5415019989013672
Validation loss = 0.554772675037384
Validation loss = 0.5557740926742554
Validation loss = 0.5624571442604065
Validation loss = 0.565192699432373
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.5357195734977722
Validation loss = 0.541146993637085
Validation loss = 0.551866888999939
Validation loss = 0.5586772561073303
Validation loss = 0.5636831521987915
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.5374566316604614
Validation loss = 0.5373468995094299
Validation loss = 0.5439040660858154
Validation loss = 0.555573582649231
Validation loss = 0.5627138614654541
Validation loss = 0.5723575949668884
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.5431140661239624
Validation loss = 0.5518107414245605
Validation loss = 0.5645289421081543
Validation loss = 0.5661137104034424
Validation loss = 0.5714277625083923
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 581
average number of affinization = 455.0204081632653
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 591
average number of affinization = 457.74
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 615
average number of affinization = 460.8235294117647
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 572
average number of affinization = 462.96153846153845
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 585
average number of affinization = 465.2641509433962
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 598
average number of affinization = 467.72222222222223
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -644      |
| Iteration     | 7         |
| MaximumReturn | -409      |
| MinimumReturn | -1.01e+03 |
| TotalSamples  | 36000     |
-----------------------------
itr #8 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.5448858737945557
Validation loss = 0.5506805777549744
Validation loss = 0.5544553995132446
Validation loss = 0.5573928356170654
Validation loss = 0.5624605417251587
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.5415853261947632
Validation loss = 0.541654109954834
Validation loss = 0.5526219010353088
Validation loss = 0.5543694496154785
Validation loss = 0.5584954023361206
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.5481827259063721
Validation loss = 0.5428407192230225
Validation loss = 0.5438936948776245
Validation loss = 0.5570273399353027
Validation loss = 0.560300886631012
Validation loss = 0.5648630261421204
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.5348225831985474
Validation loss = 0.5409784913063049
Validation loss = 0.5422701239585876
Validation loss = 0.5462133288383484
Validation loss = 0.5576077103614807
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.5377435684204102
Validation loss = 0.5389841198921204
Validation loss = 0.5528128743171692
Validation loss = 0.5548020005226135
Validation loss = 0.55583655834198
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 562
average number of affinization = 469.43636363636364
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 554
average number of affinization = 470.94642857142856
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 568
average number of affinization = 472.64912280701753
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 575
average number of affinization = 474.41379310344826
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 543
average number of affinization = 475.5762711864407
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 551
average number of affinization = 476.8333333333333
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -667      |
| Iteration     | 8         |
| MaximumReturn | -304      |
| MinimumReturn | -1.08e+03 |
| TotalSamples  | 40000     |
-----------------------------
itr #9 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.541957437992096
Validation loss = 0.540377140045166
Validation loss = 0.552776038646698
Validation loss = 0.5567348599433899
Validation loss = 0.5562765598297119
Validation loss = 0.5559176802635193
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.5365866422653198
Validation loss = 0.5360629558563232
Validation loss = 0.5472602248191833
Validation loss = 0.5457166433334351
Validation loss = 0.5508993864059448
Validation loss = 0.5532950758934021
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.5396276712417603
Validation loss = 0.5346662998199463
Validation loss = 0.543845534324646
Validation loss = 0.5533068180084229
Validation loss = 0.5577341914176941
Validation loss = 0.5525861382484436
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.5382071733474731
Validation loss = 0.5372400879859924
Validation loss = 0.5409839749336243
Validation loss = 0.5543394088745117
Validation loss = 0.5530253648757935
Validation loss = 0.5534595847129822
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.5320514440536499
Validation loss = 0.5390561819076538
Validation loss = 0.5410625338554382
Validation loss = 0.5430446863174438
Validation loss = 0.5545477867126465
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 578
average number of affinization = 478.4918032786885
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 603
average number of affinization = 480.5
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 569
average number of affinization = 481.9047619047619
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 610
average number of affinization = 483.90625
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 594
average number of affinization = 485.6
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 570
average number of affinization = 486.8787878787879
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -235     |
| Iteration     | 9        |
| MaximumReturn | 132      |
| MinimumReturn | -503     |
| TotalSamples  | 44000    |
----------------------------
itr #10 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.5444819927215576
Validation loss = 0.5321409106254578
Validation loss = 0.5383979678153992
Validation loss = 0.5401477813720703
Validation loss = 0.5446515083312988
Validation loss = 0.5527040958404541
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.5312196612358093
Validation loss = 0.5324113368988037
Validation loss = 0.5344905853271484
Validation loss = 0.5392641425132751
Validation loss = 0.5461595058441162
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.5334566235542297
Validation loss = 0.5366476774215698
Validation loss = 0.5316487550735474
Validation loss = 0.5444402098655701
Validation loss = 0.5428807735443115
Validation loss = 0.5474942922592163
Validation loss = 0.5467311143875122
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.5268552303314209
Validation loss = 0.5267990827560425
Validation loss = 0.5328049659729004
Validation loss = 0.5430047512054443
Validation loss = 0.5430594086647034
Validation loss = 0.5440546870231628
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.5249988436698914
Validation loss = 0.5301398634910583
Validation loss = 0.5388054251670837
Validation loss = 0.5394458174705505
Validation loss = 0.5393950939178467
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 647
average number of affinization = 489.2686567164179
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 664
average number of affinization = 491.8382352941176
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 603
average number of affinization = 493.4492753623188
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 668
average number of affinization = 495.9428571428571
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 655
average number of affinization = 498.1830985915493
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 631
average number of affinization = 500.02777777777777
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -369      |
| Iteration     | 10        |
| MaximumReturn | 296       |
| MinimumReturn | -1.07e+03 |
| TotalSamples  | 48000     |
-----------------------------
itr #11 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.5321569442749023
Validation loss = 0.5334903597831726
Validation loss = 0.5366940498352051
Validation loss = 0.5422148108482361
Validation loss = 0.5454193949699402
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.5301371216773987
Validation loss = 0.5295341610908508
Validation loss = 0.5384917259216309
Validation loss = 0.5382634401321411
Validation loss = 0.5410850644111633
Validation loss = 0.5427611470222473
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.5322042107582092
Validation loss = 0.5345588326454163
Validation loss = 0.5387750864028931
Validation loss = 0.5373590588569641
Validation loss = 0.5380756258964539
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.5276665091514587
Validation loss = 0.5355351567268372
Validation loss = 0.5342356562614441
Validation loss = 0.5417423248291016
Validation loss = 0.5388748049736023
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.5245445370674133
Validation loss = 0.5233511924743652
Validation loss = 0.5323547124862671
Validation loss = 0.5367533564567566
Validation loss = 0.5394776463508606
Validation loss = 0.5433681607246399
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 627
average number of affinization = 501.7671232876712
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 606
average number of affinization = 503.1756756756757
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 608
average number of affinization = 504.5733333333333
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 586
average number of affinization = 505.64473684210526
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 578
average number of affinization = 506.5844155844156
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 606
average number of affinization = 507.85897435897436
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -277      |
| Iteration     | 11        |
| MaximumReturn | 486       |
| MinimumReturn | -1.61e+03 |
| TotalSamples  | 52000     |
-----------------------------
itr #12 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.527954638004303
Validation loss = 0.5324178338050842
Validation loss = 0.546294629573822
Validation loss = 0.5355247259140015
Validation loss = 0.5463125109672546
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.5317668914794922
Validation loss = 0.5263564586639404
Validation loss = 0.5310800075531006
Validation loss = 0.5360397696495056
Validation loss = 0.5424047708511353
Validation loss = 0.5424460172653198
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.5302059650421143
Validation loss = 0.5273347496986389
Validation loss = 0.5323525667190552
Validation loss = 0.5354361534118652
Validation loss = 0.5367825031280518
Validation loss = 0.5343877077102661
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.5338872671127319
Validation loss = 0.5262414216995239
Validation loss = 0.529758632183075
Validation loss = 0.5344948172569275
Validation loss = 0.5358010530471802
Validation loss = 0.5380880236625671
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.5331516861915588
Validation loss = 0.5289443135261536
Validation loss = 0.5295401215553284
Validation loss = 0.5365466475486755
Validation loss = 0.5387793779373169
Validation loss = 0.5375704169273376
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 532
average number of affinization = 508.1645569620253
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 618
average number of affinization = 509.5375
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 621
average number of affinization = 510.91358024691357
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 548
average number of affinization = 511.3658536585366
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 591
average number of affinization = 512.3253012048193
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 628
average number of affinization = 513.702380952381
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -360      |
| Iteration     | 12        |
| MaximumReturn | 407       |
| MinimumReturn | -1.41e+03 |
| TotalSamples  | 56000     |
-----------------------------
itr #13 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.5293812155723572
Validation loss = 0.5341025590896606
Validation loss = 0.5346093773841858
Validation loss = 0.5395164489746094
Validation loss = 0.5421720743179321
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.5375960469245911
Validation loss = 0.5299580693244934
Validation loss = 0.5349385738372803
Validation loss = 0.5387886762619019
Validation loss = 0.5422312617301941
Validation loss = 0.5410169959068298
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.5322785973548889
Validation loss = 0.528114378452301
Validation loss = 0.5319573283195496
Validation loss = 0.5325224995613098
Validation loss = 0.5378584265708923
Validation loss = 0.5372396111488342
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.5318935513496399
Validation loss = 0.5302921533584595
Validation loss = 0.5342674851417542
Validation loss = 0.5307437777519226
Validation loss = 0.5392281413078308
Validation loss = 0.5356337428092957
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.5298563241958618
Validation loss = 0.5259162783622742
Validation loss = 0.5303652882575989
Validation loss = 0.5372483134269714
Validation loss = 0.538899838924408
Validation loss = 0.540313184261322
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 634
average number of affinization = 515.1176470588235
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 601
average number of affinization = 516.1162790697674
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 624
average number of affinization = 517.3563218390805
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 536
average number of affinization = 517.5681818181819
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 522
average number of affinization = 517.6179775280899
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 603
average number of affinization = 518.5666666666667
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -202     |
| Iteration     | 13       |
| MaximumReturn | 547      |
| MinimumReturn | -1.3e+03 |
| TotalSamples  | 60000    |
----------------------------
itr #14 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.5233533382415771
Validation loss = 0.5254804491996765
Validation loss = 0.5312439203262329
Validation loss = 0.5325505137443542
Validation loss = 0.5327749252319336
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.5266441702842712
Validation loss = 0.5261350870132446
Validation loss = 0.5288251638412476
Validation loss = 0.5313918590545654
Validation loss = 0.5378172993659973
Validation loss = 0.5341833829879761
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.5231096744537354
Validation loss = 0.5195109844207764
Validation loss = 0.5296761989593506
Validation loss = 0.5294508337974548
Validation loss = 0.5336268544197083
Validation loss = 0.5326101779937744
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.5217535495758057
Validation loss = 0.5236508846282959
Validation loss = 0.5260902643203735
Validation loss = 0.5295048356056213
Validation loss = 0.5311791300773621
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.5270894169807434
Validation loss = 0.5250481963157654
Validation loss = 0.526826798915863
Validation loss = 0.5330767035484314
Validation loss = 0.5298696756362915
Validation loss = 0.5334005951881409
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 632
average number of affinization = 519.8131868131868
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 556
average number of affinization = 520.2065217391304
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 571
average number of affinization = 520.752688172043
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 651
average number of affinization = 522.1382978723404
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 606
average number of affinization = 523.021052631579
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 614
average number of affinization = 523.96875
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -125      |
| Iteration     | 14        |
| MaximumReturn | 576       |
| MinimumReturn | -1.31e+03 |
| TotalSamples  | 64000     |
-----------------------------
itr #15 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.5233899354934692
Validation loss = 0.5177021026611328
Validation loss = 0.5251874923706055
Validation loss = 0.5307415127754211
Validation loss = 0.5320727825164795
Validation loss = 0.5340204834938049
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.5262755155563354
Validation loss = 0.5238667130470276
Validation loss = 0.5241470336914062
Validation loss = 0.5262263417243958
Validation loss = 0.5293728113174438
Validation loss = 0.5314376354217529
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.5246127843856812
Validation loss = 0.5250692367553711
Validation loss = 0.5229071378707886
Validation loss = 0.5246597528457642
Validation loss = 0.5272647738456726
Validation loss = 0.525657057762146
Validation loss = 0.5290756225585938
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.5205072164535522
Validation loss = 0.5249461531639099
Validation loss = 0.5216524600982666
Validation loss = 0.5237210988998413
Validation loss = 0.5276482105255127
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.5281871557235718
Validation loss = 0.5207054615020752
Validation loss = 0.5229912996292114
Validation loss = 0.52586829662323
Validation loss = 0.5260189771652222
Validation loss = 0.5290603041648865
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 566
average number of affinization = 524.4020618556701
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 608
average number of affinization = 525.2551020408164
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 644
average number of affinization = 526.4545454545455
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 652
average number of affinization = 527.71
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 649
average number of affinization = 528.9108910891089
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 693
average number of affinization = 530.5196078431372
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | 87.7      |
| Iteration     | 15        |
| MaximumReturn | 764       |
| MinimumReturn | -1.22e+03 |
| TotalSamples  | 68000     |
-----------------------------
itr #16 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.521289050579071
Validation loss = 0.5141918063163757
Validation loss = 0.5172713994979858
Validation loss = 0.5196211338043213
Validation loss = 0.5228588581085205
Validation loss = 0.5224059224128723
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.5218155384063721
Validation loss = 0.513971745967865
Validation loss = 0.5183568000793457
Validation loss = 0.5162965059280396
Validation loss = 0.5193402171134949
Validation loss = 0.5194264054298401
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.5180461406707764
Validation loss = 0.5118646025657654
Validation loss = 0.514349102973938
Validation loss = 0.5149303674697876
Validation loss = 0.5200276970863342
Validation loss = 0.5198371410369873
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.5171236991882324
Validation loss = 0.5133014917373657
Validation loss = 0.5162138938903809
Validation loss = 0.5150186419487
Validation loss = 0.5182684659957886
Validation loss = 0.5221313238143921
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.5196487307548523
Validation loss = 0.5094731450080872
Validation loss = 0.5149945020675659
Validation loss = 0.5159820318222046
Validation loss = 0.5199209451675415
Validation loss = 0.5173721313476562
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 696
average number of affinization = 532.1262135922331
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 637
average number of affinization = 533.1346153846154
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 699
average number of affinization = 534.7142857142857
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 685
average number of affinization = 536.1320754716982
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 690
average number of affinization = 537.5700934579439
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 660
average number of affinization = 538.7037037037037
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 187      |
| Iteration     | 16       |
| MaximumReturn | 611      |
| MinimumReturn | -461     |
| TotalSamples  | 72000    |
----------------------------
itr #17 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.5118277072906494
Validation loss = 0.5078099966049194
Validation loss = 0.511317789554596
Validation loss = 0.5122986435890198
Validation loss = 0.5125365853309631
Validation loss = 0.514315128326416
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.5075901746749878
Validation loss = 0.5082151293754578
Validation loss = 0.5076426267623901
Validation loss = 0.5119993090629578
Validation loss = 0.5137779712677002
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.5053579807281494
Validation loss = 0.503968358039856
Validation loss = 0.5068331956863403
Validation loss = 0.5144543051719666
Validation loss = 0.5130488872528076
Validation loss = 0.5106524229049683
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.5090816617012024
Validation loss = 0.5056055188179016
Validation loss = 0.5076113939285278
Validation loss = 0.5077552795410156
Validation loss = 0.5093770623207092
Validation loss = 0.5149109959602356
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.5099847316741943
Validation loss = 0.5051339864730835
Validation loss = 0.5067092180252075
Validation loss = 0.512870192527771
Validation loss = 0.5123106837272644
Validation loss = 0.511949360370636
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 703
average number of affinization = 540.2110091743119
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 682
average number of affinization = 541.5
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 599
average number of affinization = 542.018018018018
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 625
average number of affinization = 542.7589285714286
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 537
average number of affinization = 542.70796460177
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 559
average number of affinization = 542.8508771929825
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -718      |
| Iteration     | 17        |
| MaximumReturn | 255       |
| MinimumReturn | -2.19e+03 |
| TotalSamples  | 76000     |
-----------------------------
itr #18 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.5153279304504395
Validation loss = 0.5110911130905151
Validation loss = 0.5139561295509338
Validation loss = 0.5167105793952942
Validation loss = 0.5136521458625793
Validation loss = 0.5144558548927307
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.515234649181366
Validation loss = 0.5109706521034241
Validation loss = 0.5120704770088196
Validation loss = 0.5165565609931946
Validation loss = 0.5147162675857544
Validation loss = 0.5198243260383606
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.5090367197990417
Validation loss = 0.5051209330558777
Validation loss = 0.5097829699516296
Validation loss = 0.5149862766265869
Validation loss = 0.5131741762161255
Validation loss = 0.5126972198486328
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.5163263082504272
Validation loss = 0.5098932981491089
Validation loss = 0.5109496712684631
Validation loss = 0.5162020325660706
Validation loss = 0.5211572647094727
Validation loss = 0.5209423303604126
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.5146864652633667
Validation loss = 0.5099377036094666
Validation loss = 0.5102730989456177
Validation loss = 0.5152005553245544
Validation loss = 0.5173102021217346
Validation loss = 0.5166489481925964
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 630
average number of affinization = 543.6086956521739
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 667
average number of affinization = 544.6724137931035
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 670
average number of affinization = 545.7435897435897
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 681
average number of affinization = 546.8898305084746
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 594
average number of affinization = 547.2857142857143
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 707
average number of affinization = 548.6166666666667
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -733      |
| Iteration     | 18        |
| MaximumReturn | 577       |
| MinimumReturn | -2.02e+03 |
| TotalSamples  | 80000     |
-----------------------------
itr #19 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.5170904994010925
Validation loss = 0.5135849714279175
Validation loss = 0.5181950330734253
Validation loss = 0.5217245817184448
Validation loss = 0.5210269093513489
Validation loss = 0.519878089427948
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.513792872428894
Validation loss = 0.5135568380355835
Validation loss = 0.515294075012207
Validation loss = 0.5195836424827576
Validation loss = 0.5191193222999573
Validation loss = 0.518362820148468
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.518216609954834
Validation loss = 0.5096691846847534
Validation loss = 0.5197254419326782
Validation loss = 0.5236569046974182
Validation loss = 0.5232076048851013
Validation loss = 0.52167147397995
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.5170531272888184
Validation loss = 0.5204955339431763
Validation loss = 0.5212641954421997
Validation loss = 0.5236456394195557
Validation loss = 0.5255881547927856
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.516220211982727
Validation loss = 0.5138407945632935
Validation loss = 0.5167564153671265
Validation loss = 0.5222872495651245
Validation loss = 0.5237780809402466
Validation loss = 0.5266034007072449
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 704
average number of affinization = 549.900826446281
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 630
average number of affinization = 550.5573770491803
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 645
average number of affinization = 551.3252032520326
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 729
average number of affinization = 552.758064516129
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 711
average number of affinization = 554.024
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 727
average number of affinization = 555.3968253968254
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -277      |
| Iteration     | 19        |
| MaximumReturn | 612       |
| MinimumReturn | -2.25e+03 |
| TotalSamples  | 84000     |
-----------------------------
itr #20 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.5127852559089661
Validation loss = 0.5083293318748474
Validation loss = 0.5139703750610352
Validation loss = 0.5200304985046387
Validation loss = 0.5182318687438965
Validation loss = 0.5191887021064758
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.5164735913276672
Validation loss = 0.5104931592941284
Validation loss = 0.5121424198150635
Validation loss = 0.516158401966095
Validation loss = 0.5159958600997925
Validation loss = 0.5138665437698364
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.5183014869689941
Validation loss = 0.5168921947479248
Validation loss = 0.5199687480926514
Validation loss = 0.5206145644187927
Validation loss = 0.524125874042511
Validation loss = 0.5220562815666199
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.5242471098899841
Validation loss = 0.5130753517150879
Validation loss = 0.5192189812660217
Validation loss = 0.5258913040161133
Validation loss = 0.5220533013343811
Validation loss = 0.5265241265296936
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.5175371170043945
Validation loss = 0.5166856050491333
Validation loss = 0.5163739323616028
Validation loss = 0.5219086408615112
Validation loss = 0.523931622505188
Validation loss = 0.5230681300163269
Validation loss = 0.5256403684616089
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 683
average number of affinization = 556.4015748031496
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 630
average number of affinization = 556.9765625
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 779
average number of affinization = 558.6976744186046
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 689
average number of affinization = 559.7
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 656
average number of affinization = 560.4351145038167
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 696
average number of affinization = 561.4621212121212
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -658      |
| Iteration     | 20        |
| MaximumReturn | 486       |
| MinimumReturn | -2.07e+03 |
| TotalSamples  | 88000     |
-----------------------------
itr #21 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.5164954662322998
Validation loss = 0.5161600708961487
Validation loss = 0.5178084373474121
Validation loss = 0.5176056623458862
Validation loss = 0.5188000798225403
Validation loss = 0.518376350402832
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.5130822658538818
Validation loss = 0.5105592012405396
Validation loss = 0.5167351365089417
Validation loss = 0.5158862471580505
Validation loss = 0.513474702835083
Validation loss = 0.5194892287254333
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.5192150473594666
Validation loss = 0.5175414681434631
Validation loss = 0.5211140513420105
Validation loss = 0.5215893387794495
Validation loss = 0.5265517830848694
Validation loss = 0.5241312980651855
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.5175041556358337
Validation loss = 0.5218352675437927
Validation loss = 0.5222911238670349
Validation loss = 0.5252745747566223
Validation loss = 0.5248076915740967
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.5152342915534973
Validation loss = 0.5189507007598877
Validation loss = 0.525719404220581
Validation loss = 0.5243675708770752
Validation loss = 0.5248114466667175
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 700
average number of affinization = 562.5037593984962
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 686
average number of affinization = 563.4253731343283
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 642
average number of affinization = 564.0074074074074
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 675
average number of affinization = 564.8235294117648
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 696
average number of affinization = 565.7810218978102
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 687
average number of affinization = 566.6594202898551
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 503      |
| Iteration     | 21       |
| MaximumReturn | 682      |
| MinimumReturn | 379      |
| TotalSamples  | 92000    |
----------------------------
itr #22 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.5070962309837341
Validation loss = 0.5100088119506836
Validation loss = 0.5083690881729126
Validation loss = 0.5132209062576294
Validation loss = 0.5130483508110046
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.5125540494918823
Validation loss = 0.5050419569015503
Validation loss = 0.5089015364646912
Validation loss = 0.5114756226539612
Validation loss = 0.5087615251541138
Validation loss = 0.5126793384552002
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.5161585807800293
Validation loss = 0.5123821496963501
Validation loss = 0.5167310237884521
Validation loss = 0.5175742506980896
Validation loss = 0.5195242166519165
Validation loss = 0.5226454734802246
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.5200759768486023
Validation loss = 0.5141971707344055
Validation loss = 0.5243905186653137
Validation loss = 0.5251260995864868
Validation loss = 0.524696409702301
Validation loss = 0.5256789326667786
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.518403947353363
Validation loss = 0.5181111693382263
Validation loss = 0.5206891894340515
Validation loss = 0.5250633358955383
Validation loss = 0.5253796577453613
Validation loss = 0.529336154460907
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 708
average number of affinization = 567.6762589928057
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 652
average number of affinization = 568.2785714285715
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 668
average number of affinization = 568.9858156028369
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 673
average number of affinization = 569.7183098591549
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 684
average number of affinization = 570.5174825174826
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 696
average number of affinization = 571.3888888888889
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 450      |
| Iteration     | 22       |
| MaximumReturn | 1.01e+03 |
| MinimumReturn | -147     |
| TotalSamples  | 96000    |
----------------------------
itr #23 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.5115501880645752
Validation loss = 0.5059876441955566
Validation loss = 0.5098231434822083
Validation loss = 0.514009416103363
Validation loss = 0.5094047784805298
Validation loss = 0.5119139552116394
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.5042095184326172
Validation loss = 0.5051447749137878
Validation loss = 0.5074350237846375
Validation loss = 0.5096161365509033
Validation loss = 0.5068356990814209
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.5146253705024719
Validation loss = 0.5110445618629456
Validation loss = 0.5129263401031494
Validation loss = 0.5189897418022156
Validation loss = 0.5213594436645508
Validation loss = 0.5201015472412109
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.517770528793335
Validation loss = 0.5202939510345459
Validation loss = 0.5221380591392517
Validation loss = 0.5258931517601013
Validation loss = 0.5216900706291199
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.5241835713386536
Validation loss = 0.5213778018951416
Validation loss = 0.5249602794647217
Validation loss = 0.5258510112762451
Validation loss = 0.5268476605415344
Validation loss = 0.5290994644165039
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 644
average number of affinization = 571.8896551724138
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 680
average number of affinization = 572.6301369863014
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 653
average number of affinization = 573.1768707482993
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 650
average number of affinization = 573.6959459459459
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 709
average number of affinization = 574.6040268456376
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 658
average number of affinization = 575.16
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -943      |
| Iteration     | 23        |
| MaximumReturn | 303       |
| MinimumReturn | -2.17e+03 |
| TotalSamples  | 100000    |
-----------------------------
itr #24 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.49972420930862427
Validation loss = 0.49678540229797363
Validation loss = 0.5001704692840576
Validation loss = 0.49870991706848145
Validation loss = 0.4964712858200073
Validation loss = 0.4984487295150757
Validation loss = 0.501289427280426
Validation loss = 0.5022034645080566
Validation loss = 0.5035890340805054
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.4956590533256531
Validation loss = 0.49525943398475647
Validation loss = 0.4966731667518616
Validation loss = 0.501862108707428
Validation loss = 0.5022844076156616
Validation loss = 0.5027071833610535
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.4986467659473419
Validation loss = 0.49532830715179443
Validation loss = 0.4994094967842102
Validation loss = 0.497462660074234
Validation loss = 0.5013232231140137
Validation loss = 0.5016106367111206
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.500281035900116
Validation loss = 0.4941929876804352
Validation loss = 0.5010595917701721
Validation loss = 0.5008912086486816
Validation loss = 0.49952203035354614
Validation loss = 0.5010932683944702
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.5004059076309204
Validation loss = 0.4957459568977356
Validation loss = 0.49705860018730164
Validation loss = 0.5030917525291443
Validation loss = 0.5000123977661133
Validation loss = 0.5019277930259705
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 693
average number of affinization = 575.9403973509934
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 693
average number of affinization = 576.7105263157895
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 699
average number of affinization = 577.5098039215686
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 645
average number of affinization = 577.9480519480519
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 662
average number of affinization = 578.4903225806452
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 665
average number of affinization = 579.0448717948718
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -84.1     |
| Iteration     | 24        |
| MaximumReturn | 765       |
| MinimumReturn | -1.59e+03 |
| TotalSamples  | 104000    |
-----------------------------
itr #25 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.49655869603157043
Validation loss = 0.4930570721626282
Validation loss = 0.4946208596229553
Validation loss = 0.49859270453453064
Validation loss = 0.49611392617225647
Validation loss = 0.4971906840801239
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.49910613894462585
Validation loss = 0.49320366978645325
Validation loss = 0.49600812792778015
Validation loss = 0.49806052446365356
Validation loss = 0.49834245443344116
Validation loss = 0.4983908534049988
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.4991712272167206
Validation loss = 0.4917217791080475
Validation loss = 0.4951249957084656
Validation loss = 0.4964611828327179
Validation loss = 0.4980812072753906
Validation loss = 0.4990493059158325
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.49512752890586853
Validation loss = 0.4938446581363678
Validation loss = 0.4971621334552765
Validation loss = 0.49711692333221436
Validation loss = 0.4970059096813202
Validation loss = 0.5001882314682007
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.498412162065506
Validation loss = 0.49335548281669617
Validation loss = 0.49835437536239624
Validation loss = 0.49593597650527954
Validation loss = 0.49949324131011963
Validation loss = 0.4990484118461609
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 614
average number of affinization = 579.2675159235669
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 653
average number of affinization = 579.7341772151899
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 678
average number of affinization = 580.3522012578617
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 700
average number of affinization = 581.1
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 666
average number of affinization = 581.6273291925465
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 686
average number of affinization = 582.2716049382716
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 336      |
| Iteration     | 25       |
| MaximumReturn | 876      |
| MinimumReturn | -308     |
| TotalSamples  | 108000   |
----------------------------
itr #26 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.49284303188323975
Validation loss = 0.49070125818252563
Validation loss = 0.49188607931137085
Validation loss = 0.4941547214984894
Validation loss = 0.4956222474575043
Validation loss = 0.49306395649909973
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.4928564131259918
Validation loss = 0.49325618147850037
Validation loss = 0.49254152178764343
Validation loss = 0.4957069754600525
Validation loss = 0.492811918258667
Validation loss = 0.4979211390018463
Validation loss = 0.496194988489151
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.49149489402770996
Validation loss = 0.4881017506122589
Validation loss = 0.4927799701690674
Validation loss = 0.4927382171154022
Validation loss = 0.4938707947731018
Validation loss = 0.4933689832687378
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.494296133518219
Validation loss = 0.488136887550354
Validation loss = 0.4929896891117096
Validation loss = 0.49292677640914917
Validation loss = 0.49469447135925293
Validation loss = 0.49563106894493103
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.49318167567253113
Validation loss = 0.48929399251937866
Validation loss = 0.4917609691619873
Validation loss = 0.49484983086586
Validation loss = 0.49393585324287415
Validation loss = 0.4944376051425934
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 636
average number of affinization = 582.601226993865
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 683
average number of affinization = 583.2134146341464
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 653
average number of affinization = 583.6363636363636
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 660
average number of affinization = 584.0963855421687
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 659
average number of affinization = 584.5449101796407
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 646
average number of affinization = 584.9107142857143
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 129      |
| Iteration     | 26       |
| MaximumReturn | 711      |
| MinimumReturn | -272     |
| TotalSamples  | 112000   |
----------------------------
itr #27 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.4908633828163147
Validation loss = 0.4868747293949127
Validation loss = 0.4903091788291931
Validation loss = 0.4910108149051666
Validation loss = 0.4921606481075287
Validation loss = 0.49383407831192017
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.4909505844116211
Validation loss = 0.48970428109169006
Validation loss = 0.4901118874549866
Validation loss = 0.49199554324150085
Validation loss = 0.49441763758659363
Validation loss = 0.4960145950317383
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.49062055349349976
Validation loss = 0.4890539348125458
Validation loss = 0.4893309772014618
Validation loss = 0.49283459782600403
Validation loss = 0.49071359634399414
Validation loss = 0.48990127444267273
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.489125519990921
Validation loss = 0.4883670508861542
Validation loss = 0.4914548099040985
Validation loss = 0.492177277803421
Validation loss = 0.4936380386352539
Validation loss = 0.4938540756702423
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.49180760979652405
Validation loss = 0.4897029995918274
Validation loss = 0.4914761483669281
Validation loss = 0.49193769693374634
Validation loss = 0.49243611097335815
Validation loss = 0.49280768632888794
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 620
average number of affinization = 585.1183431952662
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 649
average number of affinization = 585.4941176470588
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 627
average number of affinization = 585.7368421052631
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 687
average number of affinization = 586.3255813953489
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 725
average number of affinization = 587.1271676300578
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 670
average number of affinization = 587.6034482758621
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -411      |
| Iteration     | 27        |
| MaximumReturn | 574       |
| MinimumReturn | -1.44e+03 |
| TotalSamples  | 116000    |
-----------------------------
itr #28 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.48970070481300354
Validation loss = 0.48766449093818665
Validation loss = 0.4881206750869751
Validation loss = 0.4900680184364319
Validation loss = 0.4909369647502899
Validation loss = 0.4913301467895508
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.4937457740306854
Validation loss = 0.48831769824028015
Validation loss = 0.49007490277290344
Validation loss = 0.49086177349090576
Validation loss = 0.4925152361392975
Validation loss = 0.4915349781513214
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.4884710907936096
Validation loss = 0.4873489737510681
Validation loss = 0.4882364869117737
Validation loss = 0.4905995726585388
Validation loss = 0.48928380012512207
Validation loss = 0.49124231934547424
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.49022382497787476
Validation loss = 0.4865773320198059
Validation loss = 0.48935467004776
Validation loss = 0.49313393235206604
Validation loss = 0.4920322895050049
Validation loss = 0.49381446838378906
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.49129417538642883
Validation loss = 0.48864614963531494
Validation loss = 0.4886026978492737
Validation loss = 0.49050331115722656
Validation loss = 0.4904250502586365
Validation loss = 0.4931116998195648
Validation loss = 0.4937169551849365
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 630
average number of affinization = 587.8457142857143
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 645
average number of affinization = 588.1704545454545
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 704
average number of affinization = 588.8248587570622
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 690
average number of affinization = 589.3932584269663
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 717
average number of affinization = 590.1061452513967
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 689
average number of affinization = 590.6555555555556
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | 324       |
| Iteration     | 28        |
| MaximumReturn | 883       |
| MinimumReturn | -1.04e+03 |
| TotalSamples  | 120000    |
-----------------------------
itr #29 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.48956596851348877
Validation loss = 0.48392218351364136
Validation loss = 0.48817628622055054
Validation loss = 0.4883912205696106
Validation loss = 0.4891182482242584
Validation loss = 0.4878884553909302
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.48682767152786255
Validation loss = 0.48609939217567444
Validation loss = 0.4879797697067261
Validation loss = 0.49135780334472656
Validation loss = 0.4911763072013855
Validation loss = 0.48983848094940186
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.4886714518070221
Validation loss = 0.484037309885025
Validation loss = 0.4898057281970978
Validation loss = 0.488160103559494
Validation loss = 0.488228440284729
Validation loss = 0.4890255928039551
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.4908955693244934
Validation loss = 0.48769626021385193
Validation loss = 0.48804202675819397
Validation loss = 0.4910633862018585
Validation loss = 0.49136701226234436
Validation loss = 0.48966485261917114
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.48917022347450256
Validation loss = 0.4851769804954529
Validation loss = 0.48825743794441223
Validation loss = 0.4894787073135376
Validation loss = 0.4905548393726349
Validation loss = 0.4949973225593567
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 699
average number of affinization = 591.2541436464088
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 717
average number of affinization = 591.945054945055
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 669
average number of affinization = 592.3661202185792
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 682
average number of affinization = 592.8532608695652
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 671
average number of affinization = 593.2756756756756
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 735
average number of affinization = 594.0376344086021
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -285      |
| Iteration     | 29        |
| MaximumReturn | 350       |
| MinimumReturn | -1.52e+03 |
| TotalSamples  | 124000    |
-----------------------------
itr #30 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.48755016922950745
Validation loss = 0.4852543771266937
Validation loss = 0.4858578145503998
Validation loss = 0.4886723458766937
Validation loss = 0.48897024989128113
Validation loss = 0.4899202883243561
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.4857403337955475
Validation loss = 0.48656603693962097
Validation loss = 0.48739394545555115
Validation loss = 0.49099722504615784
Validation loss = 0.4938402473926544
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.48358622193336487
Validation loss = 0.48395663499832153
Validation loss = 0.488609254360199
Validation loss = 0.487525999546051
Validation loss = 0.48832401633262634
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.48541775345802307
Validation loss = 0.48532044887542725
Validation loss = 0.4874703586101532
Validation loss = 0.4882798492908478
Validation loss = 0.4914265275001526
Validation loss = 0.4909559488296509
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.4852093458175659
Validation loss = 0.4848533272743225
Validation loss = 0.48647820949554443
Validation loss = 0.48831993341445923
Validation loss = 0.49059364199638367
Validation loss = 0.49015676975250244
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 673
average number of affinization = 594.4598930481284
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 640
average number of affinization = 594.7021276595744
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 627
average number of affinization = 594.8730158730159
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 692
average number of affinization = 595.3842105263158
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 721
average number of affinization = 596.041884816754
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 750
average number of affinization = 596.84375
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | 52.9      |
| Iteration     | 30        |
| MaximumReturn | 864       |
| MinimumReturn | -1.44e+03 |
| TotalSamples  | 128000    |
-----------------------------
itr #31 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.4838019609451294
Validation loss = 0.48248159885406494
Validation loss = 0.48670926690101624
Validation loss = 0.4857347905635834
Validation loss = 0.4884154498577118
Validation loss = 0.4878351390361786
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.49177995324134827
Validation loss = 0.48570653796195984
Validation loss = 0.4897502362728119
Validation loss = 0.4877469539642334
Validation loss = 0.4884563684463501
Validation loss = 0.4903452396392822
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.4894492030143738
Validation loss = 0.4817357361316681
Validation loss = 0.4867308437824249
Validation loss = 0.48493584990501404
Validation loss = 0.48631367087364197
Validation loss = 0.4860260486602783
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.48845648765563965
Validation loss = 0.48589858412742615
Validation loss = 0.486796498298645
Validation loss = 0.49007296562194824
Validation loss = 0.4898868501186371
Validation loss = 0.4866098463535309
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.48578158020973206
Validation loss = 0.48216569423675537
Validation loss = 0.48722559213638306
Validation loss = 0.4878010153770447
Validation loss = 0.4867570400238037
Validation loss = 0.48984864354133606
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 722
average number of affinization = 597.4922279792746
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 699
average number of affinization = 598.0154639175257
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 632
average number of affinization = 598.1897435897436
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 677
average number of affinization = 598.5918367346939
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 703
average number of affinization = 599.1218274111675
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 708
average number of affinization = 599.6717171717172
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | 8.73      |
| Iteration     | 31        |
| MaximumReturn | 517       |
| MinimumReturn | -1.11e+03 |
| TotalSamples  | 132000    |
-----------------------------
itr #32 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.4850511848926544
Validation loss = 0.48332464694976807
Validation loss = 0.48335427045822144
Validation loss = 0.4859447479248047
Validation loss = 0.4887446165084839
Validation loss = 0.4861040413379669
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.48611733317375183
Validation loss = 0.4841017723083496
Validation loss = 0.48689141869544983
Validation loss = 0.48698297142982483
Validation loss = 0.48829036951065063
Validation loss = 0.4879113435745239
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.4814944267272949
Validation loss = 0.4833664894104004
Validation loss = 0.4847702383995056
Validation loss = 0.48649752140045166
Validation loss = 0.4860580265522003
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.4838889241218567
Validation loss = 0.4832000434398651
Validation loss = 0.48568910360336304
Validation loss = 0.48793619871139526
Validation loss = 0.487797349691391
Validation loss = 0.48825207352638245
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.4837699234485626
Validation loss = 0.48283085227012634
Validation loss = 0.48506471514701843
Validation loss = 0.486944317817688
Validation loss = 0.48755818605422974
Validation loss = 0.48751190304756165
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 748
average number of affinization = 600.4170854271357
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 735
average number of affinization = 601.09
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 731
average number of affinization = 601.7363184079602
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 690
average number of affinization = 602.1732673267327
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 646
average number of affinization = 602.3891625615763
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 726
average number of affinization = 602.9950980392157
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 299      |
| Iteration     | 32       |
| MaximumReturn | 952      |
| MinimumReturn | -40.8    |
| TotalSamples  | 136000   |
----------------------------
