Logging to experiments/gym_cheetahO01/oct31/w350e3_Durl_seed4321
Print configuration .....
{'env_name': 'gym_cheetahO01', 'random_seeds': [4321, 2314, 2341, 3421], 'save_variables': False, 'model_save_dir': '/tmp/gym_cheetahO01_models/', 'restore_variables': False, 'start_onpol_iter': 0, 'onpol_iters': 33, 'num_path_random': 6, 'num_path_onpol': 6, 'env_horizon': 1000, 'max_train_data': 200000, 'max_val_data': 100000, 'discard_ratio': 0.0, 'dynamics': {'pre_training': {'mode': 'intrinsic_reward', 'itr': 0, 'policy_itr': 20}, 'model': 'nn', 'ensemble': True, 'ensemble_model_count': 5, 'enable_particle_ensemble': True, 'particles': 5, 'intrinsic_reward_only': False, 'external_reward_evaluation_interval': 5, 'obs_var': 1.0, 'intrinsic_reward_coeff': 1.0, 'ita': 1.0, 'mode': 'random', 'val': True, 'n_layers': 4, 'hidden_size': 1000, 'activation': 'relu', 'batch_size': 1000, 'learning_rate': 0.001, 'epochs': 200, 'kfac_params': {'learning_rate': 0.1, 'damping': 0.001, 'momentum': 0.9, 'kl_clip': 0.0001, 'cov_ema_decay': 0.99}}, 'policy': {'network_shape': [32, 32], 'init_logstd': 0.0, 'activation': 'tanh', 'reinitialize_every_itr': False}, 'trpo': {'horizon': 1000, 'gamma': 0.99, 'step_size': 0.01, 'iterations': 20, 'batch_size': 50000, 'gae': 0.95}, 'trpo_ext_reward': {'horizon': 1000, 'gamma': 0.99, 'step_size': 0.01, 'iterations': 20, 'batch_size': 50000, 'gae': 0.95}, 'algo': 'trpo'}
Generating random rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 0
average number of affinization = 0.0
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 0
average number of affinization = 0.0
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 0
average number of affinization = 0.0
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 0
average number of affinization = 0.0
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 0
average number of affinization = 0.0
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 0
average number of affinization = 0.0
Done generating random rollouts.
Creating normalization for training data.
Done creating normalization for training data.
Particle ensemble enabled? True
An ensemble of 5 dynamics model <class 'model.dynamics.NNDynamicsModel'> initialized
Train dynamics model with intrinsic reward only? False
Pre-training enabled. Using only intrinsic reward.
Pre-training dynamics model for 0 iterations...
Done pre-training dynamics model.
Using external reward only.
itr #0 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.45830589532852173
Validation loss = 0.22810453176498413
Validation loss = 0.17991463840007782
Validation loss = 0.16341421008110046
Validation loss = 0.15869589149951935
Validation loss = 0.16185003519058228
Validation loss = 0.16314223408699036
Validation loss = 0.17079128324985504
Validation loss = 0.16778767108917236
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.5371729135513306
Validation loss = 0.22842437028884888
Validation loss = 0.17932087182998657
Validation loss = 0.16438518464565277
Validation loss = 0.16326747834682465
Validation loss = 0.16349822282791138
Validation loss = 0.16276149451732635
Validation loss = 0.17758819460868835
Validation loss = 0.16265997290611267
Validation loss = 0.1660134494304657
Validation loss = 0.16437330842018127
Validation loss = 0.16406217217445374
Validation loss = 0.16386206448078156
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.6458083391189575
Validation loss = 0.21049882471561432
Validation loss = 0.171189084649086
Validation loss = 0.1620507389307022
Validation loss = 0.16084809601306915
Validation loss = 0.1618969887495041
Validation loss = 0.16200478374958038
Validation loss = 0.19538480043411255
Validation loss = 0.16191716492176056
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.5907207727432251
Validation loss = 0.23361417651176453
Validation loss = 0.18513716757297516
Validation loss = 0.16650764644145966
Validation loss = 0.16039016842842102
Validation loss = 0.15798796713352203
Validation loss = 0.1636076122522354
Validation loss = 0.16056525707244873
Validation loss = 0.16627110540866852
Validation loss = 0.1657114326953888
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.4125363230705261
Validation loss = 0.22351297736167908
Validation loss = 0.17909935116767883
Validation loss = 0.16474103927612305
Validation loss = 0.16240300238132477
Validation loss = 0.15958979725837708
Validation loss = 0.16273373365402222
Validation loss = 0.17213287949562073
Validation loss = 0.183513343334198
Validation loss = 0.18720996379852295
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 221
average number of affinization = 31.571428571428573
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 221
average number of affinization = 55.25
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 238
average number of affinization = 75.55555555555556
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 249
average number of affinization = 92.9
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 126
average number of affinization = 95.9090909090909
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 209
average number of affinization = 105.33333333333333
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -305     |
| Iteration     | 0        |
| MaximumReturn | -258     |
| MinimumReturn | -377     |
| TotalSamples  | 8000     |
----------------------------
itr #1 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.47505220770835876
Validation loss = 0.4600931406021118
Validation loss = 0.4847739040851593
Validation loss = 0.49868810176849365
Validation loss = 0.6838341951370239
Validation loss = 0.5679582357406616
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.4848984479904175
Validation loss = 0.5417726039886475
Validation loss = 0.5546319484710693
Validation loss = 0.5974649786949158
Validation loss = 0.7968135476112366
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.44711047410964966
Validation loss = 0.5036540031433105
Validation loss = 0.5969937443733215
Validation loss = 0.4808275103569031
Validation loss = 0.6256044507026672
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.4804176688194275
Validation loss = 0.41863688826560974
Validation loss = 0.4915352463722229
Validation loss = 0.6323351263999939
Validation loss = 0.7309496402740479
Validation loss = 0.8884394764900208
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.4939820170402527
Validation loss = 0.5394402742385864
Validation loss = 0.500872015953064
Validation loss = 0.3988184928894043
Validation loss = 0.6703228950500488
Validation loss = 0.9351422190666199
Validation loss = 0.8371449112892151
Validation loss = 0.9174886345863342
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 357
average number of affinization = 124.6923076923077
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 376
average number of affinization = 142.64285714285714
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 346
average number of affinization = 156.2
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 358
average number of affinization = 168.8125
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 374
average number of affinization = 180.88235294117646
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 358
average number of affinization = 190.72222222222223
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -220     |
| Iteration     | 1        |
| MaximumReturn | -182     |
| MinimumReturn | -274     |
| TotalSamples  | 12000    |
----------------------------
itr #2 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.4071553945541382
Validation loss = 1.0904518365859985
Validation loss = 0.8738208413124084
Validation loss = 1.7109473943710327
Validation loss = 1.8691848516464233
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.4835946559906006
Validation loss = 0.7578116059303284
Validation loss = 1.142279028892517
Validation loss = 1.398516297340393
Validation loss = 1.665528655052185
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.6163988709449768
Validation loss = 0.9601449966430664
Validation loss = 1.3006621599197388
Validation loss = 1.4090800285339355
Validation loss = 1.7527645826339722
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7733848094940186
Validation loss = 1.2510591745376587
Validation loss = 1.5840109586715698
Validation loss = 1.595602035522461
Validation loss = 2.2279186248779297
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7591369152069092
Validation loss = 1.1626743078231812
Validation loss = 1.6030460596084595
Validation loss = 1.70112144947052
Validation loss = 1.8809996843338013
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 503
average number of affinization = 207.1578947368421
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 500
average number of affinization = 221.8
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 474
average number of affinization = 233.8095238095238
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 505
average number of affinization = 246.13636363636363
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 477
average number of affinization = 256.17391304347825
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 508
average number of affinization = 266.6666666666667
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -256     |
| Iteration     | 2        |
| MaximumReturn | -241     |
| MinimumReturn | -273     |
| TotalSamples  | 16000    |
----------------------------
itr #3 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 1.8449698686599731
Validation loss = 2.437368869781494
Validation loss = 2.817716121673584
Validation loss = 2.9343652725219727
Validation loss = 3.279966354370117
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.9866975545883179
Validation loss = 1.7830407619476318
Validation loss = 2.0111865997314453
Validation loss = 1.4591288566589355
Validation loss = 2.4775140285491943
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 1.4932892322540283
Validation loss = 2.382059097290039
Validation loss = 1.812382698059082
Validation loss = 3.038708448410034
Validation loss = 3.1956253051757812
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 1.8236678838729858
Validation loss = 2.645577907562256
Validation loss = 2.9650943279266357
Validation loss = 2.7854931354522705
Validation loss = 3.6245691776275635
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.8716971278190613
Validation loss = 1.8626606464385986
Validation loss = 2.495978832244873
Validation loss = 2.520143747329712
Validation loss = 2.8836216926574707
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 557
average number of affinization = 278.28
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 500
average number of affinization = 286.8076923076923
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 510
average number of affinization = 295.0740740740741
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 519
average number of affinization = 303.07142857142856
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 524
average number of affinization = 310.6896551724138
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 530
average number of affinization = 318.0
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -288     |
| Iteration     | 3        |
| MaximumReturn | -242     |
| MinimumReturn | -330     |
| TotalSamples  | 20000    |
----------------------------
itr #4 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 2.5650696754455566
Validation loss = 2.786442279815674
Validation loss = 2.9737820625305176
Validation loss = 3.128976345062256
Validation loss = 3.5062527656555176
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 2.0034830570220947
Validation loss = 2.1555049419403076
Validation loss = 2.3897271156311035
Validation loss = 2.2394890785217285
Validation loss = 2.5629465579986572
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 2.07940411567688
Validation loss = 2.7552294731140137
Validation loss = 2.8931097984313965
Validation loss = 3.268707752227783
Validation loss = 3.2720913887023926
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 2.519294261932373
Validation loss = 3.2244510650634766
Validation loss = 2.9386579990386963
Validation loss = 3.7840828895568848
Validation loss = 3.806821823120117
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 1.9682512283325195
Validation loss = 2.5242059230804443
Validation loss = 2.345362901687622
Validation loss = 2.600813388824463
Validation loss = 2.8424365520477295
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 519
average number of affinization = 324.48387096774195
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 459
average number of affinization = 328.6875
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 529
average number of affinization = 334.75757575757575
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 523
average number of affinization = 340.29411764705884
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 521
average number of affinization = 345.45714285714286
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 490
average number of affinization = 349.47222222222223
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -299     |
| Iteration     | 4        |
| MaximumReturn | -256     |
| MinimumReturn | -344     |
| TotalSamples  | 24000    |
----------------------------
itr #5 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 1.3273943662643433
Validation loss = 2.7960684299468994
Validation loss = 2.9143543243408203
Validation loss = 2.859990119934082
Validation loss = 2.9799387454986572
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 1.2544338703155518
Validation loss = 1.9552850723266602
Validation loss = 2.080641031265259
Validation loss = 2.1920759677886963
Validation loss = 2.1995251178741455
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 1.459649920463562
Validation loss = 2.6651785373687744
Validation loss = 2.9171628952026367
Validation loss = 2.7141025066375732
Validation loss = 2.904832124710083
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 1.3478026390075684
Validation loss = 3.174710273742676
Validation loss = 3.209362030029297
Validation loss = 3.383863687515259
Validation loss = 3.380038022994995
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 1.616288661956787
Validation loss = 2.1596145629882812
Validation loss = 2.3557064533233643
Validation loss = 2.405614137649536
Validation loss = 2.327686309814453
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 543
average number of affinization = 354.7027027027027
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 541
average number of affinization = 359.60526315789474
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 518
average number of affinization = 363.6666666666667
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 572
average number of affinization = 368.875
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 521
average number of affinization = 372.5853658536585
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 540
average number of affinization = 376.57142857142856
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -190     |
| Iteration     | 5        |
| MaximumReturn | -165     |
| MinimumReturn | -237     |
| TotalSamples  | 28000    |
----------------------------
itr #6 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 2.5078070163726807
Validation loss = 2.4993491172790527
Validation loss = 2.989250659942627
Validation loss = 2.9066402912139893
Validation loss = 2.795766592025757
Validation loss = 2.9749245643615723
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 1.792077898979187
Validation loss = 2.073582172393799
Validation loss = 2.264965534210205
Validation loss = 2.235762357711792
Validation loss = 2.3526904582977295
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 2.4869308471679688
Validation loss = 2.765742063522339
Validation loss = 2.9011178016662598
Validation loss = 2.7852442264556885
Validation loss = 2.870023727416992
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 2.894422769546509
Validation loss = 3.251854658126831
Validation loss = 3.603954792022705
Validation loss = 3.597778558731079
Validation loss = 3.5907485485076904
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 2.198431968688965
Validation loss = 2.389116048812866
Validation loss = 2.5121681690216064
Validation loss = 2.4505743980407715
Validation loss = 2.6207988262176514
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 656
average number of affinization = 383.06976744186045
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 629
average number of affinization = 388.65909090909093
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 641
average number of affinization = 394.26666666666665
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 628
average number of affinization = 399.3478260869565
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 632
average number of affinization = 404.29787234042556
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 673
average number of affinization = 409.8958333333333
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 157      |
| Iteration     | 6        |
| MaximumReturn | 217      |
| MinimumReturn | 87.6     |
| TotalSamples  | 32000    |
----------------------------
itr #7 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 2.7535815238952637
Validation loss = 2.8705923557281494
Validation loss = 2.878598213195801
Validation loss = 2.8306636810302734
Validation loss = 3.0650441646575928
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 2.0387587547302246
Validation loss = 2.1781015396118164
Validation loss = 2.3407340049743652
Validation loss = 2.333324909210205
Validation loss = 2.227684497833252
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 2.62156081199646
Validation loss = 2.8284425735473633
Validation loss = 2.6323134899139404
Validation loss = 2.330886125564575
Validation loss = 3.0329742431640625
Validation loss = 2.797292470932007
Validation loss = 2.8523213863372803
Validation loss = 3.0384297370910645
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 3.3652381896972656
Validation loss = 3.541053056716919
Validation loss = 3.6569666862487793
Validation loss = 3.642000913619995
Validation loss = 3.8230483531951904
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 2.4637670516967773
Validation loss = 2.459617853164673
Validation loss = 2.484052896499634
Validation loss = 2.618614912033081
Validation loss = 2.359563112258911
Validation loss = 2.5533411502838135
Validation loss = 2.734123468399048
Validation loss = 2.8145368099212646
Validation loss = 2.775019645690918
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 817
average number of affinization = 418.2040816326531
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 819
average number of affinization = 426.22
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 798
average number of affinization = 433.5098039215686
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 800
average number of affinization = 440.5576923076923
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 808
average number of affinization = 447.49056603773585
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 797
average number of affinization = 453.962962962963
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 520      |
| Iteration     | 7        |
| MaximumReturn | 563      |
| MinimumReturn | 478      |
| TotalSamples  | 36000    |
----------------------------
itr #8 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 2.878391742706299
Validation loss = 2.724679946899414
Validation loss = 2.9122426509857178
Validation loss = 2.971034049987793
Validation loss = 2.873077630996704
Validation loss = 2.9591283798217773
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 2.3685004711151123
Validation loss = 2.3810842037200928
Validation loss = 2.435776710510254
Validation loss = 2.32589054107666
Validation loss = 2.4268641471862793
Validation loss = 2.5115206241607666
Validation loss = 2.550654411315918
Validation loss = 2.7416605949401855
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 2.8749353885650635
Validation loss = 2.8691165447235107
Validation loss = 2.929060697555542
Validation loss = 2.952687978744507
Validation loss = 3.09708309173584
Validation loss = 2.892910957336426
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 3.709264039993286
Validation loss = 3.6703224182128906
Validation loss = 3.833094358444214
Validation loss = 3.77948260307312
Validation loss = 3.799919605255127
Validation loss = 3.908583641052246
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 2.5860016345977783
Validation loss = 2.7152953147888184
Validation loss = 2.666381359100342
Validation loss = 2.7128353118896484
Validation loss = 2.646591901779175
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 818
average number of affinization = 460.58181818181816
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 835
average number of affinization = 467.26785714285717
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 816
average number of affinization = 473.3859649122807
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 794
average number of affinization = 478.91379310344826
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 796
average number of affinization = 484.2881355932203
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 831
average number of affinization = 490.06666666666666
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 584      |
| Iteration     | 8        |
| MaximumReturn | 724      |
| MinimumReturn | 519      |
| TotalSamples  | 40000    |
----------------------------
itr #9 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 2.95371413230896
Validation loss = 2.9273741245269775
Validation loss = 2.9303317070007324
Validation loss = 2.955596685409546
Validation loss = 2.922078847885132
Validation loss = 3.0032081604003906
Validation loss = 2.9188637733459473
Validation loss = 3.1076295375823975
Validation loss = 2.8520290851593018
Validation loss = 3.1032731533050537
Validation loss = 3.0440027713775635
Validation loss = 3.00091814994812
Validation loss = 2.9677770137786865
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 2.581321954727173
Validation loss = 2.743258476257324
Validation loss = 2.568070411682129
Validation loss = 2.4834914207458496
Validation loss = 2.6521432399749756
Validation loss = 2.5257601737976074
Validation loss = 2.642881393432617
Validation loss = 2.730011224746704
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 2.931959390640259
Validation loss = 2.9029383659362793
Validation loss = 2.5954065322875977
Validation loss = 2.917011022567749
Validation loss = 2.9610705375671387
Validation loss = 2.8940300941467285
Validation loss = 2.936429738998413
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 3.7139499187469482
Validation loss = 3.690514326095581
Validation loss = 3.6235053539276123
Validation loss = 3.5803802013397217
Validation loss = 3.942110061645508
Validation loss = 3.9707419872283936
Validation loss = 4.07351016998291
Validation loss = 3.8014533519744873
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 2.596653938293457
Validation loss = 2.486243486404419
Validation loss = 2.673024892807007
Validation loss = 2.9150540828704834
Validation loss = 2.933798313140869
Validation loss = 2.864506244659424
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 851
average number of affinization = 495.9836065573771
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 866
average number of affinization = 501.9516129032258
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 852
average number of affinization = 507.5079365079365
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 857
average number of affinization = 512.96875
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 863
average number of affinization = 518.3538461538461
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 860
average number of affinization = 523.530303030303
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 572      |
| Iteration     | 9        |
| MaximumReturn | 616      |
| MinimumReturn | 533      |
| TotalSamples  | 44000    |
----------------------------
itr #10 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 2.956833600997925
Validation loss = 2.7601804733276367
Validation loss = 2.856010675430298
Validation loss = 2.836883544921875
Validation loss = 2.8675620555877686
Validation loss = 2.914905309677124
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 2.575246572494507
Validation loss = 2.5388457775115967
Validation loss = 2.457819938659668
Validation loss = 2.5164830684661865
Validation loss = 2.4648282527923584
Validation loss = 2.5999701023101807
Validation loss = 2.5363235473632812
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 2.594733238220215
Validation loss = 2.702617883682251
Validation loss = 2.6042656898498535
Validation loss = 2.6249678134918213
Validation loss = 2.7333009243011475
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 3.6481528282165527
Validation loss = 3.742694854736328
Validation loss = 3.421220302581787
Validation loss = 3.642937660217285
Validation loss = 3.6676414012908936
Validation loss = 3.7661690711975098
Validation loss = 3.6273491382598877
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 2.8039989471435547
Validation loss = 2.741689443588257
Validation loss = 2.6889705657958984
Validation loss = 2.8951539993286133
Validation loss = 2.983322858810425
Validation loss = 3.0279042720794678
Validation loss = 3.010331630706787
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 887
average number of affinization = 528.955223880597
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 876
average number of affinization = 534.0588235294117
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 855
average number of affinization = 538.7101449275362
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 856
average number of affinization = 543.2428571428571
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 896
average number of affinization = 548.2112676056338
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 880
average number of affinization = 552.8194444444445
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 563      |
| Iteration     | 10       |
| MaximumReturn | 604      |
| MinimumReturn | 452      |
| TotalSamples  | 48000    |
----------------------------
itr #11 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 2.5111477375030518
Validation loss = 2.647089719772339
Validation loss = 2.6227023601531982
Validation loss = 2.5174498558044434
Validation loss = 2.746868371963501
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 2.406827211380005
Validation loss = 2.576342821121216
Validation loss = 2.2878973484039307
Validation loss = 2.3585469722747803
Validation loss = 2.4298975467681885
Validation loss = 2.345770835876465
Validation loss = 2.4080798625946045
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 2.624901056289673
Validation loss = 2.424762487411499
Validation loss = 2.446197032928467
Validation loss = 2.358898401260376
Validation loss = 2.3406314849853516
Validation loss = 2.3155508041381836
Validation loss = 2.229578733444214
Validation loss = 2.2305712699890137
Validation loss = 2.2235567569732666
Validation loss = 2.1529109477996826
Validation loss = 2.112771511077881
Validation loss = 2.19913649559021
Validation loss = 2.092064142227173
Validation loss = 2.2438361644744873
Validation loss = 2.0667946338653564
Validation loss = 2.1127641201019287
Validation loss = 2.1341657638549805
Validation loss = 2.0013060569763184
Validation loss = 2.049741506576538
Validation loss = 2.102647542953491
Validation loss = 2.115624189376831
Validation loss = 1.832425594329834
Validation loss = 1.9312750101089478
Validation loss = 2.075897455215454
Validation loss = 1.917930006980896
Validation loss = 1.9481401443481445
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 3.560115098953247
Validation loss = 3.4305419921875
Validation loss = 3.4330408573150635
Validation loss = 3.5863142013549805
Validation loss = 3.5140655040740967
Validation loss = 3.5176894664764404
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 2.869344711303711
Validation loss = 2.8586721420288086
Validation loss = 3.037302255630493
Validation loss = 3.3378896713256836
Validation loss = 3.286571741104126
Validation loss = 3.3666207790374756
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 918
average number of affinization = 557.8219178082192
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 904
average number of affinization = 562.5
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 890
average number of affinization = 566.8666666666667
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 918
average number of affinization = 571.4868421052631
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 881
average number of affinization = 575.5064935064935
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 935
average number of affinization = 580.1153846153846
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 580      |
| Iteration     | 11       |
| MaximumReturn | 643      |
| MinimumReturn | 469      |
| TotalSamples  | 52000    |
----------------------------
itr #12 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 2.286912679672241
Validation loss = 2.1571338176727295
Validation loss = 2.3265762329101562
Validation loss = 2.4455068111419678
Validation loss = 2.527064323425293
Validation loss = 2.1839280128479004
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 2.209425449371338
Validation loss = 2.1616320610046387
Validation loss = 2.1734189987182617
Validation loss = 2.1540017127990723
Validation loss = 2.0949349403381348
Validation loss = 1.913617491722107
Validation loss = 1.8179036378860474
Validation loss = 1.9987467527389526
Validation loss = 1.8853408098220825
Validation loss = 1.8792393207550049
Validation loss = 1.8249307870864868
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 1.6005135774612427
Validation loss = 1.6205248832702637
Validation loss = 1.6105111837387085
Validation loss = 1.7524775266647339
Validation loss = 1.5510363578796387
Validation loss = 1.500272512435913
Validation loss = 1.5073727369308472
Validation loss = 1.5505667924880981
Validation loss = 1.5453776121139526
Validation loss = 1.468942642211914
Validation loss = 1.5291597843170166
Validation loss = 1.2920411825180054
Validation loss = 1.2665588855743408
Validation loss = 1.378859043121338
Validation loss = 1.424276351928711
Validation loss = 1.3203200101852417
Validation loss = 1.234237790107727
Validation loss = 1.1985929012298584
Validation loss = 1.2704097032546997
Validation loss = 1.2439172267913818
Validation loss = 1.1332987546920776
Validation loss = 1.1289392709732056
Validation loss = 1.0904775857925415
Validation loss = 1.2492713928222656
Validation loss = 1.1398414373397827
Validation loss = 1.1277217864990234
Validation loss = 1.0580769777297974
Validation loss = 1.0029215812683105
Validation loss = 1.0318182706832886
Validation loss = 1.029654622077942
Validation loss = 1.061468243598938
Validation loss = 1.0577583312988281
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 3.0513880252838135
Validation loss = 2.919769287109375
Validation loss = 3.047218084335327
Validation loss = 3.035971164703369
Validation loss = 2.9537177085876465
Validation loss = 2.895873546600342
Validation loss = 2.797318696975708
Validation loss = 2.9395594596862793
Validation loss = 2.7442524433135986
Validation loss = 2.802377462387085
Validation loss = 2.65944242477417
Validation loss = 2.6244049072265625
Validation loss = 2.79793119430542
Validation loss = 2.615706205368042
Validation loss = 2.8021204471588135
Validation loss = 2.480661630630493
Validation loss = 2.5484304428100586
Validation loss = 2.694350004196167
Validation loss = 2.564757823944092
Validation loss = 2.458258628845215
Validation loss = 2.337608814239502
Validation loss = 2.681495189666748
Validation loss = 2.5401036739349365
Validation loss = 2.4116640090942383
Validation loss = 2.4610583782196045
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 3.0707435607910156
Validation loss = 2.9342763423919678
Validation loss = 2.7620441913604736
Validation loss = 2.88110613822937
Validation loss = 2.8790106773376465
Validation loss = 3.1038873195648193
Validation loss = 2.8600833415985107
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 927
average number of affinization = 584.506329113924
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 921
average number of affinization = 588.7125
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 923
average number of affinization = 592.8395061728395
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 912
average number of affinization = 596.7317073170732
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 909
average number of affinization = 600.4939759036145
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 924
average number of affinization = 604.3452380952381
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 642      |
| Iteration     | 12       |
| MaximumReturn | 714      |
| MinimumReturn | 587      |
| TotalSamples  | 56000    |
----------------------------
itr #13 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 2.234347105026245
Validation loss = 2.2645506858825684
Validation loss = 2.1464147567749023
Validation loss = 1.9204989671707153
Validation loss = 2.075812339782715
Validation loss = 2.013347625732422
Validation loss = 1.9716371297836304
Validation loss = 1.826995849609375
Validation loss = 1.7870854139328003
Validation loss = 1.8520883321762085
Validation loss = 1.9144408702850342
Validation loss = 1.624690055847168
Validation loss = 1.757269024848938
Validation loss = 1.6883394718170166
Validation loss = 1.5884968042373657
Validation loss = 1.5953212976455688
Validation loss = 1.4945688247680664
Validation loss = 1.5999057292938232
Validation loss = 1.3888444900512695
Validation loss = 1.5315368175506592
Validation loss = 1.4630508422851562
Validation loss = 1.516830563545227
Validation loss = 1.3837025165557861
Validation loss = 1.4211276769638062
Validation loss = 1.4183293581008911
Validation loss = 1.3955984115600586
Validation loss = 1.2408069372177124
Validation loss = 1.254516363143921
Validation loss = 1.2406922578811646
Validation loss = 1.3134727478027344
Validation loss = 1.240098237991333
Validation loss = 1.3104677200317383
Validation loss = 1.3237804174423218
Validation loss = 1.2147401571273804
Validation loss = 1.3167943954467773
Validation loss = 1.3763625621795654
Validation loss = 1.3012069463729858
Validation loss = 1.1747146844863892
Validation loss = 1.1563154458999634
Validation loss = 1.1511633396148682
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 1.6357851028442383
Validation loss = 1.5698360204696655
Validation loss = 1.6859577894210815
Validation loss = 1.667465329170227
Validation loss = 1.5933011770248413
Validation loss = 1.4890992641448975
Validation loss = 1.5438352823257446
Validation loss = 1.480854868888855
Validation loss = 1.415662169456482
Validation loss = 1.405098557472229
Validation loss = 1.4211395978927612
Validation loss = 1.3399497270584106
Validation loss = 1.2662888765335083
Validation loss = 1.2908371686935425
Validation loss = 1.2124947309494019
Validation loss = 1.2198506593704224
Validation loss = 1.1458885669708252
Validation loss = 1.2492048740386963
Validation loss = 1.1399446725845337
Validation loss = 1.0887598991394043
Validation loss = 1.045495629310608
Validation loss = 1.1365946531295776
Validation loss = 1.0782344341278076
Validation loss = 1.1277011632919312
Validation loss = 1.074410080909729
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.9397098422050476
Validation loss = 0.9502712488174438
Validation loss = 0.8474869728088379
Validation loss = 0.9747337102890015
Validation loss = 0.9208935499191284
Validation loss = 0.8880426287651062
Validation loss = 0.9344335794448853
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 1.9953601360321045
Validation loss = 2.163856267929077
Validation loss = 2.045595169067383
Validation loss = 1.8934978246688843
Validation loss = 2.1346383094787598
Validation loss = 1.8817275762557983
Validation loss = 2.0118906497955322
Validation loss = 1.8307445049285889
Validation loss = 1.9076606035232544
Validation loss = 1.7111483812332153
Validation loss = 1.8464946746826172
Validation loss = 1.8938055038452148
Validation loss = 1.752752423286438
Validation loss = 1.5864242315292358
Validation loss = 1.4679139852523804
Validation loss = 1.7751483917236328
Validation loss = 1.672814965248108
Validation loss = 1.7251145839691162
Validation loss = 1.5064479112625122
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 2.7686381340026855
Validation loss = 2.661642074584961
Validation loss = 2.7209649085998535
Validation loss = 2.818605422973633
Validation loss = 2.8280560970306396
Validation loss = 2.721466302871704
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 949
average number of affinization = 608.4
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 945
average number of affinization = 612.3139534883721
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 942
average number of affinization = 616.1034482758621
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 956
average number of affinization = 619.9659090909091
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 943
average number of affinization = 623.5955056179776
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 961
average number of affinization = 627.3444444444444
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 707      |
| Iteration     | 13       |
| MaximumReturn | 777      |
| MinimumReturn | 667      |
| TotalSamples  | 60000    |
----------------------------
itr #14 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 1.148196816444397
Validation loss = 1.1256657838821411
Validation loss = 1.0645267963409424
Validation loss = 1.1471762657165527
Validation loss = 1.0711567401885986
Validation loss = 1.1340934038162231
Validation loss = 1.0502142906188965
Validation loss = 0.9310197234153748
Validation loss = 0.9583824276924133
Validation loss = 1.1076090335845947
Validation loss = 0.9848334789276123
Validation loss = 1.0079537630081177
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.9824331998825073
Validation loss = 0.9873656630516052
Validation loss = 0.920785129070282
Validation loss = 0.8305031657218933
Validation loss = 0.8413253426551819
Validation loss = 0.8038198947906494
Validation loss = 0.794396162033081
Validation loss = 0.835411787033081
Validation loss = 0.7971702218055725
Validation loss = 0.7664718627929688
Validation loss = 0.678355872631073
Validation loss = 0.6781678795814514
Validation loss = 0.673610508441925
Validation loss = 0.653303325176239
Validation loss = 0.7219292521476746
Validation loss = 0.6427640914916992
Validation loss = 0.6469939351081848
Validation loss = 0.5806704163551331
Validation loss = 0.6777830719947815
Validation loss = 0.7125064134597778
Validation loss = 0.6046668887138367
Validation loss = 0.6932640075683594
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.8548567295074463
Validation loss = 0.8941389322280884
Validation loss = 0.878216028213501
Validation loss = 0.9472551941871643
Validation loss = 0.8942411541938782
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 1.522704005241394
Validation loss = 1.4871050119400024
Validation loss = 1.4536234140396118
Validation loss = 1.3778877258300781
Validation loss = 1.5390619039535522
Validation loss = 1.558953046798706
Validation loss = 1.330221176147461
Validation loss = 1.4340503215789795
Validation loss = 1.4367601871490479
Validation loss = 1.3962606191635132
Validation loss = 1.4210619926452637
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 2.311152219772339
Validation loss = 2.4433348178863525
Validation loss = 2.3026742935180664
Validation loss = 2.17527437210083
Validation loss = 2.2050061225891113
Validation loss = 2.1302719116210938
Validation loss = 1.977443814277649
Validation loss = 2.104405403137207
Validation loss = 2.090470552444458
Validation loss = 2.070178985595703
Validation loss = 1.7822946310043335
Validation loss = 1.9814997911453247
Validation loss = 1.830350637435913
Validation loss = 1.7322630882263184
Validation loss = 1.4914891719818115
Validation loss = 1.6546738147735596
Validation loss = 1.528670072555542
Validation loss = 1.5666550397872925
Validation loss = 1.4708893299102783
Validation loss = 1.6413800716400146
Validation loss = 1.3748271465301514
Validation loss = 1.3386244773864746
Validation loss = 1.227246642112732
Validation loss = 1.3645286560058594
Validation loss = 1.4907985925674438
Validation loss = 1.3478118181228638
Validation loss = 1.2419416904449463
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 931
average number of affinization = 630.6813186813187
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 933
average number of affinization = 633.9673913043479
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 935
average number of affinization = 637.2043010752689
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 930
average number of affinization = 640.3191489361702
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 921
average number of affinization = 643.2736842105263
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 933
average number of affinization = 646.2916666666666
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 681      |
| Iteration     | 14       |
| MaximumReturn | 745      |
| MinimumReturn | 608      |
| TotalSamples  | 64000    |
----------------------------
itr #15 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.891275942325592
Validation loss = 0.999472439289093
Validation loss = 0.8794609308242798
Validation loss = 1.0378916263580322
Validation loss = 0.8967921733856201
Validation loss = 0.8856280446052551
Validation loss = 0.9219452142715454
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.5410851836204529
Validation loss = 0.6011638641357422
Validation loss = 0.5670090317726135
Validation loss = 0.580606997013092
Validation loss = 0.5430108308792114
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.6555904746055603
Validation loss = 0.7454749345779419
Validation loss = 0.7252129316329956
Validation loss = 0.661463737487793
Validation loss = 0.721021294593811
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 1.3196709156036377
Validation loss = 1.1938667297363281
Validation loss = 1.3297454118728638
Validation loss = 1.034030556678772
Validation loss = 1.1564764976501465
Validation loss = 1.059828519821167
Validation loss = 1.0409188270568848
Validation loss = 1.1104387044906616
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 1.1934367418289185
Validation loss = 1.3464374542236328
Validation loss = 1.114661693572998
Validation loss = 1.1087185144424438
Validation loss = 1.0518590211868286
Validation loss = 1.084311604499817
Validation loss = 1.0617293119430542
Validation loss = 1.0295270681381226
Validation loss = 0.8907620906829834
Validation loss = 1.0258660316467285
Validation loss = 0.9476586580276489
Validation loss = 0.9558665752410889
Validation loss = 0.8394384384155273
Validation loss = 0.895448625087738
Validation loss = 0.8443927764892578
Validation loss = 0.8523783087730408
Validation loss = 0.827140212059021
Validation loss = 0.7620674967765808
Validation loss = 0.8329777717590332
Validation loss = 0.7588886022567749
Validation loss = 0.7306317687034607
Validation loss = 0.8659355640411377
Validation loss = 0.7376080751419067
Validation loss = 0.8262456655502319
Validation loss = 0.8378250598907471
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 932
average number of affinization = 649.2371134020618
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 923
average number of affinization = 652.030612244898
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 935
average number of affinization = 654.8888888888889
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 929
average number of affinization = 657.63
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 937
average number of affinization = 660.3960396039604
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 916
average number of affinization = 662.9019607843137
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 709      |
| Iteration     | 15       |
| MaximumReturn | 805      |
| MinimumReturn | 607      |
| TotalSamples  | 68000    |
----------------------------
itr #16 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.8227692246437073
Validation loss = 0.8437763452529907
Validation loss = 0.86554354429245
Validation loss = 0.8587259650230408
Validation loss = 0.7443829774856567
Validation loss = 0.7482022047042847
Validation loss = 0.7350520491600037
Validation loss = 0.8274590373039246
Validation loss = 0.7412283420562744
Validation loss = 0.8250946998596191
Validation loss = 0.7687907218933105
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.5094736814498901
Validation loss = 0.5576819777488708
Validation loss = 0.5185219645500183
Validation loss = 0.5709876418113708
Validation loss = 0.4836447536945343
Validation loss = 0.5271400809288025
Validation loss = 0.5445374250411987
Validation loss = 0.54518723487854
Validation loss = 0.4669005870819092
Validation loss = 0.5249220132827759
Validation loss = 0.512813150882721
Validation loss = 0.5010803937911987
Validation loss = 0.46874967217445374
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.6609586477279663
Validation loss = 0.6651120185852051
Validation loss = 0.6353650689125061
Validation loss = 0.6937777400016785
Validation loss = 0.6583055853843689
Validation loss = 0.5962241888046265
Validation loss = 0.6890103816986084
Validation loss = 0.5826554298400879
Validation loss = 0.6024832725524902
Validation loss = 0.6463685035705566
Validation loss = 0.5946282744407654
Validation loss = 0.6581092476844788
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 1.1591204404830933
Validation loss = 1.127259373664856
Validation loss = 0.9703173041343689
Validation loss = 1.0981156826019287
Validation loss = 1.0416765213012695
Validation loss = 1.07863450050354
Validation loss = 1.0907434225082397
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.728965163230896
Validation loss = 0.7857891321182251
Validation loss = 0.818072497844696
Validation loss = 0.715705931186676
Validation loss = 0.7711159586906433
Validation loss = 0.7489566802978516
Validation loss = 0.7855575680732727
Validation loss = 0.7483759522438049
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 915
average number of affinization = 665.3495145631068
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 910
average number of affinization = 667.7019230769231
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 930
average number of affinization = 670.2
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 927
average number of affinization = 672.622641509434
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 916
average number of affinization = 674.8971962616822
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 919
average number of affinization = 677.1574074074074
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 653      |
| Iteration     | 16       |
| MaximumReturn | 773      |
| MinimumReturn | 504      |
| TotalSamples  | 72000    |
----------------------------
itr #17 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6867045164108276
Validation loss = 0.862443745136261
Validation loss = 0.7418578267097473
Validation loss = 0.780164897441864
Validation loss = 0.7421095967292786
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.442588746547699
Validation loss = 0.4984128773212433
Validation loss = 0.4589179754257202
Validation loss = 0.44790709018707275
Validation loss = 0.5544923543930054
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.6038457751274109
Validation loss = 0.6641873717308044
Validation loss = 0.5779275894165039
Validation loss = 0.5899240970611572
Validation loss = 0.5620535612106323
Validation loss = 0.5680866241455078
Validation loss = 0.6242017149925232
Validation loss = 0.5713241696357727
Validation loss = 0.5062981247901917
Validation loss = 0.5412704944610596
Validation loss = 0.5741636753082275
Validation loss = 0.5302337408065796
Validation loss = 0.5789442658424377
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.9692914485931396
Validation loss = 1.0300081968307495
Validation loss = 0.9591226577758789
Validation loss = 0.8943101167678833
Validation loss = 0.8672030568122864
Validation loss = 0.8977489471435547
Validation loss = 0.9176054000854492
Validation loss = 0.8732958436012268
Validation loss = 0.8381204605102539
Validation loss = 0.7887049317359924
Validation loss = 0.842876672744751
Validation loss = 0.761208713054657
Validation loss = 0.7928661704063416
Validation loss = 0.8252363801002502
Validation loss = 0.8770937919616699
Validation loss = 0.7580686211585999
Validation loss = 0.8258193135261536
Validation loss = 0.8449965715408325
Validation loss = 0.8220710158348083
Validation loss = 0.8509619235992432
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.8428274393081665
Validation loss = 0.7061216831207275
Validation loss = 0.6823918223381042
Validation loss = 0.6811269521713257
Validation loss = 0.7878247499465942
Validation loss = 0.6933851838111877
Validation loss = 0.8000000715255737
Validation loss = 0.7588308453559875
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 923
average number of affinization = 679.4128440366973
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 925
average number of affinization = 681.6454545454545
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 921
average number of affinization = 683.8018018018018
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 918
average number of affinization = 685.8928571428571
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 917
average number of affinization = 687.9380530973451
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 922
average number of affinization = 689.9912280701755
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 635      |
| Iteration     | 17       |
| MaximumReturn | 720      |
| MinimumReturn | 509      |
| TotalSamples  | 76000    |
----------------------------
itr #18 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.609714686870575
Validation loss = 0.7261150479316711
Validation loss = 0.5822980999946594
Validation loss = 0.6765726804733276
Validation loss = 0.6612825989723206
Validation loss = 0.6223752498626709
Validation loss = 0.706704318523407
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.4211197793483734
Validation loss = 0.41126546263694763
Validation loss = 0.4388159215450287
Validation loss = 0.4841475784778595
Validation loss = 0.4776292145252228
Validation loss = 0.4463137090206146
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.5546677708625793
Validation loss = 0.5527938008308411
Validation loss = 0.5781847238540649
Validation loss = 0.630185604095459
Validation loss = 0.5704293847084045
Validation loss = 0.6875256299972534
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7303012609481812
Validation loss = 0.7173784971237183
Validation loss = 0.8381202816963196
Validation loss = 0.8045009970664978
Validation loss = 0.9437710642814636
Validation loss = 0.7233866453170776
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7510950565338135
Validation loss = 0.6912792325019836
Validation loss = 0.7873760461807251
Validation loss = 0.736903727054596
Validation loss = 0.6442728042602539
Validation loss = 0.7078657150268555
Validation loss = 0.7175956964492798
Validation loss = 0.6863259077072144
Validation loss = 0.6152024269104004
Validation loss = 0.6916289329528809
Validation loss = 0.6040664315223694
Validation loss = 0.6483610272407532
Validation loss = 0.653421938419342
Validation loss = 0.6504035592079163
Validation loss = 0.5673663020133972
Validation loss = 0.5589070320129395
Validation loss = 0.6486994624137878
Validation loss = 0.5950026512145996
Validation loss = 0.6470026969909668
Validation loss = 0.723132312297821
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 933
average number of affinization = 692.104347826087
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 936
average number of affinization = 694.2068965517242
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 917
average number of affinization = 696.1111111111111
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 930
average number of affinization = 698.0932203389831
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 935
average number of affinization = 700.0840336134454
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 930
average number of affinization = 702.0
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 676      |
| Iteration     | 18       |
| MaximumReturn | 777      |
| MinimumReturn | 621      |
| TotalSamples  | 80000    |
----------------------------
itr #19 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.7147337198257446
Validation loss = 0.7145624756813049
Validation loss = 0.777050256729126
Validation loss = 0.7897061705589294
Validation loss = 0.6233386397361755
Validation loss = 0.6736640930175781
Validation loss = 0.6593567728996277
Validation loss = 0.6580888032913208
Validation loss = 0.5694278478622437
Validation loss = 0.7177838683128357
Validation loss = 0.671394944190979
Validation loss = 0.673214316368103
Validation loss = 0.6897515058517456
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.5374759435653687
Validation loss = 0.3797440528869629
Validation loss = 0.4546917974948883
Validation loss = 0.4484078884124756
Validation loss = 0.41957932710647583
Validation loss = 0.44400888681411743
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.48369818925857544
Validation loss = 0.5310476422309875
Validation loss = 0.5129881501197815
Validation loss = 0.5928065180778503
Validation loss = 0.48747482895851135
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.751214325428009
Validation loss = 0.7319185733795166
Validation loss = 0.6937256455421448
Validation loss = 0.6556487083435059
Validation loss = 0.7765659689903259
Validation loss = 0.8357139825820923
Validation loss = 0.657564640045166
Validation loss = 0.6563268899917603
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.51177579164505
Validation loss = 0.5263591408729553
Validation loss = 0.6877267360687256
Validation loss = 0.5874837636947632
Validation loss = 0.5377174615859985
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 954
average number of affinization = 704.0826446280992
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 948
average number of affinization = 706.0819672131148
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 943
average number of affinization = 708.0081300813008
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 936
average number of affinization = 709.8467741935484
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 942
average number of affinization = 711.704
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 916
average number of affinization = 713.3253968253969
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 685      |
| Iteration     | 19       |
| MaximumReturn | 788      |
| MinimumReturn | 559      |
| TotalSamples  | 84000    |
----------------------------
itr #20 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6486008763313293
Validation loss = 0.7780376672744751
Validation loss = 0.6127221584320068
Validation loss = 0.587657630443573
Validation loss = 0.7509557604789734
Validation loss = 0.7670019865036011
Validation loss = 0.696082592010498
Validation loss = 0.589224100112915
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.42695480585098267
Validation loss = 0.429932177066803
Validation loss = 0.5185025930404663
Validation loss = 0.4121709167957306
Validation loss = 0.6136523485183716
Validation loss = 0.422181636095047
Validation loss = 0.4573236107826233
Validation loss = 0.47115787863731384
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.6600478887557983
Validation loss = 0.499588280916214
Validation loss = 0.4819948971271515
Validation loss = 0.5156302452087402
Validation loss = 0.5236649513244629
Validation loss = 0.49053874611854553
Validation loss = 0.4487406015396118
Validation loss = 0.5710591077804565
Validation loss = 0.5541775822639465
Validation loss = 0.570641815662384
Validation loss = 0.4725841283798218
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.5901776552200317
Validation loss = 0.7551140785217285
Validation loss = 0.7094771265983582
Validation loss = 0.6056720614433289
Validation loss = 0.7360898852348328
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.5883745551109314
Validation loss = 0.6085748076438904
Validation loss = 0.5272547602653503
Validation loss = 0.5153078436851501
Validation loss = 0.5423451066017151
Validation loss = 0.5548659563064575
Validation loss = 0.6060341596603394
Validation loss = 0.4086844027042389
Validation loss = 0.6394045948982239
Validation loss = 0.6071576476097107
Validation loss = 0.5904618501663208
Validation loss = 0.6385615468025208
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 958
average number of affinization = 715.2519685039371
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 954
average number of affinization = 717.1171875
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 956
average number of affinization = 718.968992248062
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 951
average number of affinization = 720.7538461538461
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 957
average number of affinization = 722.5572519083969
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 956
average number of affinization = 724.3257575757576
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 657      |
| Iteration     | 20       |
| MaximumReturn | 715      |
| MinimumReturn | 580      |
| TotalSamples  | 88000    |
----------------------------
itr #21 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.5960503816604614
Validation loss = 0.6490427851676941
Validation loss = 0.6239694952964783
Validation loss = 0.6165449619293213
Validation loss = 0.6411245465278625
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.506370484828949
Validation loss = 0.6170511245727539
Validation loss = 0.4350452125072479
Validation loss = 0.5112025737762451
Validation loss = 0.48238953948020935
Validation loss = 0.44409993290901184
Validation loss = 0.5643002390861511
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.5055609345436096
Validation loss = 0.48213300108909607
Validation loss = 0.5141414403915405
Validation loss = 0.6126432418823242
Validation loss = 0.5089049935340881
Validation loss = 0.4980659782886505
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7124887704849243
Validation loss = 0.7555487155914307
Validation loss = 0.7576270699501038
Validation loss = 0.7398561835289001
Validation loss = 0.742588460445404
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.6202822327613831
Validation loss = 0.529428243637085
Validation loss = 0.48054006695747375
Validation loss = 0.5630553960800171
Validation loss = 0.4982161819934845
Validation loss = 0.4810284376144409
Validation loss = 0.5276135802268982
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 969
average number of affinization = 726.1654135338346
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 968
average number of affinization = 727.9701492537314
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 963
average number of affinization = 729.7111111111111
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 964
average number of affinization = 731.4338235294117
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 972
average number of affinization = 733.1897810218978
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 970
average number of affinization = 734.9057971014493
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 681      |
| Iteration     | 21       |
| MaximumReturn | 780      |
| MinimumReturn | 604      |
| TotalSamples  | 92000    |
----------------------------
itr #22 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.5650639533996582
Validation loss = 0.6866247057914734
Validation loss = 0.5927476286888123
Validation loss = 0.6716311573982239
Validation loss = 0.6137561202049255
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.5475489497184753
Validation loss = 0.49674269556999207
Validation loss = 0.5220521688461304
Validation loss = 0.5611074566841125
Validation loss = 0.5250716805458069
Validation loss = 0.6240054368972778
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.5629602074623108
Validation loss = 0.5390839576721191
Validation loss = 0.5081281065940857
Validation loss = 0.5529859066009521
Validation loss = 0.5341687798500061
Validation loss = 0.5028631687164307
Validation loss = 0.5139738321304321
Validation loss = 0.617152750492096
Validation loss = 0.5435956716537476
Validation loss = 0.5455129146575928
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.6995616555213928
Validation loss = 0.7686008214950562
Validation loss = 0.5248768925666809
Validation loss = 0.5680035948753357
Validation loss = 0.6829521059989929
Validation loss = 0.6899958252906799
Validation loss = 0.6077833771705627
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.49526894092559814
Validation loss = 0.4849877953529358
Validation loss = 0.4661172926425934
Validation loss = 0.4649013578891754
Validation loss = 0.403145968914032
Validation loss = 0.4100649952888489
Validation loss = 0.4564962685108185
Validation loss = 0.5143407583236694
Validation loss = 0.46869510412216187
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 958
average number of affinization = 736.5107913669065
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 959
average number of affinization = 738.1
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 965
average number of affinization = 739.7092198581561
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 958
average number of affinization = 741.2464788732394
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 957
average number of affinization = 742.7552447552448
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 968
average number of affinization = 744.3194444444445
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 625      |
| Iteration     | 22       |
| MaximumReturn | 758      |
| MinimumReturn | 542      |
| TotalSamples  | 96000    |
----------------------------
itr #23 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.5176213383674622
Validation loss = 0.5294969081878662
Validation loss = 0.7218901515007019
Validation loss = 0.7007231712341309
Validation loss = 0.6709017753601074
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.5455065965652466
Validation loss = 0.4949463903903961
Validation loss = 0.5122777819633484
Validation loss = 0.44329074025154114
Validation loss = 0.47241339087486267
Validation loss = 0.6095206141471863
Validation loss = 0.48168322443962097
Validation loss = 0.5185083746910095
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.665005624294281
Validation loss = 0.503807544708252
Validation loss = 0.519737720489502
Validation loss = 0.5798527002334595
Validation loss = 0.5384461283683777
Validation loss = 0.5977412462234497
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.7772032618522644
Validation loss = 0.5382360816001892
Validation loss = 0.564763605594635
Validation loss = 0.5595518350601196
Validation loss = 0.5943701863288879
Validation loss = 0.5218349099159241
Validation loss = 0.6186268925666809
Validation loss = 0.5716257691383362
Validation loss = 0.5778793692588806
Validation loss = 0.5871120095252991
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.46116435527801514
Validation loss = 0.48770225048065186
Validation loss = 0.4423139989376068
Validation loss = 0.4758037030696869
Validation loss = 0.4347783625125885
Validation loss = 0.4383876323699951
Validation loss = 0.4517877995967865
Validation loss = 0.4425308406352997
Validation loss = 0.4748227298259735
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 967
average number of affinization = 745.8551724137931
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 976
average number of affinization = 747.431506849315
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 955
average number of affinization = 748.843537414966
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 961
average number of affinization = 750.277027027027
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 959
average number of affinization = 751.6778523489933
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 965
average number of affinization = 753.1
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 641      |
| Iteration     | 23       |
| MaximumReturn | 781      |
| MinimumReturn | 537      |
| TotalSamples  | 100000   |
----------------------------
itr #24 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.5067375898361206
Validation loss = 0.5527873039245605
Validation loss = 0.6582481265068054
Validation loss = 0.677117109298706
Validation loss = 0.7730312347412109
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6289103031158447
Validation loss = 0.504492998123169
Validation loss = 0.6057771444320679
Validation loss = 0.5383846759796143
Validation loss = 0.579254150390625
Validation loss = 0.48998329043388367
Validation loss = 0.6181079745292664
Validation loss = 0.4761071503162384
Validation loss = 0.5222223401069641
Validation loss = 0.47966647148132324
Validation loss = 0.5258885025978088
Validation loss = 0.6080927848815918
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.4386630654335022
Validation loss = 0.4668630361557007
Validation loss = 0.42888954281806946
Validation loss = 0.4598478078842163
Validation loss = 0.49525150656700134
Validation loss = 0.5250201225280762
Validation loss = 0.46370193362236023
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.6433419585227966
Validation loss = 0.5309770703315735
Validation loss = 0.5464707016944885
Validation loss = 0.5372097492218018
Validation loss = 0.576541543006897
Validation loss = 0.5131009817123413
Validation loss = 0.6122082471847534
Validation loss = 0.4806850850582123
Validation loss = 0.5123946666717529
Validation loss = 0.4358994662761688
Validation loss = 0.5002294778823853
Validation loss = 0.50469970703125
Validation loss = 0.4446878135204315
Validation loss = 0.4777251183986664
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.36740025877952576
Validation loss = 0.4589890241622925
Validation loss = 0.37850692868232727
Validation loss = 0.4456634521484375
Validation loss = 0.5087230801582336
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 966
average number of affinization = 754.5099337748344
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 964
average number of affinization = 755.8881578947369
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 954
average number of affinization = 757.1830065359477
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 964
average number of affinization = 758.525974025974
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 964
average number of affinization = 759.8516129032258
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 963
average number of affinization = 761.1538461538462
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 636      |
| Iteration     | 24       |
| MaximumReturn | 727      |
| MinimumReturn | 523      |
| TotalSamples  | 104000   |
----------------------------
itr #25 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6802611947059631
Validation loss = 0.5248768329620361
Validation loss = 0.5379619002342224
Validation loss = 0.616252601146698
Validation loss = 0.5859934091567993
Validation loss = 0.6670786738395691
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.520437479019165
Validation loss = 0.5225443840026855
Validation loss = 0.5225213170051575
Validation loss = 0.521151602268219
Validation loss = 0.5259724259376526
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.42067840695381165
Validation loss = 0.4535832107067108
Validation loss = 0.482943058013916
Validation loss = 0.5838128924369812
Validation loss = 0.43294069170951843
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.46485254168510437
Validation loss = 0.5322710871696472
Validation loss = 0.38827842473983765
Validation loss = 0.43039506673812866
Validation loss = 0.4344242215156555
Validation loss = 0.44603151082992554
Validation loss = 0.45412173867225647
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.37262779474258423
Validation loss = 0.44924354553222656
Validation loss = 0.6041867733001709
Validation loss = 0.5569809079170227
Validation loss = 0.4618861973285675
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 971
average number of affinization = 762.4904458598726
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 960
average number of affinization = 763.7405063291139
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 967
average number of affinization = 765.0188679245283
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 972
average number of affinization = 766.3125
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 970
average number of affinization = 767.5776397515529
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 961
average number of affinization = 768.7716049382716
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 663      |
| Iteration     | 25       |
| MaximumReturn | 812      |
| MinimumReturn | 557      |
| TotalSamples  | 108000   |
----------------------------
itr #26 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6572394967079163
Validation loss = 0.6065574288368225
Validation loss = 0.5244727730751038
Validation loss = 0.5984203219413757
Validation loss = 0.6047626733779907
Validation loss = 0.6064630150794983
Validation loss = 0.5435115098953247
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.6082088351249695
Validation loss = 0.5370511412620544
Validation loss = 0.5100146532058716
Validation loss = 0.48313048481941223
Validation loss = 0.540278971195221
Validation loss = 0.5762860178947449
Validation loss = 0.5177167057991028
Validation loss = 0.42979562282562256
Validation loss = 0.5200064778327942
Validation loss = 0.6421399116516113
Validation loss = 0.6814293265342712
Validation loss = 0.5600385069847107
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.5005101561546326
Validation loss = 0.43370339274406433
Validation loss = 0.4997018575668335
Validation loss = 0.4347623288631439
Validation loss = 0.47048765420913696
Validation loss = 0.4808536767959595
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.380257785320282
Validation loss = 0.3843936324119568
Validation loss = 0.40140166878700256
Validation loss = 0.3928121328353882
Validation loss = 0.37079131603240967
Validation loss = 0.3846553564071655
Validation loss = 0.3556808829307556
Validation loss = 0.41495510935783386
Validation loss = 0.40496528148651123
Validation loss = 0.4114397466182709
Validation loss = 0.3330135941505432
Validation loss = 0.35299739241600037
Validation loss = 0.35979193449020386
Validation loss = 0.3317960500717163
Validation loss = 0.35139337182044983
Validation loss = 0.3756351172924042
Validation loss = 0.4354459345340729
Validation loss = 0.3567988872528076
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.3983488380908966
Validation loss = 0.48218923807144165
Validation loss = 0.42436280846595764
Validation loss = 0.4485411047935486
Validation loss = 0.3647876977920532
Validation loss = 0.46775907278060913
Validation loss = 0.46863657236099243
Validation loss = 0.535224199295044
Validation loss = 0.4753025770187378
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 782
average number of affinization = 768.8527607361963
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 953
average number of affinization = 769.9756097560976
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 961
average number of affinization = 771.1333333333333
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 951
average number of affinization = 772.2168674698795
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 953
average number of affinization = 773.2994011976048
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 962
average number of affinization = 774.422619047619
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 524      |
| Iteration     | 26       |
| MaximumReturn | 665      |
| MinimumReturn | -39.7    |
| TotalSamples  | 112000   |
----------------------------
itr #27 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.12268747389316559
Validation loss = 0.11791931092739105
Validation loss = 0.11957231909036636
Validation loss = 0.11876039952039719
Validation loss = 0.1188860610127449
Validation loss = 0.11965373903512955
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.1222848892211914
Validation loss = 0.1181565448641777
Validation loss = 0.11982309818267822
Validation loss = 0.11973849684000015
Validation loss = 0.11942408233880997
Validation loss = 0.11934749782085419
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.12242455780506134
Validation loss = 0.11868592351675034
Validation loss = 0.11892779916524887
Validation loss = 0.11932722479104996
Validation loss = 0.12012699991464615
Validation loss = 0.1181463748216629
Validation loss = 0.11851869523525238
Validation loss = 0.12022106349468231
Validation loss = 0.11957966536283493
Validation loss = 0.1192830428481102
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.12258923053741455
Validation loss = 0.11870364844799042
Validation loss = 0.11851288378238678
Validation loss = 0.12025526911020279
Validation loss = 0.11850929260253906
Validation loss = 0.12016547471284866
Validation loss = 0.11932815611362457
Validation loss = 0.11866645514965057
Validation loss = 0.11952557414770126
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.1221931204199791
Validation loss = 0.11993338167667389
Validation loss = 0.11959613859653473
Validation loss = 0.11887265741825104
Validation loss = 0.11857195198535919
Validation loss = 0.11926836520433426
Validation loss = 0.11864548921585083
Validation loss = 0.11962414532899857
Validation loss = 0.118431456387043
Validation loss = 0.11950041353702545
Validation loss = 0.1187470480799675
Validation loss = 0.1185191199183464
Validation loss = 0.11972998082637787
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 957
average number of affinization = 775.5029585798817
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 959
average number of affinization = 776.5823529411765
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 963
average number of affinization = 777.672514619883
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 862
average number of affinization = 778.1627906976744
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 956
average number of affinization = 779.1907514450867
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 969
average number of affinization = 780.2816091954023
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 520      |
| Iteration     | 27       |
| MaximumReturn | 636      |
| MinimumReturn | 5.1      |
| TotalSamples  | 116000   |
----------------------------
itr #28 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.11982032656669617
Validation loss = 0.11793369054794312
Validation loss = 0.11990009993314743
Validation loss = 0.119637630879879
Validation loss = 0.11828450858592987
Validation loss = 0.11990223079919815
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.11942024528980255
Validation loss = 0.11822161823511124
Validation loss = 0.11832745373249054
Validation loss = 0.11998089402914047
Validation loss = 0.11854977905750275
Validation loss = 0.11901801824569702
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.12178889662027359
Validation loss = 0.1190953403711319
Validation loss = 0.11895765364170074
Validation loss = 0.11856964230537415
Validation loss = 0.11880341917276382
Validation loss = 0.12000183761119843
Validation loss = 0.11954721063375473
Validation loss = 0.11867768317461014
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.11981319636106491
Validation loss = 0.11918380111455917
Validation loss = 0.11843154579401016
Validation loss = 0.11940524727106094
Validation loss = 0.11951958388090134
Validation loss = 0.11946050822734833
Validation loss = 0.11936447024345398
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.1202596053481102
Validation loss = 0.1174096018075943
Validation loss = 0.11713391542434692
Validation loss = 0.11878256499767303
Validation loss = 0.11827459186315536
Validation loss = 0.11835585534572601
Validation loss = 0.11931312829256058
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 960
average number of affinization = 781.3085714285714
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 974
average number of affinization = 782.4034090909091
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 962
average number of affinization = 783.4180790960452
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 961
average number of affinization = 784.4157303370787
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 955
average number of affinization = 785.3687150837989
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 957
average number of affinization = 786.3222222222222
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 705      |
| Iteration     | 28       |
| MaximumReturn | 800      |
| MinimumReturn | 629      |
| TotalSamples  | 120000   |
----------------------------
itr #29 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.11899808049201965
Validation loss = 0.11734484136104584
Validation loss = 0.11937304586172104
Validation loss = 0.11861538141965866
Validation loss = 0.11925185471773148
Validation loss = 0.1186647042632103
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.11874499171972275
Validation loss = 0.11833677440881729
Validation loss = 0.11830516159534454
Validation loss = 0.11948313564062119
Validation loss = 0.1186935156583786
Validation loss = 0.11949147284030914
Validation loss = 0.11950253695249557
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.12025369703769684
Validation loss = 0.11720695346593857
Validation loss = 0.11796031892299652
Validation loss = 0.1182379499077797
Validation loss = 0.11848223209381104
Validation loss = 0.11929085850715637
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.1193695217370987
Validation loss = 0.11792025715112686
Validation loss = 0.11877430975437164
Validation loss = 0.11844253540039062
Validation loss = 0.1194957047700882
Validation loss = 0.11924442648887634
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.11856227368116379
Validation loss = 0.11813893169164658
Validation loss = 0.1180315762758255
Validation loss = 0.11780259013175964
Validation loss = 0.11879434436559677
Validation loss = 0.11813859641551971
Validation loss = 0.11807072162628174
Validation loss = 0.11900218576192856
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 967
average number of affinization = 787.3204419889503
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 963
average number of affinization = 788.2857142857143
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 970
average number of affinization = 789.2786885245902
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 967
average number of affinization = 790.2445652173913
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 962
average number of affinization = 791.172972972973
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 966
average number of affinization = 792.1129032258065
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 684      |
| Iteration     | 29       |
| MaximumReturn | 755      |
| MinimumReturn | 585      |
| TotalSamples  | 124000   |
----------------------------
itr #30 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.11982722580432892
Validation loss = 0.11807435005903244
Validation loss = 0.11887529492378235
Validation loss = 0.11963871121406555
Validation loss = 0.11873941123485565
Validation loss = 0.11891638487577438
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.11974427103996277
Validation loss = 0.11748799681663513
Validation loss = 0.11923623830080032
Validation loss = 0.11873365193605423
Validation loss = 0.1191747710108757
Validation loss = 0.11890294402837753
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.1191912293434143
Validation loss = 0.11667515337467194
Validation loss = 0.11821652203798294
Validation loss = 0.11824917048215866
Validation loss = 0.11790775507688522
Validation loss = 0.1181568130850792
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.11882144957780838
Validation loss = 0.11790142953395844
Validation loss = 0.11895962059497833
Validation loss = 0.118759386241436
Validation loss = 0.11802897602319717
Validation loss = 0.11872381716966629
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.11817049235105515
Validation loss = 0.11809656769037247
Validation loss = 0.11921878904104233
Validation loss = 0.1183646097779274
Validation loss = 0.1180352121591568
Validation loss = 0.11775603145360947
Validation loss = 0.11700387299060822
Validation loss = 0.11833010613918304
Validation loss = 0.11754456162452698
Validation loss = 0.11819732934236526
Validation loss = 0.1180993914604187
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 959
average number of affinization = 793.0053475935829
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 967
average number of affinization = 793.9308510638298
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 958
average number of affinization = 794.7989417989418
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 955
average number of affinization = 795.6421052631579
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 962
average number of affinization = 796.5130890052357
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 953
average number of affinization = 797.328125
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 699      |
| Iteration     | 30       |
| MaximumReturn | 784      |
| MinimumReturn | 645      |
| TotalSamples  | 128000   |
----------------------------
itr #31 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.11822670698165894
Validation loss = 0.11731927841901779
Validation loss = 0.11870969831943512
Validation loss = 0.11840490996837616
Validation loss = 0.11849339306354523
Validation loss = 0.11831855773925781
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.11841093003749847
Validation loss = 0.11625076085329056
Validation loss = 0.11825048923492432
Validation loss = 0.11817100644111633
Validation loss = 0.11866071820259094
Validation loss = 0.11842940747737885
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.11891879141330719
Validation loss = 0.1164732426404953
Validation loss = 0.11942363530397415
Validation loss = 0.117747962474823
Validation loss = 0.11772896349430084
Validation loss = 0.11712946742773056
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.11870003491640091
Validation loss = 0.11803044378757477
Validation loss = 0.11864415556192398
Validation loss = 0.1177205741405487
Validation loss = 0.11870076507329941
Validation loss = 0.11836117506027222
Validation loss = 0.11873272061347961
Validation loss = 0.11919437348842621
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.11788354068994522
Validation loss = 0.11771077662706375
Validation loss = 0.11795933544635773
Validation loss = 0.11821938306093216
Validation loss = 0.1185891404747963
Validation loss = 0.11757206916809082
Validation loss = 0.11864229291677475
Validation loss = 0.11760015785694122
Validation loss = 0.11828180402517319
Validation loss = 0.1173471063375473
Validation loss = 0.11835266649723053
Validation loss = 0.11664178222417831
Validation loss = 0.1176607608795166
Validation loss = 0.11725634336471558
Validation loss = 0.11907461285591125
Validation loss = 0.11680664122104645
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 954
average number of affinization = 798.139896373057
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 884
average number of affinization = 798.5824742268042
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 958
average number of affinization = 799.4
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 963
average number of affinization = 800.234693877551
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 958
average number of affinization = 801.0355329949239
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 959
average number of affinization = 801.8333333333334
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 548      |
| Iteration     | 31       |
| MaximumReturn | 762      |
| MinimumReturn | -239     |
| TotalSamples  | 132000   |
----------------------------
itr #32 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.11807390302419662
Validation loss = 0.11720971018075943
Validation loss = 0.11740902811288834
Validation loss = 0.11787090450525284
Validation loss = 0.11815465986728668
Validation loss = 0.11818160116672516
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.11792239546775818
Validation loss = 0.11727093160152435
Validation loss = 0.1172836422920227
Validation loss = 0.11714760959148407
Validation loss = 0.11755393445491791
Validation loss = 0.11677553504705429
Validation loss = 0.11867231130599976
Validation loss = 0.11724062263965607
Validation loss = 0.1178627535700798
Validation loss = 0.11759385466575623
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.11845340579748154
Validation loss = 0.1166268065571785
Validation loss = 0.11763526499271393
Validation loss = 0.11824484169483185
Validation loss = 0.11773938685655594
Validation loss = 0.11713948100805283
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.11823363602161407
Validation loss = 0.11721338331699371
Validation loss = 0.11772382259368896
Validation loss = 0.11749769002199173
Validation loss = 0.1182665079832077
Validation loss = 0.11726875603199005
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.1179603636264801
Validation loss = 0.11662069708108902
Validation loss = 0.11710010468959808
Validation loss = 0.11731734126806259
Validation loss = 0.11783101409673691
Validation loss = 0.11823303997516632
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 3 is 960
average number of affinization = 802.6281407035176
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 3 is 951
average number of affinization = 803.37
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 3 is 957
average number of affinization = 804.1343283582089
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 3 is 950
average number of affinization = 804.8564356435644
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 3 is 953
average number of affinization = 805.5862068965517
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 3 is 955
average number of affinization = 806.3186274509804
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 600      |
| Iteration     | 32       |
| MaximumReturn | 733      |
| MinimumReturn | 420      |
| TotalSamples  | 136000   |
----------------------------
