Logging to experiments/hopper/nov1/w350e03_seed2431
Print configuration .....
{'env_name': 'hopper', 'random_seeds': [1234, 2431, 2531, 2231], 'save_variables': False, 'model_save_dir': '/tmp/hopper_models/', 'restore_variables': False, 'start_onpol_iter': 0, 'onpol_iters': 33, 'num_path_random': 6, 'num_path_onpol': 6, 'env_horizon': 1000, 'max_train_data': 200000, 'max_val_data': 100000, 'discard_ratio': 0.0, 'dynamics': {'pre_training': {'mode': 'intrinsic_reward', 'itr': 0, 'policy_itr': 20}, 'model': 'nn', 'ensemble': True, 'ensemble_model_count': 5, 'enable_particle_ensemble': True, 'particles': 5, 'obs_var': 1.0, 'intrinsic_reward_coeff': 1.0, 'ita': 1.0, 'mode': 'random', 'val': True, 'n_layers': 4, 'hidden_size': 1000, 'activation': 'relu', 'batch_size': 1000, 'learning_rate': 0.001, 'reg_coeff': 0.0, 'epochs': 200, 'kfac_params': {'learning_rate': 0.1, 'damping': 0.001, 'momentum': 0.9, 'kl_clip': 0.0001, 'cov_ema_decay': 0.99}}, 'policy': {'network_shape': [64, 64], 'init_logstd': 0.0, 'activation': 'tanh', 'reinitialize_every_itr': False}, 'trpo': {'horizon': 1000, 'gamma': 0.99, 'step_size': 0.01, 'iterations': 20, 'batch_size': 50000, 'gae': 0.95, 'visualization': False, 'visualize_iterations': [0]}, 'algo': 'trpo'}
Generating random rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 0
average number of affinization = 0.0
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 0
average number of affinization = 0.0
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 0
average number of affinization = 0.0
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 0
average number of affinization = 0.0
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 0
average number of affinization = 0.0
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 0
average number of affinization = 0.0
Done generating random rollouts.
Creating normalization for training data.
Done creating normalization for training data.
Particle ensemble enabled? True
An ensemble of 5 dynamics model <class 'model.dynamics.NNDynamicsModel'> initialized
Train dynamics model with intrinsic reward only? False
Pre-training enabled. Using only intrinsic reward.
Pre-training dynamics model for 0 iterations...
Done pre-training dynamics model.
Using external reward only.
itr #0 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.6672388315200806
Validation loss = 0.2849127948284149
Validation loss = 0.2601998448371887
Validation loss = 0.230267733335495
Validation loss = 0.22403952479362488
Validation loss = 0.23425838351249695
Validation loss = 0.2220500111579895
Validation loss = 0.24656620621681213
Validation loss = 0.2269115149974823
Validation loss = 0.23550088703632355
Validation loss = 0.23919862508773804
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.8946794271469116
Validation loss = 0.2840608060359955
Validation loss = 0.2544397711753845
Validation loss = 0.23336100578308105
Validation loss = 0.22750046849250793
Validation loss = 0.22203010320663452
Validation loss = 0.22714385390281677
Validation loss = 0.2267189621925354
Validation loss = 0.22960585355758667
Validation loss = 0.2403540313243866
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.677655816078186
Validation loss = 0.28419923782348633
Validation loss = 0.26392650604248047
Validation loss = 0.2330925166606903
Validation loss = 0.22577393054962158
Validation loss = 0.2344713807106018
Validation loss = 0.2342512607574463
Validation loss = 0.22724798321723938
Validation loss = 0.23731805384159088
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.815540075302124
Validation loss = 0.28555595874786377
Validation loss = 0.2610437870025635
Validation loss = 0.2317143678665161
Validation loss = 0.22922571003437042
Validation loss = 0.223149374127388
Validation loss = 0.2354186773300171
Validation loss = 0.26098278164863586
Validation loss = 0.2632373571395874
Validation loss = 0.23573251068592072
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.7177366018295288
Validation loss = 0.281868040561676
Validation loss = 0.24291764199733734
Validation loss = 0.22430574893951416
Validation loss = 0.2219550609588623
Validation loss = 0.22121866047382355
Validation loss = 0.23196539282798767
Validation loss = 0.22570599615573883
Validation loss = 0.22968924045562744
Validation loss = 0.22635474801063538
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 300
average number of affinization = 42.857142857142854
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 276
average number of affinization = 72.0
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 343
average number of affinization = 102.11111111111111
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 238
average number of affinization = 115.7
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 349
average number of affinization = 136.9090909090909
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 280
average number of affinization = 148.83333333333334
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -2.54e+03 |
| Iteration     | 0         |
| MaximumReturn | -1.97e+03 |
| MinimumReturn | -3.17e+03 |
| TotalSamples  | 8000      |
-----------------------------
itr #1 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.28687432408332825
Validation loss = 0.2407129853963852
Validation loss = 0.23629839718341827
Validation loss = 0.2440134435892105
Validation loss = 0.23774714767932892
Validation loss = 0.2374112606048584
Validation loss = 0.23489908874034882
Validation loss = 0.2482835203409195
Validation loss = 0.23433008790016174
Validation loss = 0.25617024302482605
Validation loss = 0.25760403275489807
Validation loss = 0.2663828730583191
Validation loss = 0.2519805133342743
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.31100305914878845
Validation loss = 0.2411111444234848
Validation loss = 0.23979756236076355
Validation loss = 0.23849941790103912
Validation loss = 0.2530105710029602
Validation loss = 0.24895945191383362
Validation loss = 0.2462727576494217
Validation loss = 0.2371351420879364
Validation loss = 0.24513548612594604
Validation loss = 0.2470962107181549
Validation loss = 0.24505990743637085
Validation loss = 0.2660095989704132
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.28476622700691223
Validation loss = 0.2426569163799286
Validation loss = 0.23815739154815674
Validation loss = 0.2332381159067154
Validation loss = 0.22519850730895996
Validation loss = 0.23963820934295654
Validation loss = 0.22920075058937073
Validation loss = 0.23670193552970886
Validation loss = 0.25233063101768494
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.27812322974205017
Validation loss = 0.24094930291175842
Validation loss = 0.23273718357086182
Validation loss = 0.24742290377616882
Validation loss = 0.24287644028663635
Validation loss = 0.234201118350029
Validation loss = 0.23431693017482758
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.27890545129776
Validation loss = 0.22897550463676453
Validation loss = 0.2292570173740387
Validation loss = 0.22827588021755219
Validation loss = 0.22945791482925415
Validation loss = 0.22309324145317078
Validation loss = 0.22986742854118347
Validation loss = 0.2289113700389862
Validation loss = 0.25064077973365784
Validation loss = 0.23068146407604218
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 623
average number of affinization = 185.30769230769232
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 619
average number of affinization = 216.28571428571428
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 606
average number of affinization = 242.26666666666668
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 623
average number of affinization = 266.0625
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 581
average number of affinization = 284.5882352941176
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 631
average number of affinization = 303.8333333333333
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -2.08e+03 |
| Iteration     | 1         |
| MaximumReturn | -1.81e+03 |
| MinimumReturn | -2.21e+03 |
| TotalSamples  | 12000     |
-----------------------------
itr #2 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.24998997151851654
Validation loss = 0.21632151305675507
Validation loss = 0.2440006136894226
Validation loss = 0.23563982546329498
Validation loss = 0.22544986009597778
Validation loss = 0.23464109003543854
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.25893765687942505
Validation loss = 0.23361897468566895
Validation loss = 0.24077220261096954
Validation loss = 0.2330244779586792
Validation loss = 0.24373559653759003
Validation loss = 0.2415911704301834
Validation loss = 0.24911600351333618
Validation loss = 0.24338024854660034
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.23768384754657745
Validation loss = 0.21301491558551788
Validation loss = 0.2162729948759079
Validation loss = 0.23087255656719208
Validation loss = 0.22574539482593536
Validation loss = 0.24093331396579742
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.24988819658756256
Validation loss = 0.21748284995555878
Validation loss = 0.22938741743564606
Validation loss = 0.24022531509399414
Validation loss = 0.23394675552845
Validation loss = 0.24112963676452637
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.2500493824481964
Validation loss = 0.2309103012084961
Validation loss = 0.22887511551380157
Validation loss = 0.22260601818561554
Validation loss = 0.21952782571315765
Validation loss = 0.23118408024311066
Validation loss = 0.23166511952877045
Validation loss = 0.2267807275056839
Validation loss = 0.2238355427980423
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 544
average number of affinization = 316.4736842105263
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 565
average number of affinization = 328.9
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 530
average number of affinization = 338.4761904761905
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 532
average number of affinization = 347.27272727272725
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 590
average number of affinization = 357.82608695652175
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 565
average number of affinization = 366.4583333333333
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -1.89e+03 |
| Iteration     | 2         |
| MaximumReturn | -1.5e+03  |
| MinimumReturn | -2.69e+03 |
| TotalSamples  | 16000     |
-----------------------------
itr #3 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.22528041899204254
Validation loss = 0.22656238079071045
Validation loss = 0.22379368543624878
Validation loss = 0.23682114481925964
Validation loss = 0.24003952741622925
Validation loss = 0.2247515767812729
Validation loss = 0.24015814065933228
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.2188475877046585
Validation loss = 0.23319929838180542
Validation loss = 0.24125468730926514
Validation loss = 0.23267348110675812
Validation loss = 0.2322753667831421
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.2269887775182724
Validation loss = 0.21417316794395447
Validation loss = 0.21993771195411682
Validation loss = 0.21466751396656036
Validation loss = 0.21738095581531525
Validation loss = 0.2326297163963318
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.22842475771903992
Validation loss = 0.22134873270988464
Validation loss = 0.2083498239517212
Validation loss = 0.22911670804023743
Validation loss = 0.22397729754447937
Validation loss = 0.22445586323738098
Validation loss = 0.2202019840478897
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.2163040190935135
Validation loss = 0.21503743529319763
Validation loss = 0.22201283276081085
Validation loss = 0.21177488565444946
Validation loss = 0.22206348180770874
Validation loss = 0.23828822374343872
Validation loss = 0.22325509786605835
Validation loss = 0.2220015823841095
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 440
average number of affinization = 369.4
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 519
average number of affinization = 375.15384615384613
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 464
average number of affinization = 378.44444444444446
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 439
average number of affinization = 380.60714285714283
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 456
average number of affinization = 383.2068965517241
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 449
average number of affinization = 385.4
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -759      |
| Iteration     | 3         |
| MaximumReturn | -535      |
| MinimumReturn | -1.17e+03 |
| TotalSamples  | 20000     |
-----------------------------
itr #4 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.22825416922569275
Validation loss = 0.19706302881240845
Validation loss = 0.20149333775043488
Validation loss = 0.20397424697875977
Validation loss = 0.20639602839946747
Validation loss = 0.19764958322048187
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.2122037708759308
Validation loss = 0.20551025867462158
Validation loss = 0.21949680149555206
Validation loss = 0.20007415115833282
Validation loss = 0.19792985916137695
Validation loss = 0.2122540920972824
Validation loss = 0.20710568130016327
Validation loss = 0.20645840466022491
Validation loss = 0.21154184639453888
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.1998782604932785
Validation loss = 0.200609490275383
Validation loss = 0.19628940522670746
Validation loss = 0.19086547195911407
Validation loss = 0.19556280970573425
Validation loss = 0.19387999176979065
Validation loss = 0.19556479156017303
Validation loss = 0.19401291012763977
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.21027779579162598
Validation loss = 0.21562433242797852
Validation loss = 0.20547457039356232
Validation loss = 0.20648804306983948
Validation loss = 0.199673593044281
Validation loss = 0.22566774487495422
Validation loss = 0.19806678593158722
Validation loss = 0.20414869487285614
Validation loss = 0.2017616331577301
Validation loss = 0.20531317591667175
Validation loss = 0.22392046451568604
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.21893839538097382
Validation loss = 0.19388467073440552
Validation loss = 0.20599427819252014
Validation loss = 0.1922634094953537
Validation loss = 0.20526453852653503
Validation loss = 0.1979866921901703
Validation loss = 0.19213935732841492
Validation loss = 0.1900893896818161
Validation loss = 0.19422568380832672
Validation loss = 0.1977892816066742
Validation loss = 0.19207994639873505
Validation loss = 0.20073020458221436
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 428
average number of affinization = 386.7741935483871
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 418
average number of affinization = 387.75
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 512
average number of affinization = 391.5151515151515
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 468
average number of affinization = 393.7647058823529
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 444
average number of affinization = 395.2
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 406
average number of affinization = 395.5
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -898      |
| Iteration     | 4         |
| MaximumReturn | 197       |
| MinimumReturn | -1.58e+03 |
| TotalSamples  | 24000     |
-----------------------------
itr #5 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.1802053302526474
Validation loss = 0.1740461140871048
Validation loss = 0.1703510731458664
Validation loss = 0.17117969691753387
Validation loss = 0.17314253747463226
Validation loss = 0.1703326255083084
Validation loss = 0.1668357402086258
Validation loss = 0.16859208047389984
Validation loss = 0.17192578315734863
Validation loss = 0.16570541262626648
Validation loss = 0.16949254274368286
Validation loss = 0.16586096584796906
Validation loss = 0.1669526845216751
Validation loss = 0.1667717546224594
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.18693609535694122
Validation loss = 0.18152306973934174
Validation loss = 0.17996959388256073
Validation loss = 0.17338205873966217
Validation loss = 0.1745815873146057
Validation loss = 0.17150233685970306
Validation loss = 0.16560202836990356
Validation loss = 0.17960543930530548
Validation loss = 0.17302052676677704
Validation loss = 0.17729024589061737
Validation loss = 0.17281168699264526
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.1860980987548828
Validation loss = 0.1785195916891098
Validation loss = 0.17223428189754486
Validation loss = 0.16524378955364227
Validation loss = 0.16285012662410736
Validation loss = 0.1652035117149353
Validation loss = 0.160130113363266
Validation loss = 0.1646503359079361
Validation loss = 0.16290144622325897
Validation loss = 0.16172519326210022
Validation loss = 0.16217149794101715
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.18384122848510742
Validation loss = 0.183247372508049
Validation loss = 0.17216815054416656
Validation loss = 0.16484279930591583
Validation loss = 0.16624630987644196
Validation loss = 0.1696159392595291
Validation loss = 0.17097532749176025
Validation loss = 0.17607586085796356
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.18210850656032562
Validation loss = 0.17952872812747955
Validation loss = 0.17774808406829834
Validation loss = 0.16603334248065948
Validation loss = 0.17042194306850433
Validation loss = 0.16397635638713837
Validation loss = 0.16231314837932587
Validation loss = 0.16044048964977264
Validation loss = 0.16210304200649261
Validation loss = 0.16029800474643707
Validation loss = 0.16887454688549042
Validation loss = 0.16396911442279816
Validation loss = 0.16609735786914825
Validation loss = 0.16275371611118317
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 525
average number of affinization = 399.0
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 447
average number of affinization = 400.2631578947368
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 427
average number of affinization = 400.94871794871796
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 456
average number of affinization = 402.325
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 416
average number of affinization = 402.6585365853659
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 458
average number of affinization = 403.9761904761905
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | -585      |
| Iteration     | 5         |
| MaximumReturn | 112       |
| MinimumReturn | -1.64e+03 |
| TotalSamples  | 28000     |
-----------------------------
itr #6 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.18595555424690247
Validation loss = 0.15727770328521729
Validation loss = 0.15431277453899384
Validation loss = 0.14997079968452454
Validation loss = 0.1452799141407013
Validation loss = 0.14790520071983337
Validation loss = 0.15366102755069733
Validation loss = 0.15489284694194794
Validation loss = 0.1535055935382843
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.18467532098293304
Validation loss = 0.16798661649227142
Validation loss = 0.1786392778158188
Validation loss = 0.15969309210777283
Validation loss = 0.15661828219890594
Validation loss = 0.15527431666851044
Validation loss = 0.1530786007642746
Validation loss = 0.15902265906333923
Validation loss = 0.15223003923892975
Validation loss = 0.1585719883441925
Validation loss = 0.15088745951652527
Validation loss = 0.1532062292098999
Validation loss = 0.1533907651901245
Validation loss = 0.15316011011600494
Validation loss = 0.15022329986095428
Validation loss = 0.14794953167438507
Validation loss = 0.14817020297050476
Validation loss = 0.14948700368404388
Validation loss = 0.14449410140514374
Validation loss = 0.16169048845767975
Validation loss = 0.14836473762989044
Validation loss = 0.14682503044605255
Validation loss = 0.14173541963100433
Validation loss = 0.14795196056365967
Validation loss = 0.15330667793750763
Validation loss = 0.1432950645685196
Validation loss = 0.14484846591949463
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.17680899798870087
Validation loss = 0.16202734410762787
Validation loss = 0.14790908992290497
Validation loss = 0.14862428605556488
Validation loss = 0.14278285205364227
Validation loss = 0.14209885895252228
Validation loss = 0.14798901975154877
Validation loss = 0.15003852546215057
Validation loss = 0.14025349915027618
Validation loss = 0.1432776302099228
Validation loss = 0.14388339221477509
Validation loss = 0.13970862329006195
Validation loss = 0.14287637174129486
Validation loss = 0.1470097154378891
Validation loss = 0.14036254584789276
Validation loss = 0.1407063752412796
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.16970160603523254
Validation loss = 0.1631205976009369
Validation loss = 0.17003798484802246
Validation loss = 0.14977921545505524
Validation loss = 0.15520046651363373
Validation loss = 0.15586645901203156
Validation loss = 0.15072737634181976
Validation loss = 0.1537623107433319
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.1708822101354599
Validation loss = 0.1628105342388153
Validation loss = 0.15321029722690582
Validation loss = 0.15158827602863312
Validation loss = 0.14379312098026276
Validation loss = 0.1429622322320938
Validation loss = 0.1415282040834427
Validation loss = 0.14931592345237732
Validation loss = 0.14325831830501556
Validation loss = 0.14583837985992432
Validation loss = 0.14867238700389862
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 392
average number of affinization = 403.69767441860466
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 429
average number of affinization = 404.27272727272725
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 457
average number of affinization = 405.44444444444446
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 389
average number of affinization = 405.0869565217391
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 345
average number of affinization = 403.8085106382979
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 387
average number of affinization = 403.4583333333333
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 341      |
| Iteration     | 6        |
| MaximumReturn | 1.33e+03 |
| MinimumReturn | -1.4e+03 |
| TotalSamples  | 32000    |
----------------------------
itr #7 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.19433808326721191
Validation loss = 0.16219466924667358
Validation loss = 0.15201309323310852
Validation loss = 0.15149670839309692
Validation loss = 0.1549758017063141
Validation loss = 0.15302976965904236
Validation loss = 0.14837834239006042
Validation loss = 0.14962881803512573
Validation loss = 0.14797595143318176
Validation loss = 0.14893436431884766
Validation loss = 0.15141385793685913
Validation loss = 0.17042276263237
Validation loss = 0.14650481939315796
Validation loss = 0.1434975415468216
Validation loss = 0.14578208327293396
Validation loss = 0.14588026702404022
Validation loss = 0.14952093362808228
Validation loss = 0.14524897933006287
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.1933155655860901
Validation loss = 0.156812846660614
Validation loss = 0.15264034271240234
Validation loss = 0.15520891547203064
Validation loss = 0.15426859259605408
Validation loss = 0.14874008297920227
Validation loss = 0.14652341604232788
Validation loss = 0.14934289455413818
Validation loss = 0.14852139353752136
Validation loss = 0.15474818646907806
Validation loss = 0.1574292778968811
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.18904662132263184
Validation loss = 0.15297630429267883
Validation loss = 0.14538562297821045
Validation loss = 0.1422005295753479
Validation loss = 0.14531312882900238
Validation loss = 0.1466553807258606
Validation loss = 0.14725928008556366
Validation loss = 0.14126288890838623
Validation loss = 0.15778334438800812
Validation loss = 0.141301691532135
Validation loss = 0.1402217149734497
Validation loss = 0.14300212264060974
Validation loss = 0.14366663992404938
Validation loss = 0.1420043706893921
Validation loss = 0.13990435004234314
Validation loss = 0.1401650607585907
Validation loss = 0.15705202519893646
Validation loss = 0.13705573976039886
Validation loss = 0.13448604941368103
Validation loss = 0.13853701949119568
Validation loss = 0.14189384877681732
Validation loss = 0.14353764057159424
Validation loss = 0.13708965480327606
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.21898910403251648
Validation loss = 0.1644444763660431
Validation loss = 0.1592516154050827
Validation loss = 0.161314457654953
Validation loss = 0.1574440896511078
Validation loss = 0.16339047253131866
Validation loss = 0.1517341285943985
Validation loss = 0.15535269677639008
Validation loss = 0.16175729036331177
Validation loss = 0.15429721772670746
Validation loss = 0.1527000069618225
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.1700325459241867
Validation loss = 0.15375018119812012
Validation loss = 0.15941877663135529
Validation loss = 0.15101388096809387
Validation loss = 0.15575551986694336
Validation loss = 0.15263298153877258
Validation loss = 0.15220841765403748
Validation loss = 0.1484604924917221
Validation loss = 0.1478862166404724
Validation loss = 0.16238583624362946
Validation loss = 0.1475922167301178
Validation loss = 0.1487760841846466
Validation loss = 0.1476782262325287
Validation loss = 0.1469191312789917
Validation loss = 0.1502305269241333
Validation loss = 0.15132299065589905
Validation loss = 0.1506330966949463
Validation loss = 0.14768652617931366
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 435
average number of affinization = 404.1020408163265
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 370
average number of affinization = 403.42
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 209
average number of affinization = 399.6078431372549
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 429
average number of affinization = 400.1730769230769
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 438
average number of affinization = 400.8867924528302
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 456
average number of affinization = 401.9074074074074
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -49.8    |
| Iteration     | 7        |
| MaximumReturn | 511      |
| MinimumReturn | -794     |
| TotalSamples  | 36000    |
----------------------------
itr #8 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.16216275095939636
Validation loss = 0.14549708366394043
Validation loss = 0.1313072293996811
Validation loss = 0.13112656772136688
Validation loss = 0.1342851221561432
Validation loss = 0.12717564404010773
Validation loss = 0.13096843659877777
Validation loss = 0.13590273261070251
Validation loss = 0.13083840906620026
Validation loss = 0.12930943071842194
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.15813496708869934
Validation loss = 0.14464138448238373
Validation loss = 0.13731642067432404
Validation loss = 0.13224677741527557
Validation loss = 0.13504400849342346
Validation loss = 0.1378929764032364
Validation loss = 0.13875627517700195
Validation loss = 0.1336536854505539
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.15769334137439728
Validation loss = 0.1353355050086975
Validation loss = 0.13506433367729187
Validation loss = 0.12470360845327377
Validation loss = 0.12498047947883606
Validation loss = 0.12431029230356216
Validation loss = 0.12526215612888336
Validation loss = 0.12331827729940414
Validation loss = 0.12243804335594177
Validation loss = 0.12551388144493103
Validation loss = 0.12940949201583862
Validation loss = 0.12633512914180756
Validation loss = 0.12394823133945465
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.16690677404403687
Validation loss = 0.15147212147712708
Validation loss = 0.141311377286911
Validation loss = 0.14625683426856995
Validation loss = 0.13791580498218536
Validation loss = 0.13965044915676117
Validation loss = 0.1356472223997116
Validation loss = 0.14376622438430786
Validation loss = 0.1364033818244934
Validation loss = 0.13216273486614227
Validation loss = 0.13658171892166138
Validation loss = 0.1378944218158722
Validation loss = 0.1429436057806015
Validation loss = 0.13042260706424713
Validation loss = 0.13252906501293182
Validation loss = 0.13341662287712097
Validation loss = 0.13534051179885864
Validation loss = 0.1433882862329483
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.156020388007164
Validation loss = 0.14068472385406494
Validation loss = 0.13417893648147583
Validation loss = 0.13577806949615479
Validation loss = 0.13285936415195465
Validation loss = 0.13190896809101105
Validation loss = 0.13371902704238892
Validation loss = 0.1290983408689499
Validation loss = 0.13603496551513672
Validation loss = 0.13032931089401245
Validation loss = 0.1369074583053589
Validation loss = 0.12931698560714722
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 439
average number of affinization = 402.58181818181816
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 518
average number of affinization = 404.64285714285717
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 458
average number of affinization = 405.57894736842104
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 477
average number of affinization = 406.8103448275862
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 438
average number of affinization = 407.33898305084745
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 460
average number of affinization = 408.21666666666664
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 624      |
| Iteration     | 8        |
| MaximumReturn | 1.1e+03  |
| MinimumReturn | -63.2    |
| TotalSamples  | 40000    |
----------------------------
itr #9 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.13423961400985718
Validation loss = 0.12234090268611908
Validation loss = 0.11397681385278702
Validation loss = 0.11279512941837311
Validation loss = 0.11916597187519073
Validation loss = 0.12117842584848404
Validation loss = 0.1123843640089035
Validation loss = 0.11351299285888672
Validation loss = 0.12198184430599213
Validation loss = 0.11469423770904541
Validation loss = 0.11061348766088486
Validation loss = 0.10849732160568237
Validation loss = 0.11330044269561768
Validation loss = 0.11481010913848877
Validation loss = 0.11675243079662323
Validation loss = 0.11084051430225372
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.13785240054130554
Validation loss = 0.127096027135849
Validation loss = 0.11210568994283676
Validation loss = 0.11341514438390732
Validation loss = 0.11787132918834686
Validation loss = 0.1148814707994461
Validation loss = 0.11281068623065948
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.13589654862880707
Validation loss = 0.1133367046713829
Validation loss = 0.10756087303161621
Validation loss = 0.10841269791126251
Validation loss = 0.10504599660634995
Validation loss = 0.11434338241815567
Validation loss = 0.11167804151773453
Validation loss = 0.11117507517337799
Validation loss = 0.10960616171360016
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.14094862341880798
Validation loss = 0.11872647702693939
Validation loss = 0.11454063653945923
Validation loss = 0.11321743577718735
Validation loss = 0.11841411888599396
Validation loss = 0.11422456800937653
Validation loss = 0.11375759541988373
Validation loss = 0.11604656279087067
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.1315171867609024
Validation loss = 0.12387388944625854
Validation loss = 0.11769001185894012
Validation loss = 0.1145198792219162
Validation loss = 0.11493216454982758
Validation loss = 0.12471101433038712
Validation loss = 0.1125054582953453
Validation loss = 0.11393920332193375
Validation loss = 0.11523616313934326
Validation loss = 0.11259672790765762
Validation loss = 0.11522195488214493
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 453
average number of affinization = 408.95081967213116
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 481
average number of affinization = 410.11290322580646
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 455
average number of affinization = 410.8253968253968
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 459
average number of affinization = 411.578125
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 423
average number of affinization = 411.75384615384615
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 401
average number of affinization = 411.59090909090907
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.27e+03 |
| Iteration     | 9        |
| MaximumReturn | 1.68e+03 |
| MinimumReturn | 749      |
| TotalSamples  | 44000    |
----------------------------
itr #10 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.12219874560832977
Validation loss = 0.11270371079444885
Validation loss = 0.10498166084289551
Validation loss = 0.1050393134355545
Validation loss = 0.1098216250538826
Validation loss = 0.1042640209197998
Validation loss = 0.10912098735570908
Validation loss = 0.10538303107023239
Validation loss = 0.11210968345403671
Validation loss = 0.10467595607042313
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.12151611596345901
Validation loss = 0.11335662007331848
Validation loss = 0.1073705181479454
Validation loss = 0.10815854370594025
Validation loss = 0.10820572823286057
Validation loss = 0.11814706027507782
Validation loss = 0.11147589981555939
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.1298270970582962
Validation loss = 0.11201231926679611
Validation loss = 0.10564608871936798
Validation loss = 0.10169553011655807
Validation loss = 0.10651647299528122
Validation loss = 0.10327167809009552
Validation loss = 0.09949696809053421
Validation loss = 0.10550069808959961
Validation loss = 0.1037074625492096
Validation loss = 0.1031736433506012
Validation loss = 0.10211307555437088
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.130970299243927
Validation loss = 0.12032299488782883
Validation loss = 0.11000330746173859
Validation loss = 0.11059369891881943
Validation loss = 0.11054453998804092
Validation loss = 0.10739637166261673
Validation loss = 0.1129530742764473
Validation loss = 0.10809750109910965
Validation loss = 0.10833690315485
Validation loss = 0.10239822417497635
Validation loss = 0.10973960906267166
Validation loss = 0.11447148025035858
Validation loss = 0.11399567872285843
Validation loss = 0.10233394801616669
Validation loss = 0.10192279517650604
Validation loss = 0.1080172136425972
Validation loss = 0.11573278158903122
Validation loss = 0.10856065899133682
Validation loss = 0.1027669683098793
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.13593024015426636
Validation loss = 0.11468815803527832
Validation loss = 0.11100966483354568
Validation loss = 0.10633161664009094
Validation loss = 0.10431944578886032
Validation loss = 0.11083287745714188
Validation loss = 0.11170515418052673
Validation loss = 0.10922444611787796
Validation loss = 0.10581447929143906
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 475
average number of affinization = 412.53731343283584
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 499
average number of affinization = 413.80882352941177
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 343
average number of affinization = 412.7826086956522
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 494
average number of affinization = 413.9428571428571
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 208
average number of affinization = 411.0422535211268
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 456
average number of affinization = 411.6666666666667
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | 706       |
| Iteration     | 10        |
| MaximumReturn | 1.98e+03  |
| MinimumReturn | -2.05e+03 |
| TotalSamples  | 48000     |
-----------------------------
itr #11 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.10866882652044296
Validation loss = 0.1147795096039772
Validation loss = 0.09760674089193344
Validation loss = 0.10285087674856186
Validation loss = 0.1019558236002922
Validation loss = 0.1008681133389473
Validation loss = 0.10444536060094833
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.11133876442909241
Validation loss = 0.10841219872236252
Validation loss = 0.10250884294509888
Validation loss = 0.1019168272614479
Validation loss = 0.10847679525613785
Validation loss = 0.10230088979005814
Validation loss = 0.10144153237342834
Validation loss = 0.09940362721681595
Validation loss = 0.10086610913276672
Validation loss = 0.09968747943639755
Validation loss = 0.10101250559091568
Validation loss = 0.107138492166996
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.11392869800329208
Validation loss = 0.10976044088602066
Validation loss = 0.09561906009912491
Validation loss = 0.09343335777521133
Validation loss = 0.0962320938706398
Validation loss = 0.10200051218271255
Validation loss = 0.10066740959882736
Validation loss = 0.09982790797948837
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.12458834797143936
Validation loss = 0.10067471116781235
Validation loss = 0.09691011160612106
Validation loss = 0.09918884187936783
Validation loss = 0.0998404249548912
Validation loss = 0.10061195492744446
Validation loss = 0.0964607372879982
Validation loss = 0.10098441690206528
Validation loss = 0.09641293436288834
Validation loss = 0.09379560500383377
Validation loss = 0.09675858169794083
Validation loss = 0.0968996062874794
Validation loss = 0.10342586785554886
Validation loss = 0.09490678459405899
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.1108255460858345
Validation loss = 0.11010711640119553
Validation loss = 0.09907945245504379
Validation loss = 0.09808600693941116
Validation loss = 0.10126110166311264
Validation loss = 0.10586259514093399
Validation loss = 0.10248029232025146
Validation loss = 0.0980055034160614
Validation loss = 0.10040450841188431
Validation loss = 0.10720483213663101
Validation loss = 0.10108152031898499
Validation loss = 0.10006565600633621
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 414
average number of affinization = 411.6986301369863
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 511
average number of affinization = 413.0405405405405
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 522
average number of affinization = 414.49333333333334
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 511
average number of affinization = 415.7631578947368
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 472
average number of affinization = 416.4935064935065
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 497
average number of affinization = 417.52564102564105
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 625      |
| Iteration     | 11       |
| MaximumReturn | 1.55e+03 |
| MinimumReturn | -1.8e+03 |
| TotalSamples  | 52000    |
----------------------------
itr #12 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.10496318340301514
Validation loss = 0.09461359679698944
Validation loss = 0.09633905440568924
Validation loss = 0.09262192994356155
Validation loss = 0.09208913892507553
Validation loss = 0.10395673662424088
Validation loss = 0.09491106122732162
Validation loss = 0.09204187989234924
Validation loss = 0.09051001071929932
Validation loss = 0.09321468323469162
Validation loss = 0.09194093197584152
Validation loss = 0.09458927065134048
Validation loss = 0.09507698565721512
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.10216356068849564
Validation loss = 0.09989017993211746
Validation loss = 0.09428860247135162
Validation loss = 0.08871399611234665
Validation loss = 0.0910273939371109
Validation loss = 0.0930308923125267
Validation loss = 0.10588710010051727
Validation loss = 0.08981014788150787
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.1058301255106926
Validation loss = 0.0908159464597702
Validation loss = 0.08821378648281097
Validation loss = 0.09229262173175812
Validation loss = 0.08915101736783981
Validation loss = 0.09006333351135254
Validation loss = 0.08862554281949997
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.09990930557250977
Validation loss = 0.08979448676109314
Validation loss = 0.08979406952857971
Validation loss = 0.08773700892925262
Validation loss = 0.08882689476013184
Validation loss = 0.09141939133405685
Validation loss = 0.09620454907417297
Validation loss = 0.08914721757173538
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.10910887271165848
Validation loss = 0.09337638318538666
Validation loss = 0.09148787707090378
Validation loss = 0.09173209220170975
Validation loss = 0.09304165840148926
Validation loss = 0.09680989384651184
Validation loss = 0.09566293656826019
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 487
average number of affinization = 418.40506329113924
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 489
average number of affinization = 419.2875
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 461
average number of affinization = 419.8024691358025
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 515
average number of affinization = 420.9634146341463
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 531
average number of affinization = 422.28915662650604
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 479
average number of affinization = 422.9642857142857
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.28e+03 |
| Iteration     | 12       |
| MaximumReturn | 1.97e+03 |
| MinimumReturn | 487      |
| TotalSamples  | 56000    |
----------------------------
itr #13 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.10568945854902267
Validation loss = 0.08904898911714554
Validation loss = 0.08352074772119522
Validation loss = 0.08391518145799637
Validation loss = 0.09062395244836807
Validation loss = 0.0827680379152298
Validation loss = 0.08724571019411087
Validation loss = 0.08852167427539825
Validation loss = 0.0820852667093277
Validation loss = 0.08138815313577652
Validation loss = 0.08526594936847687
Validation loss = 0.08685542643070221
Validation loss = 0.09380413591861725
Validation loss = 0.08062789589166641
Validation loss = 0.07941602170467377
Validation loss = 0.0829283818602562
Validation loss = 0.08126094192266464
Validation loss = 0.08369646966457367
Validation loss = 0.07911361008882523
Validation loss = 0.07879326492547989
Validation loss = 0.08176863938570023
Validation loss = 0.08541072905063629
Validation loss = 0.08275168389081955
Validation loss = 0.07831265777349472
Validation loss = 0.07735081762075424
Validation loss = 0.07861433923244476
Validation loss = 0.08520036935806274
Validation loss = 0.07929660379886627
Validation loss = 0.07769519835710526
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.09384481608867645
Validation loss = 0.0880526527762413
Validation loss = 0.08559004217386246
Validation loss = 0.08717279881238937
Validation loss = 0.08589868247509003
Validation loss = 0.0814630389213562
Validation loss = 0.08404185622930527
Validation loss = 0.08169867098331451
Validation loss = 0.08322194963693619
Validation loss = 0.1028757095336914
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.11379978805780411
Validation loss = 0.08646989613771439
Validation loss = 0.08047634363174438
Validation loss = 0.08108155429363251
Validation loss = 0.0862719863653183
Validation loss = 0.08669184148311615
Validation loss = 0.08743571490049362
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.09969736635684967
Validation loss = 0.08247029781341553
Validation loss = 0.08404053747653961
Validation loss = 0.08649677038192749
Validation loss = 0.08140029013156891
Validation loss = 0.07947595417499542
Validation loss = 0.09087659418582916
Validation loss = 0.08766740560531616
Validation loss = 0.08640362322330475
Validation loss = 0.08022908121347427
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.10766924172639847
Validation loss = 0.08878027647733688
Validation loss = 0.08289986103773117
Validation loss = 0.08211826533079147
Validation loss = 0.08516256511211395
Validation loss = 0.08396857231855392
Validation loss = 0.08652501553297043
Validation loss = 0.08937250077724457
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 458
average number of affinization = 423.3764705882353
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 468
average number of affinization = 423.8953488372093
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 464
average number of affinization = 424.35632183908046
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 464
average number of affinization = 424.8068181818182
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 350
average number of affinization = 423.96629213483146
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 455
average number of affinization = 424.31111111111113
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | 288       |
| Iteration     | 13        |
| MaximumReturn | 1.4e+03   |
| MinimumReturn | -1.47e+03 |
| TotalSamples  | 60000     |
-----------------------------
itr #14 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.09976457059383392
Validation loss = 0.07758724689483643
Validation loss = 0.07611676305532455
Validation loss = 0.0754670724272728
Validation loss = 0.08303862065076828
Validation loss = 0.07918445765972137
Validation loss = 0.07362402975559235
Validation loss = 0.07327616959810257
Validation loss = 0.07380089908838272
Validation loss = 0.07969406247138977
Validation loss = 0.07369088381528854
Validation loss = 0.07229619473218918
Validation loss = 0.07159512490034103
Validation loss = 0.08719789236783981
Validation loss = 0.07640465348958969
Validation loss = 0.07139746844768524
Validation loss = 0.07302306592464447
Validation loss = 0.08043202012777328
Validation loss = 0.07953215390443802
Validation loss = 0.07585038989782333
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.10112281143665314
Validation loss = 0.08555785566568375
Validation loss = 0.07788302004337311
Validation loss = 0.0756329670548439
Validation loss = 0.07667037099599838
Validation loss = 0.08675187081098557
Validation loss = 0.08380208164453506
Validation loss = 0.07523609697818756
Validation loss = 0.07779259979724884
Validation loss = 0.08040013164281845
Validation loss = 0.07905612885951996
Validation loss = 0.07602600008249283
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.09231612831354141
Validation loss = 0.07958973944187164
Validation loss = 0.07752919942140579
Validation loss = 0.07822024077177048
Validation loss = 0.08020796626806259
Validation loss = 0.08470141142606735
Validation loss = 0.0823788121342659
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.08943971246480942
Validation loss = 0.08193695545196533
Validation loss = 0.07864455133676529
Validation loss = 0.07705013453960419
Validation loss = 0.09161591529846191
Validation loss = 0.0806708037853241
Validation loss = 0.07444198429584503
Validation loss = 0.0744493380188942
Validation loss = 0.08075602352619171
Validation loss = 0.0774000734090805
Validation loss = 0.07472958415746689
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.08805601298809052
Validation loss = 0.0877474695444107
Validation loss = 0.08283007144927979
Validation loss = 0.07857879996299744
Validation loss = 0.0784943476319313
Validation loss = 0.0817045196890831
Validation loss = 0.08350825309753418
Validation loss = 0.08469229191541672
Validation loss = 0.07825068384408951
Validation loss = 0.0784454494714737
Validation loss = 0.09170150756835938
Validation loss = 0.07807537913322449
Validation loss = 0.07722010463476181
Validation loss = 0.0771336778998375
Validation loss = 0.0852297693490982
Validation loss = 0.0868476927280426
Validation loss = 0.07497712224721909
Validation loss = 0.07685866206884384
Validation loss = 0.07547321915626526
Validation loss = 0.08209211379289627
Validation loss = 0.07894337922334671
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 471
average number of affinization = 424.8241758241758
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 497
average number of affinization = 425.60869565217394
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 440
average number of affinization = 425.76344086021504
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 346
average number of affinization = 424.9148936170213
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 503
average number of affinization = 425.7368421052632
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 430
average number of affinization = 425.78125
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | 486       |
| Iteration     | 14        |
| MaximumReturn | 1.84e+03  |
| MinimumReturn | -1.22e+03 |
| TotalSamples  | 64000     |
-----------------------------
itr #15 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.0802992433309555
Validation loss = 0.07139816880226135
Validation loss = 0.07206130772829056
Validation loss = 0.07069224119186401
Validation loss = 0.07179723680019379
Validation loss = 0.07588721066713333
Validation loss = 0.07404281198978424
Validation loss = 0.06915232539176941
Validation loss = 0.06982205808162689
Validation loss = 0.07978007197380066
Validation loss = 0.07003597915172577
Validation loss = 0.06804518401622772
Validation loss = 0.07665298879146576
Validation loss = 0.06922611594200134
Validation loss = 0.06991948187351227
Validation loss = 0.07924985885620117
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.0861692875623703
Validation loss = 0.080607570707798
Validation loss = 0.07485237717628479
Validation loss = 0.07509766519069672
Validation loss = 0.07534092664718628
Validation loss = 0.07711441069841385
Validation loss = 0.07730145752429962
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.08123789727687836
Validation loss = 0.08384408801794052
Validation loss = 0.07334502041339874
Validation loss = 0.07730510085821152
Validation loss = 0.07565286010503769
Validation loss = 0.07519284635782242
Validation loss = 0.07816614210605621
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.08461029827594757
Validation loss = 0.07356175780296326
Validation loss = 0.07682627439498901
Validation loss = 0.0824761837720871
Validation loss = 0.0756182074546814
Validation loss = 0.07231241464614868
Validation loss = 0.08234308660030365
Validation loss = 0.07266765832901001
Validation loss = 0.07023799419403076
Validation loss = 0.07406393438577652
Validation loss = 0.08217991888523102
Validation loss = 0.0776873379945755
Validation loss = 0.07209593802690506
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.08415050804615021
Validation loss = 0.07679356634616852
Validation loss = 0.07602084428071976
Validation loss = 0.07530053704977036
Validation loss = 0.08101063966751099
Validation loss = 0.07764065265655518
Validation loss = 0.07691627740859985
Validation loss = 0.07482615113258362
Validation loss = 0.08354569971561432
Validation loss = 0.07629633694887161
Validation loss = 0.0730738490819931
Validation loss = 0.07286728173494339
Validation loss = 0.08189323544502258
Validation loss = 0.07424309849739075
Validation loss = 0.07185350358486176
Validation loss = 0.07429282367229462
Validation loss = 0.07436344027519226
Validation loss = 0.07225318253040314
Validation loss = 0.07415448874235153
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 341
average number of affinization = 424.9072164948454
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 456
average number of affinization = 425.2244897959184
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 490
average number of affinization = 425.8787878787879
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 471
average number of affinization = 426.33
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 380
average number of affinization = 425.8712871287129
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 387
average number of affinization = 425.4901960784314
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 227      |
| Iteration     | 15       |
| MaximumReturn | 1.73e+03 |
| MinimumReturn | -867     |
| TotalSamples  | 68000    |
----------------------------
itr #16 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.091243214905262
Validation loss = 0.0682338997721672
Validation loss = 0.06624720245599747
Validation loss = 0.0675862655043602
Validation loss = 0.0693499967455864
Validation loss = 0.06615513563156128
Validation loss = 0.06441889703273773
Validation loss = 0.07655423879623413
Validation loss = 0.06933039426803589
Validation loss = 0.06397439539432526
Validation loss = 0.06576389074325562
Validation loss = 0.06691059470176697
Validation loss = 0.06350495666265488
Validation loss = 0.06353820115327835
Validation loss = 0.06607996672391891
Validation loss = 0.06923456490039825
Validation loss = 0.06335803866386414
Validation loss = 0.06348348408937454
Validation loss = 0.07358507812023163
Validation loss = 0.0605306550860405
Validation loss = 0.06154096871614456
Validation loss = 0.06812587380409241
Validation loss = 0.06400590389966965
Validation loss = 0.06221518665552139
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.08583173155784607
Validation loss = 0.07448381185531616
Validation loss = 0.07386097311973572
Validation loss = 0.06889694929122925
Validation loss = 0.07004574686288834
Validation loss = 0.08117201179265976
Validation loss = 0.06712810695171356
Validation loss = 0.06854233145713806
Validation loss = 0.07149919122457504
Validation loss = 0.07414617389440536
Validation loss = 0.06555052101612091
Validation loss = 0.06741868704557419
Validation loss = 0.07066074013710022
Validation loss = 0.06802897155284882
Validation loss = 0.06629815697669983
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.08037933707237244
Validation loss = 0.07751265168190002
Validation loss = 0.07221800088882446
Validation loss = 0.07041700929403305
Validation loss = 0.07190309464931488
Validation loss = 0.07563012093305588
Validation loss = 0.0694560632109642
Validation loss = 0.07592311501502991
Validation loss = 0.06954532861709595
Validation loss = 0.0680399164557457
Validation loss = 0.06909161806106567
Validation loss = 0.08063966035842896
Validation loss = 0.06786550581455231
Validation loss = 0.06653139740228653
Validation loss = 0.06725525856018066
Validation loss = 0.07447490841150284
Validation loss = 0.0674428790807724
Validation loss = 0.06688307970762253
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.08678509294986725
Validation loss = 0.06916424632072449
Validation loss = 0.07167371362447739
Validation loss = 0.07404018938541412
Validation loss = 0.06907641887664795
Validation loss = 0.06748747825622559
Validation loss = 0.07492507249116898
Validation loss = 0.07261303067207336
Validation loss = 0.06742311269044876
Validation loss = 0.06663203239440918
Validation loss = 0.0749431774020195
Validation loss = 0.06725489348173141
Validation loss = 0.06735808402299881
Validation loss = 0.06576770544052124
Validation loss = 0.06521943211555481
Validation loss = 0.06921668350696564
Validation loss = 0.06814268231391907
Validation loss = 0.06361249089241028
Validation loss = 0.06457541882991791
Validation loss = 0.07384232431650162
Validation loss = 0.06542651355266571
Validation loss = 0.06531432271003723
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.08377119898796082
Validation loss = 0.07181066274642944
Validation loss = 0.0697549358010292
Validation loss = 0.07707374542951584
Validation loss = 0.06876838952302933
Validation loss = 0.0691106915473938
Validation loss = 0.07753169536590576
Validation loss = 0.07281594723463058
Validation loss = 0.06782247871160507
Validation loss = 0.07268396764993668
Validation loss = 0.07612478733062744
Validation loss = 0.06798937171697617
Validation loss = 0.06636154651641846
Validation loss = 0.06783813238143921
Validation loss = 0.06922581046819687
Validation loss = 0.07276708632707596
Validation loss = 0.06509190797805786
Validation loss = 0.06645652651786804
Validation loss = 0.07184717059135437
Validation loss = 0.06794688105583191
Validation loss = 0.06482985615730286
Validation loss = 0.06610936671495438
Validation loss = 0.07394222170114517
Validation loss = 0.06536252796649933
Validation loss = 0.06632779538631439
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 489
average number of affinization = 426.1067961165049
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 238
average number of affinization = 424.2980769230769
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 503
average number of affinization = 425.04761904761904
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 524
average number of affinization = 425.9811320754717
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 512
average number of affinization = 426.78504672897196
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 545
average number of affinization = 427.8796296296296
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | 270       |
| Iteration     | 16        |
| MaximumReturn | 1.6e+03   |
| MinimumReturn | -1.51e+03 |
| TotalSamples  | 72000     |
-----------------------------
itr #17 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.07801312953233719
Validation loss = 0.06105794012546539
Validation loss = 0.05832977592945099
Validation loss = 0.05874162167310715
Validation loss = 0.06459718197584152
Validation loss = 0.057229578495025635
Validation loss = 0.058895986527204514
Validation loss = 0.06285521388053894
Validation loss = 0.06130726635456085
Validation loss = 0.06224144995212555
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.07855843752622604
Validation loss = 0.066059410572052
Validation loss = 0.061618342995643616
Validation loss = 0.06281542778015137
Validation loss = 0.06397596746683121
Validation loss = 0.0627347081899643
Validation loss = 0.0668700560927391
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.0744158923625946
Validation loss = 0.07227465510368347
Validation loss = 0.06195475161075592
Validation loss = 0.06340311467647552
Validation loss = 0.06205504387617111
Validation loss = 0.06501790881156921
Validation loss = 0.06380952149629593
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.07597223669290543
Validation loss = 0.06342320889234543
Validation loss = 0.06182195246219635
Validation loss = 0.06037361919879913
Validation loss = 0.06421484053134918
Validation loss = 0.06186135485768318
Validation loss = 0.06112215295433998
Validation loss = 0.06653157621622086
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.07633320987224579
Validation loss = 0.06535820662975311
Validation loss = 0.06062229350209236
Validation loss = 0.06130504608154297
Validation loss = 0.06204995885491371
Validation loss = 0.06345689296722412
Validation loss = 0.062166452407836914
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 483
average number of affinization = 428.38532110091745
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 448
average number of affinization = 428.56363636363636
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 512
average number of affinization = 429.31531531531533
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 457
average number of affinization = 429.5625
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 500
average number of affinization = 430.1858407079646
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 474
average number of affinization = 430.5701754385965
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 218      |
| Iteration     | 17       |
| MaximumReturn | 1.2e+03  |
| MinimumReturn | -576     |
| TotalSamples  | 76000    |
----------------------------
itr #18 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.06557360291481018
Validation loss = 0.05739046260714531
Validation loss = 0.05615709349513054
Validation loss = 0.06472615152597427
Validation loss = 0.05797409266233444
Validation loss = 0.0569092221558094
Validation loss = 0.05998218059539795
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.07948121428489685
Validation loss = 0.06352643668651581
Validation loss = 0.05881123244762421
Validation loss = 0.06278280168771744
Validation loss = 0.05854109674692154
Validation loss = 0.06542298197746277
Validation loss = 0.057697609066963196
Validation loss = 0.06123943254351616
Validation loss = 0.06869062781333923
Validation loss = 0.06361123919487
Validation loss = 0.057679325342178345
Validation loss = 0.057278960943222046
Validation loss = 0.0713464766740799
Validation loss = 0.0608709417283535
Validation loss = 0.058016348630189896
Validation loss = 0.06518453359603882
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.08275679498910904
Validation loss = 0.06038026884198189
Validation loss = 0.05962171033024788
Validation loss = 0.06386777013540268
Validation loss = 0.060757528990507126
Validation loss = 0.059928204864263535
Validation loss = 0.061508335173130035
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.06729092448949814
Validation loss = 0.05971302464604378
Validation loss = 0.05900995805859566
Validation loss = 0.05900613218545914
Validation loss = 0.06935468316078186
Validation loss = 0.059917595237493515
Validation loss = 0.05890749767422676
Validation loss = 0.07078198343515396
Validation loss = 0.05692712962627411
Validation loss = 0.06707585602998734
Validation loss = 0.057534392923116684
Validation loss = 0.05810781568288803
Validation loss = 0.06032809615135193
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.07063852995634079
Validation loss = 0.0630316212773323
Validation loss = 0.05934114754199982
Validation loss = 0.06059221923351288
Validation loss = 0.06806093454360962
Validation loss = 0.05978427454829216
Validation loss = 0.06002990901470184
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 544
average number of affinization = 431.55652173913046
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 504
average number of affinization = 432.1810344827586
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 473
average number of affinization = 432.52991452991455
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 445
average number of affinization = 432.635593220339
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 536
average number of affinization = 433.5042016806723
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 345
average number of affinization = 432.76666666666665
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 529      |
| Iteration     | 18       |
| MaximumReturn | 1.24e+03 |
| MinimumReturn | -1e+03   |
| TotalSamples  | 80000    |
----------------------------
itr #19 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.06723593175411224
Validation loss = 0.05578475445508957
Validation loss = 0.05448395758867264
Validation loss = 0.05480678752064705
Validation loss = 0.055883415043354034
Validation loss = 0.06046251207590103
Validation loss = 0.05836153030395508
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.06683419644832611
Validation loss = 0.06213493272662163
Validation loss = 0.0563160702586174
Validation loss = 0.05905694514513016
Validation loss = 0.05790983512997627
Validation loss = 0.06416868418455124
Validation loss = 0.05683284252882004
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.07060501724481583
Validation loss = 0.06213676184415817
Validation loss = 0.058672647923231125
Validation loss = 0.06058398634195328
Validation loss = 0.06567349284887314
Validation loss = 0.058263398706912994
Validation loss = 0.05747649818658829
Validation loss = 0.07257561385631561
Validation loss = 0.06239994242787361
Validation loss = 0.057573020458221436
Validation loss = 0.05718519538640976
Validation loss = 0.06975464522838593
Validation loss = 0.05684167146682739
Validation loss = 0.056449152529239655
Validation loss = 0.06466639786958694
Validation loss = 0.057901423424482346
Validation loss = 0.060224343091249466
Validation loss = 0.06184873729944229
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.07720644772052765
Validation loss = 0.058831680566072464
Validation loss = 0.057559896260499954
Validation loss = 0.056147098541259766
Validation loss = 0.06471588462591171
Validation loss = 0.05948753282427788
Validation loss = 0.05564242601394653
Validation loss = 0.056886959820985794
Validation loss = 0.06704890727996826
Validation loss = 0.056245554238557816
Validation loss = 0.05464586615562439
Validation loss = 0.05679430812597275
Validation loss = 0.059396110475063324
Validation loss = 0.05490241199731827
Validation loss = 0.054841019213199615
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.07006774097681046
Validation loss = 0.059429895132780075
Validation loss = 0.057976704090833664
Validation loss = 0.06210742145776749
Validation loss = 0.06321480125188828
Validation loss = 0.0584830641746521
Validation loss = 0.057242702692747116
Validation loss = 0.05928472429513931
Validation loss = 0.06166265159845352
Validation loss = 0.05925903469324112
Validation loss = 0.05644286796450615
Validation loss = 0.056665144860744476
Validation loss = 0.06776527315378189
Validation loss = 0.05879486724734306
Validation loss = 0.0553656741976738
Validation loss = 0.05758174508810043
Validation loss = 0.06397648900747299
Validation loss = 0.058377187699079514
Validation loss = 0.05618009716272354
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 327
average number of affinization = 431.89256198347107
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 478
average number of affinization = 432.2704918032787
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 397
average number of affinization = 431.9837398373984
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 523
average number of affinization = 432.71774193548384
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 489
average number of affinization = 433.168
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 526
average number of affinization = 433.9047619047619
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | -51      |
| Iteration     | 19       |
| MaximumReturn | 971      |
| MinimumReturn | -1.9e+03 |
| TotalSamples  | 84000    |
----------------------------
itr #20 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.06947533041238785
Validation loss = 0.054169911891222
Validation loss = 0.05450564995408058
Validation loss = 0.05667339637875557
Validation loss = 0.053384747356176376
Validation loss = 0.05205187201499939
Validation loss = 0.05761871859431267
Validation loss = 0.057369641959667206
Validation loss = 0.05603694170713425
Validation loss = 0.051896970719099045
Validation loss = 0.06344413757324219
Validation loss = 0.05373331904411316
Validation loss = 0.05147939547896385
Validation loss = 0.052857547998428345
Validation loss = 0.05337584763765335
Validation loss = 0.05445710942149162
Validation loss = 0.05189092829823494
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.07509352266788483
Validation loss = 0.057025227695703506
Validation loss = 0.05975625663995743
Validation loss = 0.056951187551021576
Validation loss = 0.06178634613752365
Validation loss = 0.0545157715678215
Validation loss = 0.05896452069282532
Validation loss = 0.06597806513309479
Validation loss = 0.054311804473400116
Validation loss = 0.05917475372552872
Validation loss = 0.05523582920432091
Validation loss = 0.05883314087986946
Validation loss = 0.0670168548822403
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.07375635951757431
Validation loss = 0.057457584887742996
Validation loss = 0.06087614223361015
Validation loss = 0.055518798530101776
Validation loss = 0.05687166005373001
Validation loss = 0.06233097240328789
Validation loss = 0.053523607552051544
Validation loss = 0.05492648109793663
Validation loss = 0.0646037682890892
Validation loss = 0.05601831153035164
Validation loss = 0.053229279816150665
Validation loss = 0.055714238435029984
Validation loss = 0.05897386744618416
Validation loss = 0.0541888065636158
Validation loss = 0.05372089520096779
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.06856833398342133
Validation loss = 0.05944172665476799
Validation loss = 0.056011080741882324
Validation loss = 0.05698637664318085
Validation loss = 0.05318393185734749
Validation loss = 0.05793740600347519
Validation loss = 0.057957280427217484
Validation loss = 0.05323754623532295
Validation loss = 0.06015852466225624
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.07173963636159897
Validation loss = 0.057802557945251465
Validation loss = 0.05435854569077492
Validation loss = 0.05554854869842529
Validation loss = 0.05821074917912483
Validation loss = 0.05539894104003906
Validation loss = 0.05289164185523987
Validation loss = 0.05722822993993759
Validation loss = 0.062441762536764145
Validation loss = 0.0543801449239254
Validation loss = 0.05911031737923622
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 468
average number of affinization = 434.1732283464567
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 512
average number of affinization = 434.78125
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 473
average number of affinization = 435.07751937984494
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 516
average number of affinization = 435.7
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 305
average number of affinization = 434.7022900763359
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 456
average number of affinization = 434.8636363636364
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 966      |
| Iteration     | 20       |
| MaximumReturn | 2.24e+03 |
| MinimumReturn | -902     |
| TotalSamples  | 88000    |
----------------------------
itr #21 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.055637821555137634
Validation loss = 0.050518110394477844
Validation loss = 0.05506235361099243
Validation loss = 0.052935466170310974
Validation loss = 0.0491030253469944
Validation loss = 0.04981584474444389
Validation loss = 0.05464227497577667
Validation loss = 0.05048339441418648
Validation loss = 0.051162220537662506
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.06038644537329674
Validation loss = 0.05283752828836441
Validation loss = 0.05356880649924278
Validation loss = 0.06426854431629181
Validation loss = 0.061873696744441986
Validation loss = 0.05261047184467316
Validation loss = 0.053916480392217636
Validation loss = 0.05209219083189964
Validation loss = 0.05106987804174423
Validation loss = 0.05824247747659683
Validation loss = 0.051028549671173096
Validation loss = 0.051982127130031586
Validation loss = 0.06800247728824615
Validation loss = 0.0513724647462368
Validation loss = 0.052653636783361435
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.06386779993772507
Validation loss = 0.05652333423495293
Validation loss = 0.05318215861916542
Validation loss = 0.05379671975970268
Validation loss = 0.05504828318953514
Validation loss = 0.05237603560090065
Validation loss = 0.053285740315914154
Validation loss = 0.05544950067996979
Validation loss = 0.05296200513839722
Validation loss = 0.06005559116601944
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.06746422499418259
Validation loss = 0.052798882126808167
Validation loss = 0.052384618669748306
Validation loss = 0.05559346452355385
Validation loss = 0.0545269250869751
Validation loss = 0.05204194784164429
Validation loss = 0.053299807012081146
Validation loss = 0.06051827594637871
Validation loss = 0.052519384771585464
Validation loss = 0.05121194198727608
Validation loss = 0.07266443222761154
Validation loss = 0.05169568210840225
Validation loss = 0.05337435007095337
Validation loss = 0.05578665807843208
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.060526635497808456
Validation loss = 0.05176349729299545
Validation loss = 0.052751366049051285
Validation loss = 0.053388748317956924
Validation loss = 0.05373331159353256
Validation loss = 0.05436204373836517
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 531
average number of affinization = 435.5864661654135
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 520
average number of affinization = 436.2164179104478
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 476
average number of affinization = 436.5111111111111
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 353
average number of affinization = 435.8970588235294
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 494
average number of affinization = 436.3211678832117
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 466
average number of affinization = 436.536231884058
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.66e+03 |
| Iteration     | 21       |
| MaximumReturn | 2.47e+03 |
| MinimumReturn | -586     |
| TotalSamples  | 92000    |
----------------------------
itr #22 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.05548626556992531
Validation loss = 0.05074870586395264
Validation loss = 0.05133191496133804
Validation loss = 0.051248352974653244
Validation loss = 0.048240263015031815
Validation loss = 0.06382406502962112
Validation loss = 0.04895107075572014
Validation loss = 0.048678070306777954
Validation loss = 0.052205972373485565
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.06011217460036278
Validation loss = 0.05410829931497574
Validation loss = 0.051376208662986755
Validation loss = 0.06795711815357208
Validation loss = 0.0504942424595356
Validation loss = 0.051318418234586716
Validation loss = 0.05363500118255615
Validation loss = 0.050507448613643646
Validation loss = 0.051972340792417526
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.0622645765542984
Validation loss = 0.05045860633254051
Validation loss = 0.05102543160319328
Validation loss = 0.05464315041899681
Validation loss = 0.05230932682752609
Validation loss = 0.05332231521606445
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.06524480134248734
Validation loss = 0.05456048995256424
Validation loss = 0.050926681607961655
Validation loss = 0.051596082746982574
Validation loss = 0.054326947778463364
Validation loss = 0.051880210638046265
Validation loss = 0.05070946365594864
Validation loss = 0.050559770315885544
Validation loss = 0.05470678582787514
Validation loss = 0.05426133796572685
Validation loss = 0.05084507539868355
Validation loss = 0.05051479488611221
Validation loss = 0.05816810578107834
Validation loss = 0.050455477088689804
Validation loss = 0.0485326424241066
Validation loss = 0.058738548308610916
Validation loss = 0.05345859378576279
Validation loss = 0.048061635345220566
Validation loss = 0.05372188985347748
Validation loss = 0.057425688952207565
Validation loss = 0.05187342315912247
Validation loss = 0.04943501576781273
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.05950126424431801
Validation loss = 0.05214143916964531
Validation loss = 0.05132640525698662
Validation loss = 0.06212868541479111
Validation loss = 0.05330913886427879
Validation loss = 0.050686612725257874
Validation loss = 0.05690430477261543
Validation loss = 0.052972462028265
Validation loss = 0.051565319299697876
Validation loss = 0.06381428241729736
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 499
average number of affinization = 436.9856115107914
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 445
average number of affinization = 437.04285714285714
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 339
average number of affinization = 436.34751773049646
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 504
average number of affinization = 436.82394366197184
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 503
average number of affinization = 437.2867132867133
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 428
average number of affinization = 437.22222222222223
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | 1.04e+03  |
| Iteration     | 22        |
| MaximumReturn | 2.51e+03  |
| MinimumReturn | -1.04e+03 |
| TotalSamples  | 96000     |
-----------------------------
itr #23 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.05829794332385063
Validation loss = 0.04855518415570259
Validation loss = 0.046264272183179855
Validation loss = 0.049458373337984085
Validation loss = 0.05732332542538643
Validation loss = 0.04625489190220833
Validation loss = 0.05184851959347725
Validation loss = 0.04882233217358589
Validation loss = 0.04782696068286896
Validation loss = 0.048246681690216064
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.05776117369532585
Validation loss = 0.051177363842725754
Validation loss = 0.04889717698097229
Validation loss = 0.04802192375063896
Validation loss = 0.06169135868549347
Validation loss = 0.04785970225930214
Validation loss = 0.04792991280555725
Validation loss = 0.04947065934538841
Validation loss = 0.048610616475343704
Validation loss = 0.0477430485188961
Validation loss = 0.04936588183045387
Validation loss = 0.049673791974782944
Validation loss = 0.04694092273712158
Validation loss = 0.04839998856186867
Validation loss = 0.04717698693275452
Validation loss = 0.05305113270878792
Validation loss = 0.053015608340501785
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.05728025361895561
Validation loss = 0.050010088831186295
Validation loss = 0.05241952836513519
Validation loss = 0.052443500608205795
Validation loss = 0.04805489256978035
Validation loss = 0.05996694043278694
Validation loss = 0.04783257842063904
Validation loss = 0.04851004481315613
Validation loss = 0.052367810159921646
Validation loss = 0.04798511788249016
Validation loss = 0.04801132157444954
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.053933948278427124
Validation loss = 0.05581796169281006
Validation loss = 0.04663778841495514
Validation loss = 0.05126761272549629
Validation loss = 0.05730462446808815
Validation loss = 0.05131572484970093
Validation loss = 0.046231430023908615
Validation loss = 0.05321343615651131
Validation loss = 0.045302391052246094
Validation loss = 0.04651105776429176
Validation loss = 0.05344150587916374
Validation loss = 0.046697456389665604
Validation loss = 0.04771049693226814
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.06179371476173401
Validation loss = 0.04910003021359444
Validation loss = 0.061286430805921555
Validation loss = 0.04859909787774086
Validation loss = 0.04858964681625366
Validation loss = 0.05467139557003975
Validation loss = 0.049500059336423874
Validation loss = 0.04971110820770264
Validation loss = 0.04897312447428703
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 446
average number of affinization = 437.28275862068966
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 459
average number of affinization = 437.43150684931504
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 527
average number of affinization = 438.0408163265306
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 501
average number of affinization = 438.4662162162162
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 489
average number of affinization = 438.80536912751677
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 479
average number of affinization = 439.0733333333333
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 2.03e+03 |
| Iteration     | 23       |
| MaximumReturn | 2.43e+03 |
| MinimumReturn | 1.09e+03 |
| TotalSamples  | 100000   |
----------------------------
itr #24 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.04968094453215599
Validation loss = 0.04640424624085426
Validation loss = 0.04678751528263092
Validation loss = 0.0470481738448143
Validation loss = 0.0478588230907917
Validation loss = 0.045880578458309174
Validation loss = 0.04580584540963173
Validation loss = 0.04726221039891243
Validation loss = 0.05651712790131569
Validation loss = 0.04522076994180679
Validation loss = 0.04532289132475853
Validation loss = 0.06019982695579529
Validation loss = 0.0450848788022995
Validation loss = 0.045980535447597504
Validation loss = 0.047457460314035416
Validation loss = 0.04467720910906792
Validation loss = 0.047411851584911346
Validation loss = 0.047240570187568665
Validation loss = 0.04555399343371391
Validation loss = 0.05087009817361832
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.05058523267507553
Validation loss = 0.04682253673672676
Validation loss = 0.04888269305229187
Validation loss = 0.0508640855550766
Validation loss = 0.04826654866337776
Validation loss = 0.04681222885847092
Validation loss = 0.04858221858739853
Validation loss = 0.046603113412857056
Validation loss = 0.04453282430768013
Validation loss = 0.05286521837115288
Validation loss = 0.047321148216724396
Validation loss = 0.04663221538066864
Validation loss = 0.05068527162075043
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.059926509857177734
Validation loss = 0.04841335862874985
Validation loss = 0.05310230329632759
Validation loss = 0.049135565757751465
Validation loss = 0.04650900885462761
Validation loss = 0.05703263357281685
Validation loss = 0.04701398313045502
Validation loss = 0.049368564039468765
Validation loss = 0.05192681774497032
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.04912443086504936
Validation loss = 0.046995609998703
Validation loss = 0.046549905091524124
Validation loss = 0.04859757050871849
Validation loss = 0.04711098223924637
Validation loss = 0.04563489928841591
Validation loss = 0.04856302589178085
Validation loss = 0.04694705829024315
Validation loss = 0.04513514041900635
Validation loss = 0.045529820024967194
Validation loss = 0.05010179430246353
Validation loss = 0.04402798041701317
Validation loss = 0.053827110677957535
Validation loss = 0.044700298458337784
Validation loss = 0.04464573413133621
Validation loss = 0.06073280796408653
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.051173675805330276
Validation loss = 0.04680580645799637
Validation loss = 0.048993028700351715
Validation loss = 0.04692496731877327
Validation loss = 0.04777666553854942
Validation loss = 0.04738413915038109
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 427
average number of affinization = 438.9933774834437
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 431
average number of affinization = 438.9407894736842
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 463
average number of affinization = 439.0980392156863
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 515
average number of affinization = 439.59090909090907
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 309
average number of affinization = 438.7483870967742
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 452
average number of affinization = 438.8333333333333
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.3e+03  |
| Iteration     | 24       |
| MaximumReturn | 2.53e+03 |
| MinimumReturn | -500     |
| TotalSamples  | 104000   |
----------------------------
itr #25 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.0557168647646904
Validation loss = 0.04369733855128288
Validation loss = 0.04420095309615135
Validation loss = 0.04835629090666771
Validation loss = 0.04983767494559288
Validation loss = 0.045153673738241196
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.053676072508096695
Validation loss = 0.045317962765693665
Validation loss = 0.04745505750179291
Validation loss = 0.04835507646203041
Validation loss = 0.04767191410064697
Validation loss = 0.044578347355127335
Validation loss = 0.04658607766032219
Validation loss = 0.04991744086146355
Validation loss = 0.051210835576057434
Validation loss = 0.0463716946542263
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.055058855563402176
Validation loss = 0.04559818655252457
Validation loss = 0.048324406147003174
Validation loss = 0.04720211774110794
Validation loss = 0.04502379521727562
Validation loss = 0.061993010342121124
Validation loss = 0.05077391862869263
Validation loss = 0.04641614854335785
Validation loss = 0.046032410115003586
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.06362731009721756
Validation loss = 0.04497043788433075
Validation loss = 0.04458554461598396
Validation loss = 0.04898907244205475
Validation loss = 0.04417722672224045
Validation loss = 0.045247238129377365
Validation loss = 0.04966742545366287
Validation loss = 0.04330647364258766
Validation loss = 0.04477823153138161
Validation loss = 0.044031545519828796
Validation loss = 0.05140967667102814
Validation loss = 0.0444885678589344
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.05173967033624649
Validation loss = 0.04655351862311363
Validation loss = 0.04682878777384758
Validation loss = 0.04675561562180519
Validation loss = 0.0464666523039341
Validation loss = 0.04745125398039818
Validation loss = 0.0455503985285759
Validation loss = 0.04647740349173546
Validation loss = 0.046280719339847565
Validation loss = 0.0444975346326828
Validation loss = 0.04969694837927818
Validation loss = 0.045004360377788544
Validation loss = 0.04728388041257858
Validation loss = 0.046931441873311996
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 507
average number of affinization = 439.26751592356686
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 425
average number of affinization = 439.17721518987344
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 549
average number of affinization = 439.8679245283019
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 517
average number of affinization = 440.35
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 500
average number of affinization = 440.72049689440996
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 421
average number of affinization = 440.5987654320988
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.05e+03 |
| Iteration     | 25       |
| MaximumReturn | 2.69e+03 |
| MinimumReturn | -929     |
| TotalSamples  | 108000   |
----------------------------
itr #26 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.04689128324389458
Validation loss = 0.04707157984375954
Validation loss = 0.04595721513032913
Validation loss = 0.04782862588763237
Validation loss = 0.04354863241314888
Validation loss = 0.045286793261766434
Validation loss = 0.04502563178539276
Validation loss = 0.05020738020539284
Validation loss = 0.042997028678655624
Validation loss = 0.050075411796569824
Validation loss = 0.04234442487359047
Validation loss = 0.04892542213201523
Validation loss = 0.04228493943810463
Validation loss = 0.04383764788508415
Validation loss = 0.04910387843847275
Validation loss = 0.04359431937336922
Validation loss = 0.04200570657849312
Validation loss = 0.04546094313263893
Validation loss = 0.04293200746178627
Validation loss = 0.044045113027095795
Validation loss = 0.04262074455618858
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.0463988333940506
Validation loss = 0.04519970715045929
Validation loss = 0.04786569997668266
Validation loss = 0.04360246658325195
Validation loss = 0.04631965607404709
Validation loss = 0.04527055844664574
Validation loss = 0.04399227350950241
Validation loss = 0.047013621777296066
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.051807668060064316
Validation loss = 0.045866675674915314
Validation loss = 0.043628208339214325
Validation loss = 0.051600344479084015
Validation loss = 0.04808490350842476
Validation loss = 0.05325654521584511
Validation loss = 0.04413659870624542
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.04538201913237572
Validation loss = 0.048965852707624435
Validation loss = 0.04384065419435501
Validation loss = 0.042471010237932205
Validation loss = 0.04408055171370506
Validation loss = 0.043291062116622925
Validation loss = 0.04211147129535675
Validation loss = 0.04730411246418953
Validation loss = 0.04284466803073883
Validation loss = 0.04393615201115608
Validation loss = 0.043990641832351685
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.05039887875318527
Validation loss = 0.043928615748882294
Validation loss = 0.04376228526234627
Validation loss = 0.0473628044128418
Validation loss = 0.047529879957437515
Validation loss = 0.04481970891356468
Validation loss = 0.0518801212310791
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 492
average number of affinization = 440.9141104294479
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 434
average number of affinization = 440.8719512195122
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 510
average number of affinization = 441.2909090909091
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 500
average number of affinization = 441.644578313253
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 472
average number of affinization = 441.8263473053892
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 460
average number of affinization = 441.9345238095238
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.87e+03 |
| Iteration     | 26       |
| MaximumReturn | 2.51e+03 |
| MinimumReturn | 1.04e+03 |
| TotalSamples  | 112000   |
----------------------------
itr #27 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.04374460503458977
Validation loss = 0.044153012335300446
Validation loss = 0.04171141982078552
Validation loss = 0.04270181432366371
Validation loss = 0.04389757290482521
Validation loss = 0.0423918291926384
Validation loss = 0.04510347917675972
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.04608084633946419
Validation loss = 0.04496411606669426
Validation loss = 0.04551975056529045
Validation loss = 0.04738236218690872
Validation loss = 0.04302768036723137
Validation loss = 0.04539588838815689
Validation loss = 0.04532041400671005
Validation loss = 0.043701715767383575
Validation loss = 0.04266145080327988
Validation loss = 0.04526925086975098
Validation loss = 0.0436863973736763
Validation loss = 0.04226166754961014
Validation loss = 0.04925549030303955
Validation loss = 0.043888140469789505
Validation loss = 0.04244401678442955
Validation loss = 0.04335709288716316
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.04421510174870491
Validation loss = 0.04657435417175293
Validation loss = 0.044226255267858505
Validation loss = 0.04437180981040001
Validation loss = 0.046331822872161865
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.04442846402525902
Validation loss = 0.04711337760090828
Validation loss = 0.04229942336678505
Validation loss = 0.044039446860551834
Validation loss = 0.04424834996461868
Validation loss = 0.04187192767858505
Validation loss = 0.051652032881975174
Validation loss = 0.041220612823963165
Validation loss = 0.0417727530002594
Validation loss = 0.0487799346446991
Validation loss = 0.042619939893484116
Validation loss = 0.04526112601161003
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.047021202743053436
Validation loss = 0.04690917208790779
Validation loss = 0.044320423156023026
Validation loss = 0.044487304985523224
Validation loss = 0.04429803416132927
Validation loss = 0.04254991188645363
Validation loss = 0.050942856818437576
Validation loss = 0.04555409401655197
Validation loss = 0.04382013529539108
Validation loss = 0.04272986203432083
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 369
average number of affinization = 441.50295857988164
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 500
average number of affinization = 441.84705882352944
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 487
average number of affinization = 442.1111111111111
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 470
average number of affinization = 442.2732558139535
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 493
average number of affinization = 442.5664739884393
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 400
average number of affinization = 442.32183908045977
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 920      |
| Iteration     | 27       |
| MaximumReturn | 1.97e+03 |
| MinimumReturn | 175      |
| TotalSamples  | 116000   |
----------------------------
itr #28 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.044726889580488205
Validation loss = 0.04393727332353592
Validation loss = 0.04347873479127884
Validation loss = 0.049042895436286926
Validation loss = 0.043105993419885635
Validation loss = 0.04379425197839737
Validation loss = 0.04356484115123749
Validation loss = 0.042835675179958344
Validation loss = 0.042524322867393494
Validation loss = 0.0419703871011734
Validation loss = 0.04455988481640816
Validation loss = 0.04704912751913071
Validation loss = 0.043169211596250534
Validation loss = 0.04215576872229576
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.053678955882787704
Validation loss = 0.043444838374853134
Validation loss = 0.043626777827739716
Validation loss = 0.04344348981976509
Validation loss = 0.04779011383652687
Validation loss = 0.04209020361304283
Validation loss = 0.04227414354681969
Validation loss = 0.04967769235372543
Validation loss = 0.042905811220407486
Validation loss = 0.042170118540525436
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.04927627369761467
Validation loss = 0.04637191444635391
Validation loss = 0.044365011155605316
Validation loss = 0.04570642113685608
Validation loss = 0.04488443210721016
Validation loss = 0.04593021422624588
Validation loss = 0.043604522943496704
Validation loss = 0.04900728166103363
Validation loss = 0.04348301142454147
Validation loss = 0.04444298893213272
Validation loss = 0.04658609256148338
Validation loss = 0.04352917894721031
Validation loss = 0.04739254340529442
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.051080092787742615
Validation loss = 0.0420103594660759
Validation loss = 0.0496414490044117
Validation loss = 0.04330376908183098
Validation loss = 0.04471909627318382
Validation loss = 0.04196661338210106
Validation loss = 0.04398435726761818
Validation loss = 0.04314706474542618
Validation loss = 0.04573705792427063
Validation loss = 0.0415194109082222
Validation loss = 0.041804905980825424
Validation loss = 0.046404238790273666
Validation loss = 0.04080434888601303
Validation loss = 0.043343208730220795
Validation loss = 0.04135647416114807
Validation loss = 0.04202428087592125
Validation loss = 0.04310593008995056
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.048818398267030716
Validation loss = 0.044806066900491714
Validation loss = 0.04337020218372345
Validation loss = 0.05367809161543846
Validation loss = 0.04293191060423851
Validation loss = 0.04291195049881935
Validation loss = 0.04998539760708809
Validation loss = 0.04282393679022789
Validation loss = 0.0427408292889595
Validation loss = 0.04871957004070282
Validation loss = 0.042241618037223816
Validation loss = 0.04569493234157562
Validation loss = 0.043543945997953415
Validation loss = 0.04463711380958557
Validation loss = 0.043092187494039536
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 507
average number of affinization = 442.69142857142856
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 461
average number of affinization = 442.79545454545456
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 450
average number of affinization = 442.8361581920904
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 495
average number of affinization = 443.1292134831461
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 496
average number of affinization = 443.4245810055866
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 462
average number of affinization = 443.52777777777777
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 2.11e+03 |
| Iteration     | 28       |
| MaximumReturn | 2.62e+03 |
| MinimumReturn | 1.41e+03 |
| TotalSamples  | 120000   |
----------------------------
itr #29 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.041934285312891006
Validation loss = 0.043050337582826614
Validation loss = 0.04399610310792923
Validation loss = 0.040908295661211014
Validation loss = 0.04412345960736275
Validation loss = 0.04144474118947983
Validation loss = 0.05495798960328102
Validation loss = 0.040273718535900116
Validation loss = 0.04208337515592575
Validation loss = 0.04233066365122795
Validation loss = 0.04135851562023163
Validation loss = 0.039870165288448334
Validation loss = 0.040649548172950745
Validation loss = 0.043191853910684586
Validation loss = 0.04019541293382645
Validation loss = 0.04431177303195
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.04276198148727417
Validation loss = 0.04677664861083031
Validation loss = 0.042777664959430695
Validation loss = 0.04336829110980034
Validation loss = 0.041671428829431534
Validation loss = 0.04204290732741356
Validation loss = 0.045284636318683624
Validation loss = 0.04108908027410507
Validation loss = 0.04122822359204292
Validation loss = 0.04385949298739433
Validation loss = 0.04026911407709122
Validation loss = 0.04420400783419609
Validation loss = 0.03995697945356369
Validation loss = 0.04579013213515282
Validation loss = 0.042828913778066635
Validation loss = 0.04271697625517845
Validation loss = 0.04148920252919197
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.04643930867314339
Validation loss = 0.04199640452861786
Validation loss = 0.049932826310396194
Validation loss = 0.04179471731185913
Validation loss = 0.04426692798733711
Validation loss = 0.04254315420985222
Validation loss = 0.04623366892337799
Validation loss = 0.04161923751235008
Validation loss = 0.04239318147301674
Validation loss = 0.04576344043016434
Validation loss = 0.041789207607507706
Validation loss = 0.045799195766448975
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.042441025376319885
Validation loss = 0.040182676166296005
Validation loss = 0.04165997728705406
Validation loss = 0.03994521126151085
Validation loss = 0.04339418187737465
Validation loss = 0.04038412496447563
Validation loss = 0.041645750403404236
Validation loss = 0.04010043665766716
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.046799663454294205
Validation loss = 0.04447925090789795
Validation loss = 0.04742557182908058
Validation loss = 0.04087980464100838
Validation loss = 0.04130314290523529
Validation loss = 0.047537095844745636
Validation loss = 0.0410311333835125
Validation loss = 0.04497498646378517
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 184
average number of affinization = 442.0939226519337
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 479
average number of affinization = 442.2967032967033
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 469
average number of affinization = 442.44262295081967
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 422
average number of affinization = 442.33152173913044
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 489
average number of affinization = 442.58378378378376
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 432
average number of affinization = 442.5268817204301
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | 1.03e+03  |
| Iteration     | 29        |
| MaximumReturn | 2.45e+03  |
| MinimumReturn | -1.14e+03 |
| TotalSamples  | 124000    |
-----------------------------
itr #30 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.048053186386823654
Validation loss = 0.050311122089624405
Validation loss = 0.04268620163202286
Validation loss = 0.044868163764476776
Validation loss = 0.04610253870487213
Validation loss = 0.04187178984284401
Validation loss = 0.05184166878461838
Validation loss = 0.042978446930646896
Validation loss = 0.04309958964586258
Validation loss = 0.04384371265769005
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.04648195579648018
Validation loss = 0.043701477348804474
Validation loss = 0.04826546832919121
Validation loss = 0.045554038137197495
Validation loss = 0.04263618588447571
Validation loss = 0.051466457545757294
Validation loss = 0.04480275511741638
Validation loss = 0.04279879480600357
Validation loss = 0.0461445078253746
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.05084802955389023
Validation loss = 0.04379316791892052
Validation loss = 0.047630779445171356
Validation loss = 0.04378202557563782
Validation loss = 0.05038182809948921
Validation loss = 0.04526665434241295
Validation loss = 0.04577506706118584
Validation loss = 0.04594559594988823
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.04740406945347786
Validation loss = 0.04290849342942238
Validation loss = 0.04700515419244766
Validation loss = 0.042138777673244476
Validation loss = 0.04638848453760147
Validation loss = 0.04406406730413437
Validation loss = 0.04425209015607834
Validation loss = 0.044358931481838226
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.05649091675877571
Validation loss = 0.04358217492699623
Validation loss = 0.046597499400377274
Validation loss = 0.04766117036342621
Validation loss = 0.04359592869877815
Validation loss = 0.04717046394944191
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 466
average number of affinization = 442.6524064171123
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 490
average number of affinization = 442.90425531914894
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 326
average number of affinization = 442.2857142857143
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 484
average number of affinization = 442.5052631578947
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 488
average number of affinization = 442.7434554973822
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 484
average number of affinization = 442.9583333333333
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.87e+03 |
| Iteration     | 30       |
| MaximumReturn | 2.54e+03 |
| MinimumReturn | 387      |
| TotalSamples  | 128000   |
----------------------------
itr #31 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.04411964863538742
Validation loss = 0.04392686486244202
Validation loss = 0.04124007374048233
Validation loss = 0.04505542293190956
Validation loss = 0.04112184792757034
Validation loss = 0.04523221775889397
Validation loss = 0.042389724403619766
Validation loss = 0.045355796813964844
Validation loss = 0.04283852130174637
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.04785986244678497
Validation loss = 0.04242967814207077
Validation loss = 0.0446556955575943
Validation loss = 0.04256204143166542
Validation loss = 0.04238646849989891
Validation loss = 0.043316181749105453
Validation loss = 0.04188767820596695
Validation loss = 0.042564183473587036
Validation loss = 0.041767656803131104
Validation loss = 0.04057633876800537
Validation loss = 0.04476655274629593
Validation loss = 0.04227122291922569
Validation loss = 0.04391217231750488
Validation loss = 0.04205116629600525
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.04757088050246239
Validation loss = 0.043374594300985336
Validation loss = 0.04642853885889053
Validation loss = 0.0439150407910347
Validation loss = 0.04281044378876686
Validation loss = 0.048202600330114365
Validation loss = 0.04282678663730621
Validation loss = 0.044329725205898285
Validation loss = 0.04199930280447006
Validation loss = 0.04434382915496826
Validation loss = 0.04182644188404083
Validation loss = 0.04229886457324028
Validation loss = 0.043281007558107376
Validation loss = 0.04613324999809265
Validation loss = 0.04240712523460388
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.047381192445755005
Validation loss = 0.04242624342441559
Validation loss = 0.043008171021938324
Validation loss = 0.04738398641347885
Validation loss = 0.042077794671058655
Validation loss = 0.047055624425411224
Validation loss = 0.04233495146036148
Validation loss = 0.04166025295853615
Validation loss = 0.04869720712304115
Validation loss = 0.04208459332585335
Validation loss = 0.04652027785778046
Validation loss = 0.042539793998003006
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.04883861169219017
Validation loss = 0.04223930090665817
Validation loss = 0.04331650957465172
Validation loss = 0.04893668740987778
Validation loss = 0.04475508630275726
Validation loss = 0.04878620803356171
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 452
average number of affinization = 443.00518134715026
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 435
average number of affinization = 442.9639175257732
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 464
average number of affinization = 443.07179487179485
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 422
average number of affinization = 442.9642857142857
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 426
average number of affinization = 442.8781725888325
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 333
average number of affinization = 442.32323232323233
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
----------------------------
| AverageReturn | 1.11e+03 |
| Iteration     | 31       |
| MaximumReturn | 2.42e+03 |
| MinimumReturn | -588     |
| TotalSamples  | 132000   |
----------------------------
itr #32 | 
Fitting dynamics.
Fitting model 0 (0-based) in the ensemble of 5 models
Validation loss = 0.04262802377343178
Validation loss = 0.042213015258312225
Validation loss = 0.04441174492239952
Validation loss = 0.04221908375620842
Validation loss = 0.04511154070496559
Validation loss = 0.04103310778737068
Validation loss = 0.04335978999733925
Validation loss = 0.04090483486652374
Validation loss = 0.046802278608083725
Validation loss = 0.040889471769332886
Validation loss = 0.04219729080796242
Validation loss = 0.041363537311553955
Validation loss = 0.04062941297888756
Validation loss = 0.04342866688966751
Validation loss = 0.04158556088805199
Validation loss = 0.04077189788222313
Validation loss = 0.04158353433012962
Fitting model 1 (0-based) in the ensemble of 5 models
Validation loss = 0.044624436646699905
Validation loss = 0.044464100152254105
Validation loss = 0.04313420504331589
Validation loss = 0.04235680773854256
Validation loss = 0.04200873523950577
Validation loss = 0.04674164578318596
Validation loss = 0.04216189682483673
Validation loss = 0.0415746234357357
Validation loss = 0.04117143154144287
Validation loss = 0.04545145109295845
Validation loss = 0.040231961756944656
Validation loss = 0.039929717779159546
Validation loss = 0.044078607112169266
Validation loss = 0.04425102472305298
Validation loss = 0.04049457982182503
Validation loss = 0.04235919564962387
Fitting model 2 (0-based) in the ensemble of 5 models
Validation loss = 0.04411375895142555
Validation loss = 0.04170653596520424
Validation loss = 0.04270808771252632
Validation loss = 0.0416366271674633
Validation loss = 0.046972356736660004
Validation loss = 0.04328620433807373
Validation loss = 0.0413445346057415
Validation loss = 0.04289906844496727
Validation loss = 0.04366647079586983
Validation loss = 0.04136725515127182
Validation loss = 0.04497559741139412
Fitting model 3 (0-based) in the ensemble of 5 models
Validation loss = 0.04433683305978775
Validation loss = 0.044757142663002014
Validation loss = 0.04206376150250435
Validation loss = 0.04124491289258003
Validation loss = 0.040948014706373215
Validation loss = 0.04127313569188118
Validation loss = 0.043443627655506134
Validation loss = 0.0413249135017395
Validation loss = 0.04207375645637512
Fitting model 4 (0-based) in the ensemble of 5 models
Validation loss = 0.04637126624584198
Validation loss = 0.04187089204788208
Validation loss = 0.04419509693980217
Validation loss = 0.04607795178890228
Validation loss = 0.04234226793050766
Validation loss = 0.04228367283940315
Done fitting dynamics.
Updating randomness.
Done updating randomness.
Training policy using TRPO.
Re-initialize init_std.
Obtaining samples for iteration 0...
Obtaining samples for iteration 1...
Obtaining samples for iteration 2...
Obtaining samples for iteration 3...
Obtaining samples for iteration 4...
Obtaining samples for iteration 5...
Obtaining samples for iteration 6...
Obtaining samples for iteration 7...
Obtaining samples for iteration 8...
Obtaining samples for iteration 9...
Obtaining samples for iteration 10...
Obtaining samples for iteration 11...
Obtaining samples for iteration 12...
Obtaining samples for iteration 13...
Obtaining samples for iteration 14...
Obtaining samples for iteration 15...
Obtaining samples for iteration 16...
Obtaining samples for iteration 17...
Obtaining samples for iteration 18...
Obtaining samples for iteration 19...
Done training policy.
Generating on-policy rollouts.
Path 0 | total_timesteps 0.
number of affinization with epsilon = 0.3 is 415
average number of affinization = 442.1859296482412
Path 1 | total_timesteps 1000.
number of affinization with epsilon = 0.3 is 276
average number of affinization = 441.355
Path 2 | total_timesteps 2000.
number of affinization with epsilon = 0.3 is 393
average number of affinization = 441.11442786069654
Path 3 | total_timesteps 3000.
number of affinization with epsilon = 0.3 is 457
average number of affinization = 441.1930693069307
Path 4 | total_timesteps 4000.
number of affinization with epsilon = 0.3 is 461
average number of affinization = 441.29064039408865
Path 5 | total_timesteps 5000.
number of affinization with epsilon = 0.3 is 466
average number of affinization = 441.4117647058824
Done generating on-policy rollouts.
Updating normalization.
Done updating normalization.
-----------------------------
| AverageReturn | 1.34e+03  |
| Iteration     | 32        |
| MaximumReturn | 2.44e+03  |
| MinimumReturn | -1.09e+03 |
| TotalSamples  | 136000    |
-----------------------------
