vocab_size not found in data/openwebtext/meta.pkl, using GPT-2 default of 50257
Initializing a new model from scratch
number of parameters: 50.96M
step 0: learning rate 0.00060000, train loss 10.9687, val loss 10.9745
step 250: learning rate 0.00060000, train loss 10.9773, val loss 10.9790
step 500: learning rate 0.00060000, train loss 10.9795, val loss 10.9725
step 750: learning rate 0.00060000, train loss 10.9804, val loss 10.9725
step 1000: learning rate 0.00060000, train loss 10.9768, val loss 10.9793
step 1250: learning rate 0.00059999, train loss 10.9769, val loss 10.9828
step 1500: learning rate 0.00059999, train loss 10.9699, val loss 10.9736
step 1750: learning rate 0.00059999, train loss 10.9730, val loss 10.9791
step 2000: learning rate 0.00059999, train loss 10.9810, val loss 10.9820
step 2250: learning rate 0.00059998, train loss 10.9743, val loss 10.9786
step 2500: learning rate 0.00059998, train loss 10.9739, val loss 10.9762
step 2750: learning rate 0.00059997, train loss 10.9710, val loss 10.9771
step 3000: learning rate 0.00059997, train loss 10.9735, val loss 10.9796
step 3250: learning rate 0.00059996, train loss 10.9834, val loss 10.9706
step 3500: learning rate 0.00059995, train loss 10.9776, val loss 10.9748
step 3750: learning rate 0.00059995, train loss 10.9740, val loss 10.9705
step 4000: learning rate 0.00059994, train loss 10.9805, val loss 10.9778
step 4250: learning rate 0.00059993, train loss 10.9813, val loss 10.9762
step 4500: learning rate 0.00059993, train loss 10.9695, val loss 10.9784
step 4750: learning rate 0.00059992, train loss 10.9722, val loss 10.9736
step 5000: learning rate 0.00059991, train loss 10.9761, val loss 10.9821
step 5250: learning rate 0.00059990, train loss 10.9777, val loss 10.9717
step 5500: learning rate 0.00059989, train loss 10.9774, val loss 10.9787
step 5750: learning rate 0.00059988, train loss 10.9813, val loss 10.9783
step 6000: learning rate 0.00059987, train loss 10.9856, val loss 10.9830
step 6250: learning rate 0.00059986, train loss 10.9737, val loss 10.9794
step 6500: learning rate 0.00059984, train loss 10.9793, val loss 10.9721
step 6750: learning rate 0.00059983, train loss 10.9804, val loss 10.9811
step 7000: learning rate 0.00059982, train loss 10.9783, val loss 10.9743
step 7250: learning rate 0.00059981, train loss 10.9774, val loss 10.9769
step 7500: learning rate 0.00059979, train loss 10.9794, val loss 10.9760
step 7750: learning rate 0.00059978, train loss 10.9772, val loss 10.9775
step 8000: learning rate 0.00059976, train loss 10.9746, val loss 10.9805
step 8250: learning rate 0.00059975, train loss 10.9725, val loss 10.9789
step 8500: learning rate 0.00059973, train loss 10.9804, val loss 10.9834
step 8750: learning rate 0.00059972, train loss 10.9786, val loss 10.9756
step 9000: learning rate 0.00059970, train loss 10.9768, val loss 10.9818
step 9250: learning rate 0.00059968, train loss 10.9716, val loss 10.9699
step 9500: learning rate 0.00059967, train loss 10.9780, val loss 10.9810
step 9750: learning rate 0.00059965, train loss 10.9762, val loss 10.9795
step 10000: learning rate 0.00059963, train loss 10.9742, val loss 10.9782
step 10250: learning rate 0.00059961, train loss 10.9775, val loss 10.9796
step 10500: learning rate 0.00059959, train loss 10.9836, val loss 10.9765
step 10750: learning rate 0.00059957, train loss 10.9760, val loss 10.9784
step 11000: learning rate 0.00059955, train loss 10.9705, val loss 10.9781
step 11250: learning rate 0.00059953, train loss 10.9762, val loss 10.9695
step 11500: learning rate 0.00059951, train loss 10.9774, val loss 10.9782
step 11750: learning rate 0.00059949, train loss 10.9715, val loss 10.9711
step 12000: learning rate 0.00059947, train loss 10.9736, val loss 10.9804
step 12250: learning rate 0.00059944, train loss 10.9726, val loss 10.9753
step 12500: learning rate 0.00059942, train loss 10.9834, val loss 10.9819
step 12750: learning rate 0.00059940, train loss 10.9740, val loss 10.9785
step 13000: learning rate 0.00059937, train loss 10.9748, val loss 10.9765
step 13250: learning rate 0.00059935, train loss 10.9783, val loss 10.9706
step 13500: learning rate 0.00059933, train loss 10.9803, val loss 10.9742
step 13750: learning rate 0.00059930, train loss 10.9799, val loss 10.9708
step 14000: learning rate 0.00059927, train loss 10.9706, val loss 10.9817
step 14250: learning rate 0.00059925, train loss 10.9685, val loss 10.9800
step 14500: learning rate 0.00059922, train loss 10.9724, val loss 10.9753
step 14750: learning rate 0.00059920, train loss 10.9760, val loss 10.9735
step 15000: learning rate 0.00059917, train loss 10.9771, val loss 10.9754
step 15250: learning rate 0.00059914, train loss 10.9817, val loss 10.9683
step 15500: learning rate 0.00059911, train loss 10.9732, val loss 10.9761
step 15750: learning rate 0.00059908, train loss 10.9801, val loss 10.9867
step 16000: learning rate 0.00059905, train loss 10.9707, val loss 10.9752
step 16250: learning rate 0.00059902, train loss 10.9815, val loss 10.9778
step 16500: learning rate 0.00059899, train loss 10.9734, val loss 10.9764
step 16750: learning rate 0.00059896, train loss 10.9746, val loss 10.9736
step 17000: learning rate 0.00059893, train loss 10.9753, val loss 10.9756
step 17250: learning rate 0.00059890, train loss 10.9724, val loss 10.9779
step 17500: learning rate 0.00059887, train loss 10.9790, val loss 10.9770
step 17750: learning rate 0.00059883, train loss 10.9743, val loss 10.9817
step 18000: learning rate 0.00059880, train loss 10.9747, val loss 10.9806
step 18250: learning rate 0.00059877, train loss 10.9784, val loss 10.9835
step 18500: learning rate 0.00059873, train loss 10.9765, val loss 10.9791
step 18750: learning rate 0.00059870, train loss 10.9758, val loss 10.9694
step 19000: learning rate 0.00059867, train loss 10.9741, val loss 10.9812
step 19250: learning rate 0.00059863, train loss 10.9727, val loss 10.9771
step 19500: learning rate 0.00059859, train loss 10.9837, val loss 10.9679
step 19750: learning rate 0.00059856, train loss 10.9787, val loss 10.9743
step 20000: learning rate 0.00059852, train loss 10.9818, val loss 10.9723
step 20250: learning rate 0.00059848, train loss 10.9711, val loss 10.9727
step 20500: learning rate 0.00059845, train loss 10.9728, val loss 10.9771
step 20750: learning rate 0.00059841, train loss 10.9758, val loss 10.9763
step 21000: learning rate 0.00059837, train loss 10.9804, val loss 10.9779
step 21250: learning rate 0.00059833, train loss 10.9791, val loss 10.9794
step 21500: learning rate 0.00059829, train loss 10.9729, val loss 10.9757
step 21750: learning rate 0.00059825, train loss 10.9676, val loss 10.9807
step 22000: learning rate 0.00059821, train loss 10.9783, val loss 10.9740
step 22250: learning rate 0.00059817, train loss 10.9701, val loss 10.9754
step 22500: learning rate 0.00059813, train loss 10.9728, val loss 10.9855
step 22750: learning rate 0.00059809, train loss 10.9775, val loss 10.9702
step 23000: learning rate 0.00059804, train loss 10.9756, val loss 10.9723
step 23250: learning rate 0.00059800, train loss 10.9743, val loss 10.9697
step 23500: learning rate 0.00059796, train loss 10.9842, val loss 10.9715
step 23750: learning rate 0.00059792, train loss 10.9793, val loss 10.9816
step 24000: learning rate 0.00059787, train loss 10.9754, val loss 10.9741
step 24250: learning rate 0.00059783, train loss 10.9841, val loss 10.9779
step 24500: learning rate 0.00059778, train loss 10.9748, val loss 10.9767
step 24750: learning rate 0.00059774, train loss 10.9823, val loss 10.9800
step 25000: learning rate 0.00059769, train loss 10.9811, val loss 10.9720
step 25250: learning rate 0.00059764, train loss 10.9755, val loss 10.9749
step 25500: learning rate 0.00059760, train loss 10.9735, val loss 10.9685
step 25750: learning rate 0.00059755, train loss 10.9796, val loss 10.9788
step 26000: learning rate 0.00059750, train loss 10.9785, val loss 10.9682
step 26250: learning rate 0.00059745, train loss 10.9696, val loss 10.9754
step 26500: learning rate 0.00059741, train loss 10.9779, val loss 10.9768
step 26750: learning rate 0.00059736, train loss 10.9787, val loss 10.9745
step 27000: learning rate 0.00059731, train loss 10.9876, val loss 10.9790
step 27250: learning rate 0.00059726, train loss 10.9779, val loss 10.9814
step 27500: learning rate 0.00059721, train loss 10.9784, val loss 10.9737
step 27750: learning rate 0.00059715, train loss 10.9737, val loss 10.9739
step 28000: learning rate 0.00059710, train loss 10.9733, val loss 10.9711
step 28250: learning rate 0.00059705, train loss 10.9803, val loss 10.9747
step 28500: learning rate 0.00059700, train loss 10.9763, val loss 10.9725
step 28750: learning rate 0.00059695, train loss 10.9733, val loss 10.9805
step 29000: learning rate 0.00059689, train loss 10.9721, val loss 10.9776
step 29250: learning rate 0.00059684, train loss 10.9795, val loss 10.9686
step 29500: learning rate 0.00059679, train loss 10.9707, val loss 10.9756
step 29750: learning rate 0.00059673, train loss 10.9768, val loss 10.9837
step 30000: learning rate 0.00059668, train loss 10.9765, val loss 10.9757
step 30250: learning rate 0.00059662, train loss 10.9844, val loss 10.9727
step 30500: learning rate 0.00059656, train loss 10.9743, val loss 10.9723
step 30750: learning rate 0.00059651, train loss 10.9750, val loss 10.9799
step 31000: learning rate 0.00059645, train loss 10.9789, val loss 10.9702
step 31250: learning rate 0.00059639, train loss 10.9768, val loss 10.9777
step 31500: learning rate 0.00059634, train loss 10.9731, val loss 10.9773
step 31750: learning rate 0.00059628, train loss 10.9679, val loss 10.9707
step 32000: learning rate 0.00059622, train loss 10.9756, val loss 10.9784
step 32250: learning rate 0.00059616, train loss 10.9731, val loss 10.9774
step 32500: learning rate 0.00059610, train loss 10.9808, val loss 10.9784
step 32750: learning rate 0.00059604, train loss 10.9764, val loss 10.9812
step 33000: learning rate 0.00059598, train loss 10.9752, val loss 10.9767
step 33250: learning rate 0.00059592, train loss 10.9808, val loss 10.9818
step 33500: learning rate 0.00059586, train loss 10.9773, val loss 10.9774
step 33750: learning rate 0.00059580, train loss 10.9753, val loss 10.9760
step 34000: learning rate 0.00059573, train loss 10.9796, val loss 10.9785
step 34250: learning rate 0.00059567, train loss 10.9799, val loss 10.9784
step 34500: learning rate 0.00059561, train loss 10.9750, val loss 10.9762
step 34750: learning rate 0.00059554, train loss 10.9871, val loss 10.9752
step 35000: learning rate 0.00059548, train loss 10.9813, val loss 10.9631
step 35250: learning rate 0.00059541, train loss 10.9713, val loss 10.9829
step 35500: learning rate 0.00059535, train loss 10.9653, val loss 10.9821
step 35750: learning rate 0.00059528, train loss 10.9789, val loss 10.9722
step 36000: learning rate 0.00059522, train loss 10.9830, val loss 10.9798
step 36250: learning rate 0.00059515, train loss 10.9771, val loss 10.9789
step 36500: learning rate 0.00059508, train loss 10.9772, val loss 10.9685
step 36750: learning rate 0.00059502, train loss 10.9754, val loss 10.9827
step 37000: learning rate 0.00059495, train loss 10.9803, val loss 10.9707
step 37250: learning rate 0.00059488, train loss 10.9753, val loss 10.9800
step 37500: learning rate 0.00059481, train loss 10.9774, val loss 10.9836
step 37750: learning rate 0.00059474, train loss 10.9773, val loss 10.9693
step 38000: learning rate 0.00059467, train loss 10.9749, val loss 10.9777
step 38250: learning rate 0.00059460, train loss 10.9738, val loss 10.9763
step 38500: learning rate 0.00059453, train loss 10.9761, val loss 10.9723
step 38750: learning rate 0.00059446, train loss 10.9739, val loss 10.9736
step 39000: learning rate 0.00059439, train loss 10.9754, val loss 10.9775
step 39250: learning rate 0.00059432, train loss 10.9806, val loss 10.9731
step 39500: learning rate 0.00059425, train loss 10.9724, val loss 10.9749
step 39750: learning rate 0.00059417, train loss 10.9849, val loss 10.9761
step 40000: learning rate 0.00059410, train loss 10.9671, val loss 10.9710
step 40250: learning rate 0.00059403, train loss 10.9803, val loss 10.9767
step 40500: learning rate 0.00059395, train loss 10.9725, val loss 10.9741
step 40750: learning rate 0.00059388, train loss 10.9679, val loss 10.9714
step 41000: learning rate 0.00059380, train loss 10.9746, val loss 10.9717
step 41250: learning rate 0.00059373, train loss 10.9815, val loss 10.9784
step 41500: learning rate 0.00059365, train loss 10.9755, val loss 10.9662
step 41750: learning rate 0.00059357, train loss 10.9803, val loss 10.9788
step 42000: learning rate 0.00059350, train loss 10.9750, val loss 10.9742
step 42250: learning rate 0.00059342, train loss 10.9878, val loss 10.9763
step 42500: learning rate 0.00059334, train loss 10.9733, val loss 10.9802
step 42750: learning rate 0.00059326, train loss 10.9778, val loss 10.9843
step 43000: learning rate 0.00059319, train loss 10.9756, val loss 10.9785
step 43250: learning rate 0.00059311, train loss 10.9773, val loss 10.9755
step 43500: learning rate 0.00059303, train loss 10.9755, val loss 10.9770
step 43750: learning rate 0.00059295, train loss 10.9787, val loss 10.9782
step 44000: learning rate 0.00059287, train loss 10.9806, val loss 10.9823
step 44250: learning rate 0.00059279, train loss 10.9797, val loss 10.9830
step 44500: learning rate 0.00059270, train loss 10.9768, val loss 10.9699
step 44750: learning rate 0.00059262, train loss 10.9785, val loss 10.9718
step 45000: learning rate 0.00059254, train loss 10.9732, val loss 10.9765
step 45250: learning rate 0.00059246, train loss 10.9698, val loss 10.9647
step 45500: learning rate 0.00059237, train loss 10.9738, val loss 10.9864
step 45750: learning rate 0.00059229, train loss 10.9771, val loss 10.9828
step 46000: learning rate 0.00059221, train loss 10.9785, val loss 10.9818
step 46250: learning rate 0.00059212, train loss 10.9732, val loss 10.9785
step 46500: learning rate 0.00059204, train loss 10.9742, val loss 10.9744
step 46750: learning rate 0.00059195, train loss 10.9711, val loss 10.9772
step 47000: learning rate 0.00059187, train loss 10.9849, val loss 10.9738
step 47250: learning rate 0.00059178, train loss 10.9795, val loss 10.9771
step 47500: learning rate 0.00059169, train loss 10.9823, val loss 10.9688
step 47750: learning rate 0.00059161, train loss 10.9704, val loss 10.9868
step 48000: learning rate 0.00059152, train loss 10.9765, val loss 10.9768
step 48250: learning rate 0.00059143, train loss 10.9751, val loss 10.9761
step 48500: learning rate 0.00059134, train loss 10.9889, val loss 10.9782
step 48750: learning rate 0.00059125, train loss 10.9845, val loss 10.9730
step 49000: learning rate 0.00059116, train loss 10.9757, val loss 10.9715
step 49250: learning rate 0.00059107, train loss 10.9770, val loss 10.9734
step 49500: learning rate 0.00059098, train loss 10.9765, val loss 10.9782
step 49750: learning rate 0.00059089, train loss 10.9772, val loss 10.9729
step 50000: learning rate 0.00059080, train loss 10.9740, val loss 10.9810
step 50250: learning rate 0.00059071, train loss 10.9774, val loss 10.9738
step 50500: learning rate 0.00059062, train loss 10.9782, val loss 10.9782
step 50750: learning rate 0.00059052, train loss 10.9775, val loss 10.9795
step 51000: learning rate 0.00059043, train loss 10.9817, val loss 10.9730
step 51250: learning rate 0.00059034, train loss 10.9828, val loss 10.9793
step 51500: learning rate 0.00059024, train loss 10.9709, val loss 10.9799
step 51750: learning rate 0.00059015, train loss 10.9720, val loss 10.9787
step 52000: learning rate 0.00059005, train loss 10.9778, val loss 10.9832
step 52250: learning rate 0.00058996, train loss 10.9779, val loss 10.9800
step 52500: learning rate 0.00058986, train loss 10.9674, val loss 10.9702
step 52750: learning rate 0.00058977, train loss 10.9745, val loss 10.9698
step 53000: learning rate 0.00058967, train loss 10.9730, val loss 10.9746
step 53250: learning rate 0.00058957, train loss 10.9708, val loss 10.9718
step 53500: learning rate 0.00058948, train loss 10.9801, val loss 10.9747
step 53750: learning rate 0.00058938, train loss 10.9802, val loss 10.9791
step 54000: learning rate 0.00058928, train loss 10.9740, val loss 10.9851
step 54250: learning rate 0.00058918, train loss 10.9790, val loss 10.9771
step 54500: learning rate 0.00058908, train loss 10.9785, val loss 10.9738
step 54750: learning rate 0.00058898, train loss 10.9797, val loss 10.9719
step 55000: learning rate 0.00058888, train loss 10.9702, val loss 10.9835
step 55250: learning rate 0.00058878, train loss 10.9874, val loss 10.9771
step 55500: learning rate 0.00058868, train loss 10.9760, val loss 10.9760
step 55750: learning rate 0.00058858, train loss 10.9796, val loss 10.9724
step 56000: learning rate 0.00058848, train loss 10.9819, val loss 10.9709
step 56250: learning rate 0.00058837, train loss 10.9765, val loss 10.9721
step 56500: learning rate 0.00058827, train loss 10.9696, val loss 10.9743
step 56750: learning rate 0.00058817, train loss 10.9785, val loss 10.9757
step 57000: learning rate 0.00058806, train loss 10.9772, val loss 10.9733
step 57250: learning rate 0.00058796, train loss 10.9806, val loss 10.9705
step 57500: learning rate 0.00058786, train loss 10.9732, val loss 10.9706
step 57750: learning rate 0.00058775, train loss 10.9783, val loss 10.9874
step 58000: learning rate 0.00058764, train loss 10.9783, val loss 10.9716
step 58250: learning rate 0.00058754, train loss 10.9759, val loss 10.9748
step 58500: learning rate 0.00058743, train loss 10.9746, val loss 10.9763
step 58750: learning rate 0.00058733, train loss 10.9771, val loss 10.9751
step 59000: learning rate 0.00058722, train loss 10.9763, val loss 10.9742
step 59250: learning rate 0.00058711, train loss 10.9769, val loss 10.9758
step 59500: learning rate 0.00058700, train loss 10.9767, val loss 10.9738
step 59750: learning rate 0.00058689, train loss 10.9816, val loss 10.9798
step 60000: learning rate 0.00058679, train loss 10.9758, val loss 10.9754
step 60250: learning rate 0.00058668, train loss 10.9797, val loss 10.9730
step 60500: learning rate 0.00058657, train loss 10.9777, val loss 10.9864
step 60750: learning rate 0.00058646, train loss 10.9745, val loss 10.9745
step 61000: learning rate 0.00058634, train loss 10.9804, val loss 10.9792
step 61250: learning rate 0.00058623, train loss 10.9694, val loss 10.9691
step 61500: learning rate 0.00058612, train loss 10.9721, val loss 10.9790
step 61750: learning rate 0.00058601, train loss 10.9753, val loss 10.9712
step 62000: learning rate 0.00058590, train loss 10.9744, val loss 10.9719
step 62250: learning rate 0.00058578, train loss 10.9800, val loss 10.9802
step 62500: learning rate 0.00058567, train loss 10.9819, val loss 10.9751
step 62750: learning rate 0.00058556, train loss 10.9765, val loss 10.9807
step 63000: learning rate 0.00058544, train loss 10.9725, val loss 10.9871
step 63250: learning rate 0.00058533, train loss 10.9727, val loss 10.9720
step 63500: learning rate 0.00058521, train loss 10.9786, val loss 10.9775
step 63750: learning rate 0.00058510, train loss 10.9806, val loss 10.9792
step 64000: learning rate 0.00058498, train loss 10.9780, val loss 10.9707
step 64250: learning rate 0.00058487, train loss 10.9767, val loss 10.9781
step 64500: learning rate 0.00058475, train loss 10.9720, val loss 10.9748
step 64750: learning rate 0.00058463, train loss 10.9724, val loss 10.9736
step 65000: learning rate 0.00058451, train loss 10.9825, val loss 10.9781
step 65250: learning rate 0.00058440, train loss 10.9761, val loss 10.9781
step 65500: learning rate 0.00058428, train loss 10.9793, val loss 10.9754
step 65750: learning rate 0.00058416, train loss 10.9729, val loss 10.9709
step 66000: learning rate 0.00058404, train loss 10.9714, val loss 10.9760
step 66250: learning rate 0.00058392, train loss 10.9815, val loss 10.9773
step 66500: learning rate 0.00058380, train loss 10.9841, val loss 10.9789
step 66750: learning rate 0.00058368, train loss 10.9813, val loss 10.9722
step 67000: learning rate 0.00058356, train loss 10.9723, val loss 10.9720
step 67250: learning rate 0.00058343, train loss 10.9804, val loss 10.9693
step 67500: learning rate 0.00058331, train loss 10.9813, val loss 10.9732
step 67750: learning rate 0.00058319, train loss 10.9804, val loss 10.9778
step 68000: learning rate 0.00058307, train loss 10.9751, val loss 10.9730
step 68250: learning rate 0.00058294, train loss 10.9729, val loss 10.9776
step 68500: learning rate 0.00058282, train loss 10.9787, val loss 10.9722
step 68750: learning rate 0.00058269, train loss 10.9781, val loss 10.9736
step 69000: learning rate 0.00058257, train loss 10.9822, val loss 10.9735
step 69250: learning rate 0.00058244, train loss 10.9741, val loss 10.9720
step 69500: learning rate 0.00058232, train loss 10.9760, val loss 10.9736
step 69750: learning rate 0.00058219, train loss 10.9761, val loss 10.9810
step 70000: learning rate 0.00058207, train loss 10.9754, val loss 10.9789
step 70250: learning rate 0.00058194, train loss 10.9720, val loss 10.9829
step 70500: learning rate 0.00058181, train loss 10.9837, val loss 10.9804
step 70750: learning rate 0.00058168, train loss 10.9786, val loss 10.9765
step 71000: learning rate 0.00058156, train loss 10.9786, val loss 10.9784
step 71250: learning rate 0.00058143, train loss 10.9811, val loss 10.9762
step 71500: learning rate 0.00058130, train loss 10.9738, val loss 10.9807
step 71750: learning rate 0.00058117, train loss 10.9769, val loss 10.9750
step 72000: learning rate 0.00058104, train loss 10.9725, val loss 10.9738
step 72250: learning rate 0.00058091, train loss 10.9819, val loss 10.9743
step 72500: learning rate 0.00058078, train loss 10.9733, val loss 10.9821
step 72750: learning rate 0.00058065, train loss 10.9724, val loss 10.9747
step 73000: learning rate 0.00058052, train loss 10.9779, val loss 10.9740
step 73250: learning rate 0.00058038, train loss 10.9709, val loss 10.9748
step 73500: learning rate 0.00058025, train loss 10.9758, val loss 10.9765
step 73750: learning rate 0.00058012, train loss 10.9813, val loss 10.9766
step 74000: learning rate 0.00057999, train loss 10.9722, val loss 10.9725
step 74250: learning rate 0.00057985, train loss 10.9712, val loss 10.9714
step 74500: learning rate 0.00057972, train loss 10.9765, val loss 10.9874
step 74750: learning rate 0.00057958, train loss 10.9764, val loss 10.9774
step 75000: learning rate 0.00057945, train loss 10.9733, val loss 10.9822
step 75250: learning rate 0.00057931, train loss 10.9672, val loss 10.9719
step 75500: learning rate 0.00057918, train loss 10.9728, val loss 10.9714
step 75750: learning rate 0.00057904, train loss 10.9830, val loss 10.9764
step 76000: learning rate 0.00057890, train loss 10.9806, val loss 10.9788
step 76250: learning rate 0.00057877, train loss 10.9736, val loss 10.9733
step 76500: learning rate 0.00057863, train loss 10.9776, val loss 10.9777
step 76750: learning rate 0.00057849, train loss 10.9775, val loss 10.9815
step 77000: learning rate 0.00057835, train loss 10.9785, val loss 10.9747
step 77250: learning rate 0.00057821, train loss 10.9810, val loss 10.9780
step 77500: learning rate 0.00057807, train loss 10.9769, val loss 10.9693
step 77750: learning rate 0.00057793, train loss 10.9751, val loss 10.9724
step 78000: learning rate 0.00057779, train loss 10.9789, val loss 10.9728
step 78250: learning rate 0.00057765, train loss 10.9765, val loss 10.9698
step 78500: learning rate 0.00057751, train loss 10.9762, val loss 10.9736
step 78750: learning rate 0.00057737, train loss 10.9722, val loss 10.9730
step 79000: learning rate 0.00057723, train loss 10.9757, val loss 10.9763
step 79250: learning rate 0.00057709, train loss 10.9762, val loss 10.9765
step 79500: learning rate 0.00057694, train loss 10.9787, val loss 10.9814
step 79750: learning rate 0.00057680, train loss 10.9749, val loss 10.9716
step 80000: learning rate 0.00057666, train loss 10.9805, val loss 10.9768
step 80250: learning rate 0.00057651, train loss 10.9743, val loss 10.9806
step 80500: learning rate 0.00057637, train loss 10.9696, val loss 10.9793
step 80750: learning rate 0.00057622, train loss 10.9669, val loss 10.9776
step 81000: learning rate 0.00057608, train loss 10.9751, val loss 10.9760
step 81250: learning rate 0.00057593, train loss 10.9841, val loss 10.9764
step 81500: learning rate 0.00057579, train loss 10.9807, val loss 10.9798
step 81750: learning rate 0.00057564, train loss 10.9740, val loss 10.9752
step 82000: learning rate 0.00057549, train loss 10.9788, val loss 10.9752
step 82250: learning rate 0.00057535, train loss 10.9801, val loss 10.9740
step 82500: learning rate 0.00057520, train loss 10.9786, val loss 10.9726
step 82750: learning rate 0.00057505, train loss 10.9720, val loss 10.9722
step 83000: learning rate 0.00057490, train loss 10.9724, val loss 10.9795
step 83250: learning rate 0.00057475, train loss 10.9831, val loss 10.9799
step 83500: learning rate 0.00057460, train loss 10.9819, val loss 10.9776
step 83750: learning rate 0.00057445, train loss 10.9801, val loss 10.9731
step 84000: learning rate 0.00057430, train loss 10.9796, val loss 10.9807
step 84250: learning rate 0.00057415, train loss 10.9770, val loss 10.9730
step 84500: learning rate 0.00057400, train loss 10.9787, val loss 10.9756
step 84750: learning rate 0.00057385, train loss 10.9771, val loss 10.9723
step 85000: learning rate 0.00057370, train loss 10.9768, val loss 10.9772
step 85250: learning rate 0.00057355, train loss 10.9804, val loss 10.9719
step 85500: learning rate 0.00057339, train loss 10.9806, val loss 10.9732
step 85750: learning rate 0.00057324, train loss 10.9767, val loss 10.9739
step 86000: learning rate 0.00057309, train loss 10.9798, val loss 10.9752
step 86250: learning rate 0.00057293, train loss 10.9791, val loss 10.9757
step 86500: learning rate 0.00057278, train loss 10.9747, val loss 10.9746
step 86750: learning rate 0.00057262, train loss 10.9777, val loss 10.9775
step 87000: learning rate 0.00057247, train loss 10.9732, val loss 10.9816
step 87250: learning rate 0.00057231, train loss 10.9744, val loss 10.9735
step 87500: learning rate 0.00057216, train loss 10.9775, val loss 10.9777
step 87750: learning rate 0.00057200, train loss 10.9766, val loss 10.9731
step 88000: learning rate 0.00057184, train loss 10.9800, val loss 10.9853
step 88250: learning rate 0.00057168, train loss 10.9808, val loss 10.9751
step 88500: learning rate 0.00057153, train loss 10.9712, val loss 10.9779
step 88750: learning rate 0.00057137, train loss 10.9763, val loss 10.9711
step 89000: learning rate 0.00057121, train loss 10.9745, val loss 10.9731
step 89250: learning rate 0.00057105, train loss 10.9780, val loss 10.9784
step 89500: learning rate 0.00057089, train loss 10.9725, val loss 10.9739
step 89750: learning rate 0.00057073, train loss 10.9738, val loss 10.9813
step 90000: learning rate 0.00057057, train loss 10.9757, val loss 10.9761
step 90250: learning rate 0.00057041, train loss 10.9766, val loss 10.9742
step 90500: learning rate 0.00057025, train loss 10.9837, val loss 10.9749
step 90750: learning rate 0.00057009, train loss 10.9806, val loss 10.9760
step 91000: learning rate 0.00056993, train loss 10.9753, val loss 10.9630
step 91250: learning rate 0.00056976, train loss 10.9765, val loss 10.9814
step 91500: learning rate 0.00056960, train loss 10.9784, val loss 10.9727
step 91750: learning rate 0.00056944, train loss 10.9715, val loss 10.9777
step 92000: learning rate 0.00056927, train loss 10.9768, val loss 10.9831
step 92250: learning rate 0.00056911, train loss 10.9777, val loss 10.9754
step 92500: learning rate 0.00056895, train loss 10.9672, val loss 10.9789
step 92750: learning rate 0.00056878, train loss 10.9748, val loss 10.9805
step 93000: learning rate 0.00056862, train loss 10.9777, val loss 10.9767
step 93250: learning rate 0.00056845, train loss 10.9731, val loss 10.9791
step 93500: learning rate 0.00056829, train loss 10.9745, val loss 10.9733
step 93750: learning rate 0.00056812, train loss 10.9806, val loss 10.9830
step 94000: learning rate 0.00056795, train loss 10.9724, val loss 10.9700
step 94250: learning rate 0.00056778, train loss 10.9735, val loss 10.9737
step 94500: learning rate 0.00056762, train loss 10.9786, val loss 10.9707
step 94750: learning rate 0.00056745, train loss 10.9746, val loss 10.9809
step 95000: learning rate 0.00056728, train loss 10.9786, val loss 10.9685
step 95250: learning rate 0.00056711, train loss 10.9726, val loss 10.9829
step 95500: learning rate 0.00056694, train loss 10.9815, val loss 10.9716
step 95750: learning rate 0.00056677, train loss 10.9765, val loss 10.9741
step 96000: learning rate 0.00056660, train loss 10.9698, val loss 10.9796
step 96250: learning rate 0.00056643, train loss 10.9797, val loss 10.9736
step 96500: learning rate 0.00056626, train loss 10.9782, val loss 10.9755
step 96750: learning rate 0.00056609, train loss 10.9766, val loss 10.9751
step 97000: learning rate 0.00056592, train loss 10.9711, val loss 10.9667
step 97250: learning rate 0.00056575, train loss 10.9868, val loss 10.9789
step 97500: learning rate 0.00056557, train loss 10.9765, val loss 10.9806
step 97750: learning rate 0.00056540, train loss 10.9725, val loss 10.9807
step 98000: learning rate 0.00056523, train loss 10.9746, val loss 10.9748
step 98250: learning rate 0.00056505, train loss 10.9807, val loss 10.9798
step 98500: learning rate 0.00056488, train loss 10.9780, val loss 10.9737
step 98750: learning rate 0.00056471, train loss 10.9845, val loss 10.9774
step 99000: learning rate 0.00056453, train loss 10.9779, val loss 10.9767
step 99250: learning rate 0.00056436, train loss 10.9741, val loss 10.9828
step 99500: learning rate 0.00056418, train loss 10.9775, val loss 10.9807
step 99750: learning rate 0.00056400, train loss 10.9820, val loss 10.9798
step 100000: learning rate 0.00056383, train loss 10.9798, val loss 10.9863
step 100250: learning rate 0.00056365, train loss 10.9774, val loss 10.9822
step 100500: learning rate 0.00056347, train loss 10.9788, val loss 10.9748
step 100750: learning rate 0.00056329, train loss 10.9769, val loss 10.9779
step 101000: learning rate 0.00056312, train loss 10.9755, val loss 10.9792
step 101250: learning rate 0.00056294, train loss 10.9748, val loss 10.9750
step 101500: learning rate 0.00056276, train loss 10.9747, val loss 10.9695
step 101750: learning rate 0.00056258, train loss 10.9808, val loss 10.9813
step 102000: learning rate 0.00056240, train loss 10.9683, val loss 10.9730
step 102250: learning rate 0.00056222, train loss 10.9815, val loss 10.9744
step 102500: learning rate 0.00056204, train loss 10.9765, val loss 10.9779
step 102750: learning rate 0.00056186, train loss 10.9801, val loss 10.9728
step 103000: learning rate 0.00056168, train loss 10.9798, val loss 10.9743
step 103250: learning rate 0.00056150, train loss 10.9724, val loss 10.9824
step 103500: learning rate 0.00056131, train loss 10.9812, val loss 10.9793
step 103750: learning rate 0.00056113, train loss 10.9771, val loss 10.9780
step 104000: learning rate 0.00056095, train loss 10.9722, val loss 10.9728
step 104250: learning rate 0.00056077, train loss 10.9801, val loss 10.9729
step 104500: learning rate 0.00056058, train loss 10.9763, val loss 10.9779
step 104750: learning rate 0.00056040, train loss 10.9788, val loss 10.9717
step 105000: learning rate 0.00056021, train loss 10.9736, val loss 10.9699
step 105250: learning rate 0.00056003, train loss 10.9725, val loss 10.9800
step 105500: learning rate 0.00055984, train loss 10.9711, val loss 10.9722
step 105750: learning rate 0.00055966, train loss 10.9762, val loss 10.9670
step 106000: learning rate 0.00055947, train loss 10.9800, val loss 10.9779
step 106250: learning rate 0.00055928, train loss 10.9760, val loss 10.9747
step 106500: learning rate 0.00055910, train loss 10.9687, val loss 10.9870
step 106750: learning rate 0.00055891, train loss 10.9744, val loss 10.9749
step 107000: learning rate 0.00055872, train loss 10.9737, val loss 10.9726
step 107250: learning rate 0.00055853, train loss 10.9761, val loss 10.9728
step 107500: learning rate 0.00055835, train loss 10.9769, val loss 10.9796
step 107750: learning rate 0.00055816, train loss 10.9778, val loss 10.9748
step 108000: learning rate 0.00055797, train loss 10.9766, val loss 10.9777
step 108250: learning rate 0.00055778, train loss 10.9802, val loss 10.9769
step 108500: learning rate 0.00055759, train loss 10.9691, val loss 10.9821
step 108750: learning rate 0.00055740, train loss 10.9744, val loss 10.9633
step 109000: learning rate 0.00055721, train loss 10.9765, val loss 10.9731
step 109250: learning rate 0.00055702, train loss 10.9758, val loss 10.9720
step 109500: learning rate 0.00055683, train loss 10.9820, val loss 10.9767
step 109750: learning rate 0.00055663, train loss 10.9707, val loss 10.9787
step 110000: learning rate 0.00055644, train loss 10.9792, val loss 10.9810
step 110250: learning rate 0.00055625, train loss 10.9763, val loss 10.9702
step 110500: learning rate 0.00055606, train loss 10.9790, val loss 10.9741
step 110750: learning rate 0.00055586, train loss 10.9759, val loss 10.9830
step 111000: learning rate 0.00055567, train loss 10.9827, val loss 10.9725
step 111250: learning rate 0.00055547, train loss 10.9761, val loss 10.9726
step 111500: learning rate 0.00055528, train loss 10.9813, val loss 10.9713
step 111750: learning rate 0.00055508, train loss 10.9721, val loss 10.9743
step 112000: learning rate 0.00055489, train loss 10.9782, val loss 10.9752
step 112250: learning rate 0.00055469, train loss 10.9766, val loss 10.9802
step 112500: learning rate 0.00055450, train loss 10.9797, val loss 10.9661
step 112750: learning rate 0.00055430, train loss 10.9807, val loss 10.9726
step 113000: learning rate 0.00055410, train loss 10.9787, val loss 10.9771
step 113250: learning rate 0.00055391, train loss 10.9793, val loss 10.9762
step 113500: learning rate 0.00055371, train loss 10.9818, val loss 10.9785
step 113750: learning rate 0.00055351, train loss 10.9771, val loss 10.9779
step 114000: learning rate 0.00055331, train loss 10.9738, val loss 10.9817
step 114250: learning rate 0.00055311, train loss 10.9787, val loss 10.9727
step 114500: learning rate 0.00055291, train loss 10.9842, val loss 10.9723
step 114750: learning rate 0.00055271, train loss 10.9764, val loss 10.9804
step 115000: learning rate 0.00055251, train loss 10.9847, val loss 10.9738
step 115250: learning rate 0.00055231, train loss 10.9714, val loss 10.9841
step 115500: learning rate 0.00055211, train loss 10.9759, val loss 10.9699
step 115750: learning rate 0.00055191, train loss 10.9697, val loss 10.9727
step 116000: learning rate 0.00055171, train loss 10.9759, val loss 10.9755
step 116250: learning rate 0.00055151, train loss 10.9717, val loss 10.9728
step 116500: learning rate 0.00055131, train loss 10.9714, val loss 10.9763
step 116750: learning rate 0.00055110, train loss 10.9767, val loss 10.9798
step 117000: learning rate 0.00055090, train loss 10.9747, val loss 10.9762
step 117250: learning rate 0.00055070, train loss 10.9877, val loss 10.9754
step 117500: learning rate 0.00055049, train loss 10.9786, val loss 10.9686
step 117750: learning rate 0.00055029, train loss 10.9717, val loss 10.9722
step 118000: learning rate 0.00055008, train loss 10.9782, val loss 10.9795
step 118250: learning rate 0.00054988, train loss 10.9816, val loss 10.9673
step 118500: learning rate 0.00054967, train loss 10.9760, val loss 10.9778
step 118750: learning rate 0.00054947, train loss 10.9811, val loss 10.9754
step 119000: learning rate 0.00054926, train loss 10.9703, val loss 10.9824
step 119250: learning rate 0.00054906, train loss 10.9716, val loss 10.9768
step 119500: learning rate 0.00054885, train loss 10.9785, val loss 10.9764
step 119750: learning rate 0.00054864, train loss 10.9750, val loss 10.9750
step 120000: learning rate 0.00054843, train loss 10.9803, val loss 10.9799
step 120250: learning rate 0.00054823, train loss 10.9761, val loss 10.9725
step 120500: learning rate 0.00054802, train loss 10.9771, val loss 10.9843
step 120750: learning rate 0.00054781, train loss 10.9853, val loss 10.9827
step 121000: learning rate 0.00054760, train loss 10.9721, val loss 10.9817
step 121250: learning rate 0.00054739, train loss 10.9876, val loss 10.9723
step 121500: learning rate 0.00054718, train loss 10.9741, val loss 10.9728
step 121750: learning rate 0.00054697, train loss 10.9821, val loss 10.9754
step 122000: learning rate 0.00054676, train loss 10.9709, val loss 10.9774
step 122250: learning rate 0.00054655, train loss 10.9725, val loss 10.9732
step 122500: learning rate 0.00054634, train loss 10.9774, val loss 10.9778
step 122750: learning rate 0.00054613, train loss 10.9735, val loss 10.9753
step 123000: learning rate 0.00054591, train loss 10.9803, val loss 10.9758
step 123250: learning rate 0.00054570, train loss 10.9734, val loss 10.9800
step 123500: learning rate 0.00054549, train loss 10.9818, val loss 10.9833
step 123750: learning rate 0.00054528, train loss 10.9767, val loss 10.9717
step 124000: learning rate 0.00054506, train loss 10.9802, val loss 10.9792
step 124250: learning rate 0.00054485, train loss 10.9767, val loss 10.9751
step 124500: learning rate 0.00054463, train loss 10.9760, val loss 10.9723
step 124750: learning rate 0.00054442, train loss 10.9701, val loss 10.9767
step 125000: learning rate 0.00054421, train loss 10.9770, val loss 10.9724
step 125250: learning rate 0.00054399, train loss 10.9748, val loss 10.9796
step 125500: learning rate 0.00054377, train loss 10.9748, val loss 10.9778
step 125750: learning rate 0.00054356, train loss 10.9802, val loss 10.9742
step 126000: learning rate 0.00054334, train loss 10.9754, val loss 10.9801
step 126250: learning rate 0.00054313, train loss 10.9784, val loss 10.9758
step 126500: learning rate 0.00054291, train loss 10.9782, val loss 10.9704
step 126750: learning rate 0.00054269, train loss 10.9841, val loss 10.9825
step 127000: learning rate 0.00054247, train loss 10.9799, val loss 10.9828
step 127250: learning rate 0.00054225, train loss 10.9802, val loss 10.9824
step 127500: learning rate 0.00054204, train loss 10.9786, val loss 10.9802
step 127750: learning rate 0.00054182, train loss 10.9765, val loss 10.9736
step 128000: learning rate 0.00054160, train loss 10.9752, val loss 10.9702
step 128250: learning rate 0.00054138, train loss 10.9812, val loss 10.9786
step 128500: learning rate 0.00054116, train loss 10.9783, val loss 10.9832
step 128750: learning rate 0.00054094, train loss 10.9755, val loss 10.9660
step 129000: learning rate 0.00054072, train loss 10.9787, val loss 10.9791
step 129250: learning rate 0.00054050, train loss 10.9783, val loss 10.9771
step 129500: learning rate 0.00054027, train loss 10.9758, val loss 10.9834
step 129750: learning rate 0.00054005, train loss 10.9796, val loss 10.9785
step 130000: learning rate 0.00053983, train loss 10.9798, val loss 10.9791
step 130250: learning rate 0.00053961, train loss 10.9756, val loss 10.9712
step 130500: learning rate 0.00053938, train loss 10.9829, val loss 10.9736
step 130750: learning rate 0.00053916, train loss 10.9744, val loss 10.9763
step 131000: learning rate 0.00053894, train loss 10.9773, val loss 10.9767
step 131250: learning rate 0.00053871, train loss 10.9715, val loss 10.9787
step 131500: learning rate 0.00053849, train loss 10.9736, val loss 10.9766
step 131750: learning rate 0.00053826, train loss 10.9739, val loss 10.9760
step 132000: learning rate 0.00053804, train loss 10.9765, val loss 10.9739
step 132250: learning rate 0.00053781, train loss 10.9740, val loss 10.9729
step 132500: learning rate 0.00053759, train loss 10.9879, val loss 10.9733
step 132750: learning rate 0.00053736, train loss 10.9765, val loss 10.9772
step 133000: learning rate 0.00053713, train loss 10.9793, val loss 10.9827
step 133250: learning rate 0.00053691, train loss 10.9718, val loss 10.9777
step 133500: learning rate 0.00053668, train loss 10.9729, val loss 10.9733
step 133750: learning rate 0.00053645, train loss 10.9757, val loss 10.9769
step 134000: learning rate 0.00053622, train loss 10.9724, val loss 10.9783
step 134250: learning rate 0.00053600, train loss 10.9779, val loss 10.9768
step 134500: learning rate 0.00053577, train loss 10.9780, val loss 10.9767
step 134750: learning rate 0.00053554, train loss 10.9752, val loss 10.9775
step 135000: learning rate 0.00053531, train loss 10.9786, val loss 10.9810
step 135250: learning rate 0.00053508, train loss 10.9736, val loss 10.9759
step 135500: learning rate 0.00053485, train loss 10.9745, val loss 10.9712
step 135750: learning rate 0.00053462, train loss 10.9705, val loss 10.9699
step 136000: learning rate 0.00053439, train loss 10.9675, val loss 10.9713
step 136250: learning rate 0.00053416, train loss 10.9785, val loss 10.9815
step 136500: learning rate 0.00053393, train loss 10.9785, val loss 10.9794
step 136750: learning rate 0.00053369, train loss 10.9736, val loss 10.9753
step 137000: learning rate 0.00053346, train loss 10.9853, val loss 10.9846
step 137250: learning rate 0.00053323, train loss 10.9753, val loss 10.9767
step 137500: learning rate 0.00053300, train loss 10.9851, val loss 10.9763
step 137750: learning rate 0.00053276, train loss 10.9803, val loss 10.9841
step 138000: learning rate 0.00053253, train loss 10.9736, val loss 10.9804
step 138250: learning rate 0.00053230, train loss 10.9815, val loss 10.9843
step 138500: learning rate 0.00053206, train loss 10.9742, val loss 10.9755
step 138750: learning rate 0.00053183, train loss 10.9776, val loss 10.9797
step 139000: learning rate 0.00053159, train loss 10.9795, val loss 10.9770
step 139250: learning rate 0.00053136, train loss 10.9755, val loss 10.9762
step 139500: learning rate 0.00053112, train loss 10.9772, val loss 10.9740
step 139750: learning rate 0.00053089, train loss 10.9686, val loss 10.9732
step 140000: learning rate 0.00053065, train loss 10.9784, val loss 10.9820
step 140250: learning rate 0.00053041, train loss 10.9769, val loss 10.9722
step 140500: learning rate 0.00053018, train loss 10.9754, val loss 10.9762
step 140750: learning rate 0.00052994, train loss 10.9783, val loss 10.9738
step 141000: learning rate 0.00052970, train loss 10.9793, val loss 10.9867
step 141250: learning rate 0.00052946, train loss 10.9774, val loss 10.9757
step 141500: learning rate 0.00052922, train loss 10.9718, val loss 10.9843
step 141750: learning rate 0.00052899, train loss 10.9790, val loss 10.9830
step 142000: learning rate 0.00052875, train loss 10.9801, val loss 10.9694
step 142250: learning rate 0.00052851, train loss 10.9763, val loss 10.9718
step 142500: learning rate 0.00052827, train loss 10.9722, val loss 10.9803
step 142750: learning rate 0.00052803, train loss 10.9799, val loss 10.9753
step 143000: learning rate 0.00052779, train loss 10.9725, val loss 10.9746
step 143250: learning rate 0.00052755, train loss 10.9825, val loss 10.9716
step 143500: learning rate 0.00052730, train loss 10.9786, val loss 10.9747
step 143750: learning rate 0.00052706, train loss 10.9800, val loss 10.9775
step 144000: learning rate 0.00052682, train loss 10.9798, val loss 10.9687
step 144250: learning rate 0.00052658, train loss 10.9801, val loss 10.9696
step 144500: learning rate 0.00052634, train loss 10.9759, val loss 10.9788
step 144750: learning rate 0.00052609, train loss 10.9739, val loss 10.9717
step 145000: learning rate 0.00052585, train loss 10.9790, val loss 10.9769
step 145250: learning rate 0.00052561, train loss 10.9838, val loss 10.9822
step 145500: learning rate 0.00052536, train loss 10.9751, val loss 10.9780
step 145750: learning rate 0.00052512, train loss 10.9788, val loss 10.9745
step 146000: learning rate 0.00052488, train loss 10.9802, val loss 10.9784
step 146250: learning rate 0.00052463, train loss 10.9737, val loss 10.9676
step 146500: learning rate 0.00052439, train loss 10.9777, val loss 10.9811
step 146750: learning rate 0.00052414, train loss 10.9781, val loss 10.9762
step 147000: learning rate 0.00052389, train loss 10.9782, val loss 10.9794
step 147250: learning rate 0.00052365, train loss 10.9758, val loss 10.9740
step 147500: learning rate 0.00052340, train loss 10.9858, val loss 10.9782
step 147750: learning rate 0.00052315, train loss 10.9717, val loss 10.9799
step 148000: learning rate 0.00052291, train loss 10.9776, val loss 10.9705
step 148250: learning rate 0.00052266, train loss 10.9721, val loss 10.9807
step 148500: learning rate 0.00052241, train loss 10.9769, val loss 10.9800
step 148750: learning rate 0.00052216, train loss 10.9790, val loss 10.9780
step 149000: learning rate 0.00052192, train loss 10.9791, val loss 10.9756
step 149250: learning rate 0.00052167, train loss 10.9756, val loss 10.9820
step 149500: learning rate 0.00052142, train loss 10.9815, val loss 10.9826
step 149750: learning rate 0.00052117, train loss 10.9768, val loss 10.9763
step 150000: learning rate 0.00052092, train loss 10.9739, val loss 10.9742
step 150250: learning rate 0.00052067, train loss 10.9887, val loss 10.9767
step 150500: learning rate 0.00052042, train loss 10.9753, val loss 10.9741
step 150750: learning rate 0.00052017, train loss 10.9751, val loss 10.9659
step 151000: learning rate 0.00051992, train loss 10.9744, val loss 10.9744
step 151250: learning rate 0.00051967, train loss 10.9709, val loss 10.9799
step 151500: learning rate 0.00051941, train loss 10.9793, val loss 10.9702
step 151750: learning rate 0.00051916, train loss 10.9784, val loss 10.9814
step 152000: learning rate 0.00051891, train loss 10.9778, val loss 10.9887
step 152250: learning rate 0.00051866, train loss 10.9764, val loss 10.9824
step 152500: learning rate 0.00051840, train loss 10.9767, val loss 10.9809
step 152750: learning rate 0.00051815, train loss 10.9774, val loss 10.9783
step 153000: learning rate 0.00051790, train loss 10.9783, val loss 10.9756
step 153250: learning rate 0.00051764, train loss 10.9797, val loss 10.9738
step 153500: learning rate 0.00051739, train loss 10.9855, val loss 10.9837
step 153750: learning rate 0.00051713, train loss 10.9821, val loss 10.9715
step 154000: learning rate 0.00051688, train loss 10.9816, val loss 10.9708
step 154250: learning rate 0.00051662, train loss 10.9774, val loss 10.9815
step 154500: learning rate 0.00051637, train loss 10.9743, val loss 10.9839
step 154750: learning rate 0.00051611, train loss 10.9732, val loss 10.9737
step 155000: learning rate 0.00051586, train loss 10.9756, val loss 10.9816
step 155250: learning rate 0.00051560, train loss 10.9817, val loss 10.9775
step 155500: learning rate 0.00051534, train loss 10.9791, val loss 10.9770
step 155750: learning rate 0.00051509, train loss 10.9689, val loss 10.9742
step 156000: learning rate 0.00051483, train loss 10.9731, val loss 10.9790
step 156250: learning rate 0.00051457, train loss 10.9847, val loss 10.9763
step 156500: learning rate 0.00051431, train loss 10.9785, val loss 10.9826
step 156750: learning rate 0.00051405, train loss 10.9801, val loss 10.9738
step 157000: learning rate 0.00051379, train loss 10.9739, val loss 10.9703
step 157250: learning rate 0.00051354, train loss 10.9708, val loss 10.9723
step 157500: learning rate 0.00051328, train loss 10.9759, val loss 10.9710
step 157750: learning rate 0.00051302, train loss 10.9813, val loss 10.9806
step 158000: learning rate 0.00051276, train loss 10.9828, val loss 10.9739
step 158250: learning rate 0.00051250, train loss 10.9742, val loss 10.9778
step 158500: learning rate 0.00051224, train loss 10.9761, val loss 10.9787
step 158750: learning rate 0.00051197, train loss 10.9759, val loss 10.9817
step 159000: learning rate 0.00051171, train loss 10.9801, val loss 10.9758
step 159250: learning rate 0.00051145, train loss 10.9813, val loss 10.9820
step 159500: learning rate 0.00051119, train loss 10.9796, val loss 10.9709
step 159750: learning rate 0.00051093, train loss 10.9824, val loss 10.9817
step 160000: learning rate 0.00051067, train loss 10.9770, val loss 10.9844
step 160250: learning rate 0.00051040, train loss 10.9757, val loss 10.9764
step 160500: learning rate 0.00051014, train loss 10.9840, val loss 10.9681
step 160750: learning rate 0.00050988, train loss 10.9705, val loss 10.9748
step 161000: learning rate 0.00050961, train loss 10.9793, val loss 10.9776
step 161250: learning rate 0.00050935, train loss 10.9752, val loss 10.9867
step 161500: learning rate 0.00050908, train loss 10.9815, val loss 10.9773
step 161750: learning rate 0.00050882, train loss 10.9746, val loss 10.9806
step 162000: learning rate 0.00050855, train loss 10.9781, val loss 10.9697
step 162250: learning rate 0.00050829, train loss 10.9780, val loss 10.9778
step 162500: learning rate 0.00050802, train loss 10.9844, val loss 10.9801
step 162750: learning rate 0.00050776, train loss 10.9738, val loss 10.9761
step 163000: learning rate 0.00050749, train loss 10.9765, val loss 10.9730
step 163250: learning rate 0.00050722, train loss 10.9769, val loss 10.9704
step 163500: learning rate 0.00050696, train loss 10.9812, val loss 10.9689
step 163750: learning rate 0.00050669, train loss 10.9786, val loss 10.9717
step 164000: learning rate 0.00050642, train loss 10.9746, val loss 10.9761
step 164250: learning rate 0.00050616, train loss 10.9811, val loss 10.9813
step 164500: learning rate 0.00050589, train loss 10.9773, val loss 10.9747
step 164750: learning rate 0.00050562, train loss 10.9744, val loss 10.9770
step 165000: learning rate 0.00050535, train loss 10.9783, val loss 10.9736
step 165250: learning rate 0.00050508, train loss 10.9706, val loss 10.9739
step 165500: learning rate 0.00050481, train loss 10.9761, val loss 10.9685
step 165750: learning rate 0.00050454, train loss 10.9736, val loss 10.9709
step 166000: learning rate 0.00050427, train loss 10.9793, val loss 10.9761
step 166250: learning rate 0.00050400, train loss 10.9755, val loss 10.9702
step 166500: learning rate 0.00050373, train loss 10.9717, val loss 10.9738
step 166750: learning rate 0.00050346, train loss 10.9772, val loss 10.9748
step 167000: learning rate 0.00050319, train loss 10.9708, val loss 10.9744
step 167250: learning rate 0.00050292, train loss 10.9838, val loss 10.9742
step 167500: learning rate 0.00050265, train loss 10.9742, val loss 10.9821
step 167750: learning rate 0.00050238, train loss 10.9713, val loss 10.9787
step 168000: learning rate 0.00050210, train loss 10.9764, val loss 10.9778
step 168250: learning rate 0.00050183, train loss 10.9797, val loss 10.9715
step 168500: learning rate 0.00050156, train loss 10.9821, val loss 10.9785
step 168750: learning rate 0.00050129, train loss 10.9737, val loss 10.9733
step 169000: learning rate 0.00050101, train loss 10.9699, val loss 10.9815
step 169250: learning rate 0.00050074, train loss 10.9753, val loss 10.9780
step 169500: learning rate 0.00050047, train loss 10.9743, val loss 10.9768
step 169750: learning rate 0.00050019, train loss 10.9743, val loss 10.9731
step 170000: learning rate 0.00049992, train loss 10.9753, val loss 10.9748
step 170250: learning rate 0.00049964, train loss 10.9774, val loss 10.9766
step 170500: learning rate 0.00049937, train loss 10.9706, val loss 10.9767
step 170750: learning rate 0.00049909, train loss 10.9762, val loss 10.9655
step 171000: learning rate 0.00049882, train loss 10.9787, val loss 10.9753
step 171250: learning rate 0.00049854, train loss 10.9720, val loss 10.9701
step 171500: learning rate 0.00049826, train loss 10.9742, val loss 10.9766
step 171750: learning rate 0.00049799, train loss 10.9775, val loss 10.9776
step 172000: learning rate 0.00049771, train loss 10.9808, val loss 10.9714
step 172250: learning rate 0.00049743, train loss 10.9790, val loss 10.9790
step 172500: learning rate 0.00049716, train loss 10.9762, val loss 10.9817
step 172750: learning rate 0.00049688, train loss 10.9814, val loss 10.9796
step 173000: learning rate 0.00049660, train loss 10.9843, val loss 10.9815
step 173250: learning rate 0.00049632, train loss 10.9777, val loss 10.9792
step 173500: learning rate 0.00049604, train loss 10.9789, val loss 10.9725
step 173750: learning rate 0.00049576, train loss 10.9755, val loss 10.9816
step 174000: learning rate 0.00049548, train loss 10.9703, val loss 10.9801
step 174250: learning rate 0.00049521, train loss 10.9854, val loss 10.9845
step 174500: learning rate 0.00049493, train loss 10.9739, val loss 10.9830
step 174750: learning rate 0.00049465, train loss 10.9785, val loss 10.9804
step 175000: learning rate 0.00049437, train loss 10.9760, val loss 10.9764
step 175250: learning rate 0.00049409, train loss 10.9788, val loss 10.9738
step 175500: learning rate 0.00049380, train loss 10.9872, val loss 10.9698
step 175750: learning rate 0.00049352, train loss 10.9784, val loss 10.9808
step 176000: learning rate 0.00049324, train loss 10.9738, val loss 10.9708
step 176250: learning rate 0.00049296, train loss 10.9853, val loss 10.9824
step 176500: learning rate 0.00049268, train loss 10.9739, val loss 10.9720
step 176750: learning rate 0.00049240, train loss 10.9706, val loss 10.9794
step 177000: learning rate 0.00049211, train loss 10.9695, val loss 10.9777
step 177250: learning rate 0.00049183, train loss 10.9797, val loss 10.9726
step 177500: learning rate 0.00049155, train loss 10.9742, val loss 10.9750
step 177750: learning rate 0.00049126, train loss 10.9814, val loss 10.9758
step 178000: learning rate 0.00049098, train loss 10.9761, val loss 10.9700
step 178250: learning rate 0.00049070, train loss 10.9752, val loss 10.9733
step 178500: learning rate 0.00049041, train loss 10.9697, val loss 10.9745
step 178750: learning rate 0.00049013, train loss 10.9740, val loss 10.9719
step 179000: learning rate 0.00048984, train loss 10.9730, val loss 10.9776
step 179250: learning rate 0.00048956, train loss 10.9779, val loss 10.9820
step 179500: learning rate 0.00048927, train loss 10.9792, val loss 10.9783
step 179750: learning rate 0.00048899, train loss 10.9774, val loss 10.9802
step 180000: learning rate 0.00048870, train loss 10.9834, val loss 10.9656
step 180250: learning rate 0.00048842, train loss 10.9827, val loss 10.9766
step 180500: learning rate 0.00048813, train loss 10.9746, val loss 10.9757
step 180750: learning rate 0.00048784, train loss 10.9704, val loss 10.9802
step 181000: learning rate 0.00048756, train loss 10.9721, val loss 10.9807
step 181250: learning rate 0.00048727, train loss 10.9707, val loss 10.9773
step 181500: learning rate 0.00048698, train loss 10.9776, val loss 10.9813
step 181750: learning rate 0.00048669, train loss 10.9757, val loss 10.9787
step 182000: learning rate 0.00048641, train loss 10.9795, val loss 10.9735
step 182250: learning rate 0.00048612, train loss 10.9666, val loss 10.9777
step 182500: learning rate 0.00048583, train loss 10.9805, val loss 10.9772
step 182750: learning rate 0.00048554, train loss 10.9822, val loss 10.9838
step 183000: learning rate 0.00048525, train loss 10.9697, val loss 10.9781
step 183250: learning rate 0.00048496, train loss 10.9858, val loss 10.9781
step 183500: learning rate 0.00048467, train loss 10.9848, val loss 10.9759
step 183750: learning rate 0.00048438, train loss 10.9848, val loss 10.9782
step 184000: learning rate 0.00048409, train loss 10.9818, val loss 10.9769
step 184250: learning rate 0.00048380, train loss 10.9793, val loss 10.9673
step 184500: learning rate 0.00048351, train loss 10.9802, val loss 10.9819
step 184750: learning rate 0.00048322, train loss 10.9669, val loss 10.9760
step 185000: learning rate 0.00048293, train loss 10.9725, val loss 10.9805
step 185250: learning rate 0.00048264, train loss 10.9804, val loss 10.9719
step 185500: learning rate 0.00048235, train loss 10.9795, val loss 10.9757
step 185750: learning rate 0.00048205, train loss 10.9787, val loss 10.9728
step 186000: learning rate 0.00048176, train loss 10.9771, val loss 10.9782
step 186250: learning rate 0.00048147, train loss 10.9758, val loss 10.9812
step 186500: learning rate 0.00048118, train loss 10.9811, val loss 10.9730
step 186750: learning rate 0.00048088, train loss 10.9742, val loss 10.9742
step 187000: learning rate 0.00048059, train loss 10.9749, val loss 10.9818
step 187250: learning rate 0.00048030, train loss 10.9762, val loss 10.9663
step 187500: learning rate 0.00048000, train loss 10.9760, val loss 10.9806
step 187750: learning rate 0.00047971, train loss 10.9783, val loss 10.9806
step 188000: learning rate 0.00047942, train loss 10.9810, val loss 10.9706
step 188250: learning rate 0.00047912, train loss 10.9792, val loss 10.9786
step 188500: learning rate 0.00047883, train loss 10.9804, val loss 10.9774
step 188750: learning rate 0.00047853, train loss 10.9769, val loss 10.9817
step 189000: learning rate 0.00047824, train loss 10.9792, val loss 10.9720
step 189250: learning rate 0.00047794, train loss 10.9823, val loss 10.9707
step 189500: learning rate 0.00047764, train loss 10.9797, val loss 10.9793
step 189750: learning rate 0.00047735, train loss 10.9740, val loss 10.9823
step 190000: learning rate 0.00047705, train loss 10.9723, val loss 10.9844
step 190250: learning rate 0.00047676, train loss 10.9753, val loss 10.9795
step 190500: learning rate 0.00047646, train loss 10.9803, val loss 10.9724
step 190750: learning rate 0.00047616, train loss 10.9803, val loss 10.9812
step 191000: learning rate 0.00047586, train loss 10.9754, val loss 10.9724
step 191250: learning rate 0.00047557, train loss 10.9770, val loss 10.9808
step 191500: learning rate 0.00047527, train loss 10.9764, val loss 10.9770
step 191750: learning rate 0.00047497, train loss 10.9817, val loss 10.9754
step 192000: learning rate 0.00047467, train loss 10.9767, val loss 10.9791
step 192250: learning rate 0.00047437, train loss 10.9771, val loss 10.9858
step 192500: learning rate 0.00047408, train loss 10.9819, val loss 10.9771
step 192750: learning rate 0.00047378, train loss 10.9771, val loss 10.9867
step 193000: learning rate 0.00047348, train loss 10.9708, val loss 10.9826
step 193250: learning rate 0.00047318, train loss 10.9744, val loss 10.9796
step 193500: learning rate 0.00047288, train loss 10.9827, val loss 10.9792
step 193750: learning rate 0.00047258, train loss 10.9773, val loss 10.9768
step 194000: learning rate 0.00047228, train loss 10.9751, val loss 10.9791
step 194250: learning rate 0.00047198, train loss 10.9736, val loss 10.9738
step 194500: learning rate 0.00047168, train loss 10.9763, val loss 10.9836
step 194750: learning rate 0.00047138, train loss 10.9762, val loss 10.9740
step 195000: learning rate 0.00047107, train loss 10.9727, val loss 10.9764
step 195250: learning rate 0.00047077, train loss 10.9785, val loss 10.9787
step 195500: learning rate 0.00047047, train loss 10.9803, val loss 10.9707
step 195750: learning rate 0.00047017, train loss 10.9803, val loss 10.9781
step 196000: learning rate 0.00046987, train loss 10.9810, val loss 10.9780
step 196250: learning rate 0.00046956, train loss 10.9721, val loss 10.9788
step 196500: learning rate 0.00046926, train loss 10.9738, val loss 10.9794
step 196750: learning rate 0.00046896, train loss 10.9758, val loss 10.9749
step 197000: learning rate 0.00046866, train loss 10.9764, val loss 10.9731
step 197250: learning rate 0.00046835, train loss 10.9783, val loss 10.9797
step 197500: learning rate 0.00046805, train loss 10.9742, val loss 10.9728
step 197750: learning rate 0.00046775, train loss 10.9700, val loss 10.9806
step 198000: learning rate 0.00046744, train loss 10.9756, val loss 10.9727
step 198250: learning rate 0.00046714, train loss 10.9790, val loss 10.9750
step 198500: learning rate 0.00046683, train loss 10.9775, val loss 10.9806
step 198750: learning rate 0.00046653, train loss 10.9882, val loss 10.9708
step 199000: learning rate 0.00046622, train loss 10.9763, val loss 10.9799
step 199250: learning rate 0.00046592, train loss 10.9747, val loss 10.9716
step 199500: learning rate 0.00046561, train loss 10.9869, val loss 10.9700
step 199750: learning rate 0.00046531, train loss 10.9747, val loss 10.9767
step 200000: learning rate 0.00046500, train loss 10.9792, val loss 10.9803
step 200250: learning rate 0.00046469, train loss 10.9748, val loss 10.9782
step 200500: learning rate 0.00046439, train loss 10.9681, val loss 10.9788
step 200750: learning rate 0.00046408, train loss 10.9756, val loss 10.9792
step 201000: learning rate 0.00046377, train loss 10.9830, val loss 10.9818
step 201250: learning rate 0.00046347, train loss 10.9745, val loss 10.9733
step 201500: learning rate 0.00046316, train loss 10.9757, val loss 10.9790
step 201750: learning rate 0.00046285, train loss 10.9807, val loss 10.9758
step 202000: learning rate 0.00046254, train loss 10.9731, val loss 10.9784
step 202250: learning rate 0.00046224, train loss 10.9788, val loss 10.9701
step 202500: learning rate 0.00046193, train loss 10.9740, val loss 10.9750
step 202750: learning rate 0.00046162, train loss 10.9772, val loss 10.9780
step 203000: learning rate 0.00046131, train loss 10.9869, val loss 10.9737
step 203250: learning rate 0.00046100, train loss 10.9788, val loss 10.9798
step 203500: learning rate 0.00046069, train loss 10.9790, val loss 10.9774
step 203750: learning rate 0.00046038, train loss 10.9786, val loss 10.9738
step 204000: learning rate 0.00046007, train loss 10.9727, val loss 10.9770
step 204250: learning rate 0.00045976, train loss 10.9774, val loss 10.9759
step 204500: learning rate 0.00045945, train loss 10.9717, val loss 10.9792
step 204750: learning rate 0.00045914, train loss 10.9859, val loss 10.9669
step 205000: learning rate 0.00045883, train loss 10.9767, val loss 10.9764
step 205250: learning rate 0.00045852, train loss 10.9721, val loss 10.9774
step 205500: learning rate 0.00045821, train loss 10.9707, val loss 10.9783
step 205750: learning rate 0.00045790, train loss 10.9787, val loss 10.9793
step 206000: learning rate 0.00045759, train loss 10.9799, val loss 10.9719
step 206250: learning rate 0.00045728, train loss 10.9736, val loss 10.9679
step 206500: learning rate 0.00045697, train loss 10.9791, val loss 10.9744
step 206750: learning rate 0.00045665, train loss 10.9763, val loss 10.9749
step 207000: learning rate 0.00045634, train loss 10.9754, val loss 10.9818
step 207250: learning rate 0.00045603, train loss 10.9655, val loss 10.9730
step 207500: learning rate 0.00045572, train loss 10.9878, val loss 10.9723
step 207750: learning rate 0.00045540, train loss 10.9762, val loss 10.9802
step 208000: learning rate 0.00045509, train loss 10.9760, val loss 10.9734
step 208250: learning rate 0.00045478, train loss 10.9828, val loss 10.9814
step 208500: learning rate 0.00045446, train loss 10.9714, val loss 10.9719
step 208750: learning rate 0.00045415, train loss 10.9774, val loss 10.9747
step 209000: learning rate 0.00045384, train loss 10.9734, val loss 10.9736
step 209250: learning rate 0.00045352, train loss 10.9753, val loss 10.9780
step 209500: learning rate 0.00045321, train loss 10.9740, val loss 10.9722
step 209750: learning rate 0.00045289, train loss 10.9710, val loss 10.9825
step 210000: learning rate 0.00045258, train loss 10.9757, val loss 10.9781
step 210250: learning rate 0.00045226, train loss 10.9759, val loss 10.9762
step 210500: learning rate 0.00045195, train loss 10.9802, val loss 10.9786
step 210750: learning rate 0.00045163, train loss 10.9804, val loss 10.9690
step 211000: learning rate 0.00045132, train loss 10.9830, val loss 10.9802
step 211250: learning rate 0.00045100, train loss 10.9781, val loss 10.9783
step 211500: learning rate 0.00045068, train loss 10.9857, val loss 10.9757
step 211750: learning rate 0.00045037, train loss 10.9800, val loss 10.9832
step 212000: learning rate 0.00045005, train loss 10.9821, val loss 10.9834
step 212250: learning rate 0.00044973, train loss 10.9802, val loss 10.9848
step 212500: learning rate 0.00044942, train loss 10.9779, val loss 10.9740
step 212750: learning rate 0.00044910, train loss 10.9668, val loss 10.9816
step 213000: learning rate 0.00044878, train loss 10.9816, val loss 10.9755
step 213250: learning rate 0.00044847, train loss 10.9720, val loss 10.9761
step 213500: learning rate 0.00044815, train loss 10.9824, val loss 10.9734
step 213750: learning rate 0.00044783, train loss 10.9770, val loss 10.9818
step 214000: learning rate 0.00044751, train loss 10.9783, val loss 10.9774
step 214250: learning rate 0.00044719, train loss 10.9748, val loss 10.9742
step 214500: learning rate 0.00044688, train loss 10.9803, val loss 10.9757
step 214750: learning rate 0.00044656, train loss 10.9741, val loss 10.9784
step 215000: learning rate 0.00044624, train loss 10.9753, val loss 10.9763
step 215250: learning rate 0.00044592, train loss 10.9773, val loss 10.9769
step 215500: learning rate 0.00044560, train loss 10.9688, val loss 10.9777
step 215750: learning rate 0.00044528, train loss 10.9749, val loss 10.9761
step 216000: learning rate 0.00044496, train loss 10.9794, val loss 10.9703
step 216250: learning rate 0.00044464, train loss 10.9726, val loss 10.9789
step 216500: learning rate 0.00044432, train loss 10.9769, val loss 10.9788
step 216750: learning rate 0.00044400, train loss 10.9795, val loss 10.9753
step 217000: learning rate 0.00044368, train loss 10.9768, val loss 10.9739
step 217250: learning rate 0.00044336, train loss 10.9838, val loss 10.9758
step 217500: learning rate 0.00044304, train loss 10.9742, val loss 10.9804
step 217750: learning rate 0.00044272, train loss 10.9745, val loss 10.9754
step 218000: learning rate 0.00044240, train loss 10.9741, val loss 10.9776
step 218250: learning rate 0.00044207, train loss 10.9748, val loss 10.9684
step 218500: learning rate 0.00044175, train loss 10.9708, val loss 10.9763
step 218750: learning rate 0.00044143, train loss 10.9735, val loss 10.9787
step 219000: learning rate 0.00044111, train loss 10.9761, val loss 10.9723
step 219250: learning rate 0.00044079, train loss 10.9718, val loss 10.9746
step 219500: learning rate 0.00044046, train loss 10.9751, val loss 10.9802
step 219750: learning rate 0.00044014, train loss 10.9751, val loss 10.9765
step 220000: learning rate 0.00043982, train loss 10.9732, val loss 10.9712
step 220250: learning rate 0.00043950, train loss 10.9766, val loss 10.9699
step 220500: learning rate 0.00043917, train loss 10.9838, val loss 10.9815
step 220750: learning rate 0.00043885, train loss 10.9744, val loss 10.9708
step 221000: learning rate 0.00043853, train loss 10.9763, val loss 10.9834
step 221250: learning rate 0.00043820, train loss 10.9758, val loss 10.9812
step 221500: learning rate 0.00043788, train loss 10.9752, val loss 10.9744
step 221750: learning rate 0.00043755, train loss 10.9787, val loss 10.9856
step 222000: learning rate 0.00043723, train loss 10.9751, val loss 10.9785
step 222250: learning rate 0.00043691, train loss 10.9750, val loss 10.9728
step 222500: learning rate 0.00043658, train loss 10.9827, val loss 10.9714
step 222750: learning rate 0.00043626, train loss 10.9827, val loss 10.9784
step 223000: learning rate 0.00043593, train loss 10.9811, val loss 10.9736
step 223250: learning rate 0.00043561, train loss 10.9815, val loss 10.9759
step 223500: learning rate 0.00043528, train loss 10.9781, val loss 10.9737
step 223750: learning rate 0.00043495, train loss 10.9749, val loss 10.9810
step 224000: learning rate 0.00043463, train loss 10.9732, val loss 10.9835
step 224250: learning rate 0.00043430, train loss 10.9720, val loss 10.9716
step 224500: learning rate 0.00043398, train loss 10.9708, val loss 10.9772
step 224750: learning rate 0.00043365, train loss 10.9781, val loss 10.9737
step 225000: learning rate 0.00043332, train loss 10.9769, val loss 10.9766
step 225250: learning rate 0.00043300, train loss 10.9770, val loss 10.9780
step 225500: learning rate 0.00043267, train loss 10.9762, val loss 10.9745
step 225750: learning rate 0.00043234, train loss 10.9771, val loss 10.9788
step 226000: learning rate 0.00043202, train loss 10.9760, val loss 10.9755
step 226250: learning rate 0.00043169, train loss 10.9744, val loss 10.9705
step 226500: learning rate 0.00043136, train loss 10.9783, val loss 10.9762
step 226750: learning rate 0.00043103, train loss 10.9797, val loss 10.9776
step 227000: learning rate 0.00043071, train loss 10.9739, val loss 10.9722
step 227250: learning rate 0.00043038, train loss 10.9735, val loss 10.9715
step 227500: learning rate 0.00043005, train loss 10.9794, val loss 10.9813
step 227750: learning rate 0.00042972, train loss 10.9780, val loss 10.9751
step 228000: learning rate 0.00042939, train loss 10.9743, val loss 10.9738
step 228250: learning rate 0.00042906, train loss 10.9829, val loss 10.9791
step 228500: learning rate 0.00042874, train loss 10.9755, val loss 10.9747
step 228750: learning rate 0.00042841, train loss 10.9766, val loss 10.9672
step 229000: learning rate 0.00042808, train loss 10.9648, val loss 10.9787
step 229250: learning rate 0.00042775, train loss 10.9769, val loss 10.9754
step 229500: learning rate 0.00042742, train loss 10.9754, val loss 10.9767
step 229750: learning rate 0.00042709, train loss 10.9805, val loss 10.9842
step 230000: learning rate 0.00042676, train loss 10.9815, val loss 10.9697
step 230250: learning rate 0.00042643, train loss 10.9807, val loss 10.9729
step 230500: learning rate 0.00042610, train loss 10.9800, val loss 10.9806
step 230750: learning rate 0.00042577, train loss 10.9796, val loss 10.9752
step 231000: learning rate 0.00042544, train loss 10.9792, val loss 10.9752
step 231250: learning rate 0.00042511, train loss 10.9757, val loss 10.9728
step 231500: learning rate 0.00042478, train loss 10.9809, val loss 10.9700
step 231750: learning rate 0.00042445, train loss 10.9800, val loss 10.9779
step 232000: learning rate 0.00042411, train loss 10.9743, val loss 10.9862
step 232250: learning rate 0.00042378, train loss 10.9783, val loss 10.9744
step 232500: learning rate 0.00042345, train loss 10.9723, val loss 10.9740
step 232750: learning rate 0.00042312, train loss 10.9817, val loss 10.9782
step 233000: learning rate 0.00042279, train loss 10.9801, val loss 10.9782
step 233250: learning rate 0.00042246, train loss 10.9780, val loss 10.9772
step 233500: learning rate 0.00042212, train loss 10.9791, val loss 10.9807
step 233750: learning rate 0.00042179, train loss 10.9709, val loss 10.9739
step 234000: learning rate 0.00042146, train loss 10.9697, val loss 10.9725
step 234250: learning rate 0.00042113, train loss 10.9801, val loss 10.9708
step 234500: learning rate 0.00042079, train loss 10.9774, val loss 10.9744
step 234750: learning rate 0.00042046, train loss 10.9809, val loss 10.9626
step 235000: learning rate 0.00042013, train loss 10.9779, val loss 10.9765
step 235250: learning rate 0.00041979, train loss 10.9878, val loss 10.9805
step 235500: learning rate 0.00041946, train loss 10.9718, val loss 10.9798
step 235750: learning rate 0.00041913, train loss 10.9732, val loss 10.9788
step 236000: learning rate 0.00041879, train loss 10.9793, val loss 10.9758
step 236250: learning rate 0.00041846, train loss 10.9781, val loss 10.9720
step 236500: learning rate 0.00041813, train loss 10.9777, val loss 10.9831
step 236750: learning rate 0.00041779, train loss 10.9802, val loss 10.9762
step 237000: learning rate 0.00041746, train loss 10.9703, val loss 10.9752
step 237250: learning rate 0.00041712, train loss 10.9769, val loss 10.9757
step 237500: learning rate 0.00041679, train loss 10.9804, val loss 10.9771
step 237750: learning rate 0.00041645, train loss 10.9717, val loss 10.9711
step 238000: learning rate 0.00041612, train loss 10.9771, val loss 10.9788
step 238250: learning rate 0.00041578, train loss 10.9782, val loss 10.9806
step 238500: learning rate 0.00041545, train loss 10.9793, val loss 10.9743
step 238750: learning rate 0.00041511, train loss 10.9753, val loss 10.9772
step 239000: learning rate 0.00041478, train loss 10.9793, val loss 10.9812
step 239250: learning rate 0.00041444, train loss 10.9786, val loss 10.9745
step 239500: learning rate 0.00041411, train loss 10.9720, val loss 10.9883
step 239750: learning rate 0.00041377, train loss 10.9855, val loss 10.9795
step 240000: learning rate 0.00041343, train loss 10.9729, val loss 10.9728
step 240250: learning rate 0.00041310, train loss 10.9784, val loss 10.9746
step 240500: learning rate 0.00041276, train loss 10.9750, val loss 10.9749
step 240750: learning rate 0.00041243, train loss 10.9777, val loss 10.9743
step 241000: learning rate 0.00041209, train loss 10.9783, val loss 10.9782
step 241250: learning rate 0.00041175, train loss 10.9719, val loss 10.9764
step 241500: learning rate 0.00041142, train loss 10.9782, val loss 10.9770
step 241750: learning rate 0.00041108, train loss 10.9756, val loss 10.9814
step 242000: learning rate 0.00041074, train loss 10.9796, val loss 10.9750
step 242250: learning rate 0.00041040, train loss 10.9836, val loss 10.9718
step 242500: learning rate 0.00041007, train loss 10.9786, val loss 10.9742
step 242750: learning rate 0.00040973, train loss 10.9844, val loss 10.9753
step 243000: learning rate 0.00040939, train loss 10.9757, val loss 10.9777
step 243250: learning rate 0.00040905, train loss 10.9818, val loss 10.9731
step 243500: learning rate 0.00040872, train loss 10.9766, val loss 10.9735
step 243750: learning rate 0.00040838, train loss 10.9826, val loss 10.9748
step 244000: learning rate 0.00040804, train loss 10.9812, val loss 10.9778
step 244250: learning rate 0.00040770, train loss 10.9857, val loss 10.9860
step 244500: learning rate 0.00040736, train loss 10.9813, val loss 10.9766
step 244750: learning rate 0.00040702, train loss 10.9838, val loss 10.9782
step 245000: learning rate 0.00040668, train loss 10.9745, val loss 10.9767
step 245250: learning rate 0.00040635, train loss 10.9768, val loss 10.9826
step 245500: learning rate 0.00040601, train loss 10.9812, val loss 10.9722
step 245750: learning rate 0.00040567, train loss 10.9826, val loss 10.9743
step 246000: learning rate 0.00040533, train loss 10.9698, val loss 10.9775
step 246250: learning rate 0.00040499, train loss 10.9710, val loss 10.9759
step 246500: learning rate 0.00040465, train loss 10.9778, val loss 10.9794
step 246750: learning rate 0.00040431, train loss 10.9802, val loss 10.9724
step 247000: learning rate 0.00040397, train loss 10.9754, val loss 10.9762
step 247250: learning rate 0.00040363, train loss 10.9794, val loss 10.9802
step 247500: learning rate 0.00040329, train loss 10.9706, val loss 10.9700
step 247750: learning rate 0.00040295, train loss 10.9770, val loss 10.9825
step 248000: learning rate 0.00040261, train loss 10.9819, val loss 10.9785
step 248250: learning rate 0.00040227, train loss 10.9743, val loss 10.9788
step 248500: learning rate 0.00040193, train loss 10.9701, val loss 10.9764
step 248750: learning rate 0.00040159, train loss 10.9750, val loss 10.9824
step 249000: learning rate 0.00040125, train loss 10.9826, val loss 10.9796
step 249250: learning rate 0.00040090, train loss 10.9790, val loss 10.9769
step 249500: learning rate 0.00040056, train loss 10.9721, val loss 10.9679
step 249750: learning rate 0.00040022, train loss 10.9759, val loss 10.9689
step 250000: learning rate 0.00039988, train loss 10.9745, val loss 10.9759
step 250250: learning rate 0.00039954, train loss 10.9836, val loss 10.9782
step 250500: learning rate 0.00039920, train loss 10.9772, val loss 10.9693
step 250750: learning rate 0.00039886, train loss 10.9835, val loss 10.9764
step 251000: learning rate 0.00039851, train loss 10.9689, val loss 10.9809
step 251250: learning rate 0.00039817, train loss 10.9774, val loss 10.9732
step 251500: learning rate 0.00039783, train loss 10.9747, val loss 10.9754
step 251750: learning rate 0.00039749, train loss 10.9743, val loss 10.9702
step 252000: learning rate 0.00039715, train loss 10.9757, val loss 10.9774
step 252250: learning rate 0.00039680, train loss 10.9726, val loss 10.9870
step 252500: learning rate 0.00039646, train loss 10.9762, val loss 10.9821
step 252750: learning rate 0.00039612, train loss 10.9755, val loss 10.9789
step 253000: learning rate 0.00039578, train loss 10.9766, val loss 10.9792
step 253250: learning rate 0.00039543, train loss 10.9781, val loss 10.9769
step 253500: learning rate 0.00039509, train loss 10.9692, val loss 10.9711
step 253750: learning rate 0.00039475, train loss 10.9760, val loss 10.9689
step 254000: learning rate 0.00039440, train loss 10.9757, val loss 10.9762
step 254250: learning rate 0.00039406, train loss 10.9781, val loss 10.9797
step 254500: learning rate 0.00039372, train loss 10.9765, val loss 10.9821
step 254750: learning rate 0.00039337, train loss 10.9739, val loss 10.9722
step 255000: learning rate 0.00039303, train loss 10.9739, val loss 10.9764
step 255250: learning rate 0.00039269, train loss 10.9749, val loss 10.9763
step 255500: learning rate 0.00039234, train loss 10.9814, val loss 10.9818
step 255750: learning rate 0.00039200, train loss 10.9785, val loss 10.9795
step 256000: learning rate 0.00039165, train loss 10.9745, val loss 10.9750
step 256250: learning rate 0.00039131, train loss 10.9729, val loss 10.9844
step 256500: learning rate 0.00039097, train loss 10.9831, val loss 10.9767
step 256750: learning rate 0.00039062, train loss 10.9817, val loss 10.9761
step 257000: learning rate 0.00039028, train loss 10.9750, val loss 10.9763
step 257250: learning rate 0.00038993, train loss 10.9786, val loss 10.9767
step 257500: learning rate 0.00038959, train loss 10.9845, val loss 10.9735
step 257750: learning rate 0.00038924, train loss 10.9733, val loss 10.9723
step 258000: learning rate 0.00038890, train loss 10.9798, val loss 10.9695
step 258250: learning rate 0.00038855, train loss 10.9754, val loss 10.9701
step 258500: learning rate 0.00038821, train loss 10.9787, val loss 10.9759
step 258750: learning rate 0.00038786, train loss 10.9774, val loss 10.9793
step 259000: learning rate 0.00038752, train loss 10.9793, val loss 10.9778
step 259250: learning rate 0.00038717, train loss 10.9815, val loss 10.9828
step 259500: learning rate 0.00038683, train loss 10.9714, val loss 10.9755
step 259750: learning rate 0.00038648, train loss 10.9735, val loss 10.9792
step 260000: learning rate 0.00038614, train loss 10.9781, val loss 10.9714
step 260250: learning rate 0.00038579, train loss 10.9764, val loss 10.9799
step 260500: learning rate 0.00038544, train loss 10.9825, val loss 10.9749
step 260750: learning rate 0.00038510, train loss 10.9695, val loss 10.9794
step 261000: learning rate 0.00038475, train loss 10.9811, val loss 10.9731
step 261250: learning rate 0.00038441, train loss 10.9726, val loss 10.9706
step 261500: learning rate 0.00038406, train loss 10.9769, val loss 10.9792
step 261750: learning rate 0.00038371, train loss 10.9771, val loss 10.9779
step 262000: learning rate 0.00038337, train loss 10.9689, val loss 10.9813
step 262250: learning rate 0.00038302, train loss 10.9677, val loss 10.9711
step 262500: learning rate 0.00038267, train loss 10.9789, val loss 10.9803
step 262750: learning rate 0.00038233, train loss 10.9812, val loss 10.9711
step 263000: learning rate 0.00038198, train loss 10.9807, val loss 10.9734
step 263250: learning rate 0.00038163, train loss 10.9726, val loss 10.9824
step 263500: learning rate 0.00038129, train loss 10.9760, val loss 10.9770
step 263750: learning rate 0.00038094, train loss 10.9805, val loss 10.9747
step 264000: learning rate 0.00038059, train loss 10.9724, val loss 10.9755
step 264250: learning rate 0.00038025, train loss 10.9741, val loss 10.9799
step 264500: learning rate 0.00037990, train loss 10.9765, val loss 10.9652
step 264750: learning rate 0.00037955, train loss 10.9812, val loss 10.9807
step 265000: learning rate 0.00037920, train loss 10.9700, val loss 10.9845
step 265250: learning rate 0.00037886, train loss 10.9699, val loss 10.9803
step 265500: learning rate 0.00037851, train loss 10.9774, val loss 10.9764
step 265750: learning rate 0.00037816, train loss 10.9758, val loss 10.9821
step 266000: learning rate 0.00037781, train loss 10.9811, val loss 10.9752
step 266250: learning rate 0.00037746, train loss 10.9762, val loss 10.9764
step 266500: learning rate 0.00037712, train loss 10.9803, val loss 10.9795
step 266750: learning rate 0.00037677, train loss 10.9773, val loss 10.9739
step 267000: learning rate 0.00037642, train loss 10.9826, val loss 10.9770
step 267250: learning rate 0.00037607, train loss 10.9786, val loss 10.9763
step 267500: learning rate 0.00037572, train loss 10.9786, val loss 10.9763
step 267750: learning rate 0.00037538, train loss 10.9815, val loss 10.9790
step 268000: learning rate 0.00037503, train loss 10.9766, val loss 10.9759
step 268250: learning rate 0.00037468, train loss 10.9714, val loss 10.9754
step 268500: learning rate 0.00037433, train loss 10.9808, val loss 10.9770
step 268750: learning rate 0.00037398, train loss 10.9823, val loss 10.9768
step 269000: learning rate 0.00037363, train loss 10.9742, val loss 10.9736
step 269250: learning rate 0.00037328, train loss 10.9744, val loss 10.9786
step 269500: learning rate 0.00037294, train loss 10.9756, val loss 10.9743
step 269750: learning rate 0.00037259, train loss 10.9717, val loss 10.9734
step 270000: learning rate 0.00037224, train loss 10.9818, val loss 10.9756
step 270250: learning rate 0.00037189, train loss 10.9884, val loss 10.9835
step 270500: learning rate 0.00037154, train loss 10.9782, val loss 10.9813
step 270750: learning rate 0.00037119, train loss 10.9733, val loss 10.9769
step 271000: learning rate 0.00037084, train loss 10.9771, val loss 10.9758
step 271250: learning rate 0.00037049, train loss 10.9783, val loss 10.9709
step 271500: learning rate 0.00037014, train loss 10.9734, val loss 10.9732
step 271750: learning rate 0.00036979, train loss 10.9822, val loss 10.9738
step 272000: learning rate 0.00036944, train loss 10.9742, val loss 10.9765
step 272250: learning rate 0.00036909, train loss 10.9821, val loss 10.9738
step 272500: learning rate 0.00036874, train loss 10.9822, val loss 10.9763
step 272750: learning rate 0.00036839, train loss 10.9762, val loss 10.9767
step 273000: learning rate 0.00036804, train loss 10.9776, val loss 10.9752
step 273250: learning rate 0.00036769, train loss 10.9738, val loss 10.9753
step 273500: learning rate 0.00036734, train loss 10.9751, val loss 10.9796
step 273750: learning rate 0.00036699, train loss 10.9808, val loss 10.9757
step 274000: learning rate 0.00036664, train loss 10.9781, val loss 10.9787
step 274250: learning rate 0.00036629, train loss 10.9766, val loss 10.9771
step 274500: learning rate 0.00036594, train loss 10.9808, val loss 10.9769
step 274750: learning rate 0.00036559, train loss 10.9778, val loss 10.9675
step 275000: learning rate 0.00036524, train loss 10.9790, val loss 10.9714
step 275250: learning rate 0.00036489, train loss 10.9746, val loss 10.9799
step 275500: learning rate 0.00036454, train loss 10.9775, val loss 10.9795
step 275750: learning rate 0.00036419, train loss 10.9781, val loss 10.9713
step 276000: learning rate 0.00036384, train loss 10.9749, val loss 10.9860
step 276250: learning rate 0.00036349, train loss 10.9786, val loss 10.9755
step 276500: learning rate 0.00036314, train loss 10.9738, val loss 10.9760
step 276750: learning rate 0.00036279, train loss 10.9735, val loss 10.9721
step 277000: learning rate 0.00036244, train loss 10.9747, val loss 10.9761
step 277250: learning rate 0.00036209, train loss 10.9802, val loss 10.9754
step 277500: learning rate 0.00036174, train loss 10.9731, val loss 10.9732
step 277750: learning rate 0.00036138, train loss 10.9782, val loss 10.9777
step 278000: learning rate 0.00036103, train loss 10.9798, val loss 10.9719
step 278250: learning rate 0.00036068, train loss 10.9778, val loss 10.9763
step 278500: learning rate 0.00036033, train loss 10.9723, val loss 10.9784
step 278750: learning rate 0.00035998, train loss 10.9804, val loss 10.9741
step 279000: learning rate 0.00035963, train loss 10.9733, val loss 10.9758
step 279250: learning rate 0.00035928, train loss 10.9791, val loss 10.9781
step 279500: learning rate 0.00035893, train loss 10.9712, val loss 10.9737
step 279750: learning rate 0.00035857, train loss 10.9710, val loss 10.9752
step 280000: learning rate 0.00035822, train loss 10.9798, val loss 10.9779
step 280250: learning rate 0.00035787, train loss 10.9741, val loss 10.9712
step 280500: learning rate 0.00035752, train loss 10.9770, val loss 10.9750
step 280750: learning rate 0.00035717, train loss 10.9764, val loss 10.9870
step 281000: learning rate 0.00035682, train loss 10.9795, val loss 10.9783
step 281250: learning rate 0.00035646, train loss 10.9822, val loss 10.9747
step 281500: learning rate 0.00035611, train loss 10.9789, val loss 10.9814
step 281750: learning rate 0.00035576, train loss 10.9806, val loss 10.9742
step 282000: learning rate 0.00035541, train loss 10.9792, val loss 10.9757
step 282250: learning rate 0.00035506, train loss 10.9748, val loss 10.9770
step 282500: learning rate 0.00035471, train loss 10.9797, val loss 10.9784
step 282750: learning rate 0.00035435, train loss 10.9806, val loss 10.9758
step 283000: learning rate 0.00035400, train loss 10.9721, val loss 10.9781
step 283250: learning rate 0.00035365, train loss 10.9766, val loss 10.9763
step 283500: learning rate 0.00035330, train loss 10.9761, val loss 10.9769
step 283750: learning rate 0.00035295, train loss 10.9760, val loss 10.9772
step 284000: learning rate 0.00035259, train loss 10.9788, val loss 10.9781
step 284250: learning rate 0.00035224, train loss 10.9810, val loss 10.9792
step 284500: learning rate 0.00035189, train loss 10.9794, val loss 10.9816
step 284750: learning rate 0.00035154, train loss 10.9791, val loss 10.9780
step 285000: learning rate 0.00035118, train loss 10.9730, val loss 10.9757
step 285250: learning rate 0.00035083, train loss 10.9753, val loss 10.9721
step 285500: learning rate 0.00035048, train loss 10.9765, val loss 10.9731
step 285750: learning rate 0.00035013, train loss 10.9771, val loss 10.9697
step 286000: learning rate 0.00034977, train loss 10.9795, val loss 10.9735
step 286250: learning rate 0.00034942, train loss 10.9756, val loss 10.9751
step 286500: learning rate 0.00034907, train loss 10.9784, val loss 10.9747
step 286750: learning rate 0.00034872, train loss 10.9711, val loss 10.9762
step 287000: learning rate 0.00034836, train loss 10.9757, val loss 10.9810
step 287250: learning rate 0.00034801, train loss 10.9745, val loss 10.9769
step 287500: learning rate 0.00034766, train loss 10.9742, val loss 10.9777
step 287750: learning rate 0.00034731, train loss 10.9754, val loss 10.9752
step 288000: learning rate 0.00034695, train loss 10.9750, val loss 10.9704
step 288250: learning rate 0.00034660, train loss 10.9817, val loss 10.9794
step 288500: learning rate 0.00034625, train loss 10.9755, val loss 10.9719
step 288750: learning rate 0.00034590, train loss 10.9764, val loss 10.9758
step 289000: learning rate 0.00034554, train loss 10.9734, val loss 10.9745
step 289250: learning rate 0.00034519, train loss 10.9796, val loss 10.9756
step 289500: learning rate 0.00034484, train loss 10.9738, val loss 10.9796
step 289750: learning rate 0.00034448, train loss 10.9777, val loss 10.9806
step 290000: learning rate 0.00034413, train loss 10.9681, val loss 10.9749
step 290250: learning rate 0.00034378, train loss 10.9782, val loss 10.9777
step 290500: learning rate 0.00034342, train loss 10.9777, val loss 10.9815
step 290750: learning rate 0.00034307, train loss 10.9753, val loss 10.9748
step 291000: learning rate 0.00034272, train loss 10.9687, val loss 10.9810
step 291250: learning rate 0.00034237, train loss 10.9800, val loss 10.9751
step 291500: learning rate 0.00034201, train loss 10.9764, val loss 10.9739
step 291750: learning rate 0.00034166, train loss 10.9776, val loss 10.9889
step 292000: learning rate 0.00034131, train loss 10.9794, val loss 10.9762
step 292250: learning rate 0.00034095, train loss 10.9710, val loss 10.9764
step 292500: learning rate 0.00034060, train loss 10.9770, val loss 10.9806
step 292750: learning rate 0.00034025, train loss 10.9795, val loss 10.9737
step 293000: learning rate 0.00033989, train loss 10.9793, val loss 10.9722
step 293250: learning rate 0.00033954, train loss 10.9745, val loss 10.9787
step 293500: learning rate 0.00033919, train loss 10.9797, val loss 10.9745
step 293750: learning rate 0.00033883, train loss 10.9714, val loss 10.9735
step 294000: learning rate 0.00033848, train loss 10.9762, val loss 10.9773
step 294250: learning rate 0.00033813, train loss 10.9700, val loss 10.9748
step 294500: learning rate 0.00033777, train loss 10.9727, val loss 10.9686
step 294750: learning rate 0.00033742, train loss 10.9776, val loss 10.9768
step 295000: learning rate 0.00033707, train loss 10.9753, val loss 10.9796
step 295250: learning rate 0.00033671, train loss 10.9728, val loss 10.9742
step 295500: learning rate 0.00033636, train loss 10.9751, val loss 10.9703
step 295750: learning rate 0.00033601, train loss 10.9793, val loss 10.9737
step 296000: learning rate 0.00033565, train loss 10.9714, val loss 10.9733
step 296250: learning rate 0.00033530, train loss 10.9729, val loss 10.9773
step 296500: learning rate 0.00033495, train loss 10.9731, val loss 10.9829
step 296750: learning rate 0.00033459, train loss 10.9777, val loss 10.9735
step 297000: learning rate 0.00033424, train loss 10.9731, val loss 10.9767
step 297250: learning rate 0.00033389, train loss 10.9797, val loss 10.9805
step 297500: learning rate 0.00033353, train loss 10.9752, val loss 10.9758
step 297750: learning rate 0.00033318, train loss 10.9749, val loss 10.9732
step 298000: learning rate 0.00033283, train loss 10.9724, val loss 10.9780
step 298250: learning rate 0.00033247, train loss 10.9753, val loss 10.9756
step 298500: learning rate 0.00033212, train loss 10.9739, val loss 10.9752
step 298750: learning rate 0.00033177, train loss 10.9752, val loss 10.9780
step 299000: learning rate 0.00033141, train loss 10.9724, val loss 10.9803
step 299250: learning rate 0.00033106, train loss 10.9730, val loss 10.9743
step 299500: learning rate 0.00033071, train loss 10.9755, val loss 10.9837
step 299750: learning rate 0.00033035, train loss 10.9831, val loss 10.9783
step 300000: learning rate 0.00033000, train loss 10.9777, val loss 10.9727
step 300250: learning rate 0.00032965, train loss 10.9824, val loss 10.9790
step 300500: learning rate 0.00032929, train loss 10.9746, val loss 10.9749
step 300750: learning rate 0.00032894, train loss 10.9743, val loss 10.9811
step 301000: learning rate 0.00032859, train loss 10.9725, val loss 10.9780
step 301250: learning rate 0.00032823, train loss 10.9714, val loss 10.9720
step 301500: learning rate 0.00032788, train loss 10.9811, val loss 10.9777
step 301750: learning rate 0.00032753, train loss 10.9773, val loss 10.9747
step 302000: learning rate 0.00032717, train loss 10.9704, val loss 10.9825
step 302250: learning rate 0.00032682, train loss 10.9758, val loss 10.9788
step 302500: learning rate 0.00032647, train loss 10.9779, val loss 10.9860
step 302750: learning rate 0.00032611, train loss 10.9763, val loss 10.9711
step 303000: learning rate 0.00032576, train loss 10.9756, val loss 10.9810
step 303250: learning rate 0.00032541, train loss 10.9809, val loss 10.9751
step 303500: learning rate 0.00032505, train loss 10.9717, val loss 10.9808
step 303750: learning rate 0.00032470, train loss 10.9722, val loss 10.9837
step 304000: learning rate 0.00032435, train loss 10.9771, val loss 10.9783
step 304250: learning rate 0.00032399, train loss 10.9771, val loss 10.9737
step 304500: learning rate 0.00032364, train loss 10.9771, val loss 10.9755
step 304750: learning rate 0.00032329, train loss 10.9800, val loss 10.9766
step 305000: learning rate 0.00032293, train loss 10.9773, val loss 10.9811
step 305250: learning rate 0.00032258, train loss 10.9770, val loss 10.9679
step 305500: learning rate 0.00032223, train loss 10.9727, val loss 10.9788
step 305750: learning rate 0.00032187, train loss 10.9762, val loss 10.9805
step 306000: learning rate 0.00032152, train loss 10.9843, val loss 10.9732
step 306250: learning rate 0.00032117, train loss 10.9760, val loss 10.9824
step 306500: learning rate 0.00032081, train loss 10.9765, val loss 10.9764
step 306750: learning rate 0.00032046, train loss 10.9749, val loss 10.9875
step 307000: learning rate 0.00032011, train loss 10.9802, val loss 10.9791
step 307250: learning rate 0.00031975, train loss 10.9854, val loss 10.9763
step 307500: learning rate 0.00031940, train loss 10.9788, val loss 10.9829
step 307750: learning rate 0.00031905, train loss 10.9757, val loss 10.9727
step 308000: learning rate 0.00031869, train loss 10.9629, val loss 10.9757
step 308250: learning rate 0.00031834, train loss 10.9747, val loss 10.9726
step 308500: learning rate 0.00031799, train loss 10.9745, val loss 10.9797
step 308750: learning rate 0.00031763, train loss 10.9893, val loss 10.9738
step 309000: learning rate 0.00031728, train loss 10.9721, val loss 10.9756
step 309250: learning rate 0.00031693, train loss 10.9734, val loss 10.9746
step 309500: learning rate 0.00031658, train loss 10.9859, val loss 10.9755
step 309750: learning rate 0.00031622, train loss 10.9784, val loss 10.9772
step 310000: learning rate 0.00031587, train loss 10.9777, val loss 10.9703
step 310250: learning rate 0.00031552, train loss 10.9689, val loss 10.9725
step 310500: learning rate 0.00031516, train loss 10.9791, val loss 10.9842
step 310750: learning rate 0.00031481, train loss 10.9788, val loss 10.9747
step 311000: learning rate 0.00031446, train loss 10.9766, val loss 10.9763
step 311250: learning rate 0.00031410, train loss 10.9841, val loss 10.9753
step 311500: learning rate 0.00031375, train loss 10.9858, val loss 10.9731
step 311750: learning rate 0.00031340, train loss 10.9747, val loss 10.9760
step 312000: learning rate 0.00031305, train loss 10.9785, val loss 10.9793
step 312250: learning rate 0.00031269, train loss 10.9815, val loss 10.9745
step 312500: learning rate 0.00031234, train loss 10.9714, val loss 10.9777
step 312750: learning rate 0.00031199, train loss 10.9780, val loss 10.9798
step 313000: learning rate 0.00031164, train loss 10.9827, val loss 10.9855
step 313250: learning rate 0.00031128, train loss 10.9736, val loss 10.9753
step 313500: learning rate 0.00031093, train loss 10.9692, val loss 10.9670
step 313750: learning rate 0.00031058, train loss 10.9783, val loss 10.9689
step 314000: learning rate 0.00031023, train loss 10.9793, val loss 10.9758
step 314250: learning rate 0.00030987, train loss 10.9811, val loss 10.9813
step 314500: learning rate 0.00030952, train loss 10.9801, val loss 10.9749
step 314750: learning rate 0.00030917, train loss 10.9793, val loss 10.9850
step 315000: learning rate 0.00030882, train loss 10.9763, val loss 10.9726
step 315250: learning rate 0.00030846, train loss 10.9719, val loss 10.9725
step 315500: learning rate 0.00030811, train loss 10.9784, val loss 10.9787
step 315750: learning rate 0.00030776, train loss 10.9834, val loss 10.9765
step 316000: learning rate 0.00030741, train loss 10.9830, val loss 10.9726
step 316250: learning rate 0.00030705, train loss 10.9747, val loss 10.9746
step 316500: learning rate 0.00030670, train loss 10.9819, val loss 10.9821
step 316750: learning rate 0.00030635, train loss 10.9754, val loss 10.9786
step 317000: learning rate 0.00030600, train loss 10.9696, val loss 10.9709
step 317250: learning rate 0.00030565, train loss 10.9769, val loss 10.9689
step 317500: learning rate 0.00030529, train loss 10.9791, val loss 10.9751
step 317750: learning rate 0.00030494, train loss 10.9798, val loss 10.9797
step 318000: learning rate 0.00030459, train loss 10.9774, val loss 10.9819
step 318250: learning rate 0.00030424, train loss 10.9806, val loss 10.9740
step 318500: learning rate 0.00030389, train loss 10.9751, val loss 10.9788
step 318750: learning rate 0.00030354, train loss 10.9793, val loss 10.9817
step 319000: learning rate 0.00030318, train loss 10.9790, val loss 10.9737
step 319250: learning rate 0.00030283, train loss 10.9783, val loss 10.9750
step 319500: learning rate 0.00030248, train loss 10.9723, val loss 10.9828
step 319750: learning rate 0.00030213, train loss 10.9855, val loss 10.9766
step 320000: learning rate 0.00030178, train loss 10.9756, val loss 10.9836
step 320250: learning rate 0.00030143, train loss 10.9735, val loss 10.9749
step 320500: learning rate 0.00030107, train loss 10.9738, val loss 10.9859
step 320750: learning rate 0.00030072, train loss 10.9741, val loss 10.9766
step 321000: learning rate 0.00030037, train loss 10.9804, val loss 10.9753
step 321250: learning rate 0.00030002, train loss 10.9717, val loss 10.9769
step 321500: learning rate 0.00029967, train loss 10.9818, val loss 10.9797
step 321750: learning rate 0.00029932, train loss 10.9808, val loss 10.9778
step 322000: learning rate 0.00029897, train loss 10.9730, val loss 10.9704
step 322250: learning rate 0.00029862, train loss 10.9710, val loss 10.9794
step 322500: learning rate 0.00029826, train loss 10.9784, val loss 10.9830
step 322750: learning rate 0.00029791, train loss 10.9808, val loss 10.9736
step 323000: learning rate 0.00029756, train loss 10.9776, val loss 10.9733
step 323250: learning rate 0.00029721, train loss 10.9773, val loss 10.9768
step 323500: learning rate 0.00029686, train loss 10.9745, val loss 10.9809
step 323750: learning rate 0.00029651, train loss 10.9779, val loss 10.9759
step 324000: learning rate 0.00029616, train loss 10.9764, val loss 10.9725
step 324250: learning rate 0.00029581, train loss 10.9810, val loss 10.9795
step 324500: learning rate 0.00029546, train loss 10.9835, val loss 10.9837
step 324750: learning rate 0.00029511, train loss 10.9769, val loss 10.9794
step 325000: learning rate 0.00029476, train loss 10.9786, val loss 10.9728
step 325250: learning rate 0.00029441, train loss 10.9752, val loss 10.9792
step 325500: learning rate 0.00029406, train loss 10.9846, val loss 10.9725
step 325750: learning rate 0.00029371, train loss 10.9733, val loss 10.9767
step 326000: learning rate 0.00029336, train loss 10.9757, val loss 10.9762
step 326250: learning rate 0.00029301, train loss 10.9770, val loss 10.9712
step 326500: learning rate 0.00029266, train loss 10.9811, val loss 10.9738
step 326750: learning rate 0.00029231, train loss 10.9804, val loss 10.9760
step 327000: learning rate 0.00029196, train loss 10.9746, val loss 10.9708
step 327250: learning rate 0.00029161, train loss 10.9789, val loss 10.9727
step 327500: learning rate 0.00029126, train loss 10.9810, val loss 10.9723
step 327750: learning rate 0.00029091, train loss 10.9793, val loss 10.9775
step 328000: learning rate 0.00029056, train loss 10.9790, val loss 10.9775
step 328250: learning rate 0.00029021, train loss 10.9752, val loss 10.9727
step 328500: learning rate 0.00028986, train loss 10.9741, val loss 10.9706
step 328750: learning rate 0.00028951, train loss 10.9755, val loss 10.9813
step 329000: learning rate 0.00028916, train loss 10.9742, val loss 10.9772
step 329250: learning rate 0.00028881, train loss 10.9738, val loss 10.9749
step 329500: learning rate 0.00028846, train loss 10.9800, val loss 10.9771
step 329750: learning rate 0.00028811, train loss 10.9828, val loss 10.9765
step 330000: learning rate 0.00028776, train loss 10.9768, val loss 10.9774
step 330250: learning rate 0.00028741, train loss 10.9790, val loss 10.9735
step 330500: learning rate 0.00028706, train loss 10.9805, val loss 10.9851
step 330750: learning rate 0.00028672, train loss 10.9731, val loss 10.9750
step 331000: learning rate 0.00028637, train loss 10.9751, val loss 10.9789
step 331250: learning rate 0.00028602, train loss 10.9848, val loss 10.9745
step 331500: learning rate 0.00028567, train loss 10.9765, val loss 10.9842
step 331750: learning rate 0.00028532, train loss 10.9768, val loss 10.9796
step 332000: learning rate 0.00028497, train loss 10.9787, val loss 10.9729
step 332250: learning rate 0.00028462, train loss 10.9766, val loss 10.9768
step 332500: learning rate 0.00028428, train loss 10.9730, val loss 10.9800
step 332750: learning rate 0.00028393, train loss 10.9793, val loss 10.9777
step 333000: learning rate 0.00028358, train loss 10.9779, val loss 10.9790
step 333250: learning rate 0.00028323, train loss 10.9766, val loss 10.9735
step 333500: learning rate 0.00028288, train loss 10.9756, val loss 10.9798
step 333750: learning rate 0.00028254, train loss 10.9797, val loss 10.9716
step 334000: learning rate 0.00028219, train loss 10.9709, val loss 10.9838
step 334250: learning rate 0.00028184, train loss 10.9719, val loss 10.9757
step 334500: learning rate 0.00028149, train loss 10.9701, val loss 10.9709
step 334750: learning rate 0.00028114, train loss 10.9740, val loss 10.9800
step 335000: learning rate 0.00028080, train loss 10.9761, val loss 10.9757
step 335250: learning rate 0.00028045, train loss 10.9730, val loss 10.9815
step 335500: learning rate 0.00028010, train loss 10.9808, val loss 10.9700
step 335750: learning rate 0.00027975, train loss 10.9748, val loss 10.9801
step 336000: learning rate 0.00027941, train loss 10.9824, val loss 10.9783
step 336250: learning rate 0.00027906, train loss 10.9779, val loss 10.9811
step 336500: learning rate 0.00027871, train loss 10.9817, val loss 10.9727
step 336750: learning rate 0.00027837, train loss 10.9772, val loss 10.9817
step 337000: learning rate 0.00027802, train loss 10.9794, val loss 10.9805
step 337250: learning rate 0.00027767, train loss 10.9795, val loss 10.9833
step 337500: learning rate 0.00027733, train loss 10.9819, val loss 10.9807
step 337750: learning rate 0.00027698, train loss 10.9760, val loss 10.9773
step 338000: learning rate 0.00027663, train loss 10.9780, val loss 10.9816
step 338250: learning rate 0.00027629, train loss 10.9747, val loss 10.9731
step 338500: learning rate 0.00027594, train loss 10.9756, val loss 10.9717
step 338750: learning rate 0.00027559, train loss 10.9728, val loss 10.9719
step 339000: learning rate 0.00027525, train loss 10.9793, val loss 10.9726
step 339250: learning rate 0.00027490, train loss 10.9712, val loss 10.9845
step 339500: learning rate 0.00027456, train loss 10.9788, val loss 10.9800
step 339750: learning rate 0.00027421, train loss 10.9755, val loss 10.9782
step 340000: learning rate 0.00027386, train loss 10.9791, val loss 10.9782
step 340250: learning rate 0.00027352, train loss 10.9712, val loss 10.9782
step 340500: learning rate 0.00027317, train loss 10.9778, val loss 10.9718
step 340750: learning rate 0.00027283, train loss 10.9816, val loss 10.9766
step 341000: learning rate 0.00027248, train loss 10.9796, val loss 10.9771
step 341250: learning rate 0.00027214, train loss 10.9735, val loss 10.9795
step 341500: learning rate 0.00027179, train loss 10.9732, val loss 10.9790
step 341750: learning rate 0.00027145, train loss 10.9696, val loss 10.9832
step 342000: learning rate 0.00027110, train loss 10.9847, val loss 10.9671
step 342250: learning rate 0.00027076, train loss 10.9738, val loss 10.9754
step 342500: learning rate 0.00027041, train loss 10.9765, val loss 10.9801
step 342750: learning rate 0.00027007, train loss 10.9807, val loss 10.9769
step 343000: learning rate 0.00026972, train loss 10.9807, val loss 10.9817
step 343250: learning rate 0.00026938, train loss 10.9814, val loss 10.9748
step 343500: learning rate 0.00026903, train loss 10.9722, val loss 10.9775
step 343750: learning rate 0.00026869, train loss 10.9746, val loss 10.9844
step 344000: learning rate 0.00026835, train loss 10.9817, val loss 10.9732
step 344250: learning rate 0.00026800, train loss 10.9709, val loss 10.9777
step 344500: learning rate 0.00026766, train loss 10.9724, val loss 10.9750
step 344750: learning rate 0.00026731, train loss 10.9752, val loss 10.9849
step 345000: learning rate 0.00026697, train loss 10.9822, val loss 10.9756
step 345250: learning rate 0.00026663, train loss 10.9735, val loss 10.9803
step 345500: learning rate 0.00026628, train loss 10.9749, val loss 10.9684
step 345750: learning rate 0.00026594, train loss 10.9831, val loss 10.9704
step 346000: learning rate 0.00026560, train loss 10.9773, val loss 10.9721
step 346250: learning rate 0.00026525, train loss 10.9742, val loss 10.9757
step 346500: learning rate 0.00026491, train loss 10.9785, val loss 10.9761
step 346750: learning rate 0.00026457, train loss 10.9758, val loss 10.9732
step 347000: learning rate 0.00026422, train loss 10.9726, val loss 10.9751
step 347250: learning rate 0.00026388, train loss 10.9802, val loss 10.9773
step 347500: learning rate 0.00026354, train loss 10.9835, val loss 10.9764
step 347750: learning rate 0.00026320, train loss 10.9758, val loss 10.9725
step 348000: learning rate 0.00026285, train loss 10.9762, val loss 10.9787
step 348250: learning rate 0.00026251, train loss 10.9766, val loss 10.9726
step 348500: learning rate 0.00026217, train loss 10.9797, val loss 10.9699
step 348750: learning rate 0.00026183, train loss 10.9690, val loss 10.9787
step 349000: learning rate 0.00026149, train loss 10.9806, val loss 10.9820
step 349250: learning rate 0.00026114, train loss 10.9699, val loss 10.9794
step 349500: learning rate 0.00026080, train loss 10.9745, val loss 10.9792
step 349750: learning rate 0.00026046, train loss 10.9718, val loss 10.9786
step 350000: learning rate 0.00026012, train loss 10.9717, val loss 10.9771
step 350250: learning rate 0.00025978, train loss 10.9783, val loss 10.9752
step 350500: learning rate 0.00025944, train loss 10.9697, val loss 10.9734
step 350750: learning rate 0.00025910, train loss 10.9720, val loss 10.9722
step 351000: learning rate 0.00025875, train loss 10.9821, val loss 10.9736
step 351250: learning rate 0.00025841, train loss 10.9796, val loss 10.9731
step 351500: learning rate 0.00025807, train loss 10.9778, val loss 10.9694
step 351750: learning rate 0.00025773, train loss 10.9803, val loss 10.9828
step 352000: learning rate 0.00025739, train loss 10.9796, val loss 10.9706
step 352250: learning rate 0.00025705, train loss 10.9797, val loss 10.9810
step 352500: learning rate 0.00025671, train loss 10.9789, val loss 10.9788
step 352750: learning rate 0.00025637, train loss 10.9774, val loss 10.9811
step 353000: learning rate 0.00025603, train loss 10.9745, val loss 10.9745
step 353250: learning rate 0.00025569, train loss 10.9784, val loss 10.9829
step 353500: learning rate 0.00025535, train loss 10.9778, val loss 10.9716
step 353750: learning rate 0.00025501, train loss 10.9856, val loss 10.9755
step 354000: learning rate 0.00025467, train loss 10.9788, val loss 10.9765
step 354250: learning rate 0.00025433, train loss 10.9716, val loss 10.9734
step 354500: learning rate 0.00025399, train loss 10.9751, val loss 10.9762
step 354750: learning rate 0.00025365, train loss 10.9709, val loss 10.9776
step 355000: learning rate 0.00025332, train loss 10.9802, val loss 10.9719
step 355250: learning rate 0.00025298, train loss 10.9769, val loss 10.9804
step 355500: learning rate 0.00025264, train loss 10.9787, val loss 10.9806
step 355750: learning rate 0.00025230, train loss 10.9788, val loss 10.9800
step 356000: learning rate 0.00025196, train loss 10.9840, val loss 10.9772
step 356250: learning rate 0.00025162, train loss 10.9711, val loss 10.9755
step 356500: learning rate 0.00025128, train loss 10.9803, val loss 10.9743
step 356750: learning rate 0.00025095, train loss 10.9745, val loss 10.9760
step 357000: learning rate 0.00025061, train loss 10.9881, val loss 10.9774
step 357250: learning rate 0.00025027, train loss 10.9826, val loss 10.9756
step 357500: learning rate 0.00024993, train loss 10.9698, val loss 10.9744
step 357750: learning rate 0.00024960, train loss 10.9736, val loss 10.9767
step 358000: learning rate 0.00024926, train loss 10.9804, val loss 10.9775
step 358250: learning rate 0.00024892, train loss 10.9766, val loss 10.9817
step 358500: learning rate 0.00024858, train loss 10.9803, val loss 10.9735
step 358750: learning rate 0.00024825, train loss 10.9809, val loss 10.9750
step 359000: learning rate 0.00024791, train loss 10.9797, val loss 10.9780
step 359250: learning rate 0.00024757, train loss 10.9736, val loss 10.9837
step 359500: learning rate 0.00024724, train loss 10.9751, val loss 10.9815
step 359750: learning rate 0.00024690, train loss 10.9795, val loss 10.9736
step 360000: learning rate 0.00024657, train loss 10.9783, val loss 10.9797
step 360250: learning rate 0.00024623, train loss 10.9733, val loss 10.9748
step 360500: learning rate 0.00024589, train loss 10.9860, val loss 10.9705
step 360750: learning rate 0.00024556, train loss 10.9766, val loss 10.9639
step 361000: learning rate 0.00024522, train loss 10.9746, val loss 10.9783
step 361250: learning rate 0.00024489, train loss 10.9735, val loss 10.9765
step 361500: learning rate 0.00024455, train loss 10.9769, val loss 10.9763
step 361750: learning rate 0.00024422, train loss 10.9800, val loss 10.9825
step 362000: learning rate 0.00024388, train loss 10.9787, val loss 10.9757
step 362250: learning rate 0.00024355, train loss 10.9705, val loss 10.9756
step 362500: learning rate 0.00024321, train loss 10.9776, val loss 10.9814
step 362750: learning rate 0.00024288, train loss 10.9773, val loss 10.9792
step 363000: learning rate 0.00024254, train loss 10.9783, val loss 10.9804
step 363250: learning rate 0.00024221, train loss 10.9781, val loss 10.9790
step 363500: learning rate 0.00024187, train loss 10.9788, val loss 10.9737
step 363750: learning rate 0.00024154, train loss 10.9753, val loss 10.9768
step 364000: learning rate 0.00024121, train loss 10.9818, val loss 10.9781
step 364250: learning rate 0.00024087, train loss 10.9727, val loss 10.9812
step 364500: learning rate 0.00024054, train loss 10.9776, val loss 10.9807
step 364750: learning rate 0.00024021, train loss 10.9755, val loss 10.9759
step 365000: learning rate 0.00023987, train loss 10.9759, val loss 10.9826
step 365250: learning rate 0.00023954, train loss 10.9787, val loss 10.9754
step 365500: learning rate 0.00023921, train loss 10.9760, val loss 10.9805
step 365750: learning rate 0.00023887, train loss 10.9794, val loss 10.9850
step 366000: learning rate 0.00023854, train loss 10.9727, val loss 10.9717
step 366250: learning rate 0.00023821, train loss 10.9823, val loss 10.9781
step 366500: learning rate 0.00023788, train loss 10.9748, val loss 10.9791
step 366750: learning rate 0.00023754, train loss 10.9800, val loss 10.9814
step 367000: learning rate 0.00023721, train loss 10.9800, val loss 10.9755
step 367250: learning rate 0.00023688, train loss 10.9784, val loss 10.9758
step 367500: learning rate 0.00023655, train loss 10.9765, val loss 10.9819
step 367750: learning rate 0.00023622, train loss 10.9782, val loss 10.9764
step 368000: learning rate 0.00023589, train loss 10.9796, val loss 10.9782
step 368250: learning rate 0.00023555, train loss 10.9734, val loss 10.9782
step 368500: learning rate 0.00023522, train loss 10.9772, val loss 10.9762
step 368750: learning rate 0.00023489, train loss 10.9756, val loss 10.9851
step 369000: learning rate 0.00023456, train loss 10.9748, val loss 10.9745
step 369250: learning rate 0.00023423, train loss 10.9728, val loss 10.9748
step 369500: learning rate 0.00023390, train loss 10.9831, val loss 10.9725
step 369750: learning rate 0.00023357, train loss 10.9832, val loss 10.9713
step 370000: learning rate 0.00023324, train loss 10.9714, val loss 10.9743
step 370250: learning rate 0.00023291, train loss 10.9737, val loss 10.9742
step 370500: learning rate 0.00023258, train loss 10.9760, val loss 10.9757
step 370750: learning rate 0.00023225, train loss 10.9765, val loss 10.9790
step 371000: learning rate 0.00023192, train loss 10.9831, val loss 10.9704
step 371250: learning rate 0.00023159, train loss 10.9806, val loss 10.9764
step 371500: learning rate 0.00023126, train loss 10.9780, val loss 10.9800
step 371750: learning rate 0.00023094, train loss 10.9767, val loss 10.9739
step 372000: learning rate 0.00023061, train loss 10.9798, val loss 10.9759
step 372250: learning rate 0.00023028, train loss 10.9692, val loss 10.9799
step 372500: learning rate 0.00022995, train loss 10.9772, val loss 10.9764
step 372750: learning rate 0.00022962, train loss 10.9763, val loss 10.9774
step 373000: learning rate 0.00022929, train loss 10.9829, val loss 10.9736
step 373250: learning rate 0.00022897, train loss 10.9699, val loss 10.9821
step 373500: learning rate 0.00022864, train loss 10.9740, val loss 10.9805
step 373750: learning rate 0.00022831, train loss 10.9830, val loss 10.9716
step 374000: learning rate 0.00022798, train loss 10.9749, val loss 10.9802
step 374250: learning rate 0.00022766, train loss 10.9810, val loss 10.9830
step 374500: learning rate 0.00022733, train loss 10.9767, val loss 10.9709
step 374750: learning rate 0.00022700, train loss 10.9799, val loss 10.9757
step 375000: learning rate 0.00022668, train loss 10.9754, val loss 10.9729
step 375250: learning rate 0.00022635, train loss 10.9885, val loss 10.9797
step 375500: learning rate 0.00022602, train loss 10.9787, val loss 10.9778
step 375750: learning rate 0.00022570, train loss 10.9793, val loss 10.9794
step 376000: learning rate 0.00022537, train loss 10.9792, val loss 10.9792
step 376250: learning rate 0.00022505, train loss 10.9736, val loss 10.9723
step 376500: learning rate 0.00022472, train loss 10.9788, val loss 10.9818
step 376750: learning rate 0.00022439, train loss 10.9730, val loss 10.9775
step 377000: learning rate 0.00022407, train loss 10.9798, val loss 10.9785
step 377250: learning rate 0.00022374, train loss 10.9783, val loss 10.9812
step 377500: learning rate 0.00022342, train loss 10.9768, val loss 10.9728
step 377750: learning rate 0.00022309, train loss 10.9751, val loss 10.9670
step 378000: learning rate 0.00022277, train loss 10.9762, val loss 10.9704
step 378250: learning rate 0.00022245, train loss 10.9771, val loss 10.9831
step 378500: learning rate 0.00022212, train loss 10.9798, val loss 10.9795
step 378750: learning rate 0.00022180, train loss 10.9821, val loss 10.9762
step 379000: learning rate 0.00022147, train loss 10.9745, val loss 10.9841
step 379250: learning rate 0.00022115, train loss 10.9774, val loss 10.9782
step 379500: learning rate 0.00022083, train loss 10.9811, val loss 10.9752
step 379750: learning rate 0.00022050, train loss 10.9839, val loss 10.9759
step 380000: learning rate 0.00022018, train loss 10.9808, val loss 10.9912
step 380250: learning rate 0.00021986, train loss 10.9777, val loss 10.9731
step 380500: learning rate 0.00021954, train loss 10.9817, val loss 10.9726
step 380750: learning rate 0.00021921, train loss 10.9799, val loss 10.9746
step 381000: learning rate 0.00021889, train loss 10.9752, val loss 10.9685
step 381250: learning rate 0.00021857, train loss 10.9766, val loss 10.9858
step 381500: learning rate 0.00021825, train loss 10.9684, val loss 10.9762
step 381750: learning rate 0.00021793, train loss 10.9776, val loss 10.9775
step 382000: learning rate 0.00021760, train loss 10.9792, val loss 10.9800
step 382250: learning rate 0.00021728, train loss 10.9744, val loss 10.9735
step 382500: learning rate 0.00021696, train loss 10.9793, val loss 10.9841
step 382750: learning rate 0.00021664, train loss 10.9810, val loss 10.9772
step 383000: learning rate 0.00021632, train loss 10.9786, val loss 10.9762
step 383250: learning rate 0.00021600, train loss 10.9770, val loss 10.9727
step 383500: learning rate 0.00021568, train loss 10.9756, val loss 10.9743
step 383750: learning rate 0.00021536, train loss 10.9758, val loss 10.9839
step 384000: learning rate 0.00021504, train loss 10.9790, val loss 10.9799
step 384250: learning rate 0.00021472, train loss 10.9765, val loss 10.9765
step 384500: learning rate 0.00021440, train loss 10.9766, val loss 10.9802
step 384750: learning rate 0.00021408, train loss 10.9760, val loss 10.9739
step 385000: learning rate 0.00021376, train loss 10.9763, val loss 10.9763
step 385250: learning rate 0.00021344, train loss 10.9800, val loss 10.9786
step 385500: learning rate 0.00021312, train loss 10.9817, val loss 10.9844
step 385750: learning rate 0.00021281, train loss 10.9774, val loss 10.9702
step 386000: learning rate 0.00021249, train loss 10.9766, val loss 10.9806
step 386250: learning rate 0.00021217, train loss 10.9759, val loss 10.9778
step 386500: learning rate 0.00021185, train loss 10.9736, val loss 10.9838
step 386750: learning rate 0.00021153, train loss 10.9759, val loss 10.9840
step 387000: learning rate 0.00021122, train loss 10.9792, val loss 10.9763
step 387250: learning rate 0.00021090, train loss 10.9801, val loss 10.9782
step 387500: learning rate 0.00021058, train loss 10.9809, val loss 10.9779
step 387750: learning rate 0.00021027, train loss 10.9822, val loss 10.9692
step 388000: learning rate 0.00020995, train loss 10.9826, val loss 10.9698
step 388250: learning rate 0.00020963, train loss 10.9749, val loss 10.9730
step 388500: learning rate 0.00020932, train loss 10.9784, val loss 10.9752
step 388750: learning rate 0.00020900, train loss 10.9760, val loss 10.9753
step 389000: learning rate 0.00020868, train loss 10.9710, val loss 10.9753
step 389250: learning rate 0.00020837, train loss 10.9764, val loss 10.9798
step 389500: learning rate 0.00020805, train loss 10.9749, val loss 10.9799
step 389750: learning rate 0.00020774, train loss 10.9788, val loss 10.9794
step 390000: learning rate 0.00020742, train loss 10.9744, val loss 10.9817
step 390250: learning rate 0.00020711, train loss 10.9776, val loss 10.9773
step 390500: learning rate 0.00020679, train loss 10.9796, val loss 10.9805
step 390750: learning rate 0.00020648, train loss 10.9714, val loss 10.9771
step 391000: learning rate 0.00020616, train loss 10.9775, val loss 10.9766
step 391250: learning rate 0.00020585, train loss 10.9754, val loss 10.9820
step 391500: learning rate 0.00020554, train loss 10.9791, val loss 10.9776
step 391750: learning rate 0.00020522, train loss 10.9782, val loss 10.9784
step 392000: learning rate 0.00020491, train loss 10.9761, val loss 10.9759
step 392250: learning rate 0.00020460, train loss 10.9781, val loss 10.9787
step 392500: learning rate 0.00020428, train loss 10.9740, val loss 10.9760
step 392750: learning rate 0.00020397, train loss 10.9727, val loss 10.9754
step 393000: learning rate 0.00020366, train loss 10.9771, val loss 10.9738
step 393250: learning rate 0.00020335, train loss 10.9747, val loss 10.9745
step 393500: learning rate 0.00020303, train loss 10.9818, val loss 10.9780
step 393750: learning rate 0.00020272, train loss 10.9796, val loss 10.9754
step 394000: learning rate 0.00020241, train loss 10.9726, val loss 10.9826
step 394250: learning rate 0.00020210, train loss 10.9751, val loss 10.9739
step 394500: learning rate 0.00020179, train loss 10.9714, val loss 10.9730
step 394750: learning rate 0.00020148, train loss 10.9731, val loss 10.9805
step 395000: learning rate 0.00020117, train loss 10.9807, val loss 10.9722
step 395250: learning rate 0.00020086, train loss 10.9691, val loss 10.9827
step 395500: learning rate 0.00020055, train loss 10.9809, val loss 10.9722
step 395750: learning rate 0.00020024, train loss 10.9750, val loss 10.9762
step 396000: learning rate 0.00019993, train loss 10.9750, val loss 10.9793
step 396250: learning rate 0.00019962, train loss 10.9778, val loss 10.9752
step 396500: learning rate 0.00019931, train loss 10.9790, val loss 10.9723
step 396750: learning rate 0.00019900, train loss 10.9753, val loss 10.9845
step 397000: learning rate 0.00019869, train loss 10.9781, val loss 10.9755
step 397250: learning rate 0.00019838, train loss 10.9721, val loss 10.9760
step 397500: learning rate 0.00019807, train loss 10.9748, val loss 10.9788
step 397750: learning rate 0.00019776, train loss 10.9815, val loss 10.9791
step 398000: learning rate 0.00019746, train loss 10.9648, val loss 10.9788
step 398250: learning rate 0.00019715, train loss 10.9737, val loss 10.9803
step 398500: learning rate 0.00019684, train loss 10.9780, val loss 10.9732
step 398750: learning rate 0.00019653, train loss 10.9793, val loss 10.9723
step 399000: learning rate 0.00019623, train loss 10.9844, val loss 10.9811
step 399250: learning rate 0.00019592, train loss 10.9756, val loss 10.9756
step 399500: learning rate 0.00019561, train loss 10.9714, val loss 10.9756
step 399750: learning rate 0.00019531, train loss 10.9759, val loss 10.9753
step 400000: learning rate 0.00019500, train loss 10.9766, val loss 10.9684
step 400250: learning rate 0.00019469, train loss 10.9821, val loss 10.9737
step 400500: learning rate 0.00019439, train loss 10.9689, val loss 10.9778
step 400750: learning rate 0.00019408, train loss 10.9743, val loss 10.9741
step 401000: learning rate 0.00019378, train loss 10.9780, val loss 10.9769
step 401250: learning rate 0.00019347, train loss 10.9676, val loss 10.9802
step 401500: learning rate 0.00019317, train loss 10.9741, val loss 10.9766
step 401750: learning rate 0.00019286, train loss 10.9781, val loss 10.9720
step 402000: learning rate 0.00019256, train loss 10.9809, val loss 10.9770
step 402250: learning rate 0.00019225, train loss 10.9765, val loss 10.9778
step 402500: learning rate 0.00019195, train loss 10.9839, val loss 10.9810
step 402750: learning rate 0.00019165, train loss 10.9771, val loss 10.9804
step 403000: learning rate 0.00019134, train loss 10.9822, val loss 10.9749
step 403250: learning rate 0.00019104, train loss 10.9777, val loss 10.9709
step 403500: learning rate 0.00019074, train loss 10.9827, val loss 10.9787
step 403750: learning rate 0.00019044, train loss 10.9728, val loss 10.9783
step 404000: learning rate 0.00019013, train loss 10.9782, val loss 10.9738
step 404250: learning rate 0.00018983, train loss 10.9761, val loss 10.9772
step 404500: learning rate 0.00018953, train loss 10.9781, val loss 10.9803
step 404750: learning rate 0.00018923, train loss 10.9774, val loss 10.9818
step 405000: learning rate 0.00018893, train loss 10.9744, val loss 10.9734
step 405250: learning rate 0.00018862, train loss 10.9765, val loss 10.9727
step 405500: learning rate 0.00018832, train loss 10.9741, val loss 10.9795
step 405750: learning rate 0.00018802, train loss 10.9826, val loss 10.9754
step 406000: learning rate 0.00018772, train loss 10.9730, val loss 10.9759
step 406250: learning rate 0.00018742, train loss 10.9745, val loss 10.9781
step 406500: learning rate 0.00018712, train loss 10.9714, val loss 10.9729
step 406750: learning rate 0.00018682, train loss 10.9852, val loss 10.9756
step 407000: learning rate 0.00018652, train loss 10.9784, val loss 10.9730
step 407250: learning rate 0.00018622, train loss 10.9785, val loss 10.9719
step 407500: learning rate 0.00018592, train loss 10.9739, val loss 10.9774
step 407750: learning rate 0.00018563, train loss 10.9787, val loss 10.9766
step 408000: learning rate 0.00018533, train loss 10.9757, val loss 10.9711
step 408250: learning rate 0.00018503, train loss 10.9812, val loss 10.9778
step 408500: learning rate 0.00018473, train loss 10.9782, val loss 10.9785
step 408750: learning rate 0.00018443, train loss 10.9742, val loss 10.9685
step 409000: learning rate 0.00018414, train loss 10.9784, val loss 10.9729
step 409250: learning rate 0.00018384, train loss 10.9726, val loss 10.9757
step 409500: learning rate 0.00018354, train loss 10.9799, val loss 10.9815
step 409750: learning rate 0.00018324, train loss 10.9737, val loss 10.9743
step 410000: learning rate 0.00018295, train loss 10.9760, val loss 10.9750
step 410250: learning rate 0.00018265, train loss 10.9768, val loss 10.9783
step 410500: learning rate 0.00018236, train loss 10.9770, val loss 10.9721
step 410750: learning rate 0.00018206, train loss 10.9767, val loss 10.9711
step 411000: learning rate 0.00018176, train loss 10.9733, val loss 10.9784
step 411250: learning rate 0.00018147, train loss 10.9709, val loss 10.9780
step 411500: learning rate 0.00018117, train loss 10.9749, val loss 10.9774
step 411750: learning rate 0.00018088, train loss 10.9757, val loss 10.9792
step 412000: learning rate 0.00018058, train loss 10.9699, val loss 10.9718
step 412250: learning rate 0.00018029, train loss 10.9804, val loss 10.9820
step 412500: learning rate 0.00018000, train loss 10.9802, val loss 10.9807
step 412750: learning rate 0.00017970, train loss 10.9770, val loss 10.9787
step 413000: learning rate 0.00017941, train loss 10.9776, val loss 10.9767
step 413250: learning rate 0.00017912, train loss 10.9683, val loss 10.9784
step 413500: learning rate 0.00017882, train loss 10.9757, val loss 10.9844
step 413750: learning rate 0.00017853, train loss 10.9827, val loss 10.9757
step 414000: learning rate 0.00017824, train loss 10.9764, val loss 10.9707
step 414250: learning rate 0.00017795, train loss 10.9718, val loss 10.9646
step 414500: learning rate 0.00017765, train loss 10.9855, val loss 10.9755
step 414750: learning rate 0.00017736, train loss 10.9748, val loss 10.9762
step 415000: learning rate 0.00017707, train loss 10.9729, val loss 10.9665
step 415250: learning rate 0.00017678, train loss 10.9770, val loss 10.9753
step 415500: learning rate 0.00017649, train loss 10.9813, val loss 10.9773
step 415750: learning rate 0.00017620, train loss 10.9722, val loss 10.9756
step 416000: learning rate 0.00017591, train loss 10.9763, val loss 10.9780
step 416250: learning rate 0.00017562, train loss 10.9776, val loss 10.9723
step 416500: learning rate 0.00017533, train loss 10.9743, val loss 10.9733
step 416750: learning rate 0.00017504, train loss 10.9772, val loss 10.9760
step 417000: learning rate 0.00017475, train loss 10.9758, val loss 10.9734
step 417250: learning rate 0.00017446, train loss 10.9695, val loss 10.9774
step 417500: learning rate 0.00017417, train loss 10.9736, val loss 10.9677
step 417750: learning rate 0.00017388, train loss 10.9828, val loss 10.9688
step 418000: learning rate 0.00017359, train loss 10.9717, val loss 10.9734
step 418250: learning rate 0.00017331, train loss 10.9796, val loss 10.9765
step 418500: learning rate 0.00017302, train loss 10.9807, val loss 10.9720
step 418750: learning rate 0.00017273, train loss 10.9653, val loss 10.9659
step 419000: learning rate 0.00017244, train loss 10.9762, val loss 10.9789
step 419250: learning rate 0.00017216, train loss 10.9794, val loss 10.9789
step 419500: learning rate 0.00017187, train loss 10.9824, val loss 10.9782
step 419750: learning rate 0.00017158, train loss 10.9755, val loss 10.9810
step 420000: learning rate 0.00017130, train loss 10.9793, val loss 10.9720
step 420250: learning rate 0.00017101, train loss 10.9822, val loss 10.9818
step 420500: learning rate 0.00017073, train loss 10.9800, val loss 10.9801
step 420750: learning rate 0.00017044, train loss 10.9739, val loss 10.9811
step 421000: learning rate 0.00017016, train loss 10.9824, val loss 10.9793
step 421250: learning rate 0.00016987, train loss 10.9804, val loss 10.9759
step 421500: learning rate 0.00016959, train loss 10.9730, val loss 10.9803
step 421750: learning rate 0.00016930, train loss 10.9792, val loss 10.9739
step 422000: learning rate 0.00016902, train loss 10.9722, val loss 10.9805
step 422250: learning rate 0.00016874, train loss 10.9772, val loss 10.9795
step 422500: learning rate 0.00016845, train loss 10.9777, val loss 10.9790
step 422750: learning rate 0.00016817, train loss 10.9743, val loss 10.9767
step 423000: learning rate 0.00016789, train loss 10.9819, val loss 10.9721
step 423250: learning rate 0.00016760, train loss 10.9798, val loss 10.9823
step 423500: learning rate 0.00016732, train loss 10.9784, val loss 10.9768
step 423750: learning rate 0.00016704, train loss 10.9792, val loss 10.9767
step 424000: learning rate 0.00016676, train loss 10.9721, val loss 10.9747
step 424250: learning rate 0.00016648, train loss 10.9768, val loss 10.9821
step 424500: learning rate 0.00016620, train loss 10.9804, val loss 10.9742
step 424750: learning rate 0.00016591, train loss 10.9720, val loss 10.9673
step 425000: learning rate 0.00016563, train loss 10.9755, val loss 10.9765
step 425250: learning rate 0.00016535, train loss 10.9815, val loss 10.9799
step 425500: learning rate 0.00016507, train loss 10.9753, val loss 10.9793
step 425750: learning rate 0.00016479, train loss 10.9739, val loss 10.9749
step 426000: learning rate 0.00016452, train loss 10.9845, val loss 10.9781
step 426250: learning rate 0.00016424, train loss 10.9736, val loss 10.9765
step 426500: learning rate 0.00016396, train loss 10.9737, val loss 10.9776
step 426750: learning rate 0.00016368, train loss 10.9735, val loss 10.9690
step 427000: learning rate 0.00016340, train loss 10.9764, val loss 10.9762
step 427250: learning rate 0.00016312, train loss 10.9791, val loss 10.9803
step 427500: learning rate 0.00016284, train loss 10.9824, val loss 10.9780
step 427750: learning rate 0.00016257, train loss 10.9790, val loss 10.9786
step 428000: learning rate 0.00016229, train loss 10.9794, val loss 10.9744
step 428250: learning rate 0.00016201, train loss 10.9636, val loss 10.9794
step 428500: learning rate 0.00016174, train loss 10.9783, val loss 10.9803
step 428750: learning rate 0.00016146, train loss 10.9768, val loss 10.9785
step 429000: learning rate 0.00016118, train loss 10.9802, val loss 10.9800
step 429250: learning rate 0.00016091, train loss 10.9762, val loss 10.9774
step 429500: learning rate 0.00016063, train loss 10.9807, val loss 10.9805
step 429750: learning rate 0.00016036, train loss 10.9822, val loss 10.9723
step 430000: learning rate 0.00016008, train loss 10.9744, val loss 10.9755
step 430250: learning rate 0.00015981, train loss 10.9763, val loss 10.9797
step 430500: learning rate 0.00015953, train loss 10.9732, val loss 10.9841
step 430750: learning rate 0.00015926, train loss 10.9686, val loss 10.9767
step 431000: learning rate 0.00015899, train loss 10.9845, val loss 10.9705
step 431250: learning rate 0.00015871, train loss 10.9733, val loss 10.9669
step 431500: learning rate 0.00015844, train loss 10.9781, val loss 10.9795
step 431750: learning rate 0.00015817, train loss 10.9775, val loss 10.9760
step 432000: learning rate 0.00015790, train loss 10.9798, val loss 10.9762
step 432250: learning rate 0.00015762, train loss 10.9743, val loss 10.9726
step 432500: learning rate 0.00015735, train loss 10.9773, val loss 10.9735
step 432750: learning rate 0.00015708, train loss 10.9718, val loss 10.9832
step 433000: learning rate 0.00015681, train loss 10.9811, val loss 10.9738
step 433250: learning rate 0.00015654, train loss 10.9792, val loss 10.9784
step 433500: learning rate 0.00015627, train loss 10.9824, val loss 10.9795
step 433750: learning rate 0.00015600, train loss 10.9806, val loss 10.9804
step 434000: learning rate 0.00015573, train loss 10.9876, val loss 10.9780
step 434250: learning rate 0.00015546, train loss 10.9771, val loss 10.9800
step 434500: learning rate 0.00015519, train loss 10.9833, val loss 10.9709
step 434750: learning rate 0.00015492, train loss 10.9780, val loss 10.9781
step 435000: learning rate 0.00015465, train loss 10.9711, val loss 10.9721
step 435250: learning rate 0.00015438, train loss 10.9812, val loss 10.9696
step 435500: learning rate 0.00015411, train loss 10.9847, val loss 10.9767
step 435750: learning rate 0.00015384, train loss 10.9783, val loss 10.9783
step 436000: learning rate 0.00015358, train loss 10.9801, val loss 10.9747
step 436250: learning rate 0.00015331, train loss 10.9666, val loss 10.9745
step 436500: learning rate 0.00015304, train loss 10.9781, val loss 10.9746
step 436750: learning rate 0.00015278, train loss 10.9783, val loss 10.9852
step 437000: learning rate 0.00015251, train loss 10.9831, val loss 10.9785
step 437250: learning rate 0.00015224, train loss 10.9711, val loss 10.9812
step 437500: learning rate 0.00015198, train loss 10.9738, val loss 10.9820
step 437750: learning rate 0.00015171, train loss 10.9749, val loss 10.9786
step 438000: learning rate 0.00015145, train loss 10.9799, val loss 10.9775
step 438250: learning rate 0.00015118, train loss 10.9712, val loss 10.9828
step 438500: learning rate 0.00015092, train loss 10.9652, val loss 10.9878
step 438750: learning rate 0.00015065, train loss 10.9721, val loss 10.9681
step 439000: learning rate 0.00015039, train loss 10.9691, val loss 10.9865
step 439250: learning rate 0.00015012, train loss 10.9798, val loss 10.9779
step 439500: learning rate 0.00014986, train loss 10.9697, val loss 10.9737
step 439750: learning rate 0.00014960, train loss 10.9811, val loss 10.9756
step 440000: learning rate 0.00014933, train loss 10.9787, val loss 10.9866
step 440250: learning rate 0.00014907, train loss 10.9794, val loss 10.9772
step 440500: learning rate 0.00014881, train loss 10.9757, val loss 10.9818
step 440750: learning rate 0.00014855, train loss 10.9736, val loss 10.9838
step 441000: learning rate 0.00014829, train loss 10.9781, val loss 10.9754
step 441250: learning rate 0.00014803, train loss 10.9661, val loss 10.9729
step 441500: learning rate 0.00014776, train loss 10.9798, val loss 10.9832
step 441750: learning rate 0.00014750, train loss 10.9796, val loss 10.9752
step 442000: learning rate 0.00014724, train loss 10.9787, val loss 10.9833
step 442250: learning rate 0.00014698, train loss 10.9814, val loss 10.9754
step 442500: learning rate 0.00014672, train loss 10.9805, val loss 10.9773
step 442750: learning rate 0.00014646, train loss 10.9706, val loss 10.9793
step 443000: learning rate 0.00014621, train loss 10.9748, val loss 10.9682
step 443250: learning rate 0.00014595, train loss 10.9746, val loss 10.9776
step 443500: learning rate 0.00014569, train loss 10.9796, val loss 10.9821
step 443750: learning rate 0.00014543, train loss 10.9819, val loss 10.9783
step 444000: learning rate 0.00014517, train loss 10.9713, val loss 10.9685
step 444250: learning rate 0.00014491, train loss 10.9749, val loss 10.9798
step 444500: learning rate 0.00014466, train loss 10.9819, val loss 10.9676
step 444750: learning rate 0.00014440, train loss 10.9842, val loss 10.9834
step 445000: learning rate 0.00014414, train loss 10.9820, val loss 10.9712
step 445250: learning rate 0.00014389, train loss 10.9819, val loss 10.9795
step 445500: learning rate 0.00014363, train loss 10.9829, val loss 10.9727
step 445750: learning rate 0.00014338, train loss 10.9796, val loss 10.9740
step 446000: learning rate 0.00014312, train loss 10.9768, val loss 10.9783
step 446250: learning rate 0.00014287, train loss 10.9731, val loss 10.9792
step 446500: learning rate 0.00014261, train loss 10.9788, val loss 10.9770
step 446750: learning rate 0.00014236, train loss 10.9773, val loss 10.9815
step 447000: learning rate 0.00014210, train loss 10.9796, val loss 10.9752
step 447250: learning rate 0.00014185, train loss 10.9837, val loss 10.9735
step 447500: learning rate 0.00014160, train loss 10.9762, val loss 10.9756
step 447750: learning rate 0.00014134, train loss 10.9739, val loss 10.9793
step 448000: learning rate 0.00014109, train loss 10.9838, val loss 10.9782
step 448250: learning rate 0.00014084, train loss 10.9734, val loss 10.9767
step 448500: learning rate 0.00014059, train loss 10.9831, val loss 10.9760
step 448750: learning rate 0.00014033, train loss 10.9817, val loss 10.9792
step 449000: learning rate 0.00014008, train loss 10.9776, val loss 10.9846
step 449250: learning rate 0.00013983, train loss 10.9722, val loss 10.9769
step 449500: learning rate 0.00013958, train loss 10.9791, val loss 10.9689
step 449750: learning rate 0.00013933, train loss 10.9773, val loss 10.9713
step 450000: learning rate 0.00013908, train loss 10.9770, val loss 10.9800
step 450250: learning rate 0.00013883, train loss 10.9788, val loss 10.9699
step 450500: learning rate 0.00013858, train loss 10.9704, val loss 10.9729
step 450750: learning rate 0.00013833, train loss 10.9718, val loss 10.9741
step 451000: learning rate 0.00013808, train loss 10.9690, val loss 10.9805
step 451250: learning rate 0.00013784, train loss 10.9733, val loss 10.9767
step 451500: learning rate 0.00013759, train loss 10.9764, val loss 10.9735
step 451750: learning rate 0.00013734, train loss 10.9733, val loss 10.9753
step 452000: learning rate 0.00013709, train loss 10.9669, val loss 10.9723
step 452250: learning rate 0.00013685, train loss 10.9747, val loss 10.9762
step 452500: learning rate 0.00013660, train loss 10.9772, val loss 10.9750
step 452750: learning rate 0.00013635, train loss 10.9787, val loss 10.9696
step 453000: learning rate 0.00013611, train loss 10.9765, val loss 10.9753
step 453250: learning rate 0.00013586, train loss 10.9790, val loss 10.9758
step 453500: learning rate 0.00013561, train loss 10.9767, val loss 10.9754
step 453750: learning rate 0.00013537, train loss 10.9734, val loss 10.9695
step 454000: learning rate 0.00013512, train loss 10.9836, val loss 10.9742
step 454250: learning rate 0.00013488, train loss 10.9707, val loss 10.9678
step 454500: learning rate 0.00013464, train loss 10.9812, val loss 10.9727
step 454750: learning rate 0.00013439, train loss 10.9843, val loss 10.9691
step 455000: learning rate 0.00013415, train loss 10.9766, val loss 10.9729
step 455250: learning rate 0.00013391, train loss 10.9801, val loss 10.9779
step 455500: learning rate 0.00013366, train loss 10.9816, val loss 10.9783
step 455750: learning rate 0.00013342, train loss 10.9736, val loss 10.9701
step 456000: learning rate 0.00013318, train loss 10.9720, val loss 10.9778
step 456250: learning rate 0.00013294, train loss 10.9814, val loss 10.9799
step 456500: learning rate 0.00013270, train loss 10.9745, val loss 10.9793
step 456750: learning rate 0.00013245, train loss 10.9873, val loss 10.9798
step 457000: learning rate 0.00013221, train loss 10.9809, val loss 10.9743
step 457250: learning rate 0.00013197, train loss 10.9751, val loss 10.9801
step 457500: learning rate 0.00013173, train loss 10.9762, val loss 10.9825
step 457750: learning rate 0.00013149, train loss 10.9786, val loss 10.9808
step 458000: learning rate 0.00013125, train loss 10.9824, val loss 10.9781
step 458250: learning rate 0.00013101, train loss 10.9842, val loss 10.9788
step 458500: learning rate 0.00013078, train loss 10.9770, val loss 10.9750
step 458750: learning rate 0.00013054, train loss 10.9761, val loss 10.9763
step 459000: learning rate 0.00013030, train loss 10.9719, val loss 10.9842
step 459250: learning rate 0.00013006, train loss 10.9730, val loss 10.9751
step 459500: learning rate 0.00012982, train loss 10.9812, val loss 10.9762
step 459750: learning rate 0.00012959, train loss 10.9767, val loss 10.9776
step 460000: learning rate 0.00012935, train loss 10.9815, val loss 10.9771
step 460250: learning rate 0.00012911, train loss 10.9802, val loss 10.9806
step 460500: learning rate 0.00012888, train loss 10.9771, val loss 10.9730
step 460750: learning rate 0.00012864, train loss 10.9777, val loss 10.9733
step 461000: learning rate 0.00012841, train loss 10.9813, val loss 10.9734
step 461250: learning rate 0.00012817, train loss 10.9787, val loss 10.9852
step 461500: learning rate 0.00012794, train loss 10.9729, val loss 10.9745
step 461750: learning rate 0.00012770, train loss 10.9767, val loss 10.9702
step 462000: learning rate 0.00012747, train loss 10.9719, val loss 10.9787
step 462250: learning rate 0.00012724, train loss 10.9778, val loss 10.9725
step 462500: learning rate 0.00012700, train loss 10.9698, val loss 10.9723
step 462750: learning rate 0.00012677, train loss 10.9664, val loss 10.9766
step 463000: learning rate 0.00012654, train loss 10.9810, val loss 10.9831
step 463250: learning rate 0.00012631, train loss 10.9809, val loss 10.9754
step 463500: learning rate 0.00012607, train loss 10.9737, val loss 10.9748
step 463750: learning rate 0.00012584, train loss 10.9786, val loss 10.9771
step 464000: learning rate 0.00012561, train loss 10.9748, val loss 10.9767
step 464250: learning rate 0.00012538, train loss 10.9793, val loss 10.9734
step 464500: learning rate 0.00012515, train loss 10.9732, val loss 10.9822
step 464750: learning rate 0.00012492, train loss 10.9785, val loss 10.9802
step 465000: learning rate 0.00012469, train loss 10.9696, val loss 10.9748
step 465250: learning rate 0.00012446, train loss 10.9827, val loss 10.9800
step 465500: learning rate 0.00012423, train loss 10.9807, val loss 10.9742
step 465750: learning rate 0.00012400, train loss 10.9755, val loss 10.9809
step 466000: learning rate 0.00012378, train loss 10.9746, val loss 10.9731
step 466250: learning rate 0.00012355, train loss 10.9786, val loss 10.9772
step 466500: learning rate 0.00012332, train loss 10.9806, val loss 10.9808
step 466750: learning rate 0.00012309, train loss 10.9763, val loss 10.9817
step 467000: learning rate 0.00012287, train loss 10.9767, val loss 10.9830
step 467250: learning rate 0.00012264, train loss 10.9794, val loss 10.9773
step 467500: learning rate 0.00012241, train loss 10.9804, val loss 10.9739
step 467750: learning rate 0.00012219, train loss 10.9710, val loss 10.9763
step 468000: learning rate 0.00012196, train loss 10.9823, val loss 10.9828
step 468250: learning rate 0.00012174, train loss 10.9821, val loss 10.9814
step 468500: learning rate 0.00012151, train loss 10.9770, val loss 10.9775
step 468750: learning rate 0.00012129, train loss 10.9778, val loss 10.9694
step 469000: learning rate 0.00012106, train loss 10.9809, val loss 10.9717
step 469250: learning rate 0.00012084, train loss 10.9729, val loss 10.9774
step 469500: learning rate 0.00012062, train loss 10.9772, val loss 10.9745
step 469750: learning rate 0.00012039, train loss 10.9802, val loss 10.9812
step 470000: learning rate 0.00012017, train loss 10.9784, val loss 10.9776
step 470250: learning rate 0.00011995, train loss 10.9748, val loss 10.9705
step 470500: learning rate 0.00011973, train loss 10.9674, val loss 10.9767
step 470750: learning rate 0.00011950, train loss 10.9776, val loss 10.9763
step 471000: learning rate 0.00011928, train loss 10.9783, val loss 10.9801
step 471250: learning rate 0.00011906, train loss 10.9787, val loss 10.9759
step 471500: learning rate 0.00011884, train loss 10.9800, val loss 10.9777
step 471750: learning rate 0.00011862, train loss 10.9788, val loss 10.9822
step 472000: learning rate 0.00011840, train loss 10.9743, val loss 10.9774
step 472250: learning rate 0.00011818, train loss 10.9792, val loss 10.9782
step 472500: learning rate 0.00011796, train loss 10.9770, val loss 10.9795
step 472750: learning rate 0.00011775, train loss 10.9805, val loss 10.9789
step 473000: learning rate 0.00011753, train loss 10.9738, val loss 10.9779
step 473250: learning rate 0.00011731, train loss 10.9759, val loss 10.9812
step 473500: learning rate 0.00011709, train loss 10.9781, val loss 10.9797
step 473750: learning rate 0.00011687, train loss 10.9750, val loss 10.9744
step 474000: learning rate 0.00011666, train loss 10.9745, val loss 10.9786
step 474250: learning rate 0.00011644, train loss 10.9776, val loss 10.9828
step 474500: learning rate 0.00011623, train loss 10.9750, val loss 10.9778
step 474750: learning rate 0.00011601, train loss 10.9843, val loss 10.9706
step 475000: learning rate 0.00011579, train loss 10.9790, val loss 10.9791
step 475250: learning rate 0.00011558, train loss 10.9715, val loss 10.9769
step 475500: learning rate 0.00011537, train loss 10.9768, val loss 10.9862
step 475750: learning rate 0.00011515, train loss 10.9727, val loss 10.9764
step 476000: learning rate 0.00011494, train loss 10.9818, val loss 10.9763
step 476250: learning rate 0.00011472, train loss 10.9801, val loss 10.9755
step 476500: learning rate 0.00011451, train loss 10.9774, val loss 10.9763
step 476750: learning rate 0.00011430, train loss 10.9763, val loss 10.9783
step 477000: learning rate 0.00011409, train loss 10.9770, val loss 10.9803
step 477250: learning rate 0.00011387, train loss 10.9739, val loss 10.9814
step 477500: learning rate 0.00011366, train loss 10.9826, val loss 10.9803
step 477750: learning rate 0.00011345, train loss 10.9775, val loss 10.9747
step 478000: learning rate 0.00011324, train loss 10.9786, val loss 10.9777
step 478250: learning rate 0.00011303, train loss 10.9763, val loss 10.9853
step 478500: learning rate 0.00011282, train loss 10.9770, val loss 10.9756
step 478750: learning rate 0.00011261, train loss 10.9746, val loss 10.9784
step 479000: learning rate 0.00011240, train loss 10.9790, val loss 10.9748
step 479250: learning rate 0.00011219, train loss 10.9771, val loss 10.9792
step 479500: learning rate 0.00011198, train loss 10.9760, val loss 10.9698
step 479750: learning rate 0.00011177, train loss 10.9813, val loss 10.9811
step 480000: learning rate 0.00011157, train loss 10.9790, val loss 10.9786
step 480250: learning rate 0.00011136, train loss 10.9769, val loss 10.9791
step 480500: learning rate 0.00011115, train loss 10.9805, val loss 10.9706
step 480750: learning rate 0.00011094, train loss 10.9827, val loss 10.9795
step 481000: learning rate 0.00011074, train loss 10.9741, val loss 10.9760
step 481250: learning rate 0.00011053, train loss 10.9773, val loss 10.9846
step 481500: learning rate 0.00011033, train loss 10.9742, val loss 10.9771
step 481750: learning rate 0.00011012, train loss 10.9803, val loss 10.9743
step 482000: learning rate 0.00010992, train loss 10.9682, val loss 10.9808
step 482250: learning rate 0.00010971, train loss 10.9764, val loss 10.9756
step 482500: learning rate 0.00010951, train loss 10.9804, val loss 10.9727
step 482750: learning rate 0.00010930, train loss 10.9812, val loss 10.9799
step 483000: learning rate 0.00010910, train loss 10.9742, val loss 10.9734
step 483250: learning rate 0.00010890, train loss 10.9760, val loss 10.9812
step 483500: learning rate 0.00010869, train loss 10.9794, val loss 10.9757
step 483750: learning rate 0.00010849, train loss 10.9764, val loss 10.9760
step 484000: learning rate 0.00010829, train loss 10.9746, val loss 10.9755
step 484250: learning rate 0.00010809, train loss 10.9777, val loss 10.9757
step 484500: learning rate 0.00010789, train loss 10.9719, val loss 10.9788
step 484750: learning rate 0.00010769, train loss 10.9806, val loss 10.9717
step 485000: learning rate 0.00010749, train loss 10.9804, val loss 10.9762
step 485250: learning rate 0.00010729, train loss 10.9776, val loss 10.9670
step 485500: learning rate 0.00010709, train loss 10.9776, val loss 10.9753
step 485750: learning rate 0.00010689, train loss 10.9785, val loss 10.9740
step 486000: learning rate 0.00010669, train loss 10.9727, val loss 10.9813
step 486250: learning rate 0.00010649, train loss 10.9724, val loss 10.9771
step 486500: learning rate 0.00010629, train loss 10.9783, val loss 10.9791
step 486750: learning rate 0.00010609, train loss 10.9792, val loss 10.9754
step 487000: learning rate 0.00010590, train loss 10.9752, val loss 10.9774
step 487250: learning rate 0.00010570, train loss 10.9730, val loss 10.9706
step 487500: learning rate 0.00010550, train loss 10.9744, val loss 10.9794
step 487750: learning rate 0.00010531, train loss 10.9776, val loss 10.9767
step 488000: learning rate 0.00010511, train loss 10.9735, val loss 10.9798
step 488250: learning rate 0.00010492, train loss 10.9761, val loss 10.9763
step 488500: learning rate 0.00010472, train loss 10.9809, val loss 10.9743
step 488750: learning rate 0.00010453, train loss 10.9806, val loss 10.9780
step 489000: learning rate 0.00010433, train loss 10.9756, val loss 10.9771
step 489250: learning rate 0.00010414, train loss 10.9740, val loss 10.9778
step 489500: learning rate 0.00010394, train loss 10.9778, val loss 10.9795
step 489750: learning rate 0.00010375, train loss 10.9738, val loss 10.9818
step 490000: learning rate 0.00010356, train loss 10.9758, val loss 10.9773
step 490250: learning rate 0.00010337, train loss 10.9847, val loss 10.9778
step 490500: learning rate 0.00010317, train loss 10.9752, val loss 10.9783
step 490750: learning rate 0.00010298, train loss 10.9853, val loss 10.9764
step 491000: learning rate 0.00010279, train loss 10.9799, val loss 10.9805
step 491250: learning rate 0.00010260, train loss 10.9805, val loss 10.9831
step 491500: learning rate 0.00010241, train loss 10.9754, val loss 10.9808
step 491750: learning rate 0.00010222, train loss 10.9798, val loss 10.9721
step 492000: learning rate 0.00010203, train loss 10.9809, val loss 10.9825
step 492250: learning rate 0.00010184, train loss 10.9729, val loss 10.9815
step 492500: learning rate 0.00010165, train loss 10.9806, val loss 10.9721
step 492750: learning rate 0.00010147, train loss 10.9801, val loss 10.9706
step 493000: learning rate 0.00010128, train loss 10.9780, val loss 10.9693
step 493250: learning rate 0.00010109, train loss 10.9717, val loss 10.9795
step 493500: learning rate 0.00010090, train loss 10.9788, val loss 10.9703
step 493750: learning rate 0.00010072, train loss 10.9739, val loss 10.9784
step 494000: learning rate 0.00010053, train loss 10.9771, val loss 10.9732
step 494250: learning rate 0.00010034, train loss 10.9761, val loss 10.9758
step 494500: learning rate 0.00010016, train loss 10.9793, val loss 10.9748
step 494750: learning rate 0.00009997, train loss 10.9819, val loss 10.9750
step 495000: learning rate 0.00009979, train loss 10.9758, val loss 10.9683
step 495250: learning rate 0.00009960, train loss 10.9764, val loss 10.9719
step 495500: learning rate 0.00009942, train loss 10.9732, val loss 10.9746
step 495750: learning rate 0.00009923, train loss 10.9741, val loss 10.9772
step 496000: learning rate 0.00009905, train loss 10.9756, val loss 10.9771
step 496250: learning rate 0.00009887, train loss 10.9791, val loss 10.9817
step 496500: learning rate 0.00009869, train loss 10.9792, val loss 10.9780
step 496750: learning rate 0.00009850, train loss 10.9836, val loss 10.9786
step 497000: learning rate 0.00009832, train loss 10.9790, val loss 10.9731
step 497250: learning rate 0.00009814, train loss 10.9804, val loss 10.9788
step 497500: learning rate 0.00009796, train loss 10.9734, val loss 10.9844
step 497750: learning rate 0.00009778, train loss 10.9789, val loss 10.9751
step 498000: learning rate 0.00009760, train loss 10.9791, val loss 10.9785
step 498250: learning rate 0.00009742, train loss 10.9742, val loss 10.9802
step 498500: learning rate 0.00009724, train loss 10.9751, val loss 10.9831
step 498750: learning rate 0.00009706, train loss 10.9733, val loss 10.9711
step 499000: learning rate 0.00009688, train loss 10.9793, val loss 10.9776
step 499250: learning rate 0.00009671, train loss 10.9812, val loss 10.9737
step 499500: learning rate 0.00009653, train loss 10.9801, val loss 10.9862
step 499750: learning rate 0.00009635, train loss 10.9734, val loss 10.9706
step 500000: learning rate 0.00009617, train loss 10.9874, val loss 10.9719
step 500250: learning rate 0.00009600, train loss 10.9828, val loss 10.9709
step 500500: learning rate 0.00009582, train loss 10.9719, val loss 10.9786
step 500750: learning rate 0.00009564, train loss 10.9777, val loss 10.9736
step 501000: learning rate 0.00009547, train loss 10.9772, val loss 10.9749
step 501250: learning rate 0.00009529, train loss 10.9759, val loss 10.9788
step 501500: learning rate 0.00009512, train loss 10.9771, val loss 10.9750
step 501750: learning rate 0.00009495, train loss 10.9832, val loss 10.9751
step 502000: learning rate 0.00009477, train loss 10.9781, val loss 10.9760
step 502250: learning rate 0.00009460, train loss 10.9726, val loss 10.9699
step 502500: learning rate 0.00009443, train loss 10.9808, val loss 10.9776
step 502750: learning rate 0.00009425, train loss 10.9822, val loss 10.9794
step 503000: learning rate 0.00009408, train loss 10.9782, val loss 10.9816
step 503250: learning rate 0.00009391, train loss 10.9755, val loss 10.9845
step 503500: learning rate 0.00009374, train loss 10.9769, val loss 10.9752
step 503750: learning rate 0.00009357, train loss 10.9751, val loss 10.9886
step 504000: learning rate 0.00009340, train loss 10.9745, val loss 10.9787
step 504250: learning rate 0.00009323, train loss 10.9698, val loss 10.9737
step 504500: learning rate 0.00009306, train loss 10.9760, val loss 10.9711
step 504750: learning rate 0.00009289, train loss 10.9824, val loss 10.9745
step 505000: learning rate 0.00009272, train loss 10.9748, val loss 10.9727
step 505250: learning rate 0.00009255, train loss 10.9806, val loss 10.9815
step 505500: learning rate 0.00009238, train loss 10.9805, val loss 10.9768
step 505750: learning rate 0.00009222, train loss 10.9782, val loss 10.9810
step 506000: learning rate 0.00009205, train loss 10.9692, val loss 10.9744
step 506250: learning rate 0.00009188, train loss 10.9696, val loss 10.9762
step 506500: learning rate 0.00009171, train loss 10.9776, val loss 10.9893
step 506750: learning rate 0.00009155, train loss 10.9738, val loss 10.9763
step 507000: learning rate 0.00009138, train loss 10.9757, val loss 10.9709
step 507250: learning rate 0.00009122, train loss 10.9689, val loss 10.9746
step 507500: learning rate 0.00009105, train loss 10.9748, val loss 10.9795
step 507750: learning rate 0.00009089, train loss 10.9784, val loss 10.9735
step 508000: learning rate 0.00009073, train loss 10.9805, val loss 10.9782
step 508250: learning rate 0.00009056, train loss 10.9759, val loss 10.9800
step 508500: learning rate 0.00009040, train loss 10.9766, val loss 10.9789
step 508750: learning rate 0.00009024, train loss 10.9743, val loss 10.9773
step 509000: learning rate 0.00009007, train loss 10.9732, val loss 10.9835
step 509250: learning rate 0.00008991, train loss 10.9818, val loss 10.9713
step 509500: learning rate 0.00008975, train loss 10.9758, val loss 10.9797
step 509750: learning rate 0.00008959, train loss 10.9726, val loss 10.9830
step 510000: learning rate 0.00008943, train loss 10.9729, val loss 10.9718
step 510250: learning rate 0.00008927, train loss 10.9732, val loss 10.9765
step 510500: learning rate 0.00008911, train loss 10.9845, val loss 10.9796
step 510750: learning rate 0.00008895, train loss 10.9725, val loss 10.9761
step 511000: learning rate 0.00008879, train loss 10.9739, val loss 10.9768
step 511250: learning rate 0.00008863, train loss 10.9725, val loss 10.9738
step 511500: learning rate 0.00008847, train loss 10.9784, val loss 10.9800
step 511750: learning rate 0.00008832, train loss 10.9719, val loss 10.9804
step 512000: learning rate 0.00008816, train loss 10.9760, val loss 10.9816
step 512250: learning rate 0.00008800, train loss 10.9834, val loss 10.9760
step 512500: learning rate 0.00008784, train loss 10.9721, val loss 10.9768
step 512750: learning rate 0.00008769, train loss 10.9730, val loss 10.9790
step 513000: learning rate 0.00008753, train loss 10.9834, val loss 10.9844
step 513250: learning rate 0.00008738, train loss 10.9779, val loss 10.9806
step 513500: learning rate 0.00008722, train loss 10.9721, val loss 10.9802
step 513750: learning rate 0.00008707, train loss 10.9799, val loss 10.9747
step 514000: learning rate 0.00008691, train loss 10.9823, val loss 10.9809
step 514250: learning rate 0.00008676, train loss 10.9800, val loss 10.9790
step 514500: learning rate 0.00008661, train loss 10.9789, val loss 10.9739
step 514750: learning rate 0.00008645, train loss 10.9798, val loss 10.9694
step 515000: learning rate 0.00008630, train loss 10.9773, val loss 10.9716
step 515250: learning rate 0.00008615, train loss 10.9705, val loss 10.9832
step 515500: learning rate 0.00008600, train loss 10.9806, val loss 10.9771
step 515750: learning rate 0.00008585, train loss 10.9774, val loss 10.9810
step 516000: learning rate 0.00008570, train loss 10.9742, val loss 10.9791
step 516250: learning rate 0.00008555, train loss 10.9745, val loss 10.9716
step 516500: learning rate 0.00008540, train loss 10.9738, val loss 10.9778
step 516750: learning rate 0.00008525, train loss 10.9815, val loss 10.9744
step 517000: learning rate 0.00008510, train loss 10.9781, val loss 10.9686
step 517250: learning rate 0.00008495, train loss 10.9787, val loss 10.9743
step 517500: learning rate 0.00008480, train loss 10.9738, val loss 10.9738
step 517750: learning rate 0.00008465, train loss 10.9807, val loss 10.9712
step 518000: learning rate 0.00008451, train loss 10.9750, val loss 10.9713
step 518250: learning rate 0.00008436, train loss 10.9741, val loss 10.9812
step 518500: learning rate 0.00008421, train loss 10.9801, val loss 10.9840
step 518750: learning rate 0.00008407, train loss 10.9796, val loss 10.9751
step 519000: learning rate 0.00008392, train loss 10.9774, val loss 10.9696
step 519250: learning rate 0.00008378, train loss 10.9699, val loss 10.9726
step 519500: learning rate 0.00008363, train loss 10.9812, val loss 10.9701
step 519750: learning rate 0.00008349, train loss 10.9799, val loss 10.9797
step 520000: learning rate 0.00008334, train loss 10.9747, val loss 10.9875
step 520250: learning rate 0.00008320, train loss 10.9748, val loss 10.9716
step 520500: learning rate 0.00008306, train loss 10.9797, val loss 10.9808
step 520750: learning rate 0.00008291, train loss 10.9727, val loss 10.9719
step 521000: learning rate 0.00008277, train loss 10.9812, val loss 10.9879
step 521250: learning rate 0.00008263, train loss 10.9721, val loss 10.9743
step 521500: learning rate 0.00008249, train loss 10.9759, val loss 10.9801
step 521750: learning rate 0.00008235, train loss 10.9783, val loss 10.9672
step 522000: learning rate 0.00008221, train loss 10.9755, val loss 10.9692
step 522250: learning rate 0.00008207, train loss 10.9801, val loss 10.9752
step 522500: learning rate 0.00008193, train loss 10.9780, val loss 10.9800
step 522750: learning rate 0.00008179, train loss 10.9738, val loss 10.9759
step 523000: learning rate 0.00008165, train loss 10.9804, val loss 10.9694
step 523250: learning rate 0.00008151, train loss 10.9786, val loss 10.9745
step 523500: learning rate 0.00008137, train loss 10.9756, val loss 10.9765
step 523750: learning rate 0.00008123, train loss 10.9808, val loss 10.9749
step 524000: learning rate 0.00008110, train loss 10.9752, val loss 10.9744
step 524250: learning rate 0.00008096, train loss 10.9782, val loss 10.9784
step 524500: learning rate 0.00008082, train loss 10.9703, val loss 10.9733
step 524750: learning rate 0.00008069, train loss 10.9767, val loss 10.9781
step 525000: learning rate 0.00008055, train loss 10.9846, val loss 10.9711
step 525250: learning rate 0.00008042, train loss 10.9760, val loss 10.9788
step 525500: learning rate 0.00008028, train loss 10.9746, val loss 10.9728
step 525750: learning rate 0.00008015, train loss 10.9798, val loss 10.9829
step 526000: learning rate 0.00008001, train loss 10.9827, val loss 10.9746
step 526250: learning rate 0.00007988, train loss 10.9761, val loss 10.9768
step 526500: learning rate 0.00007975, train loss 10.9760, val loss 10.9770
step 526750: learning rate 0.00007962, train loss 10.9746, val loss 10.9793
step 527000: learning rate 0.00007948, train loss 10.9840, val loss 10.9770
step 527250: learning rate 0.00007935, train loss 10.9794, val loss 10.9804
step 527500: learning rate 0.00007922, train loss 10.9738, val loss 10.9768
step 527750: learning rate 0.00007909, train loss 10.9707, val loss 10.9763
step 528000: learning rate 0.00007896, train loss 10.9756, val loss 10.9784
step 528250: learning rate 0.00007883, train loss 10.9792, val loss 10.9766
step 528500: learning rate 0.00007870, train loss 10.9703, val loss 10.9744
step 528750: learning rate 0.00007857, train loss 10.9814, val loss 10.9757
step 529000: learning rate 0.00007844, train loss 10.9788, val loss 10.9889
step 529250: learning rate 0.00007832, train loss 10.9770, val loss 10.9821
step 529500: learning rate 0.00007819, train loss 10.9806, val loss 10.9831
step 529750: learning rate 0.00007806, train loss 10.9725, val loss 10.9827
step 530000: learning rate 0.00007793, train loss 10.9706, val loss 10.9747
step 530250: learning rate 0.00007781, train loss 10.9775, val loss 10.9757
step 530500: learning rate 0.00007768, train loss 10.9712, val loss 10.9824
step 530750: learning rate 0.00007756, train loss 10.9772, val loss 10.9788
step 531000: learning rate 0.00007743, train loss 10.9766, val loss 10.9710
step 531250: learning rate 0.00007731, train loss 10.9703, val loss 10.9731
step 531500: learning rate 0.00007718, train loss 10.9734, val loss 10.9806
step 531750: learning rate 0.00007706, train loss 10.9809, val loss 10.9797
step 532000: learning rate 0.00007693, train loss 10.9777, val loss 10.9782
step 532250: learning rate 0.00007681, train loss 10.9874, val loss 10.9786
step 532500: learning rate 0.00007669, train loss 10.9755, val loss 10.9816
step 532750: learning rate 0.00007657, train loss 10.9803, val loss 10.9811
step 533000: learning rate 0.00007644, train loss 10.9807, val loss 10.9840
step 533250: learning rate 0.00007632, train loss 10.9738, val loss 10.9714
step 533500: learning rate 0.00007620, train loss 10.9776, val loss 10.9909
step 533750: learning rate 0.00007608, train loss 10.9690, val loss 10.9786
step 534000: learning rate 0.00007596, train loss 10.9806, val loss 10.9753
step 534250: learning rate 0.00007584, train loss 10.9842, val loss 10.9736
step 534500: learning rate 0.00007572, train loss 10.9724, val loss 10.9685
step 534750: learning rate 0.00007560, train loss 10.9676, val loss 10.9814
step 535000: learning rate 0.00007549, train loss 10.9829, val loss 10.9752
step 535250: learning rate 0.00007537, train loss 10.9761, val loss 10.9770
step 535500: learning rate 0.00007525, train loss 10.9788, val loss 10.9775
step 535750: learning rate 0.00007513, train loss 10.9762, val loss 10.9767
step 536000: learning rate 0.00007502, train loss 10.9758, val loss 10.9796
step 536250: learning rate 0.00007490, train loss 10.9698, val loss 10.9791
step 536500: learning rate 0.00007479, train loss 10.9740, val loss 10.9772
step 536750: learning rate 0.00007467, train loss 10.9710, val loss 10.9779
step 537000: learning rate 0.00007456, train loss 10.9737, val loss 10.9868
step 537250: learning rate 0.00007444, train loss 10.9784, val loss 10.9783
step 537500: learning rate 0.00007433, train loss 10.9707, val loss 10.9743
step 537750: learning rate 0.00007422, train loss 10.9768, val loss 10.9871
step 538000: learning rate 0.00007410, train loss 10.9727, val loss 10.9762
step 538250: learning rate 0.00007399, train loss 10.9756, val loss 10.9735
step 538500: learning rate 0.00007388, train loss 10.9779, val loss 10.9769
step 538750: learning rate 0.00007377, train loss 10.9755, val loss 10.9763
step 539000: learning rate 0.00007366, train loss 10.9759, val loss 10.9723
step 539250: learning rate 0.00007354, train loss 10.9786, val loss 10.9820
step 539500: learning rate 0.00007343, train loss 10.9713, val loss 10.9701
step 539750: learning rate 0.00007332, train loss 10.9776, val loss 10.9755
step 540000: learning rate 0.00007321, train loss 10.9813, val loss 10.9755
step 540250: learning rate 0.00007311, train loss 10.9780, val loss 10.9777
step 540500: learning rate 0.00007300, train loss 10.9775, val loss 10.9802
step 540750: learning rate 0.00007289, train loss 10.9729, val loss 10.9826
step 541000: learning rate 0.00007278, train loss 10.9758, val loss 10.9746
step 541250: learning rate 0.00007267, train loss 10.9827, val loss 10.9713
step 541500: learning rate 0.00007257, train loss 10.9714, val loss 10.9827
step 541750: learning rate 0.00007246, train loss 10.9830, val loss 10.9725
step 542000: learning rate 0.00007236, train loss 10.9762, val loss 10.9719
step 542250: learning rate 0.00007225, train loss 10.9834, val loss 10.9707
step 542500: learning rate 0.00007214, train loss 10.9738, val loss 10.9684
step 542750: learning rate 0.00007204, train loss 10.9802, val loss 10.9706
step 543000: learning rate 0.00007194, train loss 10.9798, val loss 10.9744
step 543250: learning rate 0.00007183, train loss 10.9856, val loss 10.9697
step 543500: learning rate 0.00007173, train loss 10.9746, val loss 10.9783
step 543750: learning rate 0.00007163, train loss 10.9749, val loss 10.9808
step 544000: learning rate 0.00007152, train loss 10.9771, val loss 10.9728
step 544250: learning rate 0.00007142, train loss 10.9767, val loss 10.9811
step 544500: learning rate 0.00007132, train loss 10.9797, val loss 10.9749
step 544750: learning rate 0.00007122, train loss 10.9831, val loss 10.9780
step 545000: learning rate 0.00007112, train loss 10.9773, val loss 10.9767
step 545250: learning rate 0.00007102, train loss 10.9767, val loss 10.9801
step 545500: learning rate 0.00007092, train loss 10.9768, val loss 10.9781
step 545750: learning rate 0.00007082, train loss 10.9758, val loss 10.9595
step 546000: learning rate 0.00007072, train loss 10.9707, val loss 10.9799
step 546250: learning rate 0.00007062, train loss 10.9795, val loss 10.9764
step 546500: learning rate 0.00007052, train loss 10.9796, val loss 10.9792
step 546750: learning rate 0.00007043, train loss 10.9800, val loss 10.9731
step 547000: learning rate 0.00007033, train loss 10.9737, val loss 10.9791
step 547250: learning rate 0.00007023, train loss 10.9709, val loss 10.9795
step 547500: learning rate 0.00007014, train loss 10.9718, val loss 10.9739
step 547750: learning rate 0.00007004, train loss 10.9739, val loss 10.9742
step 548000: learning rate 0.00006995, train loss 10.9843, val loss 10.9761
step 548250: learning rate 0.00006985, train loss 10.9845, val loss 10.9731
step 548500: learning rate 0.00006976, train loss 10.9847, val loss 10.9747
step 548750: learning rate 0.00006966, train loss 10.9738, val loss 10.9764
step 549000: learning rate 0.00006957, train loss 10.9723, val loss 10.9754
step 549250: learning rate 0.00006948, train loss 10.9714, val loss 10.9769
step 549500: learning rate 0.00006938, train loss 10.9781, val loss 10.9699
step 549750: learning rate 0.00006929, train loss 10.9861, val loss 10.9687
step 550000: learning rate 0.00006920, train loss 10.9733, val loss 10.9799
step 550250: learning rate 0.00006911, train loss 10.9826, val loss 10.9768
step 550500: learning rate 0.00006902, train loss 10.9757, val loss 10.9727
step 550750: learning rate 0.00006893, train loss 10.9810, val loss 10.9721
step 551000: learning rate 0.00006884, train loss 10.9749, val loss 10.9759
step 551250: learning rate 0.00006875, train loss 10.9779, val loss 10.9811
step 551500: learning rate 0.00006866, train loss 10.9800, val loss 10.9821
step 551750: learning rate 0.00006857, train loss 10.9794, val loss 10.9745
step 552000: learning rate 0.00006848, train loss 10.9754, val loss 10.9821
step 552250: learning rate 0.00006839, train loss 10.9772, val loss 10.9806
step 552500: learning rate 0.00006831, train loss 10.9788, val loss 10.9761
step 552750: learning rate 0.00006822, train loss 10.9736, val loss 10.9742
step 553000: learning rate 0.00006813, train loss 10.9755, val loss 10.9809
step 553250: learning rate 0.00006805, train loss 10.9674, val loss 10.9831
step 553500: learning rate 0.00006796, train loss 10.9765, val loss 10.9755
step 553750: learning rate 0.00006788, train loss 10.9771, val loss 10.9772
step 554000: learning rate 0.00006779, train loss 10.9729, val loss 10.9736
step 554250: learning rate 0.00006771, train loss 10.9790, val loss 10.9756
step 554500: learning rate 0.00006763, train loss 10.9771, val loss 10.9747
step 554750: learning rate 0.00006754, train loss 10.9753, val loss 10.9734
step 555000: learning rate 0.00006746, train loss 10.9798, val loss 10.9715
step 555250: learning rate 0.00006738, train loss 10.9721, val loss 10.9820
step 555500: learning rate 0.00006730, train loss 10.9774, val loss 10.9750
step 555750: learning rate 0.00006721, train loss 10.9809, val loss 10.9715
step 556000: learning rate 0.00006713, train loss 10.9778, val loss 10.9726
step 556250: learning rate 0.00006705, train loss 10.9751, val loss 10.9763
step 556500: learning rate 0.00006697, train loss 10.9772, val loss 10.9829
step 556750: learning rate 0.00006689, train loss 10.9764, val loss 10.9829
step 557000: learning rate 0.00006681, train loss 10.9769, val loss 10.9683
step 557250: learning rate 0.00006674, train loss 10.9745, val loss 10.9765
step 557500: learning rate 0.00006666, train loss 10.9778, val loss 10.9823
step 557750: learning rate 0.00006658, train loss 10.9792, val loss 10.9786
step 558000: learning rate 0.00006650, train loss 10.9767, val loss 10.9767
step 558250: learning rate 0.00006643, train loss 10.9815, val loss 10.9784
step 558500: learning rate 0.00006635, train loss 10.9761, val loss 10.9665
step 558750: learning rate 0.00006627, train loss 10.9820, val loss 10.9769
step 559000: learning rate 0.00006620, train loss 10.9794, val loss 10.9917
step 559250: learning rate 0.00006612, train loss 10.9732, val loss 10.9778
step 559500: learning rate 0.00006605, train loss 10.9797, val loss 10.9742
step 559750: learning rate 0.00006597, train loss 10.9764, val loss 10.9798
step 560000: learning rate 0.00006590, train loss 10.9816, val loss 10.9759
step 560250: learning rate 0.00006583, train loss 10.9759, val loss 10.9771
step 560500: learning rate 0.00006575, train loss 10.9845, val loss 10.9800
step 560750: learning rate 0.00006568, train loss 10.9761, val loss 10.9826
step 561000: learning rate 0.00006561, train loss 10.9729, val loss 10.9705
step 561250: learning rate 0.00006554, train loss 10.9742, val loss 10.9790
step 561500: learning rate 0.00006547, train loss 10.9771, val loss 10.9813
step 561750: learning rate 0.00006540, train loss 10.9749, val loss 10.9741
step 562000: learning rate 0.00006533, train loss 10.9752, val loss 10.9669
step 562250: learning rate 0.00006526, train loss 10.9730, val loss 10.9794
step 562500: learning rate 0.00006519, train loss 10.9782, val loss 10.9768
step 562750: learning rate 0.00006512, train loss 10.9811, val loss 10.9796
step 563000: learning rate 0.00006505, train loss 10.9733, val loss 10.9814
step 563250: learning rate 0.00006498, train loss 10.9834, val loss 10.9685
step 563500: learning rate 0.00006492, train loss 10.9728, val loss 10.9820
step 563750: learning rate 0.00006485, train loss 10.9757, val loss 10.9743
step 564000: learning rate 0.00006478, train loss 10.9784, val loss 10.9750
step 564250: learning rate 0.00006472, train loss 10.9734, val loss 10.9702
step 564500: learning rate 0.00006465, train loss 10.9774, val loss 10.9805
step 564750: learning rate 0.00006459, train loss 10.9788, val loss 10.9721
step 565000: learning rate 0.00006452, train loss 10.9726, val loss 10.9750
step 565250: learning rate 0.00006446, train loss 10.9760, val loss 10.9789
step 565500: learning rate 0.00006439, train loss 10.9784, val loss 10.9760
step 565750: learning rate 0.00006433, train loss 10.9776, val loss 10.9731
step 566000: learning rate 0.00006427, train loss 10.9780, val loss 10.9788
step 566250: learning rate 0.00006420, train loss 10.9840, val loss 10.9790
step 566500: learning rate 0.00006414, train loss 10.9755, val loss 10.9748
step 566750: learning rate 0.00006408, train loss 10.9683, val loss 10.9782
step 567000: learning rate 0.00006402, train loss 10.9735, val loss 10.9730
step 567250: learning rate 0.00006396, train loss 10.9769, val loss 10.9758
step 567500: learning rate 0.00006390, train loss 10.9752, val loss 10.9738
step 567750: learning rate 0.00006384, train loss 10.9767, val loss 10.9811
step 568000: learning rate 0.00006378, train loss 10.9751, val loss 10.9735
step 568250: learning rate 0.00006372, train loss 10.9739, val loss 10.9819
step 568500: learning rate 0.00006366, train loss 10.9735, val loss 10.9883
step 568750: learning rate 0.00006361, train loss 10.9785, val loss 10.9779
step 569000: learning rate 0.00006355, train loss 10.9839, val loss 10.9707
step 569250: learning rate 0.00006349, train loss 10.9811, val loss 10.9757
step 569500: learning rate 0.00006344, train loss 10.9700, val loss 10.9759
step 569750: learning rate 0.00006338, train loss 10.9742, val loss 10.9801
step 570000: learning rate 0.00006332, train loss 10.9831, val loss 10.9775
step 570250: learning rate 0.00006327, train loss 10.9717, val loss 10.9756
step 570500: learning rate 0.00006321, train loss 10.9714, val loss 10.9746
step 570750: learning rate 0.00006316, train loss 10.9713, val loss 10.9760
step 571000: learning rate 0.00006311, train loss 10.9803, val loss 10.9752
step 571250: learning rate 0.00006305, train loss 10.9846, val loss 10.9854
step 571500: learning rate 0.00006300, train loss 10.9798, val loss 10.9776
step 571750: learning rate 0.00006295, train loss 10.9740, val loss 10.9816
step 572000: learning rate 0.00006290, train loss 10.9784, val loss 10.9850
step 572250: learning rate 0.00006285, train loss 10.9800, val loss 10.9728
step 572500: learning rate 0.00006279, train loss 10.9828, val loss 10.9772
step 572750: learning rate 0.00006274, train loss 10.9787, val loss 10.9759
step 573000: learning rate 0.00006269, train loss 10.9722, val loss 10.9787
step 573250: learning rate 0.00006264, train loss 10.9738, val loss 10.9760
step 573500: learning rate 0.00006259, train loss 10.9762, val loss 10.9716
step 573750: learning rate 0.00006255, train loss 10.9652, val loss 10.9760
step 574000: learning rate 0.00006250, train loss 10.9783, val loss 10.9793
step 574250: learning rate 0.00006245, train loss 10.9745, val loss 10.9765
step 574500: learning rate 0.00006240, train loss 10.9751, val loss 10.9765
step 574750: learning rate 0.00006236, train loss 10.9722, val loss 10.9732
step 575000: learning rate 0.00006231, train loss 10.9775, val loss 10.9730
step 575250: learning rate 0.00006226, train loss 10.9790, val loss 10.9731
step 575500: learning rate 0.00006222, train loss 10.9741, val loss 10.9761
step 575750: learning rate 0.00006217, train loss 10.9842, val loss 10.9741
step 576000: learning rate 0.00006213, train loss 10.9797, val loss 10.9793
step 576250: learning rate 0.00006208, train loss 10.9754, val loss 10.9696
step 576500: learning rate 0.00006204, train loss 10.9772, val loss 10.9694
step 576750: learning rate 0.00006200, train loss 10.9868, val loss 10.9719
step 577000: learning rate 0.00006196, train loss 10.9792, val loss 10.9724
step 577250: learning rate 0.00006191, train loss 10.9812, val loss 10.9743
step 577500: learning rate 0.00006187, train loss 10.9704, val loss 10.9774
step 577750: learning rate 0.00006183, train loss 10.9752, val loss 10.9817
step 578000: learning rate 0.00006179, train loss 10.9741, val loss 10.9699
step 578250: learning rate 0.00006175, train loss 10.9779, val loss 10.9794
step 578500: learning rate 0.00006171, train loss 10.9735, val loss 10.9726
step 578750: learning rate 0.00006167, train loss 10.9787, val loss 10.9790
step 579000: learning rate 0.00006163, train loss 10.9787, val loss 10.9797
step 579250: learning rate 0.00006159, train loss 10.9758, val loss 10.9819
step 579500: learning rate 0.00006155, train loss 10.9689, val loss 10.9719
step 579750: learning rate 0.00006152, train loss 10.9758, val loss 10.9767
step 580000: learning rate 0.00006148, train loss 10.9848, val loss 10.9807
step 580250: learning rate 0.00006144, train loss 10.9808, val loss 10.9835
step 580500: learning rate 0.00006141, train loss 10.9800, val loss 10.9810
step 580750: learning rate 0.00006137, train loss 10.9860, val loss 10.9845
step 581000: learning rate 0.00006133, train loss 10.9750, val loss 10.9783
step 581250: learning rate 0.00006130, train loss 10.9758, val loss 10.9726
step 581500: learning rate 0.00006127, train loss 10.9766, val loss 10.9812
step 581750: learning rate 0.00006123, train loss 10.9796, val loss 10.9714
step 582000: learning rate 0.00006120, train loss 10.9800, val loss 10.9692
step 582250: learning rate 0.00006117, train loss 10.9741, val loss 10.9765
step 582500: learning rate 0.00006113, train loss 10.9771, val loss 10.9797
step 582750: learning rate 0.00006110, train loss 10.9773, val loss 10.9734
step 583000: learning rate 0.00006107, train loss 10.9706, val loss 10.9767
step 583250: learning rate 0.00006104, train loss 10.9725, val loss 10.9748
step 583500: learning rate 0.00006101, train loss 10.9767, val loss 10.9686
step 583750: learning rate 0.00006098, train loss 10.9799, val loss 10.9824
step 584000: learning rate 0.00006095, train loss 10.9826, val loss 10.9700
step 584250: learning rate 0.00006092, train loss 10.9755, val loss 10.9751
step 584500: learning rate 0.00006089, train loss 10.9749, val loss 10.9815
step 584750: learning rate 0.00006086, train loss 10.9781, val loss 10.9764
step 585000: learning rate 0.00006083, train loss 10.9754, val loss 10.9722
step 585250: learning rate 0.00006080, train loss 10.9735, val loss 10.9705
step 585500: learning rate 0.00006078, train loss 10.9781, val loss 10.9739
step 585750: learning rate 0.00006075, train loss 10.9780, val loss 10.9749
step 586000: learning rate 0.00006073, train loss 10.9769, val loss 10.9785
step 586250: learning rate 0.00006070, train loss 10.9814, val loss 10.9838
step 586500: learning rate 0.00006067, train loss 10.9701, val loss 10.9792
step 586750: learning rate 0.00006065, train loss 10.9759, val loss 10.9773
step 587000: learning rate 0.00006063, train loss 10.9754, val loss 10.9797
step 587250: learning rate 0.00006060, train loss 10.9763, val loss 10.9735
step 587500: learning rate 0.00006058, train loss 10.9689, val loss 10.9700
step 587750: learning rate 0.00006056, train loss 10.9798, val loss 10.9764
step 588000: learning rate 0.00006053, train loss 10.9809, val loss 10.9798
step 588250: learning rate 0.00006051, train loss 10.9710, val loss 10.9732
step 588500: learning rate 0.00006049, train loss 10.9777, val loss 10.9714
step 588750: learning rate 0.00006047, train loss 10.9721, val loss 10.9730
step 589000: learning rate 0.00006045, train loss 10.9753, val loss 10.9772
step 589250: learning rate 0.00006043, train loss 10.9704, val loss 10.9668
step 589500: learning rate 0.00006041, train loss 10.9721, val loss 10.9768
step 589750: learning rate 0.00006039, train loss 10.9739, val loss 10.9793
step 590000: learning rate 0.00006037, train loss 10.9760, val loss 10.9746
step 590250: learning rate 0.00006035, train loss 10.9784, val loss 10.9676
step 590500: learning rate 0.00006033, train loss 10.9748, val loss 10.9819
step 590750: learning rate 0.00006032, train loss 10.9760, val loss 10.9790
step 591000: learning rate 0.00006030, train loss 10.9704, val loss 10.9779
step 591250: learning rate 0.00006028, train loss 10.9806, val loss 10.9774
step 591500: learning rate 0.00006027, train loss 10.9746, val loss 10.9781
step 591750: learning rate 0.00006025, train loss 10.9798, val loss 10.9710
step 592000: learning rate 0.00006024, train loss 10.9786, val loss 10.9670
step 592250: learning rate 0.00006022, train loss 10.9750, val loss 10.9712
step 592500: learning rate 0.00006021, train loss 10.9765, val loss 10.9748
step 592750: learning rate 0.00006019, train loss 10.9805, val loss 10.9765
step 593000: learning rate 0.00006018, train loss 10.9754, val loss 10.9786
step 593250: learning rate 0.00006017, train loss 10.9767, val loss 10.9794
step 593500: learning rate 0.00006016, train loss 10.9695, val loss 10.9781
step 593750: learning rate 0.00006014, train loss 10.9747, val loss 10.9769
step 594000: learning rate 0.00006013, train loss 10.9770, val loss 10.9840
step 594250: learning rate 0.00006012, train loss 10.9763, val loss 10.9780
step 594500: learning rate 0.00006011, train loss 10.9701, val loss 10.9773
step 594750: learning rate 0.00006010, train loss 10.9737, val loss 10.9776
step 595000: learning rate 0.00006009, train loss 10.9793, val loss 10.9744
step 595250: learning rate 0.00006008, train loss 10.9751, val loss 10.9802
step 595500: learning rate 0.00006007, train loss 10.9736, val loss 10.9726
step 595750: learning rate 0.00006007, train loss 10.9723, val loss 10.9814
step 596000: learning rate 0.00006006, train loss 10.9785, val loss 10.9815
step 596250: learning rate 0.00006005, train loss 10.9772, val loss 10.9791
step 596500: learning rate 0.00006005, train loss 10.9767, val loss 10.9810
step 596750: learning rate 0.00006004, train loss 10.9747, val loss 10.9823
step 597000: learning rate 0.00006003, train loss 10.9797, val loss 10.9856
step 597250: learning rate 0.00006003, train loss 10.9845, val loss 10.9780
step 597500: learning rate 0.00006002, train loss 10.9790, val loss 10.9737
step 597750: learning rate 0.00006002, train loss 10.9785, val loss 10.9697
step 598000: learning rate 0.00006001, train loss 10.9746, val loss 10.9849
step 598250: learning rate 0.00006001, train loss 10.9751, val loss 10.9734
step 598500: learning rate 0.00006001, train loss 10.9750, val loss 10.9843
step 598750: learning rate 0.00006001, train loss 10.9793, val loss 10.9704
step 599000: learning rate 0.00006000, train loss 10.9779, val loss 10.9718
step 599250: learning rate 0.00006000, train loss 10.9837, val loss 10.9684
step 599500: learning rate 0.00006000, train loss 10.9741, val loss 10.9766
step 599750: learning rate 0.00006000, train loss 10.9786, val loss 10.9780
step 600000: learning rate 0.00006000, train loss 10.9809, val loss 10.9705
