model with decay scheduler (10k warm up and decay to zero):
    36245709, 39431378, 39431379, 39431380, 39431381, 39431383, 39431384, 39431385, 39431386, 39431387, 39431390, 40408135, 40408173
model with fast warm up scheduler (1k warm up and constant lr):
    40408135, 40408173, 40408181, 40408182
model with slow warm up scheduler (10k warm up and constant lr):
    40467729 (seed42), 40467730 (seed43), 40467732 (seed44), 40467734 (seed45), 40467736 (seed46)
    44542984 (seed47), 44542986 (seed48), 44542987 (seed49), 44542989 (seed50), 44543009 (seed51)
    44543280 (seed52), 44543284 (seed53), 44543288 (seed54), 44543306 (seed55), 44543309 (seed56)
    44543310 (seed57), 44543311 (seed58), 44543314 (seed59), 44543315 (seed60), 44543317 (seed61)
    44543376 (seed62), 44543378 (seed63), 44544142 (seed64), 44543394 (seed65), 44543397 (seed66)
    44543402 (seed67), 44543404 (seed68), 44543417 (seed69), 44543419 (seed70), 44543420 (seed71)

model with 1M steps standard lr decay:
    40480964, 4063156, 40636152, 40636150, 40636147, 40636137, 40480966
model with 300k steps standard lr decay:
    41066097


What is special about each run:
36245709: the well studied run seed 42 with 200k steps (original run converges)
40408173: original model kind of converged
40480964: seed 42 with learning rate decay, 1M steps, clean convergence behavior 
40636150: seed 45 with learning rate decay, the original model did not converge