# Debugging and set up
- 38094990
    - testing things with one full rank head
    - everything works. after 200 epochs, loss is 0.0004
- 38095115
    - retrying with 400 epochs
    - noticing that GPU utilization is quite low
- 38095251
    - retrying with bigger batch size, longer time
    - after 400 epochs, loss is 0.0002
- 38096152
    - retrying with 100 heads of rank 50
    - making the scheduler a bit more aggresive
- 38102997
    - try puting the dataloader on the GPU
- 38106043
    - 3x3x3 sweep of lr, nheads, and rank
    - dim = 100, ntokens = 100
    - noticing that I'm probably not initializing right, when there are more heads, VO should be initialized smaller
    - but eventually it recovers from the bad initialization. can't do better than 0.006
    - what pytorch does it Xavier initialization where the in-dimension and out-dimension are both embedding dimension. but for them, the VO matrix is also low rank! and they initialize the two parts separately
- 38107893
    - attempted to fix the initialization problem so that scaling nheads doesn't break things
- 38115797
    - retrying full rank with lr = 1e-3
    - too slow. hit max iter without learning the right thing?
- command line
    - trying with cosine annealing starting from lr = 1e-2
    - my job timed out, maybe worth trying this one again?