
start off with learning RTE using MNLI components on roberta-base, training
RTE classifier head during coefficient fitting.

maybe try decomp of roberta-dna (or whatever it was called, check old bio repo)


divisbility toy task, numbers 2 to 30 as divisor, up to 6 digit numbers as
dividends


Divis TODOs:
- Make models/layers module.
- Make script to train and save model (ciriculum learning)
- Get some script to get per-examples Fishers.
- Slight refactor of the divisiblity dataset stuff.

- I'm getting dense_fisher_norms of 0, which seems wrong.


- See if we can do NMF where the components over each dense matrix is a rank-1 matrix
- See if I can improve efficiency/quality for the NMFs by doing NMF over the averages
  of several per-example Fishers rather than a single one.

- Task2vec-style selection mechanism for components?




- do nmf per "layer" (figure out what layer means) for transformers (maybe evenutally
  addd option to compute fishers per-layer)
- use nmf H matrix to decompose per-example Fishers not used in the decomposition.


- Looks like the NMFs last 2 sub-blocks NaNed



- Sloppy model-based improvements to ReLU combo stuff? Basically ignore some elements of
  the activation pattern depending on the Fisher of the unit for that example. Can maybe
  help scaling a lot.
    - Convexity of optimization sub-problems though?
