1. query selection (done)
2. reward model last layer
3. dataset size (done)
4. trivial offline RL

5. ATAC
