## Implementation of policy optimization algorithms

To reproduce the experiments in the paper, first use the code under the `sft` directory to train the initial SFT policies. Then use the code under other directories for the desired optimization method.
