

\section{Conclusion}\label{sec:conclusion}

We introduced a new language model self-improvement method which uses model-generated latent principles to learn intrinsic self-correction. These serve a purpose akin to a reasoning chain-of-thought, boosting the quality of LM generations on alignment-focused benchmarks. Furthermore, our approximate posterior-regularized Monte Carlo EM algorithm shows that the model can continue to improve over multiple iterations, while simultaneously compressing the principles to a human-interpretable constitution. We also show that our clustering approach balances performance with the diversity of the generated constitution, thus adding valuable utility to the STaPLe algorithm. The efficacy of STaPLe highlights the potential for constitutional alignment with self-generated principles to improve model responses in an interpretable manner with minimal human supervision. 