

\section{Introduction}\label{sec:intro}

Modern language models \citep{grattafiori2024llama3herdmodels,openai2024gpt4ocard, deepseekai2025deepseekv3technicalreport} have achieved striking fluency and coherence in open‐ended generation, yet guiding them to satisfy multiple, possibly overlapping human‐defined criteria remains a core challenge. Conventional approaches to align language models (LMs) rely on human annotations distinguishing between a chosen and rejected generation, even when their gap in quality may be nuanced and multi-faceted. 
Constitutional AI and other related paradigms \citep{bai2022constitutionalaiharmlessnessai,guan2025deliberativealignmentreasoningenables} consider a human-curated "constitution" of high-level attributes which the model's responses should follow. While this framework enables models to be steered toward safer behavior, the static nature of the constitution requires experts to anticipate every nuance in advance and update rules manually as edge cases surface. As use cases proliferate, new failure modes arise -- reliably synthesizing task-specific "amendments" and collecting annotations is a costly and time-consuming process -- leading to brittleness and limited adaptability. 
We aim to automate the process of discovering the attributes for model improvement, obviating the need for human intervention or explicit domain adaptation.

Automatically discovering the attributes for self-improvement can be seen as a meta-level reasoning process. Recent efforts to induce reasoning capabilities in LMs have often focused on domains such as math and code where a gold reference answer exists and candidate answers are more easily verifiable \citep{deepseekai2025deepseekr1incentivizingreasoningcapability}. The availability of verifiable responses has also been capitalized for teaching self-correction \citep{kumar2025training}. However, in this work, we focus on open-ended text generation tasks that are challenging to verify; identifying situations for a human to intervene and induce a refined response
can be especially tricky in such cases. 

\begin{figure}[t]
  \centering
\label{fig:main-figure}
\begin{minipage}[t]{0.45\textwidth}
\centering\includegraphics[width=\textwidth]{Figures/flow.png}
  \end{minipage}\hfill
  \begin{minipage}[t]{0.425\textwidth}
\includegraphics[width=\textwidth]{Figures/alpacaeval_gains_v2.png}
  \end{minipage}\hfill

  \caption{%
    We introduce \textbf{\textit{Self-Taught Principle Learning} (STaPLe)}. (Left) Our Monte Carlo EM algorithm alternates between on-policy discovery and learning of latent principles guiding self-correction behavior. The principles may also be clustered to a compressed set, yielding human-interpretable constitutions $\mathcal{C}_t$ and models trained to follow them $\mathcal{M}_t$. (Right) The STaPLe algorithm induces self-improvement in AlpacaEval win-rate over three iterations for all three language models. 
  }

\end{figure}

We introduce a novel approach to discover expressive principles, treating them as latent attributes in the self-correction setting to bridge an initial attempt and a target response. We find that the language model itself serves as an effective principle generator to improve its responses, contrasting prior works which rely on human annotations or strong model supervision. 
We design an Expectation-Maximization algorithm, \textbf{\underline{S}elf-\underline{Ta}ught \underline{P}rinciple \underline{Le}arning (STaPLe)}, which first leverages rejection sampling in identifying principle candidates for self-correction and choosing the candidate that is closest to the gold, and then trains over these trajectories to learn this principle-guided refinement behavior. Repeating this method iteratively results in a model trained on a dynamic constitution of elements produced from itself, implicitly learning the refinement goal to enable its self-correction abilities at inference-time. We also show that the discovered principles can be compressed to a smaller set for human readability by applying hierarchical clustering after the E-step in a manner akin to posterior regularization, without compromising in downstream performance. 

We validate the efficacy of this method over several iterations on instruction-following benchmarks including MT-Bench \citep{mt-bench} and AlpacaEval \citep{alpaca_eval}, and leverage Prometheus-v2.0 \citep{prometheus} to analyze win-rates with fine-grained, principle-following rubrics. Our results show that STaPLe outpaces baseline methods such as Self-Taught Reasoner (STaR; \cite{zelikman}) (modified for non-verifiable responses) and prompted refinement approaches like Self-Refine \citep{madaan}. It continues to self-improve in performance over multiple iterations, before saturating. We also find that clustering largely matches or outperforms training on all principles. 

Our key contributions can be summarized as follows:

\begin{itemize}
    \item We propose a Monte Carlo EM algorithm for iterative latent principle discovery and learning, to enable language model self-improvement. 
    \item We find that on-policy generated principles are effective stimuli for self-correction in smaller LMs, and training to learn them improves the performance on MT-Bench, AlpacaEval-2.0.

    \item Clustering the set of discovered principles retains most of the full distribution's performance while yielding an interpretable constitution.
\end{itemize}
