\section{LMs Can Self-Improve with Iterative Principle Discovery}\label{sec:section-4}

\subsection{Experimental Setup}\label{sec:expt-setup}

\paragraph{Mixed-Domain Input Corpus.} We form a corpus of 100k samples for the principle discovery phase, consisting of four datasets: Anthropic HH-RLHF \citep{bai2022traininghelpfulharmlessassistant}, UltraFeedback \citep{cui2024ultrafeedbackboostinglanguagemodels}, TL;DR \citep{tldr}, and HotpotQA \citep{yang-etal-2018-hotpotqa}, taken in equal proportion (i.e. 25k samples of each dataset, drawn randomly) and deduplicated by prompt. For preference datasets, we take the chosen response to be the gold answer $y^G$. To run  STaPLe, we use the first 50k samples for iteration 1, to heavily bootstrap off the first iteration, and then use 10k samples for each iteration thereafter, such that the input prompts are unseen for each iteration. 

\paragraph{Models and Hyperparameters.} We evaluate three performant small language models: Llama-3.1-8B-Instruct \citep{grattafiori2024llama3herdmodels, meta-llama-3.1-8b-instruct}, Granite-3.1-8B-Instruct \citep{granite, ibm-granite-3.1-8b-instruct}, and Qwen2.5-7B-Instruct \citep{qwen2025qwen25technicalreport}. We use the all-MiniLM-L6-v2 model \citep{sentence-transformers-all-MiniLM-L6-v2} from SentenceTransformers as the embedding model to compute medoids in our clustering approach. We use the Rouge-L F1 score \citep{lin-2004-rouge} to compare the similarity of candidate responses relative to the reference answer.  We also include an ablation in 
Appendix \ref{appendix:lm-as-a-judge-sim} using a prompted Phi-4 \citep{phi4} judge to score responses, leveraging additional compute to improve the quality of rejection sampling. We discuss all other major STaPLe algorithm and model training hyperparameters in Appendix \ref{appendix:hypers}. 

\paragraph{Baselines.} We compared our method against several baselines in both the single-iteration and multi-iteration settings, in addition to the scores of each model's initial policy. 

\begin{enumerate}
\item Prompted self-refinement to directly produce a self-critique and revision, akin to Self-Refine, without any principle or specific feedback criterion provided a priori. \item Supervised fine-tuning on the gold responses of the first 50k samples in the mining corpus.
\item Following SCoRe, we adopt STaR-like baseline for intrinsic self-correction; we apply the STaPLe algorithm and perform supervised fine-tuning on the best refined response (without principle-based refinement trajectory). This will henceforth be referred to as "STaR". 
\end{enumerate}
We compare the STaPLe and STaR algorithms over four iterations -- this is performed over the same number of samples per iteration, i.e. 50k samples in the first iteration and 10k samples for each subsequent one. Naturally, the other baselines are performed for a single iteration. 

\paragraph{Evaluation.} We evaluate on the MT-Bench \citep{mt-bench} and AlpacaEval-2.0-LC \citep{alpaca_eval, dubois2024length} datasets, instruction-following evaluations designed to reflect alignment abilities of LLMs in chat settings. We also use the Prometheus-8x7B-v2.0 model \citep{prometheus} model on responses to the above datasets and the IFEval \citep{zhou2023instructionfollowingevaluationlargelanguage} dataset, for fine-grained evaluation on principle-following rubrics, with additional experiments included in Appendix \ref{appendix:prometheus-winrates}. At inference time, if a principle was invoked intrinsically given a prompt, the response is parsed so as to only score the refined generation, following the principle proposal -- this is similarly done for the STaR baseline. Otherwise, we score the full generated response, and no special parsing is required. For the Prometheus results, the win-rate is with respect to the principle invoked -- for example, if the principle is "Directness", the judge assesses which response is more direct between the candidate generation and the generation from the initial policy. Given the STaR baseline does not explicitly invoke a principle, we use the same principle invoked for that sample by the STaPLe model.

\subsection{Results}\label{sec:results}

\paragraph{Latent Principle Learning Improves Response Quality.}

\begin{table}
\footnotescript
  \caption{Comparison of the STaPLe algorithm (unconstrained and constrained) against the baselines. The scores reported below are an average over 5 runs for all benchmarks.}
  \label{table:compare-against-baselines}
  \centering
  \begin{tabular}{ccccccl}
    \toprule
    Model    & MT-Bench (avg)  & MT-Bench (T1) & MT-Bench (T2) & AlpacaEval & IFEval WR \\
    \midrule
    \textbf{Llama-3.1-8B-Instruct} & & & & & \\
    \midrule
    Initial Policy & 7.46 & 8.09 & 6.83 & 26.9 & -- \\
    Self-Refine & 7.40  & 8.05  & 6.75 & 26.1 & 51.2\%   \\
    Gold-only SFT & 7.47 & 8.11  & 6.83 & 26.4 & 56.2\%    \\
    STaR Iter 4   & 7.56 & 8.11 & 7.00 & 31.8 & 62.3\% \\
    STaPLe Iter 4 & \textbf{7.71} & \textbf{8.13} & \textbf{7.30} & 33.4 & 68.9\% \\
    Constrained STaPLe Iter 4  & 7.70 & \textbf{8.13} & 7.28 & \textbf{34.9} & \textbf{69.1\%}  \\
    \midrule
    \midrule
    \textbf{Granite-3.1-8B-Instruct} & & & & & \\
    \midrule
    Initial Policy & 7.83 & 8.59 & 7.08 & 30.2 & -- \\
    Self-Refine & 7.86 & 8.63  & 7.10 & 31.7 & 57.1\%  \\
    Gold-only SFT  & 7.86 & 8.68  & 7.05 & 30.1 &  55.8\%   \\
    STaR Iter 4   & 7.96 & 8.68 & 7.25 & 35.6 & 62.1\% \\
    STaPLe Iter 4 & \textbf{8.04} & \textbf{8.69} & \textbf{7.41} & 38.4 & 67.6\% \\
    Constrained STaPLe Iter 4   & 8.03 & 8.65 & \textbf{7.41} & \textbf{38.8} & \textbf{68.4\%}  \\
    \midrule
    \midrule
    \textbf{Qwen2.5-7B-Instruct} & & & & & \\
    \midrule
    Initial Policy & 6.83 & 7.34 & 6.31 & 30.4 & -- \\
    Self-Refine & 6.91 & 7.41  & 6.40 & 30.7 & 58.4\%   \\
    Gold-only SFT & 6.89 & 7.43  & 6.35 & 30.0 & 56.9\%    \\
    STaR Iter 4 & 7.14 & 7.63 & 6.66 & 37.8 & 68.4\% \\
    STaPLe Iter 4 & \textbf{7.24} & \textbf{7.64} & \textbf{6.85} & \textbf{40.2} & \textbf{73.4\%} \\
    Constrained STaPLe Iter 4  & 7.22 & 7.60 & 6.84 & 39.9 & 72.1\%  \\
    \bottomrule
  \end{tabular}
\end{table}

The STaPLe algorithm outperforms the baselines on all benchmarks, across all models, as seen in Table \ref{table:compare-against-baselines}. The MT-Bench average exceeds the best baseline by an average of +0.11 over the three models, with the Turn 2 increasing by an average of +0.22. The AlpacaEval win-rates improve over the initial policy by +5.3-7\%, and improves over the STaR baseline by +1.6-2.8\%.  Furthermore, the IFEval win-rates on principle-following of the refined against the base policy using Prometheus improve by +5-6.6\%. This suggests that training models to \textit{explicitly invoke the principle} as an expressive form of a latent attribute is effective, as opposed to implicitly learning over this by simply training on the refined response (the STaR baseline). The Self-Refine baseline improves performance for the Granite and Qwen models, but not for Llama-8B, suggesting that it is not as effective in zero-shot self-refinement without pre-identified principles. This corresponds with a higher IFEval win-rate for those models with strong self-refinement abilities.

\paragraph{Iterative Principle Discovery Enables Self-Improvement.}

The results in Table \ref{table:compare-against-baselines} demonstrate the performance of our algorithm in the fourth iteration of our Monte Carlo EM algorithm; Our algorithm outpaces the STaR baseline by a sizable margin throughout the execution of both algorithms. We include the full set of results in Appendix \ref{complete_table}. By iteration 3, the STaPLe scores outperform STaR and the initial policy on average across the three models by $+0.16$ and $+0.29$ on MT-Bench (avg.); +3.6\% and +9.2\% on AlpacaEval win-rate; and +7.9\% and +21.0\% on IFEval principle-following win-rate, respectively.
We do observe a slight diminishing returns effect with the STaPLe algorithm, as in iteration 4, the scores either remain at a similar level or drop slightly for Llama-8B and Granite-8B; however, Qwen-7B continues to improve on all three benchmarks. We further analyze principle-following quality in Appendix \ref{appendix:prometheus-winrates} and stepwise win-rates of iteration $t$ against iteration $t-1$ in Appendix \ref{appendix:stepwise-winrates} to reinforce the self-improvement induced by STaPLe. In Appendix \ref{intrinsic-self-correction}, we demonstrate that the model's intrinsic self-correction ability improves over the iterations.

\paragraph{Clustering Balances Interpretability and Performance.}

In Table \ref{table:compare-against-baselines}, we also include the performance of "constrained" STaPLe -- the version of the algorithm with agglomerative clustering following the E-step during each iteration, and use the medoids of each cluster as a representative principle to yield dataset $\widetilde{\mathcal{D}}$. We find that this largely matches the performance of the "unconstrained" version, in fact outperforming it in AlpacaEval and IFEval win-rates for Llama-8B and Granite-8B. The full results can be found in Appendix \ref{sec:clustering-methods}, where we ablate across different label replacement schemes (medoids, modes, and a perplexity-based method). For both  versions, we observe a strong correlation in the trend  (avg. $\rho = 0.95$-$0.96$) between the MT-Bench (avg.) and AlpacaEval results. 

\begin{figure}[t]
\caption{Principle discovery rates of the STaPLe algorithm in the unconstrained (left) and constrained (right) settings. This represents the fraction of the trajectories saved from the principle discovery process (E-step) that contain a unique principle label that was unseen in previous iterations.}

  \label{fig:principle-discovery-rates}
  \centering
  \begin{minipage}{0.48\textwidth}
    \centering
    \includegraphics[width=\linewidth]{Figures/principle-discovery-rate.png}
  \end{minipage}\hfill
  \begin{minipage}{0.48\textwidth}
    \centering
    \includegraphics[width=\linewidth]{Figures/constrained-principle-discovery-rate.png}
  \end{minipage}
\end{figure}

\subsection{Analysis of Principle Discovery}\label{sec:principle-discovery-analysis}

It is also valuable to study the nature of the principle discovery process and the model-generated constitutions that we have aligned the language model toward. We include the full constitutions and perform more qualitative analysis on their distribution of elements  when performing label replacement ("density" of the constitution) in Appendix \ref{appendix:constitutions}.  In Figure \ref{fig:size-of-constitution-over-iterations}, we show that the number of principles in the constitution under Constrained STaPLe decreases over the iterations, suggesting that the model converges to learning a relatively stable distribution of  principles. In particular, the size of the constitution by iteration 4 is roughly 50\% of the iteration 1 size, or even smaller.

This finding is reinforced by an analysis of the principle discovery rate -- the fraction of refinement samples with new, unseen principles -- in Figure \ref{fig:principle-discovery-rates}. We show that this rate decreases over the iterations under both the unconstrained and constrained versions of the STaPLe algorithm, suggesting that all models learn to re-use principles accordingly. The observation that constrained STaPLe helps to accelerate this convergence to a condensed set of principles reinforces the motivation behind the introduction of clustering as being akin to a posterior regularization mechanism. This also highlights one of the advantages of using the LM to approximate the posterior distribution, as the changing nature of the learned posterior can be observed over the iterations and elicited via on-policy sampling.
