\appendix

\addcontentsline{toc}{section}{Appendix}
\part{\Large{Appendix}} 
\parttoc

\newpage
\section{Formal Description of STaPLe Algorithm}\label{appendix:staple-algorithm}

We provide a full, formal description of the STaPLe algorithm below. We use $y^1$ and $y^2$ notationally to avoid confusion with the sample indices. We use general variables for components which may be ablated on: the similarity function $f$, clustering algorithm $\mathcal{C}$ and label replacement scheme $\mathcal{R}$. We leave the $M$-step in terms of the dataset $D'$ for generality, although if clustering were to be performed, one would use $\widetilde{\mathcal{D}}$ instead. 

\input{algorithm_writeup}

\section{Reproducibility Statement}\label{appendix:reproducibility}

In addition to the algorithm description above (Algorithm \ref{alg:STaPLe}) and experimental details in Section \ref{sec:expt-setup}, we include the hyperparameters used and model training details in Appendix \ref{appendix:hypers} and the prompts used in the STaPLe algorithm in Appendix \ref{appendix:prompts}. We make all evaluation results available in tabular format throughout the main paper and the appendices for comparability. We also will publicly release the code for the STaPLe algorithm, to further facilitate reproducibility of our self-improvement method. 

\section{STaPLe Hyperparameters and Training Details }\label{appendix:hypers}

\paragraph{STaPLe Algorithm Hyperparameters.} We use a Rouge-L F1 threshold of 0.4 for the similarity threshold ($f(y,y^G)$ -- if the initial response exceeds this threshold, we do not pursue refinement). For the ablation using a Phi-4 judge in Appendix \ref{appendix:lm-as-a-judge-sim}, the threshold was set to be 9 (on a scale of 1-10). The other major hyperparameters involved in the execution of the STaPLe algorithm are $N$, the number of principles to sample, and the distance threshold for the clustering algorithm. STaPLe requires an inference time budget of $3N+1$ for the Rouge-L version, and $4N+2$ for the LLM-as-a-judge version -- we set $N=16$ to balance runtime per iteration of the algorithm with sufficient exploration of diverse principles. During principle discovery, we sample principles, critiques, and responses at a temperature of $0.7$;  the maximum number of tokens for principle proposal and critique is set at $500$, and is set at $1024$ for the refined response. We use $4\times$H100 Nvidia GPUs for the principle discovery phase, with a separate vLLM \citep{kwon2023efficient} instance per GPU.

We set a distance threshold $\delta$ to avoid setting a specific target number of clusters when performing agglomerative clustering. The current results involve manually setting a distance threshold, where the authors analyzed the resulting set of clusters and for each of the first three iterations, ensured that there are at least 30 clusters. Fortunately, given the speed of agglomerative clustering, this is fairly easy to do. For the first iteration, the Euclidean distance thresholds were set at 8 (Llama and Qwen) and 6 (Granite); for iterations 2-4, the thresholds were decreased to 7 and 5, respectively. Alternatively, one could automate this process by designing an objective over the diversity (semantic or surface-level) of the cluster medoid labels and performing a hyperparameter search. This is explored further in Appendix \ref{appendix:bayesian-hypers} using Bayesian hyperparameter optimization tools to identify an appropriate, model-specific distance threshold. The threshold $\tau_{PPL}$ for the perplexity difference label-replacement scheme, described in Appendix \ref{sec:clustering-methods}, was set at $0.2$.

\paragraph{Model Training.} We perform full supervised fine-tuning for 3 epochs at a learning rate of $1\times10^{-6}$ with the AdamW optimizer \citep{loshchilov2018decoupled}, with a sequence length of 4096. %We find that performing SFT without prompt masking achieves superior performance, following the Transformer Reinforcement Learning (TRL) library default settings and \textcolor{red}{(cite)}. 
All experiments were performed on $8\times$H100 Nvidia GPUs. 

\section{Derivation of the Monte Carlo EM Gradient}\label{appendix:mc-em-gradient}

Recall that the conditional log-likelihood is defined as:

\[\mathcal{L}(\theta) = \log \sum_{y^2 \in \mathcal{V^*}} \sum_{z \in \mathcal{V^*}} p(y^G \mid x, y^1, z,  y^2) \cdot p(y^2,z \mid x,y^1; \theta)\]


The gradient with respect to this objective is given by

\begin{align*}\label{eq-1}
\nabla_\theta \hspace{0.5mm} \mathcal{L}(\theta) &=  \nabla_\theta \log \sum_{y^2 \in \mathcal{V}^*}\sum_{z \in \mathcal{V}^*} p(y^G \mid x, y^1, z, y^2) \cdot p(y^2,z,c \mid x,y^1; \theta)\nonumber\\
        &=  \sum_{y^2 \in \mathcal{V}^*}\sum_{z \in \mathcal{V}^*} \frac{p(y^G \mid x, y^1, z, y^2)}{p(y^G \mid x,y^1)}\hspace{0.5mm}\nabla_\theta \hspace{0.5mm} p(y^2,z \mid x,y^1; \theta)\nonumber\\
        &=  \sum_{y^2 \in \mathcal{V}^*}\sum_{z \in \mathcal{V}^*} \frac{p(y^G \mid x,y^1,z,y^2) \cdot p(y^2, z \mid x,y^1; \theta)}{p(y^G \mid x, y^1)}\nabla_\theta \log p(y^2,z \mid x, y^1; \theta)\nonumber\\
        &=  \mathbb{E}_{p(y^2,z \hspace{0.5mm}\mid \hspace{0.5mm} x,y^1,y^G)} \left\{\nabla_\theta \log p(y^2,z \mid x,y^1; \theta)\right\}
\end{align*}

\section{Derivation of Rejection Sampling Rule}\label{appendix:rejection-sampling-rule}

Recall the intractable posterior which we obtain from the MC-EM gradient:

\[p(y_2,z \mid x,y_1,y^G) = \frac{p(y^G \mid x,y_1,z,y_2) \cdot p(y_2,z \mid x,y_1;\theta)}{p(y^G \mid x,y_1)}\]

This can be approximated via Monte Carlo Expectation-Maximization \citep{mc-em}, where sampling techniques are used to obtain samples from the intractable posterior, which are then used for updating model parameters. In particular, we choose the rejection sampling technique \citep{vonNeumann1951RandomDigits} with $\tilde{p}(y^2,z \mid x,y^1, y^G; \theta)$ as proposal distribution. Given a sample $y_n \sim \tilde{p}(y^2,z \mid x,y^1; \theta)$, we accept it with probability :

\[p_n = \frac{p(y_n,z \hspace{0.5mm}\mid\hspace{0.5mm} x,y^1,y^G; \theta)}{M \cdot \tilde{p}(y_n, z \hspace{0.5mm}\mid\hspace{0.5mm} x,y^1, y^G;\theta)}\]

The scaling factor $M$ guarantees $p_n$ to be bounded by $1$ appropriately. Formally, we take $M$ to be:
\[M = \max_{y \in \mathcal{V}^* z \in \mathcal{V}^*} \frac{p(y^2,z \mid x,y^1,y^G; \theta)}{\tilde{p}(y^2,z \mid x,y^1, y^G; \theta)} = \frac{1}{p(y^G \mid x,y^1)} \cdot \max_{y \in \mathcal{V}^*z \in V^*} p(y^G \mid x,y^1,z,y^2)\]

This yields the following acceptance probability for rejection sampling:
\[p_n = \frac{p(y^G \mid x, y^1, z,  y^2)}{\max\limits_{y \in \mathcal{V^*} z \in \mathcal{V^*}} p(y^G \mid x, y^1, z,  y^2)}\]

Since the denominator cancels out through marginalization in rejection sampling, all that remains is to specify the validator model $p(y^G \mid x, y^1,z,y^2)$ as an unnormalized distribution; any response matching metric that measures the similarity between $y^G$ and $y^2$ can serve this purpose. In particular, we experiment with Rouge-L similarity and LLM-based similarity judgments. For instance, in the case of Rouge-L, we take a positive increase score as the acceptance rule, that is:%\[p(y^G \mid x, y^1,z,y^2) \propto \delta_{\text{rouge}(y^2,y^G) \hspace{0.5mm}>\hspace{0.5mm} \text{rouge}(y^1,y^G)}\]

\[p(y^G \mid x, y^1,z,y^2) \propto \begin{cases}
    f(y^2,y^G) \hspace{0.5mm}-\hspace{0.5mm} f(y^1,y^G),& \text{if } f(y^2,y^G) \hspace{0.5mm}>\hspace{0.5mm} f(y^1,y^G)\\
    0,              & \text{otherwise}
\end{cases}\]


\section{Self-Play Equivalence}\label{thrm-1-proof}

The STaPLe Monte Carlo EM approach can equivalently be described through the lens of \textit{self-play}, somewhat akin to SPIN \citep{SPIN}. That is, we can formulate a two-player game wherein the adversary produces a response, and the agent's role is to 1. produce a revised response to the prompt that improves over the adversary's generation relative to the gold, and 2. specify the dimension or aspect on which it improved over the adversary. In the first iteration, we take the same LM to play the both roles. In subsequent iterations, given the policy $\pi_{\theta}$ has now learned self-correction behavior, we take the initial response (opponent's generation) as the starting point, which we posit to be similar to generations sampled from the base policy $\pi_0$ -- that is, $y_a \sim \pi_{0}(\cdot \mid x) \approx y_b \in (y_b, z, y_c)\sim \pi_{\theta}(\cdot \mid x)$. At the same time, the agent's policy updates to $\pi_{\theta}$, which learns principle-conditioned self-refinement, thus improving the agent's ability to perform its primary objectives. 

Formally, we can define the self-play advantage of the refinement over the adversary's generation as \[A(y_2,y_1; x,y^G) = f(y_2,y^G) - f(y_1,y^G)\] Recall that in the STaPLe algorithm, if the agent "loses" -- that is, it fails to produce a refinement that improves over the initial response -- the sample is discarded. The nature of the advantage depends on the instantiation of the similarity function $f$; for instance, under exact match, this collapses to a binary indicator. The objective under the self-correction setting is to maximize the expected advantage under $\pi_{\theta}$:

   \[J(\theta) = \mathop{\mathbb{E}}_{y_1,z,y_2 \sim \pi_{\theta}}[A(y_2,y_1;x,y^G)]\]

The score-function gradient is thus:
\begin{align} \label{eq-2}
    \nabla_{\theta}J(\theta) = \mathop{\mathbb{E}}_{y_1,z,y_2 \sim \pi_{\theta}}[A(y_2, y_1;x,y^G)\nabla_{\theta}p(y_1,z,y_2 \mid x)]
\end{align}


\begin{theorem}[Equivalence of EM and Self-Play Gradients]
Assume the setting of an input $x$, an initial model response $y_1 \sim \pi_{\theta}(\cdot \mid x)$, a latent principle $z \sim \pi_{\theta}(\cdot \mid x,y_1,y^G)$ and critique of $y_1$ with respect to $z$ denoted by $c$, and a refinement $y_2 \sim \pi_{\theta}(\cdot \mid x,y_1,z,c)$. Then, the EM gradient given by Equation \ref{eq-1} is equivalent to the REINFORCE score-function gradient under variance-reduced self-play, given by Equation \ref{eq-2}, under the the self-play advantage and the validator assignment \[p(y_2\mid x,y_1) = \textbf{1}(f(y_2,y^G) > f(y_1,y^G))\] 
    
\end{theorem}

\begin{proof}
    We begin by marginalizing over the latent principle $z$. By definition, \[p(y_2\mid x, y_1, y^G) = \sum\limits_{z,c} p(y_2,z,c \mid x, y_1, y^G)\]

    By Bayes' rule:
    \[p(y_2,z \mid x,y_1,y^G) = \frac{p(y^G \mid x,y_1,z,y_2) \cdot p(y_2,z \mid x,y_1;\theta)}{p(y^G \mid x,y_1)}\]

    Then, given that only the final term depends on $\theta$, we can rewrite its gradient as \[\nabla_{\theta} \log p(y_2,z\mid x, y_1;\theta) = \nabla_{\theta} \log[\pi_{\theta}(y_1\mid x)\pi_{\theta}(y_2\mid x, y_1)]\]

    Thus, revisiting the EM gradient, we have:
    \begin{align*}
    \nabla_{\theta}(y^G,y_1,x,\theta) &= \sum\limits_{y_2} \sum\limits_{z} p(y_2,z\mid x, y_1, y^G; \theta)\nabla_{\theta} \log[p(y_2,z\mid x, y_1;\theta]\nonumber\\
    &= \sum_{y_2} p(y_2\mid x,y_1,y^G; \theta)\nabla_{\theta}\log[\pi_{\theta}(y_1\mid x)\pi_{\theta}(y_2\mid x, y_1)]
    \end{align*}

    Next, by Bayes' rule, we can re-express $p(y_2 \mid x, y_1, y^G;\theta)$ in terms of the un-normalized EM validator term:
    \[p(y_2 \mid x, y_1, y^G;\theta) = \frac{p(y^G\mid x,y_1,y_2)\pi_{\theta}(y_2\mid x,y_1)}{Z(x,y_1)} = \frac{p(y_2\mid x, y_1)\pi_{\theta}(y_2\mid x, y_1)}{Z(x,y_1)}\]
    Where $Z(x,y_1) = \sum\limits_{y_2} p(y_2 \mid x,y_1) \pi_{\theta}(y_2 \mid x, y_1)$, and taking $p(y_2\mid x,y_1) \propto p(y^G\mid x, y_1, y_2)$.

    Thus, we have that the EM gradient takes the form of
    \[\nabla_{\theta}\mathcal{L}(y^G,y_1,x,\theta) = \mathop{\mathbb{E}}_{y_1 \sim \pi_{\theta}}[\frac{\sum\limits_{y_2} p(y_2 \mid x,y_1)\pi_{\theta}(y_2 \mid x,y_1)\nabla_{\theta}\log[\pi_{\theta}(y_1\mid x)\pi_{\theta}(y_2\mid x,y_1)]}{\sum\limits_{y_2} p(y_2\mid x,y_1)\pi_{\theta}(y_2\mid x,y_1)}]\]

    Next, we consider the assignment of the EM validator to be in terms of the comparison between initial response $y_1$ and refined response $y_2$ with respect to $y^G$ over the similarity function $f$. That is, take $p(y_2\mid x, y_1) = \textbf{1}(f(y_2,y^G) > f(y_1,y^G)$; such that we only accept refinements that improve over the gold. This reflects the STaPLe algorithm's accept/reject criterion.  Then, factoring out the summation over $y_2$ into the expectation, and substituting the validator term, we have:
    \[\nabla_{\theta}\mathcal{L}(y^G,y_1,x,\theta) \propto \mathop{\mathbb{E}}_{y_1,y_2 \sim \pi_{\theta}}[\textbf{1}(f(y_2,y^G) > f(y_1,y^G)\nabla_{\theta}\log[\pi_{\theta}(y_1\mid x)\pi_{\theta}(y_2\mid x,y_1)]]\]
    with proportionality to the degree of $\frac{1}{Z}$ where $Z = \sum\limits_{y_2} p(y_2\mid x,y_1)\pi_{\theta}(y_2\mid x, y_1)$ is the normalization constant.     This is a binary reward, and in practice, the selection of the advantage depends on the nature of the similarity function $f$ which is considered. To generalize to real-valued rewards such as Rouge-L, reward models, or LLM-as-a-judge scores, we instead replace this hard indicator with an advantage function $A(y_2,y_1;x,y^G) = f(y_2,y^G) -f(y_1,y^G)$. 

    In the canonical REINFORCE self-play setting \citep{dayan1990reinforcement,sutton1984temporal}, the reward $R(\tau)$ over the trajectory $\tau$ is often replaced by an advantage to reduce the variance of the Monte Carlo estimate, introducing $A(\tau) = R(\tau) - b$ for a comparison $b$. This yields a gradient $\nabla_{\theta}J(\theta) = \mathop{\mathbb{E}}[A(\tau)\nabla_{\theta}\log[\pi_{\theta}(\tau)]]$ in practice. In our setting, we are simply taking the score of the initial response $f(y_1,y^G)$ to be the comparison. 

    Performing this substitution in the current form of the EM gradient:
    \[\nabla_{\theta}\mathcal{L}(y^G,y_1,x,\theta) \propto \mathop{\mathbb{E}}_{y_1,y_2 \sim \pi_{\theta}}[A(y_2,y_1;x,y^G)\nabla_{\theta}\log[\pi_{\theta}(y_1\mid x)\pi_{\theta}(y_2\mid x,y_1)]]\]

    Since $z$ was marginalized over, this recovers Equation \ref{eq-2}, the self-play REINFORCE gradient, concluding the proof.
\end{proof}

\newpage
\section{Complete Table: Self-Improvement over Multiple Iterations}\label{complete_table}

In this section, we include the complete tables over four iterations of the STaPLe algorithm, to demonstrate the model's progression of self-improvement. As shown in Table \ref{table:multiple-iters}, STaPLe outpaces the STaR baseline by a substantial margin throughout the execution of both algorithms, even in spite of the improvements of Llama-8B and Granite-8B saturating by the end of iteration 3. While both algorithms have fairly similar MT-Bench Turn-1 scores by iteration 4, the Turn-2 score is substantially higher (average of +0.22) for STaPLe. We observe similar general trends for the models in AlpacaEval win-rate and Prometheus-based IFEval principle-following win-rate, as well. 

\input{Tables/self-improvement-multi-iters}

\newpage
\section{Prometheus Win-rates on MT-Bench and AlpacaEval}\label{appendix:prometheus-winrates}

Given that in Section \ref{sec:results}, we have sampled responses with intrinsic principle-conditioned self-correction behavior from the language model for MT-Bench and AlpacaEval, we can further study the quality of the Prometheus-8x7B-v2.0 model in producing judgements over a fine-grained rubric. We specifically would like to understand whether the model's responses -- which, as per Tables \ref{table:compare-against-baselines} and \ref{table:multiple-iters}, achieve improvements in score -- actually reflect the principles they invoke. 

This method corresponds to the IFEval principle-following win-rates reported in Section \ref{sec:results}. As such, note that the AlpacaEval win-rate in this section differs from the standard AlpacaEval scoring -- this is the \textbf{\textit{percentage of AlpacaEval (correspondingly MT-Bench) samples on which Prometheus-v2.0 chose the refined response over the base policy's generation, with regards to the principle-following rubric.}} Recall that our STaR baseline also produces an intrinsic self-correction, but without the principle, so we use the principle invoked by the STaPLe model in the Prometheus judge rubric.

\begin{table}[h]
\footnotescript
  \caption{Analysis of the Prometheus-8x7B-v2.0 model's judgements on the self-correction responses of the STaPLe model against the STaR baseline. The baseline win-rate against the base policy is 50\%.}
  \label{table:prometheus-mt-bench-alpacaeval}
  \centering
  \begin{tabular}{ccccccl}
    \toprule
    Model    & MT-Bench Prometheus Win-rate & AlpacaEval Prometheus Win-rate \\
    \midrule
    \textbf{Llama-3.1-8B-Instruct} & &  \\
    \midrule
    STaR Iter 1 (28.2k)  & 56.3\% &  54.0\% \\
    STaPLe Iter 1 (28.2k) & 62.5\% &   62.4\% \\
    \midrule
    STaR Iter 2 (6.0k)  & 61.3\% & 58.6\% \\
    STaPLe Iter 2 (6.1k) & 67.5\% & 65.0\% \\
    \midrule
    STaR Iter 3 (6.1k)  & 62.5\% & 61.1\% \\
    STaPLe Iter 3 (6.3k) & \textbf{71.3\% }& \textbf{68.7\%} \\
    \midrule
    STaR Iter 4 (6.3k)  & 66.3\% & 62.4\% \\
    STaPLe Iter 4 (6.6k) & 70.0\% & 64.6\% \\
    \midrule
    \midrule
    \textbf{Granite-3.1-8B-Instruct} & & \\
    \midrule
    STaR Iter 1 (24.1k) & 57.5\% & 56.1\% \\
    STaPLe Iter 1 (24.1k) & 63.8\% & 62.1\% \\
    \midrule
    STaR Iter 2 (5.4k)  & 60.0\% & 60.1\% \\
    STaPLe Iter 2 (5.2k) & 68.8\% & 65.6\% \\
    \midrule
    STaR Iter 3 (5.9k)  & 63.8\% & 62.2\% \\
    STaPLe Iter 3 (5.9k) & 72.5\% & \textbf{69.3\%} \\
    \midrule
    STaR Iter 4 (6.2k)   & 66.3\% & 63.0\% \\
    STaPLe Iter 4 (6.3k) & \textbf{73.8\%} & 68.7\% \\
    \midrule
    \midrule
    \textbf{Qwen2.5-7B-Instruct} & & \\
    \midrule
    STaR Iter 1 (30.9k) & 60.0\% & 58.6\% \\
    STaPLe Iter 1 (30.9k) & 67.5\% &   65.2\%  \\
    \midrule
    STaR Iter 2 (6.5k)  & 63.8\% & 63.1\% \\
    STaPLe Iter 2 (6.5k) & 71.3\% & 68.8\% \\
    \midrule
    STaR Iter 3 (7.1k)   & 67.5\% & 65.3\% \\
    STaPLe Iter 3 (7.0k) & 75.0\% & 70.7\% \\
    \midrule
    STaR Iter 4 (7.1k)  & 71.3\% & 65.6\% \\
    STaPLe Iter 4 (7.1k) & \textbf{76.3\%} & \textbf{71.3\%} \\
    \bottomrule
  \end{tabular}
\end{table}

Both algorithms yield gains over the base policy in win-rate, with STaPLe outperforming STaR across all iterations. 
Interestingly, we find that on MT-Bench, the STaR baseline continues to increase by a sizable amount 
 (2.5-3.8 pts) in iteration 4, unlike the true MT-Bench score and the other benchmarks as reported in Table \ref{table:multiple-iters}. By contrast, training over principles in the unconstrained STaPLe yields a smaller gain (for Granite-8B and Qwen-7B, and a slight drop for Llama-8B), although STaPLe still outperforms STaR by +7.5-8.8\% in iteration 3 and +3.7\%-7.5\% in iteration 4. However, Granite-8B does appear to improve in MT-Bench win-rate in iteration 4, despite the average MT-Bench score dropping (as can be witnessed in Table \ref{table:multiple-iters}. However, given the small sample size of the dataset (80 samples), this could be a product of noise, unlike the larger datasets like IFEval (541 samples) and AlpacaEval (805 samples). On AlpacaEval, we witness a similar trend, albeit more consistent with the AlpacaEval scores reported in Table \ref{table:multiple-iters}.


\section{Stepwise Win-rate Analysis}\label{appendix:stepwise-winrates}

Recall that the Prometheus win-rates that have been reported thus far are a comparison against generations from each model's initial policy (instruct model) $\pi_0$. However, to confirm that the model's generations continue to improve in principle-following quality over the iterations, we compare the iteration $t$ model's generations against iteration $t\hspace{-0.5mm}-\hspace{-0.5mm}1$ in the Prometheus judgement setup. Given our primary focus in Tables \ref{table:compare-against-baselines} and \ref{table:multiple-iters} was on IFEval, we recompute these win-rates against the initial response in trained STaPLe model's own generated self-correction trajectories. In iteration 1, the comparison is done against the base policy, and thus the win-rates reported are the same as in the aforementioned tables. 

\begin{table}[h]
\footnotescript
  \caption{Stepwise win-rates over the iterations of the unconstrained STaPLe algorithm with the Prometheus-v2.0 judge. Instead of comparing against the initial (instruction-tuned) policy for all iterations, this judge compares against the responses sampled from the previous iteration's policy. }
  \label{table:prometheus-stepwise}
  \centering
  \begin{tabular}{ccccccl}
    \toprule
    Model    & IFEval Prometheus Win-rate  \\
    \midrule
    \textbf{Llama-3.1-8B-Instruct} &   \\
    \midrule
    STaPLe Iter 1 (28.2k) & 65.6\%   \\
    \midrule
    STaPLe Iter 2 (6.1k) & 58.2\%  \\
    \midrule
    STaPLe Iter 3 (6.3k) & 54.3\%  \\
    \midrule
    STaPLe Iter 4 (6.6k) & 49.4\% \\
    \midrule
    \midrule
    \textbf{Granite-3.1-8B-Instruct} &  \\
    \midrule
    STaPLe Iter 1 (24.1k) & 65.1\% \\
    \midrule
    STaPLe Iter 2 (5.2k) & 62.3\% \\
    \midrule
    STaPLe Iter 3 (5.9k) & 58.0\% \\
    \midrule
    STaPLe Iter 4 (6.3k) & 47.9\%  \\
    \midrule
    \midrule
    \textbf{Qwen2.5-7B-Instruct} &  \\
    \midrule
    STaPLe Iter 1 (30.9k) &  68.2\%    \\
    \midrule
    STaPLe Iter 2 (6.5k) & 61.2\% \\
    \midrule
    STaPLe Iter 3 (7.0k) & 63.4\%  \\
    \midrule
    STaPLe Iter 4 (7.1k) & 60.8\% \\
    \bottomrule
  \end{tabular}
\end{table}

\begin{figure}[h]
    \centering
    \includegraphics[width=0.7\linewidth]{Prompts/stepwise-winrates-ifeval-prometheus.png}
    \caption{Visualization of Table \ref{table:prometheus-stepwise}, comparing against the 50\% baseline. While the win-rate exceeds 50\%, the model continues to self-improve.}
    \label{fig:ifeval-prometheus-stepwise}
\end{figure}

Note that a win-rate of 50\% indicates that responses generated under $\pi_{t}$ were equally preferred to responses generated under $\pi_{t-1}$. As such, the win-rates remaining above 50\% by a sizble margin is further evidence of the model's self-improvement. These stepwise win-rates are also a useful signal in behaving like an "elbow" method, to determine when to terminate the STaPLe algorithm. For instance, observing that the win-rates drop below 50\% for the Llama-8B and Granite-8B models in iteration 4 suggests that their responses degraded compared to their prior iteration's responses (albeit, Llama-8B is fairly marginally below 50\%). On the other hand, Qwen's win-rates remain above 60\% throughout, suggesting that there perhaps is potential to continue its self-improvement for additional iterations. We plot this progression in Figure \ref{fig:ifeval-prometheus-stepwise} for a visual representation of this selection process. 

\newpage
\section{Intrinsic Self-Correction}\label{intrinsic-self-correction}

Given that the trained STaPLe model performs intrinsic self-correction -- given a prompt, it produces an initial response, invokes a principle to improve it, and improves the response, without an external stimulus or re-prompting -- we can analyze the advantage between the model's initial and final responses. We do this using the Prometheus-v2.0 judge, on IFEval prompts, to give a binary preference between the initial response and final response on principle-following, using the same judge prompt as in other experiments in Tables \ref{table:compare-against-baselines}-\ref{table:prometheus-stepwise}. The results are found in Table \ref{table:self-correction-winrate}. We find that the win-rates do improve over the iterations, reinforcing the claim that STaPLe-trained models learn intrinsic self-correction behavior. These win-rates are also consistent with our prior findings that the Llama-8B and Granite-8B models degrade in iteration 4, while Qwen-7B continues to improve. 

\begin{table}[h]
\footnotescript
  \caption{Prometheus-v2.0 win-rate in comparing the model-generated initial and refined responses, on the basis of which response better reflects the principle invoked for unconstrained STaPLe.}
  \label{table:self-correction-winrate}
  \centering
  \begin{tabular}{ccccccl}
    \toprule
    Model    & IFEval Prometheus Win-rate  \\
    \midrule
    \textbf{Llama-3.1-8B-Instruct} & &  \\
    \midrule
    STaPLe Iter 1 (28.2k) & 72.6\% &   \\
    \midrule
    STaPLe Iter 2 (6.1k) & 74.3\% & \\
    \midrule
    STaPLe Iter 3 (6.3k) & \textbf{75.0\%} &  \\
    \midrule
    STaPLe Iter 4 (6.6k) & 73.4\% & \\
    \midrule
    \midrule
    \textbf{Granite-3.1-8B-Instruct} & & \\
    \midrule
    STaPLe Iter 1 (24.1k) & 76.5\% & \\
    \midrule
    STaPLe Iter 2 (5.2k) & 77.1\% & \\
    \midrule
    STaPLe Iter 3 (5.9k) & \textbf{83.2\%} & \\
    \midrule
    STaPLe Iter 4 (6.3k) & 77.8\% &  \\
    \midrule
    \midrule
    \textbf{Qwen2.5-7B-Instruct} & & \\
    \midrule
    STaPLe Iter 1 (30.9k) & 75.8\% &     \\
    \midrule
    STaPLe Iter 2 (6.5k) & 78.0\% & \\
    \midrule
    STaPLe Iter 3 (7.0k) & 79.7\% &  \\
    \midrule
    STaPLe Iter 4 (7.1k) & \textbf{82.1\%} & \\
    \bottomrule
  \end{tabular}
\end{table}

\begin{figure}[h]
\caption{STaPLe refinement rates across 4 iterations for unconstrained STaPLe algorithm. This represents the fraction of samples in the mining corpus on which at least one principle-conditioned refinement attempt improved over the initial response. }
  \label{fig:STaPLe-refinement-rates}
  \centering
  \includegraphics[width=0.7\linewidth]{Figures/Refinement-rate-plot.png}
\end{figure}

\begin{figure}[h]
    \centering
    \label{fig:constrained-STaPLe-refinement-rates}
    \caption{STaPLe refinement rates across 4 iterations for constrained STaPLe algorithm. This represents the fraction of samples in the mining corpus on which at least one principle-conditioned refinement attempt improved over the initial response. }\includegraphics[width=0.7\linewidth]{Figures/Constrained-refinement-rate-plot.png}
\end{figure}

We compare the refinement rates between the unconstrained and constrained versions of STaPLe in Figures \ref{fig:STaPLe-refinement-rates} and \ref{fig:constrained-STaPLe-refinement-rates}. We observe a similar trend, where Qwen-7B starts with th highest rate (above 0.61) and remains the highest throughout. The refinement rates for Llama-8B and Granite-8B gain similarly for both versions, although the refinement rates are lower by iteration 4 in the constrained version. In the left plot, the Granite refinement rate spikes during iteration 3 principle discovery (the E-step), which we do not see in the constrained version. 

\newpage
\section{Model-Generated Constitutions}\label{appendix:constitutions}

For each model, we include the constitution generated, and a histogram of the densities of each element taught during the final iteration of training. This histogram denotes the number of samples in the cluster for which each principle serves as a representative. 

\subsection{Granite-3.1-8B-Instruct-Generated Constitution}

\input{Constitutions/granite_constitution}

 \begin{figure}[h]
\caption{Breakdown of the Granite-3.1-8B-Instruct iteration 3 model-generated constitution in terms of the number of elements in each cluster. The label on the x-axis denotes the cluster representative element (medoid). The counts also denote the number of fine-tuning samples contained this principle in the augmented dataset $\widetilde{\mathcal{D}}$, following  label replacement in the trajectories. We use ellipses for the sake of readability.}

  \label{fig:granite-constitution-histogram}
  \centering
  \includegraphics[width=0.93\linewidth]{Figures/granite_principles_histogram.png}
\end{figure}


In particular, we observe that the "Clarity and Conciseness" and "Empathy and Compassion" principles are the most emphasized, likely as a result of mining corpus domains including summarization (TL;DR) and harmlessness (HH-RLHF). The phrase "Emphasize ..." is repeated fairly often, albeit in different contexts. This reflects the model's stylistic preferences for principles that aid it in self-correcting, one of the key reasons for using on-policy-generated principles in the STaPLe algorithm, rather than introducing "supervision" from a stronger model in an off-policy fashion. We also repeat the Granite-generated constitution in Figure \ref{fig:granite-constitution-histogram}, to ease direct comparison of the constitutions across models here in the Appendix. 

\subsection{Llama-3.1-8B-Instruct-Generated Constitution}

\input{Constitutions/llama_constitution}

 \begin{figure}[h]
\caption{Analysis of the Llama-3.1-8B-Instruct iteration 4 constitution. We use ellipses for brevity, as in Figure \ref{fig:granite-constitution-histogram}, given the corresponding full principles may be found above.}

  \label{fig:llama-constitution-histogram}
  \centering
  \includegraphics[width=0.97\linewidth]{Figures/llama_principles_histogram.png}
\end{figure}

We observe that the at face value, the elements in the Llama-8B constituion are more "high-level", akin to some of the elements in works such as Constitutional AI and Dromedary \citep{bai2022constitutionalaiharmlessnessai, dromedary}. As with Granite, the majority of the mass is placed on elements with the premise of "Conciseness and Clarity" (simply swapping the order), as well as "Empathy and Emotional Validation), which is fairly similar to "Empathy and Compassion" from the Granite-8B constitution. A new element that appears fairly often ($\approx 800$ instances) is "Directness and Assertiveness". 

\newpage
\subsection{Qwen2.5-7B-Instruct-Generated Constitution}

\input{Constitutions/qwen_constitution}

 \begin{figure}[h]
\caption{Analysis of the Qwen2.5-7B-Instruct iteration 4 constitution. We use ellipses for brevity, as in Figures \ref{fig:granite-constitution-histogram} and \ref{fig:llama-constitution-histogram}, given the corresponding full principles may be found above. }

  \label{fig:qwen-constitution-histogram}
  \centering
  \includegraphics[width=\linewidth]{Figures/qwen_principles_histogram.png}
\end{figure}

Qwen-7B appears to generate a larger constitution than the other models, despite discovering fewer new principles in subsequent iterations, as corroborated by Figure \ref{fig:principle-discovery-rates}. However, we find the constitution to be, at face value, not as diverse in its phrasing given many of the principles have "clarity" or "clarify". However, the contexts behind its usage varies quite drastically, e.g. "Clarity of Information and Timeline Accuracy" differs greatly from "Clarity and Specificity in Cultural Context"; this is akin to the phrase "Emphasize" as noted earlier in the Granite constitutions. As such, we still find this to be an appropriate constitution, especially when coupled with the gains that Qwen2.5-7B yields extending into the fourth iteration of STaPLe. 

\subsection{Number of Clusters over the Iterations of STaPLe Algorithm}\label{appendix:analysis-of-constitutions}

 \begin{figure}[h]
\caption{We plot the size of the constitutions generated under Constrained STaPLe with the medoids label replacement scheme.}

  \label{fig:size-of-constitution-over-iterations}
  \centering
  \includegraphics[width=0.8\linewidth]{Figures/size_of_constitutions.png}
\end{figure}

We observe that the size of the Qwen2.5-7B-generated constitution is larger throughout the iterations, although all models converge to a roughly fixed size, with the gap in size between the iterations 3 and 4 constitutions being minimal. The size of the constitution by iteration 4 is roughly around or more than 50\% smaller than the iteration 1 constitution, suggesting that the learned distribution is converging to a stable set (surrounding this constitution). This also corroborates with Figure \ref{fig:principle-discovery-rates}, where we show that the number of new principles discovered decreases over the iterations. 


\section{STaPLe Algorithm Prompts}\label{appendix:prompts}

\subsection{Principle Mining Prompt}

\input{Prompts/principle-mining-prompt}

\subsection{Critique Generation Prompt}

\input{Prompts/critique-generation-prompt}

\subsection{Principle-Conditioned Refinement Prompt}

\input{Prompts/principle-refinement-prompt}

\newpage
\section{Ablations}
\subsection{Label Replacement Method}\label{sec:clustering-methods}

We include a thorough investigation into the performance of the STaPLe algorithm under different label replacement methods. In particular, in addition to the medoid method outline in Section \ref{sec:pr-clustering}, we explore using the mode of each clustering based on the counts of principles invoked and an augmentation on the medoid scheme, where we only perform the label replacement if the difference in perplexity of the trajectory is bounded by a threshold $\tau_{PPL}$, which we take as 0.2. 

\begin{equation}
    \tilde{Z}_{medoid} = \{m_k: m_k = \arg\min\limits_{m \in C_k} \sum_{j \in C_k} ||e_i - e_j||_2, \hspace{0.5mm} k \in [1,K]\}
    \tag{Medoid Representatives}
\end{equation}

\begin{equation}
  \tilde{Z}_{mode} = \bigl\{\,m_k : 
     m_k = \arg\max_{z\in C_k}\sum_{j\in C_k}\mathbf{1}(z_j = z)
     \,,\quad k=1,\dots,K\bigr\}
  \tag{Mode Representatives}
\end{equation}

For the cluster medoid and mode label-replacement methods, we simply retrieve the cluster $C_i$ which sample $i$ belongs to, and replace $\hat{z_i}$ with $\tilde{z}_i$ from $\tilde{Z}_{medoid}$ or $\tilde{Z}_{mode}$, respectively. For the third method, define the perplexity of the sequence $S$ from the iteration $t$ language model $M_t$ to be $PPL(S;\theta_t) = \exp(-\frac{1}{|S|}\sum\limits_{j=1}^{|S|} \ln[{P_{\theta_t}(S_j \mid S_{<j}})])$. We then compute the perplexity of the two sequence consisting of the input $x_i$, initial response $y_{i,1}$, principle candidate ($\hat{z}_i$ and $\tilde{z}_i$ from $\tilde{Z}_{medoid}$), critique based on the principle ($c_{\hat{z}_i}$ and $c_{\tilde{z}_i}$, respectively), and the refined response $y_2$ -- denote these sequences $S_{i,\hat{z}_i}$ and  $S_{i, \tilde{z}_i}$, respectively. If the difference in perplexity between these two sequences does not exceed a threshold $\tau$, we replace $\hat{z}_i$ with $\tilde{z}_i$ for $\widetilde{\mathcal{D}}$; else, we discard sample $i$. Intuitively, this means that all samples in $\widetilde{\mathcal{D}}$ with this perplexity difference scheme are those where the cluster medoid representative is nearly as good, if not better, than the original principle, based on likelihood of generation in the sequence, including the refined response. Formally, the set of principles retained are:
\[
\tilde{Z}_{PPL} \;=\;\bigl\{\,
\tilde z_i \in \tilde{Z}_{medoid}
\;\bigm|\;
\mathrm{PPL}(S_{i,\tilde{z_i}}; \theta_t)
- \mathrm{PPL}(S_{i,\hat{z_i}}; \theta_t)
\;\leq\;\tau_{PPL}
\bigr\}, i \in [1,|\mathcal{D}'|]\}
\]

Regardless of the scheme, this results in dataset $(x_i, y_{i,1}, \tilde{z}_i, y_{i,2}) \in \widetilde{D}$, where $|\widetilde{\mathcal{D}}| \leq |\mathcal{D}'|$ for the perplexity method (equality otherwise). 


The results of this analysis are included in Table \ref{table:full-label-replacement-table}. We find that using the medoid outperforms using the mode or the perplexity scheme (denoted PPL) across nearly all experiments, with the exception of Granite-8B iteration 4 for MT-Bench (average) and Qwen-7B in iteration 4 for AlpacaEval. That being said, the values across the schemes are generally close to one another, and follow a similar trend to the unconstrained version of STaPLe, suggesting that they are all viable principle cluster labels that may be taught to the LM. As noted in Section \ref{sec:results}, STaPLe with clustering generally avoid the same degree of performance degradation seen in the unconstrained version for iteration 4 with Llama-8B and Granite-8B; this extends to the other two label replacement schemes as well. Revisiting the posterior regularization formulation as defined in Section \ref{sec:pr-clustering}, placing mass on a reduced number of elements induced by the clustering thus seems to, in fact, have a regularization effect of sorts. 

\begin{table}
\footnotescript
  \caption{Comparison of the label replacement schemes proposed in Section \ref{sec:pr-clustering}, against the unconstrained experiments in Table \ref{table:multiple-iters}, all with the STaPLe algorithm.}
  \label{table:full-label-replacement-table}
  \centering
  \begin{tabular}{ccccccl}
    \toprule
    Model    & MT-Bench (avg)  & MT-Bench (T1) & MT-Bench (T2) & AlpacaEval & IFEval WR \\
    \midrule
    \textbf{Llama-3.1-8B-Instruct} & & & & & \\
    \midrule
    Initial Policy & 7.46 & 8.09 & 6.83 & 26.9 & -- \\
    \midrule
    Unconstrained Iter 1 (28.2k) & 7.66 & 8.15  & 7.16 & 32.2 &  65.6\%   \\    
    Medoids Iter 1 (28.2k)  & 7.63 & 8.14 & 7.11 & 31.9 & 65.1\%  \\
    Modes Iter 1 (28.2k) & 7.59 & 8.10 & 7.09 & 31.2 & 64.5\% \\
    PPL  Iter 1 (28.2k) & 7.62 & 8.14 & 7.09 & 31.1 & 64.3\% \\
    \midrule
    Unconstrained Iter 2 (6.1k) & \textbf{7.74} & \textbf{8.19} & 7.29 & 34.4 & 66.2\% \\
    Medoids Iter 2 (6.0k)  & 7.70 & 8.15 & 7.25 & 34.6 & 66.0\%  \\
    Modes Iter 2 (6.0k) & 7.66 & 8.14 & 7.18 & 33.8 & 65.1\% \\
    PPL  Iter 2 (5.8kk) & 7.65 & 8.14 & 7.16 & 34.0 & 65.4\% \\
    \midrule
    Unconstrained Iter 3 (6.3k) & \textbf{7.74}& 8.16 & \textbf{7.31} & 35.6 & 68.8\%  \\
    Medoids Iter 3 (6.2k)  & 7.72 & 8.16& 7.28 & \textbf{35.7} & 68.4\%  \\
    Modes Iter 3 (6.2k) & 7.66 & 8.14 & 7.18 & 34.9 & 66.0\% \\
    PPL  Iter 3 (6.1k) & 7.68 & 8.13 & 7.23& 35.2 & 66.5\% \\
    \midrule
    Unconstrained Iter 4 (6.6k) & 7.71 & 8.13 & 7.30 & 33.4 &  68.9\% \\
    Medoids Iter 4 (6.4k)  & 7.70 & 8.13 & 7.28 & 34.9 & \textbf{69.1\%}  \\
    Modes Iter 4 (6.3k) & 7.63 & 8.13 & 7.14 & 34.1 & 66.7\% \\
    PPL  Iter 4 (6.1k) & 7.68 & 8.11 & 7.25 & 33.7 & 66.7\% \\
    \midrule
    \midrule
    \textbf{Granite-3.1-8B-Instruct} & & & & & \\
    \midrule
    Initial Policy & 7.83 & 8.59 & 7.08 & 30.2 & -- \\
    \midrule
    Unconstrained Iter 1 (24.1k) & 7.99 & 8.69  & 7.29 & 36.7 &   65.1\% \\
    Medoids Iter 1 (24.1k)  & 7.98& 8.66& 7.30 & 36.2 & 64.9\%  \\
    Modes Iter 1 (24.1k) & 7.94 & 8.69 & 7.19 & 35.8& 64.0\% \\
    PPL  Iter 1 (24.1k) & 7.93 & 8.64 & 7.23 & 35.2 & 63.3\% \\
    \midrule
    Unconstrained Iter 2 (5.2k) & 8.04 & 8.74 & 7.34 & 38.9 & 65.2\% \\
    Medoids Iter 2 (5.1k)  & 8.01 & 8.68 & 7.35 & 38.7 & 67.3\%  \\
    Modes Iter 2 (5.1k) & 7.98 & 8.71 & 7.25 & 37.8 & 65.6\% \\
    PPL  Iter 2 (4.8k) & 7.99 & 8.65 & 7.33 & 38.1 & 66.7\% \\
    \midrule
    Unconstrained Iter 3 (5.9k) & \textbf{8.06} & \textbf{8.75} & 7.38 & \textbf{39.8} & \textbf{71.6\%} \\
    Medoids Iter 3 (5.4k)  & \textbf{8.06} & 8.74 & 7.39 & 39.4 & 69.9\%   \\
    Modes Iter 3 (5.3k) & 8.02& 8.74 & 7.30 & 38.9 & 68.0\%  \\
    PPL  Iter 3 (5.2k) & 8.05 & 8.73 & 7.38 & 39.1 & 68.6\% \\
    \midrule
     Unconstrained Iter 4 (6.3k) & 8.04 & 8.66 & \textbf{7.41} & 38.4 & 67.6\%  \\
    Medoids Iter 4 (5.8k)  & 8.03 & 8.65 & \textbf{7.41} & 38.8 & 68.4\%  \\
    Modes Iter 4 (5.5k) & 8.01 & 8.68 & 7.35 & 37.3 & 67.1\% \\
    PPL  Iter 4 (5.3k) & 8.04 & 8.65 & 7.43 & 38.2 & 67.7\% \\
    \midrule
    \midrule
    \textbf{Qwen2.5-7B-Instruct} & & & &  & \\
    \midrule
    Initial Policy & 6.83 & 7.34 & 6.31 & 30.4 & -- \\
    \midrule
    Unconstrained Iter 1 (30.9k) & 7.03 & 7.48  & 6.59 & 37.3 & 68.2\%     \\
    Medoids Iter 1 (30.9k)  & 6.99 & 7.43 & 6.55 & 36.5 & 67.3\%  \\
    Modes Iter 1 (30.9k) & 6.97 & 7.43 & 6.51 & 36.3 & 67.3\% \\
    PPL  Iter 1 (30.9k) & 6.97 & 7.40 & 6.54 & 36.5 & 66.9\% \\
    \midrule
    Unconstrained Iter 2 (6.5k) & 7.14 & 7.55 & 6.73 & 39.4 & 66.2\% \\
    Medoids Iter 2 (6.5k)  & 7.10 & 7.46 & 6.74 & 38.9 & 68.4\%  \\
    Modes Iter 2 (6.5k) & 7.08 & 7.48 & 6.68 & 38.5& 67.3\% \\
    PPL  Iter 2 (6.3k) & 7.09 & 7.46 & 6.73 & 38.5& 67.7\% \\
    \midrule
    Unconstrained Iter 3 (7.0k) & 7.20 & 7.63 & 6.78 & 39.8 & 72.5\%  \\
   Medoids Iter 3 (6.9k)  & 7.17 & 7.54 & 6.80 & 39.8 & 70.4\%  \\
    Modes Iter 3 (6.9k) & 7.12 & 7.54 & 6.70 & 39.2 & 68.8\% \\
    PPL  Iter 3 (6.8k) & 7.15 & 7.53 & 6.78 & 39.6 & 69.7\% \\
    \midrule
    Unconstrained Iter 4 (7.1k) & \textbf{7.24} & \textbf{7.64} & \textbf{6.85} & \textbf{40.2} & \textbf{73.4\%} \\
    Medoids Iter 4 (7.2k)  & 7.22 & 7.60 & 6.84 & 39.9 & 72.1\%  \\
    Modes Iter 4 (7.1k) & 7.14 & 7.56 & 6.73 & 39.1 & 69.7\% \\
    PPL  Iter 4 (7.1k) & 7.17 & 7.55 & 6.79 & 40.0 & 71.0\% \\
    \bottomrule
  \end{tabular}
\end{table}

\subsection{LLM-as-a-Judge Rejection Sampling}\label{appendix:lm-as-a-judge-sim}

We note in Section \ref{sec:algorithm} and \ref{sec:expt-setup} that we use the Rouge-L F1 score as the similarity scoring metric between a candidate response and the gold reference. We find this method to work well in practice, as shown by the results thus far. Nonetheless, under the recent paradigm of using an LLM-as-a-judge \citep{mt-bench}, one could use a stronger performing model as a judge to score closeness to the gold, provided that one is willing to expend the inference-time compute to do so. We explore this setup using the Phi-4 model \citep{phi4}, a 14B parameter model which reduces latency in performing $N+1$ judge queries (one per refined response, along with the initial response), compared to a larger model such as Mixtral-8x22B or Llama-3.1-405B-Instruct. We use a score threshold of 9 on a scale from 1-10 for the initial response -- that is, if the model assigns a score of 8 or lower, we proceed to refinement. 

\newpage
\subsubsection{Judge Prompt for Similarity Scoring}

We leverage a judge prompt adapted from \cite{katsis2025mtragmultiturnconversationalbenchmark}, focusing on comparison against the reference answer rather than faithfulness to a grounding document. 

\input{Prompts/phi4-judge-prompt}

\subsubsection{Results}

\begin{table}[h]
\footnotescript
  \caption{Self-improvement with the Constrained STaPLe algorithm using a Phi-4 model as a judge to score similarity to the gold response for rejection sampling. We include Constrained STaPLe with Rouge-L, to make a direct comparison, denoted "STaPLe w/ Rouge". }
  \label{table:phi4-judge-results}
  \centering
  \begin{tabular}{ccccccl}
    \toprule
    Model    & MT-Bench (avg)  & MT-Bench (T1) & MT-Bench (T2) & AlpacaEval & IFEval WR \\
    \midrule
    \textbf{Llama-3.1-8B-Instruct} & & & & & \\
    \midrule
    Initial Policy & 7.46 & 8.09 & 6.83 & 26.9 & -- \\
    \midrule
    STaR Iter 1 (28.2k)  & 7.43  & 8.04  & 6.81 & 29.1 & 55.5\%   \\
    STaPLe w/ Rouge Iter 1 (28.2k)  & 7.63 & 8.14 & 7.11 & 31.9 & 65.1\%  \\
    STaPLe w/ Judge Iter 1 (25.8k) & 7.60 & 8.13 & 8.08 & 31.6 & 64.9\% \\
    \midrule
    STaR Iter 2 (6.0k)  & 7.47 & 8.08 &  6.86 & 30.6 & 57.7\% \\
    STaPLe w/ Rouge Iter 2 (6.0k)  & 7.70 & 8.15 & 7.25 & 34.6 & 66.0\%  \\
    STaPLe w/ Judge Iter 2 (5.7k) & 7.68 & 8.15 & 7.21 & 34.1 & 65.6\% \\
    \midrule
    STaR Iter 3 (6.1k)  & 7.51 & 8.10 & 6.91 & 31.5 & 61.0\% \\
    STaPLe w/ Rouge Iter 3 (6.2k)  & \textbf{7.72} & \textbf{8.16}& \textbf{7.28} & \textbf{35.7} & \textbf{68.4\%}  \\
    STaPLe w/ Judge Iter 3 (6.3k) & 7.70 & \textbf{8.16} & 7.25 & 35.6 & 68.0\%  \\
    \midrule
    \midrule
    \textbf{Granite-3.1-8B-Instruct} & & & & & \\
    \midrule
    Initial Policy & 7.83 & 8.59 & 7.08 & 30.2 & -- \\
    \midrule
    STaR Iter 1 (24.1k) & 7.83  & 8.61  & 7.05 & 33.0 & 57.3\%  \\
    STaPLe w/ Rouge Iter 1 (24.1k)  & 7.98& 8.66& 7.30 & 36.2 & 64.9\%  \\
    STaPLe w/ Judge Iter 1 (20.9k) & 7.93 & 8.66 & 7.20 & 36.0 & 65.2\% \\
    \midrule
    STaR Iter 2 (5.4k)  & 7.86 & 8.63 & 7.10 & 34.7 & 59.5\% \\
    STaPLe w/ Rouge Iter 2 (5.1k)  & 8.01 & 8.68 & 7.35 & 38.7 & 67.3\%  \\
    STaPLe w/ Judge Iter 2 (5.2k) & 8.01 & 8.70 & 7.31 & 39.0 & 66.9\% \\
    \midrule
    STaR Iter 3 (5.9k)  & 7.92 & 8.66 & 7.18 & 35.4 & 61.9\% \\
    STaPLe w/ Rouge Iter 3 (5.4k)  & 8.06 & 8.74 & \textbf{7.39} & 39.4 & 69.9\%   \\
    STaPLe w/ Judge Iter 3 (6.3k) & \textbf{8.07} & \textbf{8.76} & 7.38 & \textbf{40.4} & \textbf{70.2\%} \\
    \midrule
    \midrule
    \textbf{Qwen2.5-7B-Instruct} & & & & & \\
    \midrule
    Initial Policy & 6.83 & 7.34 & 6.31 & 30.4 & -- \\
    \midrule
    STaR Iter 1 (30.9k) & 6.85  & 7.39  & 6.31 & 34.5 &  61.0\%  \\
    STaPLe w/ Rouge Iter 1 (30.9k)  & 6.99 & 7.43 & 6.55 & 36.5 & 67.3\%  \\
    STaPLe w/ Judge Iter 1 (29.5k) & 6.96 & 7.45 & 6.48& 36.2 &  66.7\%  \\
    \midrule
    STaR Iter 2 (6.5k)  & 6.98 & 7.45& 6.51 & 36.9 & 63.0\% \\
    STaPLe w/ Rouge Iter 2 (6.5k)  & 7.10 & 7.46 & 6.74 & 38.9 & 68.4\%  \\
    STaPLe w/ Judge Iter 2 (6.5k) & 7.05 & 7.48 & 6.63 & 38.1 & 67.7\% \\
    \midrule
    STaR Iter 3 (7.1k)  & 7.08 & 7.58 & 6.59 & 37.6 & 66.4\% \\
    STaPLe w/ Rouge Iter 3 (6.9k)  & \textbf{7.17} & 7.54 & \textbf{6.80} & \textbf{39.8} & \textbf{70.4\%}  \\
    STaPLe w/ Judge Iter 3 (7.2k) & 7.13 & \textbf{7.56} & 6.70 & 39.5 & 69.5\% \\
    \bottomrule
  \end{tabular}
\end{table}


In Table \ref{table:phi4-judge-results}, we present a similar table as Table \ref{table:multiple-iters}, but comparing STaPLe over 3 iterations with the judge for rejection sampling in place of Rouge-L scoring. We use constrained STaPLe with the medoids label replacement method. We observe that the MT-Bench average scores drop slightly relative to using the Rouge-L similarity function, but still vastly outperforming STaR; in fact for Granite-8B, the iteration 2 scores are equal and STaPLe with the Phi-4 judge actually outperforms it in iteration 3. Notably, the turn-1 scores are higher with the Phi-4 while the turn-2 scores drop. On AlpacaEval, the scores of STaPLe with the judge are slightly lower than with Rouge-L for Llama-8B and Qwen-7B, while they gain +1\% in iteration 3 for Granite-8B. A similar trend persists for the IFEval Win-rates, where Granite gains slightly in iterations 1 and 3, while Qwen and Llama drop slightly. We conclude that given the scores are largely similar, this highlights the generality of the STaPLe algorithm in expanding to various choices of similarity function. 

\newpage
\subsection{Bayesian Hyperparameter Optimization for Clustering Distance Threshold}\label{appendix:bayesian-hypers}

As discussed in Section \ref{sec:expt-setup}, we use the deterministic agglomerative clustering algorithm to ensure a fast, yet consistent assignment of clusters over the principle embeddings. However, this relies on a hyperparameter, $\delta$, which we use to denote the Euclidean distance threshold under which clusters will be merged. As such, a lower threshold corresponds to a greater number of clusters, and vice versa, thus controlling the size of the yielded constitutions. This hyperparameter is currently set in a manual fashion, where the size and representative elements (medoids or modes) are inspected by the authors of this work and the threshold adjusted if needed -- this resulted in thresholds of 6 (Granite) and 8 (Llama and Qwen) for iteration 1, which was subsequently decreased to 5 and 7, respectively, for iterations 2-4. However, it is desirable for this threshold to be adaptive, and to mathematically encode the target properties for a cluster to satisfy. 

Accordingly, we design an objective function consisting of two terms over a clustering assignment: 1. the inter-medoid diversity and 2. the intra-cluster tightness. The former is denoted by the average cosine-similarity (abbreviated as "cossim" henceforth) between each pair of medoids, while the latter is average cosine similarity of the points in their cluster to their own medoid. This can be written mathematically as follows:
\[J(\delta) = \lambda \cdot \frac{2}{|C|(|C|-1)} \sum\limits_{1 \leq i < j \leq |C|} [1-cossim(m_i,m_j)] + (1-\lambda) \cdot \frac{1}{|C|}\sum\limits_{k=1}^{|C|}\frac{1}{|C_k|}\sum\limits_{i \in C_k} cossim(z_i,m_k)\]

where $C = AggClustering(\delta)$ is the set of clusters assigned by the Agglomerative Clustering algorithm at a threshold of $\delta$. Given we value a balance between medoid diversity and intra-cluster tightness to the medoid for higher quality assignments, we set $\lambda = 0.5$ to weigh both terms equally. 

We aim to search for a value of $\delta$ that yields a clustering $C$ that maximizes this objective. We use the scikit-optimize package \citep{head2020scikitoptimize} to perform Bayesian optimization via Gaussian Processes to search for an optimal value of $\delta$ over this function. This process performs Gaussian Process regression over seen instances, uses an expected improvement ($-\mathop{\mathbb{E}}[J(x)-J(x^+)]$ acquisition function to identify the next threshold to evaluate, then clusters and evaluates at the next chosen value of $x$, repeating this process iteratively. We use the L-BFGS algorithm over 30 evaluations.

We evaluate the STaPLe algorithm with the Llama-8B and Granite-8B models at the chosen thresholds $\delta_i^*$ for iteration $i$, which we also report below. Following from the results in Table \ref{table:full-label-replacement-table}, where the medoid label replacement scheme performs the best, we apply this to the clusters yielded using $\delta_i^*$. 


\begin{table}[h]
\footnotescript
  \caption{Analyzing the Constrained STaPLe algorithm performance over three iterations with optimal thresholds on the diversity-tightness objective searched over via Bayesian hyperparameter optimization. We use the medoids label replacement scheme, among the options in Appendix \ref{sec:clustering-methods}.}
  \label{table:phi4-judge-results}
  \centering
  \begin{tabular}{ccccccl}
    \toprule
    Model    & MT-Bench (avg)  & MT-Bench (T1) & MT-Bench (T2) & AlpacaEval \\
    \midrule
    \textbf{Llama-3.1-8B-Instruct} & & & & & \\
    \midrule
    Initial Policy & 7.46 & 8.09 & 6.83 & 26.9  \\
    \midrule
    STaR Iter 1 (28.2k)  & 7.43  & 8.04  & 6.81 & 29.1    \\
    STaPLe ($\delta_1=8.0$) Iter 1 (28.2k)  & 7.63 & 8.14 & 7.11 & 31.9   \\
    STaPLe ($\delta_1^*=7.2$) Iter 1 (28.2k) & 7.64 & 8.14 & 7.14 & 31.8 \\
    \midrule
    STaR Iter 2 (6.0k)  & 7.47 & 8.08 &  6.86 & 30.6  \\
    STaPLe ($\delta_2=7.0$) Iter 2 (6.0k)  & 7.70 & 8.15 & 7.25 & 34.6   \\
    STaPLe  ($\delta_2^*=7.3$) Iter 2 (6.1k) & 7.72 & 8.16 & 7.28 & 34.6 \\
    \midrule
    STaR Iter 3 (6.1k)  & 7.51 & 8.10 & 6.91 & 31.5  \\
    STaPLe ($\delta_3=7.0$) Iter 3 (6.2k)  & 7.72 & 8.16& 7.28 & \textbf{35.7}   \\
    STaPLe  ($\delta_3^*=6.6$) Iter 3 (6.4k) & \textbf{7.75} & \textbf{8.18} & \textbf{7.33} & 35.4  \\
    \midrule
    \midrule
    \textbf{Granite-3.1-8B-Instruct} & & & & & \\
    \midrule
    Initial Policy & 7.83 & 8.59 & 7.08 & 30.2  \\
    \midrule
    STaR Iter 1 (24.1k) & 7.83  & 8.61  & 7.05 & 33.0   \\
    STaPLe ($\delta_1=6.0$) Iter 1 (24.1k)  & 7.98& 8.66& 7.30 & 36.2   \\
    STaPLe ($\delta_1^*=6.3$) Iter 1 (24.1k) & 7.98 & 8.65 & 7.31 & 36.0 \\
    \midrule
    STaR Iter 2 (5.4k)  & 7.86 & 8.63 & 7.10 & 34.7  \\
    STaPLe ($\delta_2=5.0$) Iter 2 (5.1k)  & 8.01 & 8.68 & 7.35 & 38.7   \\
    STaPLe ($\delta_2^*=5.9$) Iter 2 (5.2k) & 8.02 & 8.68 & 7.36 & 38.8 \\
    \midrule
    STaR Iter 3 (5.9k)  & 7.92 & 8.66 & 7.18 & 35.4  \\
    STaPLe ($\delta_3=5.0$) Iter 3 (5.4k)  & 8.06 & 8.74 & 7.39 & \textbf{39.4}    \\
    STaPLe ($\delta_3^*=4.2$) Iter 3 (5.9k) & \textbf{8.08} & \textbf{8.75} & \textbf{7.41} & 39.2 \\
    \bottomrule
  \end{tabular}
\end{table}

Notably, it is interesting that the thresholds drop in iteration 3, suggesting that as the number of clusters decreases, a more permissive threshold suffices to balance diversity and cluster tightness. We find that the optimized thresholds result in very similar results, albeit with slight improvements across both MT-Bench turns, and thus the average score as well. 

\subsubsection{Diversity of Constitutions with Manually Selected Thresholds}\label{appendix:diversity-of-constitutions}

To further study the claim of the original, hand-set thresholds being fairly well optimized, we can use this 
 this objective function $J(\delta)$ as an appropriate metric to study the quality of the clusterings yielded. As expected, the optimized thresholds improve diversity, by a sufficient margin to suggest that there exist multiple thresholds which would improve upon the manually set $\delta_i$ values; the best of which being these $\delta_i^*$ values. 

\begin{table}[h]
\footnotescript
  \caption{Analyzing the constitutions yielded by the Constrained STaPLe algorithm, both with and without the Bayesian hyperparameter optimization process detailed above. }
  \label{table:diversity-of-constitutions}
  \centering
  \begin{tabular}{ccccccl}
    \toprule
    Model    & $J(\delta)$ \\
    \midrule
    \textbf{Llama-3.1-8B-Instruct} & & & & & \\
    \midrule
    STaPLe ($\delta_1=8.0$) Iter 1 (28.2k)  &  0.6437  \\
    STaPLe ($\delta_1^*=7.6$) Iter 1 (28.2k) & 0.6502 \\
    \midrule
    STaPLe ($\delta_2=7.0$) Iter 2 (6.0k)  &  0.6625  \\
    STaPLe  ($\delta_2^*=7.3$) Iter 2 (6.1k) & 0.6732 \\
    \midrule
    STaPLe ($\delta_3=7.0$) Iter 3 (6.2k)  & 0.6889   \\
    STaPLe  ($\delta_3^*=6.6$) Iter 3 (6.4k) & 0.7054 \\
    \midrule
    \midrule
    \textbf{Granite-3.1-8B-Instruct} & & & & & \\
    \midrule
    STaPLe ($\delta_1=6.0$) Iter 1 (24.1k)  & 0.6036  \\
    STaPLe ($\delta_1^*=6.3$) Iter 1 (24.1k) & 0.6151 \\
    \midrule
    STaPLe ($\delta_2=5.0$) Iter 2 (5.1k)  & 0.6241  \\
    STaPLe ($\delta_2^*=5.9$) Iter 2 (5.2k) & 0.6482 \\
    \midrule
    STaPLe ($\delta_3=5.0$) Iter 3 (5.4k)  & 0.6765     \\
    STaPLe ($\delta_3^*=4.2$) Iter 3 (5.9k) & 0.6894 \\
    \bottomrule
  \end{tabular}
\end{table}

\section{Qualitative Examples of Principle-Guided Self-Correction}\label{appendix:qualitative-examples}

\subsection{IFEval Examples with STaPLe Iteration 4 Llama-3.1-8B-Instruct}

\input{Examples/ifeval_examples}

\section{Ethics Statement}\label{appendix:ethics}

Our findings suggest that most users can use STaPLe to improve the quality of the model's responses by eliciting and training the model to follow desirable latent attributes. As such, we hope that this induces a positive societal impact by way of producing a set of model-preferred labels which are used effectively to perform self-correction in an expressive, and thus interpretable manner. However, we caveat this by noting that a principle label \textit{alone} does not fully model the latent reasoning process that a human may use in self-correction, but rather, only serves as a stimulus to indicate the most relevant direction that a refined response should "step" towards for improvement. 


An adversarial user could potentially use this process as a means to deliberately \textit{misalign} the model by using the principle discovery phase as a means to steer the model further \text{away} from desirable responses. That is, one could select another objective aside from the gold response to use as a self-correction target; this would likely yield drastically different principles and results. Training on such trajectories would induce \textit{self-degradation} behavior at inference-time, collapsing the quality of the model's responses, rather than the desired self-improvement of its self-correction abilities. We observe that this is a potential risk for all such principle-driven alignment strategies, even with human-curated or strong model-generated principles, but is especially the case with self-generated principles, given the generator is a relatively weaker language model. 

As a mitigation strategy for this potential negative impact, continuing from our discussion in Section \ref{sec:discussion-limitations}, we suggest human oversight by way of human-in-the-loop feedback. Specifically, an external set of reviewers can assess the quality and safety of the principles generated at the end of the E-step of each iteration after clustering before training the model to follow it. One could feasibly provide multiple candidate constitutions -- e.g. one constitution per label replacement strategy described in Appendix \ref{sec:clustering-methods}, or under different clustering thresholds (the impact of which is explored in Appendix \ref{appendix:bayesian-hypers}) -- and the annotators can select the best one and make edits to it as appropriate. For instance, if an annotator were to discard an element, one could simply discard all samples with labels that fall under that cluster. Thus, we acknowledge the role that clustering plays in making informed assessments over the constitution; as such, constrained STaPLe is more \textit{controllable} in comparison to the unconstrained version. While this reintroduces human oversight to balance performance with safety, it would  add minimal human labor overhead, as \textit{judging} a constitution for safety would require substantially fewer annotation hours than \textit{curating} one, presenting an advantage over methods such as Constitutional AI. We believe that this strategy would be effective in enforcing responsible usage of STaPLe. 

The above human-in-the-loop proposal is also an effective strategy to mitigate bias amplification over the iterations. Allowing annotators to discard elements that they assess would propagate biases or stereotypes would ensure that these behaviors are not learned by the model and then invoked in subsequent iterations, avoiding the cascading effect. Again, clustering and the label replacement scheme plays an important role here, by ensuring that we do not train on principles that are hyper-specific to a particular sample. This is especially relevant when there may be noisy or adversarial prompts designed to induce undesirable behavior. We suggest that users inspect the model-generated constitutions to assess their principles and the alignment of these labels with their values before training over these elements in the M-step.


Even when using STaPLe to improve responses towards the gold, it is possible that this reference answer is noisy -- i.e. it is incorrect (verifiable settings) or still undesirable in some aspect (preference settings). Given the algorithm's generality, dataset selection is left to the user -- we encourage users to analyze the gold responses to filter samples with lower quality gold responses accordingly during pre-processing. This could be done by way of human annotation (using Likert scale annotations on multiple attributes, akin to UltraFeedback), or using trained or model-based filters for undesirable qualities such as profane language. 

We believe that the promise of STaPLe in facilitating self-improvement in language models by alignment to model-generated constitutions outweighs the possible negative impacts. We further suggest that the strategies detailed above -- specifically, the introduction of some human oversight into the STaPLe algorithm -- would largely mitigate these risks and promote responsible usage.

\newpage
\section{Details of Models and Datasets Used}\label{appendix:models-and-datasets-info}

As noted in Section \ref{sec:expt-setup}, we use the following large language models in our experiments:
\begin{itemize}
    \item \href{https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct}{Llama-3.1-8B-Instruct}  \citep{grattafiori2024llama3herdmodels}; this model is available under the custom Llama-3.1 Community License\footnote{\href{https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct/blob/main/LICENSE}{https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct/blob/main/LICENSE}} which includes provisions for commercial usage. 
    \item \href{https://huggingface.co/ibm-granite/granite-3.1-8b-instruct}{Granite-3.1-8B-Instruct}  \citep{granite}; this model is available under the permissive, Apache 2.0 open-source license.
    \item \href{https://huggingface.co/Qwen/Qwen2.5-7B-Instruct}{Qwen2.5-7B-Instruct}  \citep{qwen2025qwen25technicalreport}; this model is also available under Apache 2.0. 
\end{itemize}

Furthermore, in Appendix \ref{appendix:lm-as-a-judge-sim}, we explore the use of an LLM-as-a-judge as a similarity scoring function between a candidate response generated on-policy by one of the above models to the gold response. We instantiate this judge with the \href{https://huggingface.co/microsoft/phi-4}{Phi-4} language model \citep{phi4}, which is made available under the permissive MIT license.

We also provide further details of the datasets used in the mining corpus, expanding on our description in Section \ref{sec:expt-setup}:
\begin{itemize}
    \item \href{https://huggingface.co/datasets/Anthropic/hh-rlhf}{Anthropic HH-RLHF}: this dataset consists of a total of 161k preference pairs (chosen-rejected) over helpfulness and harmlessness as described in \cite{bai2022traininghelpfulharmlessassistant}. HH-RLHF is available under the MIT license. 

    \item \href{http://huggingface.co/datasets/openbmb/UltraFeedback}{UltraFeedback} \citep{cui2024ultrafeedbackboostinglanguagemodels}: this dataset consists of 64k prompts; for each prompt, responses are sampled from four different language models. For each response, Likert-scale annotations are obtained over four attributes -- helpfulness, honesty, instruction-following, and truthfulness -- with corresponding rationales. For the STaPLe algorithm, we only consider samples where all Likert scores are at least $3$, forming a list of gold responses. We then score against the gold in the  by taking the average over the multiple reference answers. UltraFeedback has been made available under the MIT license. 

    \item \href{https://huggingface.co/datasets/openai/summarize_from_feedback}{TL;DR} \citep{tldr}: this dataset consists of Reddit posts detailing a situation, along with two candidate summaries, in the "comparisons" part, which we use. They include a "choice" label, which we use to select our gold response (summary). We use the train set, consisting of 92.9k samples. TL;DR is available under the CC-BY-4.0 license.

    \item \href{https://huggingface.co/datasets/hotpotqa/hotpot_qa}{HotpotQA}: this dataset focuses on Wikipedia-based question answering. We use the train set of the "fullwiki" split, consisting of 90.4k samples; these contain a question, context, supporting facts, and a gold response. HotpotQA is available under CC-BY-SA-4.0.
\end{itemize}

Lastly, we discuss the details behind the evaluation datasets and evaluation framework. 

\begin{itemize}
    \item \href{https://github.com/lm-sys/FastChat/blob/main/fastchat/llm_judge/README.md}{MT-Bench} consists of 80 prompts, testing multi-turn, open-ended response generation capabilities for chat assistants. It is available under the Apache 2.0 license, in the FastChat GitHub repository. We use GPT-4o \citep{openai2024gpt4ocard} as the judge model. 

    \item \href{https://github.com/tatsu-lab/alpaca_eval}{AlpacaEval-2.0-LC} \citep{alpaca_eval} consists of 805 samples testing instruction-following abilities, using length-controlled win-rates through a generalized linear modeling approach \citep{dubois2024length}. It is released under the Apache 2.0 license.

    \item \href{https://huggingface.co/datasets/google/IFEval}{IFEval} \citep{zhou2023instructionfollowingevaluationlargelanguage} consists of 541 prompts, similarly testing instruction-following abilities. It is released under the Apache 2.0 license. 
\end{itemize}

We used the \href{https://huggingface.co/prometheus-eval/prometheus-8x7b-v2.0}{Prometheus-8x7B-v2.0} language model \citep{prometheus} as a fine-grained judge to compare the quality of the STaPLe models' generations in their principle-following ability. This model is available under the Apache 2.0 license. 
