% Appendix: Model Answer Examples
% Auto-generated examples of model predictions showing compact, bloated, and invalid formulas

\section{Model Answer Examples}
\label{sec:appendix_model_examples}

This appendix presents representative examples of model-generated formulas across three categories:
\emph{compact} solutions (valid formulas shorter than gold),
\emph{bloated} solutions (valid but significantly longer than gold), and
\emph{invalid} solutions from strong models.
These examples illustrate the range of model behaviors observed in the benchmark.

% =============================================================================
\subsection{FullObs Examples}
% =============================================================================

\paragraph{Compact Solutions.}
Multiple models find simpler valid formulas for instance \texttt{simple\_006}:

\begin{center}
\small
\begin{tabular*}{\textwidth}{@{\extracolsep{\fill}}llp{10cm}c@{}}
\toprule
Model & AST & Formula & $\Delta$ \\
\midrule
\textit{Gold} & 20 & \texttt{(and (and (not (P x)) (not (Q x))) (exists y (and (R x y) (and (P y) (not (Q y))))))} & --- \\
\addlinespace
Grok4.1f & 12 & \texttt{(and (not (or (P x) (Q x))) (exists y (S x y)))} & $-8$ \\
Grok4 & 13 & \texttt{(and (not (P x)) (not (Q x)) (exists y (S x y)))} & $-7$ \\
GPT-5.2 & 13 & \texttt{(and (not (P x)) (not (Q x)) (exists y (S x y)))} & $-7$ \\
\bottomrule
\end{tabular*}
\end{center}

All three models discover that the gold formula's relational constraint \texttt{(R x y)} with property conditions can be equivalently captured by the simpler \texttt{(S x y)} relation in the given worlds.
This suggests the models identify structural redundancy that the planted gold formula does not exploit.

\paragraph{Bloated Solutions.}
GPT-5.2 sometimes produces extremely long valid formulas. For instance \texttt{hard\_082} (gold AST=15, $\Delta$=+355):

\noindent\textit{Gold formula:}
\begin{quote}
\small\texttt{(forall y (or (not (S x y)) (exists z (and (R y z) (Q z)))))}
\end{quote}

\noindent\textit{GPT-5.2 prediction (AST=370):}
\begin{quote}
\scriptsize\texttt{(or (and (not (exists y (not (exists z (R y z))))) (not (exists y (not (exists z (S y z))))) (or (and (P x) (Q x)) (and (not (Q x)) (exists y (and (R x y) (Q y))) (or (forall z (or (not (R x z)) (P z))) (exists w (and (R x w) (Q w) (not (P w)))))))) (and (exists y (not (exists z (R y z)))) (not (exists y (not (exists z (S y z))))) (and (Q x) (forall y (or (not (S x y)) (Q y))))) (and (exists y (not (exists z (R y z)))) (exists y (not (exists z (S y z)))) (exists y (and (Q y) (not (P y)) (not (exists z (S y z))))) (or (not (Q x)) (P x) (not (exists y (S x y))))) (and (exists y (not (exists z (R y z)))) (exists y (not (exists z (S y z)))) (not (exists y (and (Q y) (not (P y)) (not (exists z (S y z)))))) (and (not (Q x)) (exists y (exists z (exists w (and (R x y) (R x z) (R x w) (not (= y z)) (not (= y w)) (not (= z w)))))))) (and (not (exists y (not (exists z (R y z))))) (exists y (not (exists z (S y z)))) (not (exists y (S y y))) (not (exists y (S x y)))) (and (not (exists y (not (exists z (R y z))))) (exists y (not (exists z (S y z)))) (exists y (S y y)) (exists y (and (Q y) (not (P y)))) (or (not (P x)) (and (Q x) (forall y (or (not (R x y)) (P y)))) (exists y (and (R x y) (Q y) (not (P y)))) (and (P x) (not (S x x)) (exists y (and (Q y) (S y x) (not (= y x))))))) (and (not (exists y (not (exists z (R y z))))) (exists y (not (exists z (S y z)))) (exists y (S y y)) (not (exists y (and (Q y) (not (P y))))) (Q x)))}
\end{quote}

The gold formula expresses a simple universal constraint, but GPT-5.2 constructs an elaborate disjunction that enumerates cases.
This pattern---valid but bloated---indicates the model memorizes specific domain configurations rather than abstracting the underlying rule.

\paragraph{Invalid Solutions from Strong Models.}

\begin{center}
\small
\begin{tabular*}{\textwidth}{@{\extracolsep{\fill}}llp{9cm}l@{}}
\toprule
Instance & Model & Formula & Failure \\
\midrule
\texttt{easy\_003} & GPT-5.2 & \texttt{(forall y (or (not (R x y)) (Q y)))} & 3/4 worlds \\
\multicolumn{2}{@{}l}{\textit{Gold}:} & \texttt{(forall y (or (not (R x y)) (exists z (and (S y z) (P z)))))} & \\
\addlinespace
\texttt{easy\_005} & Grok4 & \texttt{(not (exists y (and (S x y) (Q y))))} & 2/4 worlds \\
\multicolumn{2}{@{}l}{\textit{Gold}:} & \texttt{(forall y (or (not (S x y)) (exists z (and (S y z) (Q z)))))} & \\
\bottomrule
\end{tabular*}
\end{center}

In \texttt{easy\_003}, GPT-5.2's formula uses \texttt{(Q y)} in the consequent instead of the required existential \texttt{(exists z (and (S y z) (P z)))}.
This substitutes a unary predicate for a relational condition, which happens to match in one world but fails in others.
In \texttt{easy\_005}, Grok4 uses a simple negated existential \texttt{(not (exists y ...))} instead of the gold's universal-existential pattern requiring each \texttt{S}-successor to have a further \texttt{S}-successor with \texttt{Q}.

% =============================================================================
\subsection{CI (Contrastive Induction) Examples}
% =============================================================================

\paragraph{Compact Solutions.}
CI often admits simpler formulas because during problem generation, we keep pools of hypotheses alive when generating YES worlds, to be
killed by NO worlds.

\begin{center}
\small
\begin{tabular*}{\textwidth}{@{\extracolsep{\fill}}llp{10cm}c@{}}
\toprule
Model & AST & Formula & $\Delta$ \\
\midrule
\multicolumn{4}{@{}l}{\texttt{C\_core\_047}:} \\
\textit{Gold} & 18 & \texttt{(exists y (and (or (P y) (Q y)) (forall z (or (not (S y z)) (S z x)))))} & --- \\
GPT-5.2 & 8 & \texttt{(or (exists y (R x y)) (Q x))} & $-10$ \\
Grok4 & 8 & \texttt{(or (P x) (exists y (R x y)))} & $-10$ \\
\bottomrule
\end{tabular*}
\end{center}

For \texttt{C\_core\_047}, both GPT-5.2 and Grok4 find equivalent compact alternatives using different predicates.

\paragraph{Bloated Solutions.}
For instance \texttt{C\_core\_095} (gold AST=16, $\Delta$=+523):

\noindent\textit{Gold formula:}
\begin{quote}
\small\texttt{(exists y (and (not (P y)) (forall z (or (not (R y z)) (R z x)))))}
\end{quote}

\noindent\textit{GPT-5.2 prediction (AST=539):}
\begin{quote}
\scriptsize\texttt{(or (and (exists y (and (R y y) (forall z (not (S y z))))) (exists y (and (R y y) (forall z (not (S y z))) (R y x)))) (and (not (exists y (and (R y y) (forall z (not (S y z)))))) (exists y (and (Q y) (forall z (or (not (Q z)) (= z y))))) (or (P x) (Q x) (forall y (not (S x y))))) (and (not (exists y (and (R y y) (forall z (not (S y z)))))) (not (exists y (and (Q y) (forall z (or (not (Q z)) (= z y)))))) (exists y (and (R y y) (not (P y)) (not (Q y)))) (exists y (and (R y y) (not (P y)) (not (Q y)) (= x y)))) (and (not (exists y (and (R y y) (forall z (not (S y z)))))) (not (exists y (and (Q y) (forall z (or (not (Q z)) (= z y)))))) (not (exists y (and (R y y) (not (P y)) (not (Q y))))) (exists y (and (Q y) (not (P y)) (R y y) (forall z (or (not (and (Q z) (not (P z)) (R z z))) (= z y))))) (exists y (and (Q y) (not (P y)) (R y y) (forall z (or (not (and (Q z) (not (P z)) (R z z))) (= z y))) (or (and (= x y) (exists w (and (S y w) (P w) (Q w)))) (and (R y x) (not (Q x))))))) (and (not (exists y (and (R y y) (forall z (not (S y z)))))) (not (exists y (and (Q y) (forall z (or (not (Q z)) (= z y)))))) (not (exists y (and (R y y) (not (P y)) (not (Q y))))) (not (exists y (and (Q y) (not (P y)) (R y y) (forall z (or (not (and (Q z) (not (P z)) (R z z))) (= z y)))))) (forall y (or (not (Q y)) (P y))) (and (P x) (not (exists w (and (P w) (not (Q w)) (R w w) (forall z (or (not (and (P z) (not (Q z)) (R z z))) (= z w))) (Q x) (S x w) (forall y (or (not (and (S x y) (not (Q y)))) (= y w)))))))) (and (not (exists y (and (R y y) (forall z (not (S y z)))))) (not (exists y (and (Q y) (forall z (or (not (Q z)) (= z y)))))) (not (exists y (and (R y y) (not (P y)) (not (Q y))))) (not (exists y (and (Q y) (not (P y)) (R y y) (forall z (or (not (and (Q z) (not (P z)) (R z z))) (= z y)))))) (not (forall y (or (not (Q y)) (P y)))) (exists y (and (Q y) (forall z (or (not (R y z)) (not (Q z)))) (forall w (or (not (and (Q w) (forall z (or (not (R w z)) (not (Q z)))))) (= w y))) (or (= x y) (R y x))))))}
\end{quote}

The gold formula is a concise existential-universal pattern, but GPT-5.2 constructs a 539-AST disjunction.

\paragraph{Invalid Solutions from Strong Models.}

\begin{center}
\small
\begin{tabular*}{\textwidth}{@{\extracolsep{\fill}}llp{9cm}l@{}}
\toprule
Instance & Model & Formula & Failure \\
\midrule
\texttt{C\_core\_002} & Gemini 3 & \texttt{(exists y (and (R x y) (not (= x y)) (or (not (S x y)) (and (not (exists z (R y z))) (not (P y))))))} & YES fail \\
\multicolumn{2}{@{}l}{\textit{Gold}:} & \texttt{(exists y (exists z (and (and (R x y) (S y z)) (not (P z)))))} & \\
\addlinespace
\texttt{C\_core\_005} & Gemini 3 & \texttt{(exists y (and (S x y) (not (Q y))))} & YES fail \\
\multicolumn{2}{@{}l}{\textit{Gold}:} & \texttt{(forall y (or (not (R x y)) (exists z (and (S y z) (P z)))))} & \\
\bottomrule
\end{tabular*}
\end{center}

In CI, ``YES fail'' means the formula fails to match the extension on at least one YES world.
Gemini 3's formula for \texttt{C\_core\_005} uses \texttt{S} and \texttt{Q} instead of the required \texttt{R}, \texttt{S}, and \texttt{P} predicates, missing the structural pattern entirely.

% =============================================================================
\subsection{EC (Existential Completion) Examples}
% =============================================================================

\paragraph{Compact Solutions.}
Instance \texttt{E\_core\_0010} admits a simpler valid formula found by multiple models:

\begin{center}
\small
\begin{tabular*}{\textwidth}{@{\extracolsep{\fill}}llp{10cm}c@{}}
\toprule
Model & AST & Formula & $\Delta$ \\
\midrule
\textit{Gold} & 16 & \texttt{(and (not (P x)) (exists y (and (R x y) (and (P y) (not (Q y))))))} & --- \\
\addlinespace
Grok4.1f & 9 & \texttt{(and (not (P x)) (exists y (R x y)))} & $-7$ \\
DSR & 9 & \texttt{(and (not (P x)) (exists y (R x y)))} & $-7$ \\
GPT-5.2 & 9 & \texttt{(and (not (P x)) (exists y (R x y)))} & $-7$ \\
Gemini 3 & 9 & \texttt{(and (not (P x)) (exists y (R x y)))} & $-7$ \\
Opus 4.5 & 9 & \texttt{(and (not (P x)) (exists y (R x y)))} & $-7$ \\
\bottomrule
\end{tabular*}
\end{center}

Under EC semantics with unknown atoms, the models discover that the gold formula's detailed constraint \texttt{(and (P y) (not (Q y)))} on the existential witness is unnecessary---simply requiring an \texttt{R}-successor suffices to characterize the target concept in the given training worlds.

\paragraph{Bloated Solutions.}
For instance \texttt{E\_core\_0003} (gold AST=17, $\Delta$=+107):

\noindent\textit{Gold formula:}
\begin{quote}
\small\texttt{(forall y (or (or (not (S x y)) (not (S y x))) (and (P y) (Q y))))}
\end{quote}

\noindent\textit{GPT-5.2 prediction (AST=124):}
\begin{quote}
\scriptsize\texttt{(or (and (exists y (and (Q y) (not (P y)) (not (exists z (S y z))))) (and (Q x) (or (not (exists y (S x y))) (exists y (and (S x y) (P y)))))) (and (not (exists y (and (Q y) (not (P y)) (not (exists z (S y z)))))) (or (and (exists y (not (exists z (S y z)))) (or (and (not (exists y (S x y))) (exists y (exists z (and (R x y) (R x z) (not (= y z)))))) (and (P x) (Q x) (not (exists y (R x y)))))) (and (not (exists y (not (exists z (S y z))))) (exists y (exists z (and (R x y) (R x z) (not (= y z)))))))))}
\end{quote}

The gold formula expresses a symmetric constraint on the \texttt{S} relation, but GPT-5.2's bloated alternative enumerates specific configurations.

\paragraph{Invalid Solutions from Strong Models.}

\begin{center}
\small
\begin{tabular*}{\textwidth}{@{\extracolsep{\fill}}llp{9cm}l@{}}
\toprule
Instance & Model & Formula & Failure \\
\midrule
\texttt{E\_core\_0001} & Grok4 & \texttt{(and (P x) (or (exists y (and (R x y) (Q y))) (exists z (and (R x z) (S z z)))))} & invalid \\
\multicolumn{2}{@{}l}{\textit{Gold}:} & \texttt{(and (P x) (exists y (and (R x y) (Q y))))} & \\
\addlinespace
\texttt{E\_core\_0003} & Grok4 & \texttt{(not (R x x))} & invalid \\
\multicolumn{2}{@{}l}{\textit{Gold}:} & \texttt{(forall y (or (or (not (S x y)) (not (S y x))) (and (P y) (Q y))))} & \\
\bottomrule
\end{tabular*}
\end{center}

Grok4's formula for \texttt{E\_core\_0001} adds an unnecessary disjunct \texttt{(exists z (and (R x z) (S z z)))}, which cannot be satisfied in all training worlds even with existential completion.
For \texttt{E\_core\_0003}, the formula \texttt{(not (R x x))} ignores the \texttt{S} relation entirely and cannot characterize the target concept.

