%%%%%%%% ICML 2026 INDUCTION PAPER %%%%%%%%%%%%%%%%%
%%%% Ported from KR format to ICML format

\documentclass{article}

% Recommended packages
\usepackage{microtype}
\usepackage{graphicx}
\usepackage{booktabs}
\usepackage{hyperref}

% ICML style (for blind submission)
\usepackage{icml2026}

\usepackage{amsmath,amssymb}
\usepackage{placeins}

% cleveref for smart references
\usepackage[capitalize,noabbrev]{cleveref}

% Short running title
\icmltitlerunning{INDUCTION: Finite-Structure Concept Synthesis in FOL}

% Macros
\newcommand{\Lang}{\mathcal{L}}
\newcommand{\W}{\mathcal{W}}
\newcommand{\Yes}{\mathrm{YES}}
\newcommand{\No}{\mathrm{NO}}
\newcommand{\Target}{T}
\newcommand{\astsz}{\mathrm{AST}}
\newcommand{\QD}{\mathrm{QD}}

% Path to auto-generated tables/figures (relative to icml_paper folder)
\graphicspath{{../auto/figures/}{../auto/figs/}}

\begin{document}

\twocolumn[
  \icmltitle{\textsc{INDUCTION}: Finite-Structure Concept Synthesis in First-Order Logic}

  % Author information hidden for blind review
  \begin{icmlauthorlist}
    \icmlauthor{Anonymous Author(s)}{}
  \end{icmlauthorlist}

  % Empty affiliations for blind review
  \icmlaffiliation{}{}

  \icmlkeywords{first-order logic, concept synthesis, inductive reasoning, benchmark, symbolic reasoning}

  \vskip 0.3in
]

\printAffiliationsAndNotice{}

\begin{abstract}
Large language and reasoning models can be prompted to generate well-formed first-order formulas, but we still
lack evaluations of their ability to produce \emph{correct, compact explanations} under fully
specified, mechanically checkable semantics.
We study \emph{finite-structure concept synthesis}: given several small finite relational worlds that
are labeled extensionally with a unary target predicate $T(x)$, the learner must output a single first-order
formula $\varphi(x)$ that recovers (\emph{explains}) $T$ uniformly across worlds.
Because the domains are finite, correctness is solver-verifiable via exact model checking and SMT.
We introduce \textsc{INDUCTION}, a benchmark suite providing challenging,
end-to-end evaluation of first-order \emph{definition synthesis} from extensional relational evidence.
\textsc{INDUCTION} includes three regimes---\textsc{FullObs} (full observation), \textsc{CI}
(contrastive $\Yes/\No$ worlds), and \textsc{EC} (partial observation under existential completion)---and
reports metrics that penalize formula bloat.
Across tasks we observe sharp difficulty gradients and persistent hard structural families; moreover, held-out
world evaluation shows that among training-correct solutions, low-bloat formulas generalize far better than
highl-bloat ones, motivating bloat-aware scoring as metric for symbolic induction.
\end{abstract}

\section{Introduction}

Recent large language and reasoning models can generate syntactically valid logical
expressions on demand, including first-order formulas.
However, it remains difficult to measure whether they genuinely \emph{generalize} logical
structure---being able to form correct, compact logical formulas that
explain a set of observations. This gap is particularly salient for first-order
logic (FOL): quantifiers, relational structure, and nontrivial interactions between predicates are
central to knowledge representation, but existing evaluations often conflate
logical competence with linguistic pragmatics, dataset artifacts, or
unverifiable free-form answers.

More broadly, constructing \emph{simple, robust hypotheses} from limited observations
is central to human scientific discovery, and conjecture-making
is core to mathematical practice. In these settings, success is not merely fitting
observed data, but arriving at concise hypotheses that remain stable under new evidence.
In this paper, we study the ability of models to form first-order compact logical explanations
of observed data in a fully verifiable abstract interface.

We study \emph{finite-structure concept synthesis}:
given several small finite worlds (structures) over a fixed relational
signature, and a designated unary target predicate $\Target(x)$ presented
extensionally in each world, the learner must output a single FOL formula
$\varphi(x)$ that recovers $\Target$.
Each world has a finite domain and interpretations for observed predicates
(e.g., unary $P,Q$ and binary $R,S$), and correctness is determined by model
checking over the finite domains.
This setting isolates the core logical challenge---constructing a formula that
captures a relational concept across worlds---while remaining fully
solver-verifiable.

We introduce \textsc{INDUCTION}, a benchmark suite that probes inductive
generalization under three complementary task variants that share a common
language, generator, and evaluation pipeline:
(1) \textbf{FullObs} (full observation), where all predicate
facts are observed and $\varphi$ must match $\Target$ in every training world;
(2) \textbf{CI} (contrastive / Zendo-style), where worlds
are partitioned into $\Yes$ and $\No$ groups and a solution must match
$\Target$ on all $\Yes$ worlds while \emph{failing to match} $\Target$ on each
$\No$ world, thereby requiring discriminative hypotheses;
and (3) \textbf{EC} (partial observation), where some ground
atoms are unknown and correctness is evaluated under an \emph{existential
completion} semantics: a solution is valid if there exists a completion of the
unknowns under which $\varphi$ matches the observed target labels.
These variants are designed to separate different failure modes: overfitting
to few worlds (FullObs), inability to use negative evidence (CI), and reasoning
under missing information (EC).

A core theme in our design is that \emph{accuracy alone is not enough}.
In all three tasks, especially under partial observation, a model can sometimes
satisfy the logical constraints using very large, case-splitting formulas that
exploit accidental regularities or the flexibility of completion.
To address this, \textsc{INDUCTION} reports \emph{budgeted} and
\emph{gold-relative} metrics that quantify how often models succeed with
formulas whose syntactic complexity is near that of the planted gold concept,
rather than allowing unbounded formula bloat.
This makes the benchmark more stable as models improve: progress should ideally
reflect better search and abstraction, not merely longer outputs.

We emphasize \emph{controllable difficulty}.
Inspired by counterexample-guided generation, our generators maintain pools of
plausible distractor hypotheses and construct worlds to eliminate them.
For the contrastive task CI, we develop a targeted ``trap'' mechanism that
makes $\No$ worlds informative: they are generated to match
tempting shortcut formulas that survive the $\Yes$ worlds,
forcing models that rely on such shortcuts to fail in $\No$ worlds.
For FullObs and EC, we track version-space diagnostics (how many candidate
hypotheses remain consistent with the worlds) and world informativeness (how
many candidates each world eliminates), enabling principled sweeps over
hardness axes.

\paragraph{Contributions.}
\begin{itemize}
  \item We formalize a unified finite-structure concept synthesis setting for
  FOL and introduce three tasks (FullObs/CI/EC) with precise,
  solver-verifiable semantics.
  \item We provide difficulty-controlled generation procedures, including
  contrastive trap construction for CI and version-space diagnostics for
  FullObs/EC, with quality filters that reduce trivial or artifact-driven
  instances.
  \item We propose evaluation metrics that combine validity with AST-based
  formula-complexity, highlighting when success
  depends on excessive case-splitting.
  \item We release stable v1 benchmarks and report
  results across a range of models, linking failure modes across
  tasks to structural properties of instances.
\end{itemize}

\section{Problem Setup}
\label{sec:setup}

\textbf{Finite relational worlds.}
All tasks take place in \emph{finite first-order structures}.
We fix a small relational signature
$\Sigma=\{P,Q,R,S\}$
consisting of unary predicates $P(\cdot),Q(\cdot)$ and binary predicates $R(\cdot,\cdot),S(\cdot,\cdot)$.
A \emph{world} $W$ is a finite $\Sigma$-structure with domain $D_W=\{a_1,\dots,a_n\}$ and interpretations
$P^W\subseteq D_W$,
$Q^W\subseteq D_W$,
$R^W\subseteq D_W\times D_W$,
$S^W\subseteq D_W\times D_W$.
In addition, each world comes with a designated \emph{unary target concept} $T_W\subseteq D_W$, given extensionally as a set of elements labeled ``true'' (and, by complement, ``false'').
The learning problem is to recover a single first-order formula that defines $T$ \emph{uniformly} across multiple worlds.

\textbf{Instances (multi-world problems).}
A benchmark \emph{problem instance} $\mathcal{I}$ consists of a finite set of worlds $\{W_1,\dots,W_k\}$ that share the same signature $\Sigma$ but may have different domains and interpretations.
The solver (a language model) must output a single first-order formula $\varphi(x)$ with one free variable $x$.
Intuitively, $\varphi$ should carve out the target set in each world:
$\{a\in D_W \mid W\models \varphi[a]\}\approx T_W$.
Across our tasks, the evaluation criterion varies in how the worlds constrain the solution (\cref{sec:tasks}).

\textbf{Symbolic output language.}
Models output $\varphi(x)$ in a Lisp-style S-expression syntax over predicates $P,Q,R,S$ with standard Boolean connectives and first-order quantifiers.
We also allow the standard equality atom \texttt{(= x y)}; it is parsed and evaluated under the usual first-order semantics.
The full grammar and task prompts are provided in Appendix~D; qualitative model output examples appear in Appendix~C. Outputs are parsed into an AST and checked mechanically.

\textbf{Model checking and solver-verifiable semantics.}
Because all domains are finite, evaluation reduces to finite model checking.
Given a world $W$ and a candidate $\varphi(x)$, we compute the \emph{predicted extension}
\begin{equation}
\hat{T}_W(\varphi) \;=\; \{\,a\in D_W \mid W \models \varphi[a]\,\}.
\end{equation}
Our implementation evaluates $\varphi$ by grounding quantifiers over the finite domain and translating the result to a Boolean circuit (and optionally to Z3 for consistency checks and diagnostics).
This makes correctness \emph{mechanically checkable} and independent of any natural-language interpretation.

\textbf{Complexity measures and budgeted scoring.}
In our experiments we observe that high-capacity models may satisfy the constraints using extremely long, case-splitting formulas.
To distinguish principled concept recovery from unbounded ``bloat,'' we track two syntactic measures:
(i) \emph{AST size} (the number of nodes in the parsed formula tree), and
(ii) \emph{quantifier depth} (maximum nesting depth of \texttt{forall}/\texttt{exists}).
Many instances are generated from a planted \emph{gold} formula $\varphi^\star$; we therefore report \emph{gold-relative} success rates such as
$\text{Acc}@(\text{gold}+ \Delta)$: the fraction of instances solved with $\text{AST}(\hat\varphi)\le \text{AST}(\varphi^\star)+\Delta$.
We also report a \emph{bloat rate} (e.g., $\text{AST}(\hat\varphi)>\text{AST}(\varphi^\star)+25$) to quantify how often correctness relies on highly non-canonical solutions.

\section{Induction Tasks}\label{sec:tasks}

All induction tasks share the same objects (worlds, target extensions, and a single output formula $\varphi(x)$) but differ in (i) what is observed in each world and (ii) how worlds constrain a valid solution.
We denote by $\mathcal{I}=\{W_1,\dots,W_k\}$ the worlds in an instance.

\subsection{Full Observation across Multiple Worlds}

In \textbf{FullObs} each world provides a complete interpretation of the observed predicates $P,Q,R,S$ (closed-world for the provided domain).
A candidate $\varphi(x)$ \emph{solves} an instance iff it matches the target extension in \emph{every} world:
\begin{equation}
\forall W\in \mathcal{I}\;:\;\hat{T}_W(\varphi)=T_W.
\end{equation}
This is the most ``direct'' concept synthesis setting: the only source of difficulty is finding a single relational/quantified definition that generalizes across multiple finite structures.


\subsection{Contrastive Induction}

In \textbf{CI}, the worlds are partitioned into
$\mathcal{I}_{\textsc{yes}}$ and $\mathcal{I}_{\textsc{no}}$.
Intuitively, \textsc{yes} worlds behave like standard training worlds, while \textsc{no} worlds provide \emph{negative evidence about the hypothesis class}.
Formally, define an exact-match predicate:
\begin{equation}
\textsc{Match}(W,\varphi) \;\;\Longleftrightarrow\;\; \hat{T}_W(\varphi)=T_W.
\end{equation}
A candidate $\varphi(x)$ solves an instance iff it matches \emph{every} \textsc{yes} world and fails to match \emph{every} \textsc{no} world:
\begin{multline}
\Big(\forall W\in \mathcal{I}_{\textsc{yes}}:\textsc{Match}(W,\varphi)\Big)
\;\wedge\; \\
\Big(\forall W\in \mathcal{I}_{\textsc{no}}:\neg\textsc{Match}(W,\varphi)\Big).
\end{multline}
The second conjunct ensures that a solution must be \emph{discriminative}: it cannot perfectly explain the 
targets on any NO-world. CI does not require $\varphi$ to invert labels on NO worlds; it only requires 
that $\varphi$ is not an exact match to the NO-world target extension. During generation, CI worlds are 
constructed to make the no-world constraint informative by design (e.g., via trap/survivor mechanisms that 
produce tempting shortcut hypotheses).


\subsection{Partial Observation \& Existential Completion}

The \textbf{EC} variant introduces missing information in the observed predicates.
For each world $W$, ground atoms of $P,Q,R,S$ are partitioned into \emph{known true}, \emph{known false}, and \emph{unknown}.
Let $\Omega_W$ denote the set of unknown ground atoms in $W$.
A \emph{completion} $\mathcal{C}$ assigns a truth value to every atom in $\Omega_W$, yielding a fully specified world $W^{\mathcal{C}}$.
The target extension $T_W$ is still given extensionally and treated as observed.

Under the \emph{existential completion} semantics, a formula is considered valid if, for each world independently, there exists some completion under which it matches the target:
\begin{equation}
\forall W \in \mathcal{I},\; \exists \mathcal{C}_W \in \mathrm{Comp}(W) \;:\;
\hat{T}_{W^{\mathcal{C}_W}}(\varphi) = T_W.
\end{equation}
Completions are world-local: each world has its own unknown atoms, so we quantify $\exists$ over completions independently per world.
Equivalently, $\varphi$ must be compatible with the observed facts and labels in each world, without requiring it to succeed under \emph{all} completions of any single world.
This setting probes whether models can reason about what \emph{could} be true in a partially observed relational world.
In practice, we evaluate EC instances by introducing Boolean variables for unknown atoms, grounding quantifiers over the finite domain, and asking a solver whether the resulting constraints are satisfiable (thereby witnessing a completion).

\section{Dataset Generation}\label{sec:generation}

A central goal of \textsc{INDUCTION} is \emph{controllable difficulty}: we want
instances that are neither trivially solvable nor infeasible for current elite models.
This requires careful attention to (i) the pool of gold target formulas,
(ii) the construction of worlds that provide informative constraints, and
(iii) rejection filters that eliminate degenerate instances.
We describe the generation procedures for FullObs and CI in detail; EC generation
follows similar principles with additional partial-observation machinery.

\subsection{Gold Formula Pool and Structural Tagging}

Gold target concepts $\varphi^\star(x)$ are drawn from a curated template pool
comprising approximately 200 structurally distinct formulas.
Each formula is tagged by \emph{family} (based on quantifier structure:
single-$\exists$, single-$\forall$, nested $\exists\forall$, nested
$\forall\exists$, etc.) and \emph{subfamily} (a fine-grained signature
capturing guard predicates, relation usage, and nesting patterns).

Although equality is allowed in the output language, our v1 planted gold templates
do not use it. This design keeps the target distribution focused on relational induction
(predicates and quantifiers) rather than identity-based constraints. Models are free to use
equality in predictions; such formulas are evaluated normally.

A particularly important subfamily involves \emph{lift-hard} patterns: formulas
where a relation involving the free variable $x$ appears inside a universally
quantified subformula.
For example, $\forall y\,(R(x,y) \to \exists z\, S(y,z))$ requires checking a
property of $x$'s $R$-successors for \emph{all} such successors---a pattern
that models frequently fail to generalize correctly.
We use lift-hard formulas to create headroom slices in each task.

Gold formulas are selected per-band with constraints on quantifier depth
($\QD$), AST size, and family distribution.
To ensure variety, we cap the number of problems per gold formula and per
subfamily, forcing coverage across structural patterns rather than
oversampling easy families.

\subsection{FullObs Generation}

FullObs instances require the learner to match $\Target$ across $k$ fully observed
worlds.
The challenge is constructing worlds that make the gold concept
\emph{highly constraining}---i.e., that eliminate a large fraction of plausible
shortcut or structurally different hypotheses, even though the gold formula is
not guaranteed to be uniquely identified by the finite worlds.

\textbf{Hypothesis pool.}
We maintain a frozen pool of ${\approx}1{,}500$ candidate hypotheses organized into three tiers:
(i) simple shortcuts (atomic predicates, $\QD=1$ formulas with AST $\le 10$);
(ii) near-miss mutants of gold formulas (predicate swaps, guard drops, argument permutations);
(iii) structurally complex distractors ($\QD = 2$, larger AST).

\textbf{World generation with kill tracking.}
For each gold formula $\varphi^\star$, each candidate world is a random finite model with a target 
extension induced by $\varphi^\star$.
Unary predicates $P,Q$ are sampled via Bernoulli trials with $p_{\text{unary}}=0.4$ 
(constrained to 15--85\% true per predicate).
Binary predicates $R,S$ use regular out-degree sampling: each element has exactly 2 outgoing edges for each relation to keep relational density
stable across domain sizes and avoid degenerate sparse/dense regimes for quantifiers.
Domain sizes vary by band (5--7 for easy, 7--10 for medium, 8--12 for hard).
Before accepting a world, we require that it kills at least one surviving hypothesis from the pool, i.e., a 
hypothesis that matched all previous worlds but fails on this one.


\textbf{Filters.}
After generating $k$ worlds, we apply rejection filters (Appendix \cref{tab:world_gen_params}):
(i) \textbf{atomic}: reject if any atomic formula matches $\Target$ in all worlds;
(ii) \textbf{subformula}: reject if any proper subformula of $\varphi^\star$ matches all worlds;
(iii) \textbf{quantifier-free}: reject if any quantifier-free combination matches all worlds.
These filters ensure that accepted instances require genuine quantifier reasoning.

\textbf{Band configuration.}
FullObs v1 uses six bands with increasing difficulty:
\textbf{simple} ($k=4$, $\QD=1$),
\textbf{easy} ($k=6$, $\QD \ge 2$),
\textbf{medium} ($k=8$),
\textbf{hard} ($k=10$, tighter filters),
and two \textbf{extreme} slices (logic-focused and context-focused) with
$k=10$ and lift-hard formulas.

\subsection{CI Generation: Tiered Trap Pools}

CI instances require the learner to find a formula that matches $\Target$ on
all YES worlds while \emph{failing} on each NO world.
The key insight is that NO worlds should be \emph{informative}---they should
rule out plausible shortcut hypotheses that a model might propose after seeing
only the YES worlds.

\textbf{Trap-based contrastive construction.}
For each gold formula $\varphi^\star$, we construct a per-problem \emph{trap pool} of plausible alternatives:
simple shortcuts (AST $\le 10$, $\QD \le 1$) and near-miss mutants of $\varphi^\star$ (guard drops, predicate swaps, argument permutations).
YES worlds are generated incrementally, tracking which traps \emph{survive}---i.e., match $\Target$ on all YES worlds so far.
We enforce a \emph{tight survivor band}: after all YES worlds, the number of surviving near-miss traps should be in $[1, 2]$ and total survivors in $[2, 4]$.
This band was calibrated empirically: too many survivors makes matching YES worlds with shortcuts too easy; too few makes NO worlds uninformative.
Each NO world is then constructed to \emph{kill} at least one surviving trap by being exactly matched by it, ensuring that models relying on shortcut hypotheses will incur NO-world failures.

\textbf{Band configuration.}
CI v1 uses two bands:
\textbf{core} (no lift-hard, 7--8 YES worlds, 2--3 NO worlds) and
\textbf{lift\_mix} (35\% lift-hard formulas, same world counts).
The lift\_mix band provides headroom for models that master the core band.

\subsection{EC Generation: Partial Observation}

EC instances present worlds where some ground atoms are \emph{unknown}---their
truth values are hidden from the model.
A candidate formula $\varphi$ is valid if, for each world, there exists a
\emph{completion} of the unknown atoms under which $\varphi$ matches the
observed target labels exactly.
This existential-completion semantics is checked via Z3: we ground quantifiers
over finite domains and search for a satisfying assignment to the unknowns.

\textbf{Unknown atom selection.}
For each world, we designate a subset of ground atoms as unknown.
In v1, 20\% of binary atoms are masked as unknown: the core band masks atoms from $R$ and $S$, while the hard 
band masks only $R$ atoms (unary $P,Q$ remain fully observed).
The model observes only the known atoms; unknown atoms appear in the prompt as explicitly marked ``unknown.''

\textbf{Band configuration.}
EC v1 uses two bands:
\textbf{Core} (120 problems): $k = 3$ worlds, domain sizes 6--8,
$\QD = 1$ gold formulas, 20\% unknown rate on $R,S$.
\textbf{Hard} (80 problems): $k = 3$ worlds, domain sizes 7--9,
$\QD = 2$ gold formulas, 20\% unknown rate on $R$ only.
Core is designed to be solvable by multiple models, while Hard provides headroom.

\section{Evaluation: Reported Metrics}\label{sec:evaluation}


We report: (1) \textbf{Coverage}: fraction of instances with a parseable formula. Grok4
  is an outlier with low coverage in FullObs and CI, having failed to return
  many responses after multiple attempts. Despite retries, Grok4 frequently did not return 
  an output within the timeout; we count these as missing. (2) \textbf{Unbounded accuracy/validity}: FullObs/CI 
  exact-match correctness; EC existential-completion validity. (3) \textbf{Budgeted accuracy} Acc@(+$\Delta$): valid and
  $\astsz(\varphi)\le \astsz(\varphi^\star)+\Delta$. (4) \textbf{Bloat rate}: fraction of valid outputs with
  $\astsz(\varphi)>\astsz(\varphi^\star)+25$. Budgeted metrics are essential in FullObs and EC, where some models achieve high
validity using large case splits.

% \textbf{Decoding settings.}
% All models use a maximum output length of 32{,}000 tokens (when supported) with up to 5 retry attempts for API failures.
% Grok4 nonetheless exhibits frequent non-returns ($\approx$33\% missing on FullObs), which we count as incorrect in \emph{Acc\_all}.
% Full model identifiers and retry settings are in Appendix \cref{tab:model_settings}.

\section{Experiments and v1 Results}\label{sec:experiments}

We now summarize a snapshot of v1 results for FullObs/CI/EC.
All metrics use denominator=\emph{all} instances (missing outputs count as
incorrect) unless otherwise noted.
These numbers are intended as a stable picture of v1 behavior; they may
shift modestly as remaining runs complete.
All models were evaluated with the same prompt template and standardized decoding limits; see Appendix A.12 for exact model identifiers and decoding parameters.

\textbf{Parsimony and generalization.}
Across tasks, \emph{budgeted} success (Acc@gold$+25$) and \emph{unbounded} success can diverge substantially, especially in FullObs and EC (\cref{tab:summary_across_tasks}; \cref{fig:fo_budget_curve,fig:ec_budget_curve}).
To test whether formula bloat merely reflects harmless syntactic redundancy or instead indicates overfitting, we evaluate \emph{held-out} worlds labeled by the planted gold concept.
The resulting generalization curves (\cref{tab:fo_generalization,tab:ci_generalization}; \cref{fig:fo_gen_bins}) show that low-bloat solutions generalize dramatically better than high-bloat solutions, supporting the use of gold-relative budgets as a proxy for conceptual abstraction rather than case-by-case patching.

% Across-task summary table (auto-generated)
\input{../auto/tables/summary_across_tasks.tex}

\subsection{FullObs multi-band difficulty and bloat}

FullObs v1 contains 375 problems across six bands:
simple (QD=1), easy, medium, hard, and two extreme slices (logic/context) (Appendix \cref{tab:fo_bandwise}).
\cref{fig:fo_budget_curve} shows budgeted accuracy curves.
Additional diagnostic tables appear in
Appendix~\ref{sec:appendix_tables}.

\begin{figure}[tb]
  \centering
  \includegraphics[width=\columnwidth]{fo_budget_curve.pdf}
  \caption{FullObs v1 budgeted accuracy curves: Acc@(+$\Delta$) for
  $\Delta \in [0, 100]$.}
  \label{fig:fo_budget_curve}
\end{figure}

FullObs exhibits a pronounced cliff from QD=1 to QD=2 multi-world induction.
Grok4 attains high accuracy when it completes, but suffers low coverage
($\approx$33\% missing) after multiple retries, therefore accuracy is
reported as a fraction of all problems.
GPT-5.2 achieves strong unbounded accuracy but lower under budgeted metrics,
revealing heavy reliance on large formulas
(24\% bloat rate).
GPT-4o achieves 0\% FullObs accuracy despite 100\% coverage, suggesting these tasks require capabilities not reliably elicited in this baseline under our standardized prompting.
Some models exploit equality (not used by the gold templates) to express alternative
correct definitions; Appendix~\ref{app:equality} summarizes equality usage and success
rates.


\textbf{Bloat hurts generalization.}
To investigate whether bloated formulas truly capture the underlying concept,
we generated $H=5$ held-out worlds per FullObs instance, drawn IID from the same generator distribution and labeled by the planted gold formula $\varphi^\star$.
We report exact-match rate: the fraction of holdout worlds where the predicted formula matches $\varphi^\star$'s extension on all domain elements.
\cref{tab:fo_generalization} shows that among valid
(train-correct) formulas, \emph{near-gold} solutions (AST $\leq$ gold+1)
generalize at 78--98\% holdout exact-match, while \emph{above-gold} solutions
(AST $>$ gold+1) drop to 15--41\%.
(Note: the near-gold threshold of +1 is stricter than the bloat-rate threshold of +25 used in summary statistics.)
This confirms that bloated formulas often overfit to training worlds rather
than learning the true concept.

The near-gold vs.\ above-gold gap is significant: ranging from 47 to 75 points for top models (\cref{tab:fo_generalization}).
Moreover, the effect is monotone: when we bin train-correct predictions by relative AST size, holdout exact-match degrades steadily as $\mathrm{AST}(\hat\varphi)-\mathrm{AST}(\varphi^\star)$ increases (\cref{fig:fo_gen_bins}).

% FullObs generalization table (auto-generated)
\input{../auto/tables/fo_generalization.tex}

\IfFileExists{../auto/figs/fig_fullobs_generalization_vs_bloat_bins.pdf}{%
  \begin{figure}[tb]
    \centering
    \includegraphics[width=\columnwidth]{fig_fullobs_generalization_vs_bloat_bins.pdf}
    \caption{FullObs holdout generalization by AST delta bin. For train-correct
    predictions, we measure exact-match rate on held-out worlds. Generalization
    degrades monotonically with formula complexity across all models.}
    \label{fig:fo_gen_bins}
  \end{figure}
}{}

\textbf{Controlling for instance difficulty.}
Bloated formulas could appear disproportionately on harder problems,
confounding the bloat--generalization relationship.
We address this by comparing multiple train-correct solutions to the \emph{same} problem across models.
For each problem with $\geq 2$ valid predictions, we compare holdout generalization of the shortest
vs.\ longest formula. Importantly, we exclude exact gold matches,
which have perfect holdout generalization by construction.
\cref{tab:within_problem_bloat_fo} shows that even within the same problem, near-gold formulas
\emph{different from the gold formula}
generalize dramatically better: among 42 problems with both near-gold and above-gold solutions, near-gold
formulas achieve 86\% holdout exact-match vs.\ 9.9\% for above-gold,
with 88\% of problems showing positive $\Delta$ and only 5\% showing negative $\Delta$.
This confirms that the parsimony--generalization gap is not an artifact of problem difficulty.


\input{../auto/tables/fullobs_within_problem_bloat_control.tex}

\subsection{Contrastive success and NO-world failures}

CI v1 contains 200 problems with two bands: core (120) and lift\_mix (80).
\cref{tab:ci_failure} decomposes failures into YES-world failures
(formula doesn't match positive examples), NO-world failures (formula matches negative examples), parse errors, and missing responses.
Additional diagnostic tables appear in
Appendix~\ref{sec:appendix_tables}.

% CI failure decomposition table (auto-generated) - key diagnostic in main
\input{../auto/tables/ci_failure.tex}

Most CI failures are on YES worlds; however, CI problems contain
on average 7.9 YES worlds vs.\ 2.0 NO worlds, so the failure rates are roughly
proportional to world counts (normalized ratio of 0.85) rather than indicating
a specific weakness in positive vs negative generalization.
We found that CI difficulty is sensitive not only to task semantics but also
to generation success conditions.
Because the CI generator maintains a population of plausible
surviving trap hypotheses during YES world construction, to have these traps killed in NO worlds,
it produces YES worlds that are easier to satisfy with simple formulas.

\cref{tab:ci_generalization} shows held-out generalization for CI, using 3 YES + 2 NO holdout worlds per instance.
A key subtlety is that CI correctness does \emph{not} uniquely identify the planted gold concept: many formulas can satisfy the CI criterion by matching the finite YES set while avoiding exact matches on NO worlds, without coinciding with $\varphi^\star$.
Accordingly, absolute holdout YES exact-match rates can be low even among train-correct CI solutions (\cref{tab:ci_generalization}).
Nevertheless, the same qualitative pattern persists as in FullObs: near-gold CI solutions generalize better than above-gold ones, and the generalization rate decays monotonically with AST delta (Appendix~\ref{sec:appendix_ci_gen_bins}, \cref{fig:ci_gen_bins}).
The bloat effect persists for CI, though smaller than in FullObs (Appendix \cref{tab:within_problem_bloat_ci}).


% CI generalization table (auto-generated)
\input{../auto/tables/ci_generalization.tex}


\subsection{Existential completion and parsimony gap}

EC v1 contains 200 problems with two bands:
core (120; QD=1) and hard (80; QD=2).
\cref{tab:ec_bands} shows the sharp gradient between core and hard bands;
\cref{fig:ec_budget_curve} shows budgeted accuracy curves revealing the
parsimony gap.
Additional diagnostic tables appear in
Appendix~\ref{sec:appendix_tables}.

% EC band table (auto-generated) - kept in main for core vs hard comparison
\input{../auto/tables/ec_bands.tex}

On EC-core, models achieve high budgeted accuracy (e.g.,
GPT-5.2 82.5\%, Grok4 78.3\%, Gemini 75.8\% at +25), which is expected because this band
was designed to be especially easy consisting entirely of QD=1 problems.
On EC-hard, accuracy drops sharply (GPT-5.2 25.0\%, Gemini 16.2\%, Grok4 15.0\%
at +25), creating headroom.
Notably, GPT-5.2 exhibits a large validity--budget gap and high bloat (18.5\%),
whereas other models achieve most of their validity with compact formulas.
For GPT-5.2, EC provides a probe of the parsimony gap between \emph{finding
any completion-consistent explanation} and \emph{finding a succinct one}.

Finally, we evaluated a diagnostic for EC failures by computing the minimum number of label mismatches 
achievable under any completion of unknown atoms (Appendix~\ref{sec:appendix_ec_best_completion}).
Most invalid formulas are several mismatches from EC validity even under the best completion.

\FloatBarrier
\section{Related Work}\label{sec:related}

\textbf{Inductive logic programming and concept learning.}
Learning logical theories from examples has a long history in inductive logic
programming (ILP), including classic systems and formalisms for learning Horn
clauses and relational rules from structured data
\cite{muggleton1991inductive,muggleton1994inductive,quinlan1990foil,srinivasan2001aleph,cropper2020metagol}.
Our setting differs in two respects: (i) the hypothesis language is full
first-order logic (of small quantifier-depth), not restricted to
Horn clauses; and (ii) we focus on \emph{benchmarking generalization behavior
of modern neural models} under mechanically checkable
semantics rather than proposing a new learning algorithm.
Nevertheless, many of our dataset design concerns---avoiding trivial
hypotheses, ensuring identifiability, and constructing informative
examples---mirror ILP practice.

\textbf{Program synthesis and constraint-based learning.}
Synthesizing symbolic programs or formulas that satisfy input--output
constraints is central in program synthesis
\cite{solarlezama2006combinatorial,alur2013sygus,torlak2014rosette}.
Our tasks can be viewed as a constrained synthesis problem where the
specification is given by multiple finite models and target extensions.
Unlike many synthesis benchmarks, we explicitly incorporate \emph{relational
structure} and quantifiers and we evaluate learned hypotheses across multiple
worlds.
We also adopt a benchmark-level analogue of counterexample-guided refinement
(CEGIS) \cite{solarlezama2008thesis}: the generator constructs additional worlds
that eliminate families of distractor hypotheses, producing instances with
controlled version-space properties.

\textbf{Logical and relational reasoning benchmarks.}
A large body of work evaluates reasoning in logic settings, including
datasets for deductive reasoning, theorem proving, and rule-based inference
\cite{tafjord2021proofwriter,han2022folio,clark2020aristo,polu2020generative}.
Many such benchmarks are mediated through natural language, where ambiguity
and linguistic priors can dominate performance.
\textsc{INDUCTION} instead presents \emph{extensional finite structures} and
requires output in a precise symbolic language, enabling exact model checking
and finer-grained diagnostics about quantifiers, relational patterns, and
formula complexity.
Our contrastive CI variant is conceptually related to discriminative concept
learning (e.g., Zendo/Bongard-style settings), but differs by operating in a
fixed relational signature with solver-verifiable semantics
\cite{bongard1970pattern,zendo2001}.

\textbf{Mathematical reasoning and formal proof benchmarks.}
Recent work evaluates models on mathematical reasoning both in natural language
(e.g., grade-school and competition-style problem sets) and in \emph{formal}
settings where outputs are mechanically verified inside proof assistants.
Natural-language benchmarks include GSM8K-style word problems and competition math
datasets such as MATH \cite{cobbe2021training_verifiers_gsm8k,hendrycks2021math_dataset}.
The formal direction has produced benchmarks for tactic and proof
generation in systems such as Coq and HOL Light, as well as formalization and
autoformalization testbeds (e.g., CoqGym, HOList, miniF2F, ProofNet)
\cite{yang2019coqgym,bansal2019holist,zheng2022minif2f,azerbayev2023proofnet}.
While these benchmarks emphasize deductive search and library-aware proof construction,
\textsc{INDUCTION} targets a complementary capability: synthesizing compact first-order 
\emph{concept induction} and quantifier-level generalization from multiple 
finite structures under solver-verifiable semantics.

\textbf{Neuro-symbolic learning and large language models.}
Recent work explores neural approaches to symbolic reasoning \cite{garcez2023neurosymbolic,polu2020generative}.
Large language models can emit syntactically valid logical expressions, but
their robustness and generalization remain difficult to assess.
Several evaluations report that models may rely on shortcuts, surface patterns,
or excessively long outputs when unconstrained \cite{wei2022chain,dziri2024faith}.
\textsc{INDUCTION} contributes a setting where correctness is exact and
solver-verifiable and negative evidence can be made informative by construction
(CI).


\section{Limitations}
\label{sec:limitations}

\textbf{Restricted scope and synthetic distributions.}
\textsc{INDUCTION} v1 uses a small relational signature (unary $P,Q$ and binary $R,S$) and small finite domains to keep instances solver-verifiable and difficulty-controllable.
Results should therefore be interpreted as competence with \emph{finite-structure} FOL synthesis under a controlled distribution, and may not transfer to richer settings with larger vocabularies, constants/functions, types, or background theories.

\textbf{Template-bounded concepts and non-unique solutions.}
Gold concepts come from a curated template pool (bounded quantifier depth and AST), and worlds are constructed to separate them from plausible distractors.
This improves diagnostic clarity but can induce template biases.
Moreover, many distinct formulas can agree with $T$ on sampled worlds; we use gold-relative AST budgets to prefer succinct hypotheses, but AST is only a proxy for simplicity and we do not attempt full equivalence checking.

\textbf{EC semantics and generation-time artifacts.}
Existential completion can be satisfied by large disjunctions that effectively case-split across worlds; we quantify this via bloat-aware scoring.
CI generation can also introduce selection effects (resampling failures can shift the realized instance distribution).

\textbf{Interface sensitivity and benchmark overfitting.}
Evaluation depends on a fixed prompt, output syntax, and parser; we report coverage/parse failures but do not tune prompts per model.
As with any public benchmark, repeated use risks overfitting or contamination; we recommend held-out splits or periodic regeneration under a frozen spec.

\section{Conclusion}

We introduced \textsc{INDUCTION}, a solver-verifiable benchmark suite for first-order
concept synthesis over finite relational worlds.  The three tasks---\textsc{FullObs}
(exact multi-world matching), \textsc{CI} (contrastive YES/NO worlds), and \textsc{EC}
(existential completion under partial observation)---share a common language and evaluation
pipeline while isolating distinct failure modes.  Across all three, we find sharp
difficulty gradients as quantifier structure and the number of worlds increase, and we
identify  hard subfamilies that preserve headroom even as simpler cases saturate.

A central lesson is that \emph{validity alone is not a sufficient indicator of
learned structure}.  Budgeted, gold-relative scoring reveals a substantial \emph{parsimony gap}:
some models achieve high unbounded success by emitting large case-splitting formulas.
Held-out validation shows that the gap is not merely cosmetic: among
training-correct solutions, compact formulas generalize dramatically better than bloated
ones, suggesting that bloat is strongly
associated with overfitting to the finite set of observed worlds.

Viewed through the lens of hypothesis formation, these results suggest that what distinguishes
stronger systems is not only the ability to produce \emph{a} consistent formula, but the
ability to find \emph{succinct hypotheses} that remain stable under additional evidence---a
hallmark of effective conjecture formation in both science and mathematics.
Benchmarks that surface this distinction may therefore serve as a practical proxy for progress
toward machine-supported discovery.


\textsc{INDUCTION} provides a methodology for building logic benchmarks: instances are finite 
and solver-checkable, so we can precisely
define semantics, control difficulty through targeted generation and diagnostics,
analyze errors at the level of quantifier structure and relational patterns.  Future
work includes extending to richer signatures and relational vocabularies and developing
synthesis baselines for abductive, causal, and other forms of abstract reasoning.
We hope these benchmarks help sharpen empirical claims about logical generalization
in models and encourage evaluation protocols that reward \emph{succinct, stable} hypotheses.

\section*{Software and Data}

We provide an anonymized supplementary artifact
containing the frozen v1 benchmark datasets, the evaluation and analysis code, and cached model outputs
sufficient to regenerate the tables and figures reported in the paper. If accepted, we will release
the full benchmark generation pipeline and complete repository publicly.


\section*{Impact Statement}
This work introduces benchmarks for inductive synthesis of first-order definitions and metrics that reward succinct, generalizable hypotheses.
As with other benchmarks, it may bias research toward the covered distributions or incentivize gaming; we mitigate this by releasing diagnostics and recommending held-out splits or periodic regeneration.



\bibliography{mybib}
\bibliographystyle{icml2026}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% APPENDIX
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\newpage
\appendix
\onecolumn

\section{Additional Tables and Figures}
\label{sec:appendix_tables}

This appendix contains structural and diagnostic breakdowns referenced in the
main text.

\subsection{FullObs Diagnostic Tables}

\textbf{Overview of additional FullObs diagnostic tables.}
The tables below unpack the headline FullObs results from Section~6.1.
\cref{tab:fo_bandwise} shows band-wise accuracy, revealing the sharp difficulty gradient.
The remaining tables report overall budgeted and unbounded accuracy (denominator=all), coverage, and bloat;
break budgeted accuracy by formula family, highlighting which quantifier/relational patterns remain difficult;
and decompose valid outputs by relative size (compact/equal/longer/bloat), supporting the parsimony-gap distinction.

% FullObs band-wise table (moved from main text)
\begin{table}[t]
\centering
\small
\begin{tabular}{@{}lrrrrr@{}}
\toprule
Model & simple & easy & medium & hard & extreme \\
\midrule
Grok4 & \textbf{100.0\%} & 75.0\% & \textbf{43.0\%} & \textbf{39.0\%} & \textbf{16.0\%} \\
GPT-5.2 & \textbf{100.0\%} & \textbf{89.0\%} & 29.0\% & 21.0\% & 0.0\% \\
Grok4.1f & 92.0\% & 26.0\% & 8.0\% & 8.0\% & 4.0\% \\
Gemini 3 & 96.0\% & 26.0\% & 5.0\% & 4.0\% & 2.0\% \\
DSR & 68.0\% & 13.0\% & 7.0\% & 2.0\% & 0.0\% \\
Opus 4.5 & 80.0\% & 10.0\% & 1.0\% & 1.0\% & 0.0\% \\
Hermes4 & 40.0\% & 0.0\% & 0.0\% & 0.0\% & 0.0\% \\
GPT-4o & 0.0\% & 0.0\% & 0.0\% & 0.0\% & 0.0\% \\
\bottomrule
\end{tabular}
\caption{FullObs v1 band-wise accuracy (Acc\_all, denominator = total problems per band). Extremes are aggregated.}
\label{tab:fo_bandwise}
\end{table}

% FullObs appendix tables (overall, family breakdown, failure modes)
\input{../auto/appendix/fo_appendix_tables.tex}

\subsection{CI Diagnostic Tables}

\paragraph{Overview of additional CI diagnostic tables.}
CI permits many solutions that satisfy the YES/NO criterion without matching the planted gold concept.
Accordingly, Appendix CI tables separate \emph{overall} performance (\cref{tab:ci_overall}), \emph{structural} variation by family (\cref{tab:ci_family}), and band-specific failure profiles (\cref{tab:ci_band_core,tab:ci_band_lift_mix}).
Together they clarify whether errors arise primarily from failing to fit the YES evidence (YES-fail) or from falling into trap hypotheses that exactly match a NO target (NO-fail), complementing the summary in \cref{tab:ci_failure} and the holdout analysis in \cref{tab:ci_generalization} and \cref{fig:ci_gen_bins}.

% CI appendix tables (overall, family breakdown, band failure breakdown)
\input{../auto/appendix/ci_appendix_tables.tex}

% CI failure mode figure
\begin{figure}[h]
  \centering
  \includegraphics[width=\columnwidth]{ci_failure_modes.pdf}
  \caption{CI v1 failure mode distribution by model (stacked bar chart).}
  \label{fig:ci_failure_modes}
\end{figure}

\subsection{EC Diagnostic Tables}

\paragraph{Interpreting EC validity and budgeted success.}
EC evaluates existential-completion validity per world and reports budgeted accuracy Acc@gold$+25$ to penalize case-splitting solutions.
\cref{tab:ec_overall} summarizes overall EC validity, budgeted accuracy, coverage, and bloat; \cref{tab:ec_family} breaks Acc@gold$+25$ by family.
\cref{tab:ec_failures} characterizes whether models' valid solutions tend to be near-gold or above-gold.
These diagnostics complement the main EC band split (\cref{tab:ec_bands}) and the budget curves (\cref{fig:ec_budget_curve}).

% EC appendix tables (overall, family breakdown, failure modes)
\input{../auto/appendix/ec_appendix_tables.tex}

\subsection{Error Profiles}
\label{sec:appendix_error_profiles}

\textbf{Interpreting FP/FN and NO margin measurements.}
\cref{tab:error_profiles} decomposes per-world errors into false positives (predicting $T$ for elements labeled $\neg T$) and false negatives (missing elements labeled $T$).
For FullObs these rates are averaged over all training worlds; for CI, YES FP/FN are computed on YES worlds only, while the NO margin reports the mean number of mismatched elements on NO worlds (higher values indicate stronger separation from the NO targets).
A useful qualitative pattern is that models with strong accuracy on FullObs tend to have lower FP and FN simultaneously, whereas weak models often exhibit high FP (overly permissive predicates) and/or high FN (overly strict predicates).
In CI, NO failures correspond precisely to the degenerate case of zero NO margin (exactly matching a NO target), which is why NO margin complements the YES/NO-fail decomposition in the main results.

% Error profiles table (FP/FN rates across tasks)
\input{../auto/appendix/error_profiles.tex}

\subsection{EC Budget Curves}
\label{sec:appendix_ec_budget}

\cref{fig:ec_budget_curve} shows the budgeted validity curves for EC, analogous to the FullObs curves in the main text.

\begin{figure}[h]
  \centering
  \includegraphics[width=\columnwidth]{ec_budget_curve.pdf}
  \caption{EC v1 budgeted validity curves: Acc@(+$\Delta$) for $\Delta \in [0, 100]$.
  Y-axis shows the fraction of instances with valid formulas having AST $\leq$ gold$+\Delta$.}
  \label{fig:ec_budget_curve}
\end{figure}

\subsection{EC Best-Completion Error Analysis}
\label{sec:appendix_ec_best_completion}

\textbf{EC best-completion error analysis.}
\cref{tab:ec_best_completion} reports, for predictions that fail the EC validity check, how close they could come to validity under an optimal completion of unknown atoms.
A minimum mismatch of 0 would imply EC validity; higher values measure distance from feasibility.
In v1, this diagnostic is primarily a sanity check: most invalid predictions remain at $\geq 3$ mismatches even under the best completion, indicating that failure is typically structural rather than a single mislabeled element.
Because EC validity is existential and completion-dependent, this measure is not intended as a replacement for budgeted accuracy, but it helps distinguish ``almost valid'' hypotheses from those that cannot plausibly be repaired by completion alone.

% EC best-completion table (minimum mismatches for invalid predictions)
\input{../auto/appendix/ec_best_completion.tex}


\subsection{CI Generalization vs Bloat}
\label{sec:appendix_ci_gen_bins}

\cref{fig:ci_gen_bins} shows the CI analogue of the FullObs generalization analysis.
Absolute rates are lower because CI correctness does not uniquely identify the planted gold concept, but the same monotone trend persists.

\IfFileExists{../auto/figs/fig_ci_generalization_vs_bloat_bins.pdf}{%
  \begin{figure}[h]
    \centering
    \includegraphics[width=\columnwidth]{fig_ci_generalization_vs_bloat_bins.pdf}
    \caption{CI holdout generalization by AST delta bin. Same methodology as
    FullObs; bloated formulas generalize worse to new YES worlds.}
    \label{fig:ci_gen_bins}
  \end{figure}
}{}

% CI within-problem bloat control table
\input{../auto/tables/ci_within_problem_bloat_control.tex}

\subsection{Lift-Hard Breakdown}
\label{sec:appendix_lift_hard}

\textbf{Lift-hard as a structural stress test.}
Lift-hard instances implement cross-relational patterns in which relations involving the free variable $x$ appear inside universally quantified contexts, creating a characteristic failure mode (e.g., dropping the lifted literal or confusing argument orientation).
\cref{tab:lift_hard_breakdown} shows that lift-hard instances remain substantially harder than non-lift instances in both FullObs and CI.
Notably, EC v1 contains no lift-hard instances by construction, reflecting a design choice to avoid all-zero headroom in the partial-observation setting while we calibrate nested-quantifier EC hardness.

\input{../auto/tables/tab_lift_hard_breakdown.tex}

\subsection{Generation Design Lessons}
\label{sec:appendix_design_lessons}

Several methodological lessons emerged from generation experiments:
\begin{enumerate}
  \item \textbf{Survivor bands matter.} In CI, the number of traps surviving after YES worlds strongly affects difficulty. Too many survivors makes problems easy; too few makes generation fail. The tight band $[2,4]$ balances these.
  \item \textbf{Frozen pools enable diagnostics.} Using a frozen hypothesis pool (rather than per-problem mining) ensures that version-space metrics are comparable across instances and reproducible across runs.
  \item \textbf{Filters are essential.} Without atomic and subformula filters, a significant fraction of instances admit trivial solutions that inflate accuracy without testing quantifier reasoning.
  \item \textbf{Generation success is not uniform.} Some gold formulas are harder to instantiate (valid worlds are rarer). We track generation failure rates and ensure that accepted instances are not biased toward easy-to-generate subfamilies.
\end{enumerate}

\subsection{World Generation Hyperparameters}
\label{sec:appendix_world_gen}

\cref{tab:world_gen_params} lists the exact generation parameters used for each task and band in the v1 benchmark.
These parameters are frozen in the released configuration files.

\begin{table}[h]
\centering
\caption{World generation hyperparameters (v1). $k$ = number of worlds per instance; unknown rate applies to EC only.}
\label{tab:world_gen_params}
\small
\begin{tabular}{@{}llcccl@{}}
\toprule
Task & Band & Domain & $k$ & Unknown rate & Unknown preds \\
\midrule
FullObs & simple & 5--7 & 4 & --- & --- \\
FullObs & easy & 5--7 & 6 & --- & --- \\
FullObs & medium & 7--10 & 8 & --- & --- \\
FullObs & hard & 8--12 & 10 & --- & --- \\
FullObs & extreme & 8--12 & 10 & --- & --- \\
\midrule
CI & core & 7--9 & 7--8 YES + 2--3 NO & --- & --- \\
CI & lift\_mix & 7--9 & 7--8 YES + 2--3 NO & --- & --- \\
\midrule
EC & core & 6--8 & 3 & 20\% & $R,S$ \\
EC & hard & 7--9 & 3 & 20\% & $R$ only \\
\bottomrule
\end{tabular}
\end{table}

\textbf{Unary sampling}: Each element $a \in D$ is assigned $P(a)=\text{true}$ independently with probability $p_{\text{unary}}=0.4$, subject to a balance constraint requiring 15--85\% of the domain to satisfy each unary predicate.

\textbf{Binary sampling}: For regular out-degree mode, each element $a$ samples exactly 2 outgoing edges for $R$ (uniformly from $D \setminus \{a\}$) and exactly 2 outgoing edges for $S$, yielding expected edge density $\approx 2/|D|$ per relation.

\textbf{Unknown masking (EC)}: For each world, we collect all ground atoms of the unknown-eligible predicates, shuffle them, and mark the first $\lfloor \text{unknown\_rate} \times \text{total\_atoms} \rfloor$ as unknown.
The unknown rate is 20\% for both EC bands; the core band masks atoms from $R$ and $S$, while the hard band masks only $R$ atoms.

\subsection{EC Relevance Filter Pseudocode}
\label{sec:appendix_ec_filters}

The relevance filters ensure unknown atoms are genuinely relevant to the gold formula:

\begin{small}
\begin{verbatim}
def check_relevance(world, formula, mode):
  C0 = complete_unknowns(world, False)
  C1 = complete_unknowns(world, True)
  m0 = matches_target(formula, world, C0)
  m1 = matches_target(formula, world, C1)
  if mode == "extreme_or":
    return m0 or m1
  elif mode == "middle_allowed":
    return not (m0 and m1)
\end{verbatim}
\end{small}

\noindent\texttt{extreme\_or} (core): at least one extreme completion works.\\
\texttt{middle\_allowed} (hard): reject if both extremes work (trivial).

\subsection{Model and Decoding Settings}
\label{sec:appendix_model_settings}

\cref{tab:model_settings} lists the exact model identifiers and decoding parameters used for evaluation.
All models received the same prompt template (Appendix~\ref{sec:appendix_prompts}).

\begin{table}[h]
\centering
\caption{Model identifiers and decoding settings. max\_tok = maximum output tokens (set only for GPT-5.2, Opus 4.5, and Hermes4).}
\label{tab:model_settings}
\small
\begin{tabular}{@{}llc@{}}
\toprule
Display Name & Provider / Model ID & max\_tok \\
\midrule
Grok4 & XAI / \texttt{grok-4-0709} & --- \\
GPT-5.2 & OpenAI / \texttt{gpt-5.2} & 32000 \\
Grok4.1f & XAI / \texttt{grok-4-1-fast-reasoning} & --- \\
Gemini 3 & Google / \texttt{gemini-3-pro-preview} & --- \\
DSR & DeepSeek / \texttt{deepseek-reasoner} & --- \\
Opus 4.5 & Anthropic / \texttt{claude-opus-4-5-20251101} & 21333 \\
Hermes4 & OpenRouter / \texttt{hermes4} & 32000 \\
GPT-4o & OpenAI / \texttt{gpt-4o} & --- \\
\bottomrule
\end{tabular}
\end{table}

All models use JSON output format where supported.
Extended thinking/reasoning is enabled for models that support it (Claude, GPT-5 series, DeepSeek).
Timeout is 1500 seconds for GPT models and 900 seconds for OpenRouter; other providers use default timeouts.
Each API call was retried up to 5 times with exponential backoff on transient failures.
Despite retries, Grok4 failed to return results on approximately 33\% of FullObs problems, which are counted as incorrect in accuracy metrics.

\subsection{Equality Usage Analysis}
\label{app:equality}

The gold formulas in our benchmark do not use equality predicates---all target concepts are expressed using only the unary predicates $P, Q$ and binary predicates $R, S$.
However, models are free to use equality $(= x\, y)$ in their predictions, and some models do so to express equivalent definitions or to construct case splits.

\cref{tab:eq_fullobs,tab:eq_ci,tab:eq_ec} summarize equality usage across tasks.
Key observations:
\begin{itemize}
  \item \textbf{GPT-5.2} uses equality most frequently (13--28\% of predictions across tasks), often in complex case-split formulas with high AST sizes.
  \item Equality-containing formulas in \textbf{FullObs} have lower validity (35.7\%) than overall, suggesting equality often appears in failed attempts at case enumeration.
  \item In \textbf{CI}, equality-containing formulas achieve high validity (79.1\%), possibly because the contrastive setting admits alternative correct definitions.
  \item Most models use equality sparingly ($<$10\% of predictions), with weaker models (Hermes4, GPT-4o) almost never using it.
\end{itemize}

\input{../auto/appendix/equality_usage.tex}

\clearpage
\section{Formula Templates}
\label{sec:appendix_templates}

This section lists the formula templates used to generate gold target concepts in the benchmark.
Templates are organized by quantifier depth (QD) and structural family.
Each formula has one free variable $x$; bound variables are $y$ and $z$.

\subsection{Formula Families}

Formulas are classified into structural families based on their quantifier pattern and guard structure.
\cref{tab:formula_families} describes each family.

\begin{table}[h]
\centering
\small
\begin{tabular}{@{}cl@{}}
\toprule
\textbf{Family} & \textbf{Pattern Description} \\
\midrule
A & $\forall y(\neg\text{BIN}(x,y) \lor \exists z\, \text{BIN}(y,z))$ --- simple $\forall\exists$ \\
B & $\exists y(U(y) \land \forall z\ldots)$ --- guarded $\exists\forall$ with positive unary \\
C & $\exists y(\neg U(y) \land \forall z\ldots)$ --- guarded $\exists\forall$ with negated unary \\
D & $\forall y(\neg\text{BIN}(x,y) \lor \exists z(\text{BIN} \land U))$ --- $\forall\exists$ with unary filter \\
F & $\forall y(\neg\text{BIN}(x,y) \lor \exists z(\text{BIN} \land \neg U))$ --- $\forall\exists$ with negated filter \\
G & $\exists y\forall z\ldots$ --- simple $\exists\forall$ (no guard) \\
H & $\exists y\exists z\ldots$ --- path/composition patterns (two existentials) \\
M & $\forall y(\neg\text{BIN}(x,y) \lor \exists z(\text{BIN}(y,z) \land \text{BIN}(x,z)))$ --- mixed-witness \\
Z & $\exists y(U(y) \land \forall z(\neg\text{BIN}(y,z) \lor (\text{BIN}(x,z) \land V(z))))$ --- $\exists\forall$ with $z$-filter \\
\bottomrule
\end{tabular}
\caption{Formula family classification. BIN denotes a binary predicate ($R$ or $S$); $U$, $V$ denote unary predicates ($P$ or $Q$).}
\label{tab:formula_families}
\end{table}

\textbf{Lift-hard patterns.}
A formula is classified as \emph{lift-hard} if a binary predicate involving the free variable $x$ appears inside a universally quantified scope.
For example, $\forall y(\neg R(x,y) \lor \exists z\, S(y,z))$ is lift-hard because $R(x,y)$ appears under $\forall y$.
These patterns require reasoning about $x$'s relationships across all witnesses, which models frequently fail to generalize correctly.

\subsection{Quantifier-Free Templates (QD=0)}

\begin{small}
\begin{verbatim}
(P x)
(Q x)
(not (P x))
(not (Q x))
(R x x)
(S x x)
(not (R x x))
(not (S x x))
(and (P x) (Q x))
(or (P x) (Q x))
(and (P x) (not (Q x)))
(or (not (P x)) (Q x))
(and (not (P x)) (not (Q x)))
(or (not (P x)) (not (Q x)))
\end{verbatim}
\end{small}

\subsection{Single-Quantifier Templates (QD=1)}

\begin{small}
\begin{verbatim}
(exists y (R x y))
(exists y (S x y))
(exists y (R y x))
(exists y (S y x))
(exists y (and (R x y) (P y)))
(exists y (and (R x y) (Q y)))
(exists y (and (S x y) (P y)))
(exists y (and (R x y) (not (P y))))
(exists y (and (S x y) (not (Q y))))
(forall y (or (not (R x y)) (P y)))
(forall y (or (not (R y x)) (P y)))
(forall y (or (not (S x y)) (Q y)))
(forall y (or (not (R x y)) (Q y)))
(forall y (or (not (S y x)) (P y)))
(and (P x) (exists y (R x y)))
(and (not (P x)) (exists y (and (R x y) (Q y))))
(or (P x) (forall y (or (not (R x y)) (Q y))))
(and (Q x) (exists y (S x y)))
\end{verbatim}
\end{small}

\subsection{Nested-Quantifier Templates (QD=2)}

\begin{small}
\begin{verbatim}
(exists y (forall z (or (not (R y z)) (S x z))))
(exists y (forall z (or (not (S y z)) (R x z))))
(exists y (and (P y) (forall z (or (not (R y z)) (S x z)))))
(exists y (and (Q y) (forall z (or (not (S y z)) (R x z)))))
(exists y (and (not (P y)) (forall z (or (not (R y z)) (S x z)))))
(forall y (or (not (R x y)) (exists z (S y z))))
(forall y (or (not (S x y)) (exists z (R y z))))
(forall y (or (not (R x y)) (exists z (and (S y z) (P z)))))
(forall y (or (not (R x y)) (exists z (and (S y z) (Q z)))))
(forall y (exists z (and (R x y) (S y z))))
(and (P x) (exists y (forall z (or (not (R y z)) (S x z)))))
(and (not (Q x)) (forall y (or (not (R x y)) (exists z (S y z)))))
(or (P x) (exists y (forall z (or (not (R y z)) (S x z)))))
(or (not (P x)) (forall y (or (not (R x y)) (exists z (S y z)))))
(and (Q x) (forall y (or (not (S x y)) (exists z (R y z)))))
(exists y (forall z (or (not (R z y)) (S z x))))
(forall y (or (not (R y x)) (exists z (S z y))))
\end{verbatim}
\end{small}

\textbf{Note:} The actual benchmark uses an expanded pool that includes additional variants with different predicate combinations (e.g., swapping $P$ and $Q$, or $R$ and $S$) and argument orderings.
The templates above represent the core structural patterns; the full pool contains approximately 200 distinct formulas.

\clearpage
\input{../appendix_model_examples}

\clearpage
\section{Prompts}
\label{sec:appendix_prompts}

This section presents the prompts used for each of the three induction tasks.
Each prompt consists of a task description followed by the problem instance (worlds and target labels), and concludes with output format instructions.
The prompts below show the task-specific portions; the problem instance is inserted between the task description and the output instructions.
For FullObs, one example problem description is shown.

\subsection{FullObs Prompt}

\begin{small}
\begin{verbatim}
# First-Order Logic Concept Synthesis

## Task Overview

You are given several finite "worlds," each containing:
- A finite domain of objects (named a0, a1, a2, ...)
- Interpretations of predicates (which objects/pairs satisfy each predicate)
- A target concept T(x) that specifies which objects are "positive" (T is TRUE)
  and which are "negative" (T is FALSE)

**Closed World Assumption**: Only the facts explicitly listed as TRUE are true.
Any predicate application (P(a), R(a,b), etc.) not explicitly listed should be
assumed FALSE.

**Your goal**: Find a first-order logic formula phi(x) with one free variable x
that **perfectly separates** the positive and negative examples:
- phi(c) must evaluate to TRUE for every object c where T(c) is TRUE
- phi(c) must evaluate to FALSE for every object c where T(c) is FALSE

The formula must work correctly for ALL objects in ALL training worlds.

## Output Format

You must output your formula in S-expression syntax. The grammar is:

phi ::= (P x)             -- unary predicate applied to variable
      | (R x y)           -- binary predicate applied to two variables
      | (= x y)           -- equality of two variables
      | (not phi)         -- negation
      | (and phi1 phi2)   -- conjunction (2 or more arguments)
      | (or phi1 phi2)    -- disjunction (2 or more arguments)
      | (forall v phi)    -- universal quantification
      | (exists v phi)    -- existential quantification

**Important constraints:**
- Your formula must have exactly one free variable: x
- All other variables must be bound by quantifiers (forall or exists)
- Variable names should be: x (free), y, z, w (bound by quantifiers)
- Prefer simpler formulas; avoid redundant conjuncts and case-splitting

[... Problem instance inserted here ...]

## Your Task

Analyze the training worlds carefully. Identify what **distinguishes** objects
where T is TRUE from objects where T is FALSE.

**Think step by step:**
1. For each training world, compare the T-TRUE objects against the T-FALSE objects
2. Look for properties (unary predicates P, Q) or relationships (binary predicates
   R, S) that correlate with T
3. Check: Do all T-TRUE objects share some property? Do all T-FALSE objects lack it?
4. Consider whether the pattern involves existential quantification
   ("there exists a y such that...") or universal quantification ("for all y...")
5. Formulate your hypothesis as an S-expression formula
6. Verify: For each object in each training world, check that your formula gives
   TRUE exactly when T is TRUE

## Output

Your final answer must be exactly ONE LINE containing ONLY valid JSON:
- "formula": a single S-expression formula string with one free variable x
- "description": a short plain-English description (one sentence)

Example:
{"formula":"(exists y (and (R x y) (P y)))","description":"x has an R-successor
that satisfies P."}
\end{verbatim}
\end{small}

\subsection{CI (Contrastive) Prompt}

\begin{small}
\begin{verbatim}
# First-Order Logic Concept Synthesis (Zendo-Style)

## Task Overview

You are given two sets of finite "worlds":
- **YES worlds**: Worlds where the hidden rule is satisfied
- **NO worlds**: Worlds where the hidden rule is NOT satisfied

Each world contains:
- A finite domain of objects (named a0, a1, a2, ...)
- Interpretations of predicates (which objects/pairs satisfy each predicate)
- A target concept T(x) that labels certain objects as "positive examples"

**Closed World Assumption**: Only the facts explicitly listed as TRUE are true.

Your goal is to find a first-order logic formula phi(x) with one free variable x
such that:
- In YES worlds: phi(x) **exactly matches** T(x) for all objects
- In NO worlds: phi(x) **fails to match** T(x) -- at least one object is misclassified

Think of this like the game Zendo: you must find the secret rule that all YES
worlds follow but NO worlds violate.

## Validity Criterion (Important)

Define **Match(W, phi)** := for all a in domain(W): W |= phi(a) iff a in T_true(W)

Your formula is **correct** iff:
1. For every YES world W: Match(W, phi) is TRUE
2. For every NO world W: Match(W, phi) is FALSE

This means your formula must work perfectly in YES worlds, and must have at least
one error in each NO world.

## Output Format

<same as FullObs prompt>

**Important constraints:**

<same as FullObs prompt>

[... Problem instance inserted here ...]

## Your Task

Analyze the YES and NO worlds carefully. Find the pattern that:
1. Perfectly matches T in all YES worlds
2. Fails to match T in all NO worlds

## Output

<same as FullObs prompt>

\end{verbatim}
\end{small}

\subsection{EC (Existential Completion) Prompt}

\noindent\textit{(Corrected prompt used for all EC results reported.)}

\begin{small}
\begin{verbatim}
# First-Order Logic Concept Synthesis (Partial Observation)

## Task Overview

You are given several finite "worlds" with **partial observations**:
- Some predicate facts are **known** (observed as TRUE or FALSE)
- Some predicate facts are **unknown** (not observed)

Each world contains:
- A finite domain of objects (named a0, a1, a2, ...)
- Known facts: predicates whose truth values have been observed
- Unknown atoms: predicates whose truth values are hidden
- A target concept T(x) that specifies which objects are "positive" (T is TRUE)
  and which are "negative" (T is FALSE)

**Observation Rules**:
- Atoms listed under "Known Facts" with explicit TRUE values are TRUE
- Atoms listed under "Unknown Atoms" have unknown truth values
- Any atom NOT listed as TRUE and NOT listed as Unknown is FALSE

The target T(x) values are always fully specified (not unknown).

**Your goal**: Find a first-order logic formula phi(x) with one free variable x
that **perfectly separates** the positive and negative examples:
- phi(c) must be TRUE for every object c where T(c) is TRUE
- phi(c) must be FALSE for every object c where T(c) is FALSE

**Completion semantics**: For each world separately, there must exist at least one
assignment of truth values to the unknown atoms such that phi matches T for all
objects in that world.

## Output Format

<same as FullObs prompt>

**Important constraints:**

<same as FullObs prompt>

[... Problem instance inserted here ...]

## Your Task
Analyze the training worlds carefully, keeping in mind that some facts are unknown.

**Think step by step:**
1. For each training world, compare objects where T is TRUE vs. T is FALSE
2. Note which predicate facts are known vs unknown
3. Look for patterns that **distinguish** T-TRUE objects from T-FALSE objects
   using the known facts
4. Consider whether the pattern involves existential or universal quantification
5. Formulate your hypothesis as an S-expression formula
6. Verify: Check that your formula gives TRUE exactly when T is TRUE, and FALSE
   exactly when T is FALSE (under some valid completion of unknowns)

**Key insight**: Your formula should work based on the known facts. Focus on
patterns that depend on observed predicates.

## Output

<same as FullObs prompt>

\end{verbatim}
\end{small}

\end{document}
