\begin{center}
\textbf{\large Supplementary Materials}
\end{center}
\setcounter{section}{0}
\renewcommand{\thesection}{\Alph{section}}

In the following, the interested reader can find details on evaluations and corresponding information.

\section{\ptc details}\label{app:ptcdetails}

\subsection{Object Representations.}
Initially, the approach of \citep{chen2022pix2seq} was developed to detect each object in an image and provide one class label to these. We modify this setting by training the model to predict multiple attributes per object. Hereby, each attribute is treated individually and is provided its own bounding box coordinates. The dynamic nature of the output of Pix2Seq allows to predict the same bounding box multiple times with another attribute class with just one prediction. In \ptc, to retrieve the symbolic representations $O$ from such a sequence of detected attributes, the attributes are combined based on their corresponding bounding boxes.
For this, the values of the bounding box are compared with a tolerance of 7 to compensate for model errors, and similar bounding boxes get aggregated. Attribute labels associated with bounding boxes of one aggregation belong to the same object.

The object extractor of \ptc was independently trained for 50 epochs. We used the same hyperparameters as \cite{chen2022pix2seq} with ResNet50 as backbone but decreased the learning rate to 0.0003 and the learning rate of the backbone to 0.00003 and use a batch size of 8. 
We finetuned a pre-trained version of Pix2Seq\footnote{https://github.com/gaopengcuhk/Pretrained-Pix2Seq} on respectively 2000 random images of Kandinsky Patterns and CLEVR images. Both image sets have two to ten objects in their scenes. We report the different mean Average Precision (AP) values for both data sets on five different seeds in \autoref{tab:pix2seq} on \kp and CLEVR~\cite{johnson2017clevr} images. 

The metric mAP calculates mean Average Precision values for ten different Intersection over Union (IoU) thresholds, from 0.50 to 0.95 in 0.05 steps. This rewards models with better localization of objects more. The metrics AP$_{50}$ and AP$_{75}$ give the average precision values of classifications with IoU values over 50\% and 75\%. The AP$_{75}$ values decrease only slightly in comparison to AP$_{50}$, which shows that there are few object detections where the IoU is under 75\%. However, when comparing the AP$_{75}$ values to the AP value, one can see that the performance decreases quite a bit. This is, because the AP considers AP metrics with IoU bounds over 75\% as well, where the model reaches its limits. The metrics AP$_\text{S}$, AP$_\text{M}$ and AP$_\text{L}$ measure the performance of the model on small, medium and large objects. The performance is best on large objects and decreases from medium to smaller ones which is typical for object detection models, as larger objects consist of more pixels and therefore provide more features based on which they can be classified. 

Overall, Pix2Seq provides high AP values and is, therefore, a well-suited object extractor for our method.

\begin{table}[h]
\centering
\small
\caption{Average Precision of Pix2Seq approach with multiple attribute classes on \kp and CLEVR images. Models have been fine-tuned on training examples with 5 different seeds. They have been evaluated on 750 test examples respectively.}
\label{tab:pix2seq}
\begin{tabular}{@{}lllllll@{}}
\hline
Dataset          & AP   & AP$_{50}$  & AP$_{75}$  & AP$_\text{S}$    & AP$_\text{M}$ & AP$_\text{L}$ \\ \hline
\kp & $89.7 \pm 0.45$ & $97.8 \pm 0.31$     & $96.4 \pm 0.49$       & $75.4 \pm 1.10$           & $91.8 \pm 1.08$          & $96.3 \pm 0.61$         \\
CLEVR    & $96.5 \pm 0.26$ & $98.8 \pm 0.19$       & $98.7 \pm 0.16$       & $91.1 \pm 0.8$             & $96.7 \pm 0.24$          & $99.2 \pm 0.32$          \\
\hline
\end{tabular}%
\end{table}

\begin{table}[t]
\caption{DSL used in Pix2Code experiments. t0 can be an arbitrary type.}
\label{tab:dsl}
\resizebox{\columnwidth}{!}{%
\begin{tabular}{@{}p{2cm}l|p{11cm}@{}}
\toprule
Primitive & Types                                      & Description                                                                                 \\ \midrule
true      & bool                                       & Boolean with positive value.                                                                                            \\
not       & bool $\rightarrow$ bool                    & Boolean operator that negates a boolean.                                                    \\
and       & bool $\rightarrow$ bool $\rightarrow$ bool & Boolean operator. If both inputs are true, the ouput is true. Otherwise it is false.        \\
or        & bool $\rightarrow$ bool $\rightarrow$ bool & Boolean operator. If at least one input is true, the output is true. Otherwise it is false. \\
eq?       & int $\rightarrow$ int $\rightarrow$ bool   & Compares two integer values. If they are equal the output is true. Otherwise it is false.   \\
gt? & int $\rightarrow$ int $\rightarrow$ bool & Compares two integer values. If the first one is greather than the second one the output is true. Otherwise it is false. \\
find      & t0 $\rightarrow$ list[t0] $\rightarrow$ int                                         &  Searches for the given element in the given list. Returns the index of the element in the list. Throws error if no element is found.   \\
max          &  int $\rightarrow$ int $\rightarrow$ int                                          & Given two integer values the function returns the higher value. \\
min & int $\rightarrow$ int $\rightarrow$ int                                          & Given two integer values the function returns the lower value. \\
map & (t0 $\rightarrow$ t1) $\rightarrow$ list[t0] $\rightarrow$ list[t1] & Applies the input function to every element in the given input list of type t0. The output is a list of type t1. \\
index & int $\rightarrow$ list[t0] $\rightarrow$ t0 & Takes an integer and a list as input and outputs the element at the index defined by the input integer. \\
fold & list[t0] $\rightarrow$ t1 $\rightarrow$ (t0 $\rightarrow$ t1 $\rightarrow$ t1) $\rightarrow$ t1 & Inputs a list, a start value and a function. Applies the function to the start value and to the first element in the list and overwrites the start value by the result. Repeats this with every element in the list. \\
length & list[t0] $\rightarrow$ int & Returns the length of the given list. \\
if & bool $\rightarrow$ t0 $\rightarrow$ t0 $\rightarrow$ t0& Inputs a boolean and two options. If the boolean is true, the first option is returned, else the second one.\\
+ & int $\rightarrow$ int $\rightarrow$ int& Adds two integer values.\\
- & int $\rightarrow$ int $\rightarrow$ int& Subtracts two integer values.\\
empty & list(t0) & An empty list.\\
cons & t0 $\rightarrow$ list[t0] $\rightarrow$ list[t0] & Appends the given item to the start of the given list. \\
car & list[t0] $\rightarrow$ t0	& Returns the head of the given list.\\
cdr & list[t0] $\rightarrow$ list[t0]& Returns the tail of the given list.\\
empty? & list[t0] $\rightarrow$ bool & Returns true if the list is empty. Otherwise returns false.\\
forall & (t0 $\rightarrow$ bool) $\rightarrow$ list[t0] $\rightarrow$ bool & Takes a predicate function and a list and applies the function to all elements in the list. If for all the predicate is true, the function returns true. \\
exists & (t0 $\rightarrow$ bool) $\rightarrow$ list[t0] $\rightarrow$ bool &  Takes a predicate function and a list and applies the function to all elements in the list. If for at least one element the predicate is true, the function returns true.\\
count & list[t0] $\rightarrow$ t0 $\rightarrow$ int & Takes a list and an element and counts how often the element appears in the list. \\
0-9 & int & Integer values.\\
10-14* & int & Integer values \\
\midrule
&&*Only used in revising confounded task experiment. \\
\end{tabular}%
}
\end{table}

\subsection{Program Synthesis.}\label{app:program_synthesis}

For the domain of visual concepts, the input of the program synthesis tasks has the type of a list of symbolic object representations, \ie a integer list. Therefore, the domain specific language (DSL) for Pix2Code was created based on that from the list domain of \citet{ellis2023dreamcoder} and adapted to construct concepts, \eg, with primitives like \textit{forall} and \textit{count}. A list of all primitives used in our evaluations can be found in \autoref{tab:dsl}. The DSL has primitives like \textit{fold} and \textit{map} and logical primitives like \textit{and} and \textit{not}. Further, there are integer values which are needed for counting, but more importantly to encode the attribute values of the objects. In \autoref{tab:dsl-ints} we provide a mapping from integers to the attributes of \kp and CURI objects is given.

\begin{table}[t]
\centering
\caption{A mapping from integers to the attributes of \kp and CURI objects.}
\label{tab:dsl-ints}
\begin{tabular}{llll}
\toprule
\kp & & &  \\
size     & color &      shape   \\ 
\midrule
0: small & 0: red    & 0: triangle   \\
1: medium & 1: blue    & 1: square   \\
2: big         & 2: yellow      & 2: circle      \\
&&& \\
&&& \\
&&& \\
&&& \\
\end{tabular}
\quad
\begin{tabular}{llll}
\toprule
CURI & & &  \\
size     & color &      shape &  material  \\ 
\midrule
0: small & 0: gray    & 0: cube &  0: rubber   \\
1: large & 1: blue    & 1: sphere &   1: metal  \\
         & 2: brown      & 2: cylinder  &    \\
         & 3: yellow    &   &    \\
         & 4: red       &   &    \\
         & 5: green     &   &    \\
         & 6: purple    &   &    \\ 
\end{tabular}
\end{table}

For the main evaluations, \ptcs program synthesis component was trained with an enumeration timeout of 720s and 96 CPUs. We used the batching strategy "batch all unsolved" where the algorithm starts with all tasks and continues with the tasks that were not solved in the previous iterations. Further an $\lambda$-value of $1.5$, $\alpha$-value of $30$ and beam size of $5$ was used. The experiments were run for $15$ iterations.

\subsection{Learned Library Primitives}
\begin{figure}[h]
    \centering
    \includegraphics[width=0.9\columnwidth]{images/abstracted_primitives.pdf}
    \caption{Graphical illustration of the library primitives abstracted by Pix2Code trained on CURI IID.}
    \label{fig:abstracted_primitives}
\end{figure}

In \autoref{fig:abstracted_primitives} we provide a graph of all the abstracted primitives of a Pix2Code model trained on CURI iid. Let us highlight a few things here. We observe that one of the basic primitives is f0=($\lambda$ (x y) (forall ($\lambda$ (z) (eq? y (index x z))))). This represents a function that can be applied on a list of lists and checks for each element if the value at index x is equal to y. This is quite general and can be used to compare colors, shapes, etc.

In another primitive, f1=(f0 5), the primitive f0 gets extended so that the parameter x gets set by the value 5 (which represents the index for color). f1 is then integrated in other primitives, e.g. f8=(f1 6) and f18=(f1 0) where y is set respectively with the values 6 and 0 for the colors purple and gray. This means that f8 resembles the concept all objects are purple and f18 the concept all objects are gray.

\subsection{Classifying an image.}
Given learned concept representations \ptc can identify if a specific concept is present in a novel image by first extracting the corresponding object representations and then testing a selected program of a concept on these. If the concept is present in the image the program will return True. \autoref{app:figclassification} sketches this procedure.

\begin{figure}[h]
    \centering
    \includegraphics[width=0.8\columnwidth]{images/inference.pdf}
    \caption{\ptc overview for the classification of an unseen image. The image gets processed by the object extractor resulting in a symbolic object representation. This is the input to the program used for classification which gives the final prediction.}
    \label{app:figclassification}
\end{figure}

\section{CURI baseline model}\label{app:curib}

For comparing \ptc, we use the baseline model of \cite{vedantam21a}, which we refer to as CURI-B. For that we train the four different proposed architectures via the query loss of \cite{vedantam21a}(\ie $\alpha = 0$) as we are investigating unsupervised concept learning settings in this work. We train for $1000$ steps and use the same hyperparameters as used in the main evaluation of the authors. Based on \autoref{tab:curi_results_curi_b} we select the best performing pooling approach for each dataset and split.

\begin{table}[h!]
\centering
\caption{Mean accuracy (with std) for different datasets and splits reported individually for the model of \citep{vedantam21a} based on the different proposed pooling approaches.}
\label{tab:curi_results_curi_b}

\begin{tabular}{lcccc}
\toprule
{} &      Concatenation &         Global Avg. &     Relation Net &       Transformer \\
\midrule
\kp iid                 & 50.87 $\pm$ 0.08 & \textbf{59.69} $\pm$ 0.83 & 57.45 $\pm$ 1.47 & 52.76 $\pm$ 0.24 \\
CURI iid                 &   57.96 $\pm$ 1.00 &  65.57 $\pm$ 0.91 &  \textbf{66.68} $\pm$ 1.50 &  62.24 $\pm$ 1.54 \\
Boolean       &   56.63 $\pm$ 0.80 &   63.85 $\pm$ 3.70 &  \textbf{67.86} $\pm$ 1.21 &   66.60 $\pm$ 1.16 \\
Counting         &   54.48 $\pm$ 1.90 &  60.24 $\pm$ 0.68 &  58.87 $\pm$ 2.16 &  \textbf{62.19} $\pm$ 2.44 \\ 
Extrinsic      &  61.89 $\pm$ 2.19 &  67.81 $\pm$ 5.48 &  \textbf{72.56} $\pm$ 0.40 &  70.21 $\pm$ 2.84 \\
Intrinsic      &  53.01 $\pm$ 0.13 &  66.16 $\pm$ 1.33 &  \textbf{67.85} $\pm$ 2.50 &   63.70 $\pm$ 4.54 \\
Binding(color) &  58.41 $\pm$ 0.55 &   65.20 $\pm$ 3.05 &  61.21 $\pm$ 2.61 &  \textbf{69.89} $\pm$ 1.54 \\
Compositional               &  59.16 $\pm$ 1.95 &  65.18 $\pm$ 0.22 &  \textbf{67.63} $\pm$ 0.53 &  65.42 $\pm$ 3.46 \\
Complexity &  57.94 $\pm$ 1.79 &  64.04 $\pm$ 1.31 &  \textbf{65.24} $\pm$ 0.14 &  62.27 $\pm$ 1.74 \\
Binding(shape)               &   59.30 $\pm$ 3.02 &  57.24 $\pm$ 4.67 &  \textbf{66.35} $\pm$ 0.36 &  63.51 $\pm$ 3.45 \\
\bottomrule
\end{tabular}
\end{table}

\section{Data used for main experiments}
\subsection{\kp}\label{app:kp}

\begin{figure}[t]
    \centering
\includegraphics[width=.5\textwidth]{images/kandinsky.pdf}
    \caption{Three example concepts of \kp. The two left images depict positive examples of the concept and the two right images depict negative ones.}
    \label{fig:kp}
\end{figure}

To investigate the relational concept learning abilities of Pix2Code, the data set \kp was constructed, which includes 200 specific Kandinsky Patterns of varying complex concepts based on the number of objects, concept types, number of relations, and number of pairs. For each train concept, one support and one query set were created. The support and query sets consist of 25 examples with five positive $20$ negative image examples of the concept, respectively. The test concepts have one support set and eight query sets per concept, giving $40$ positive and 160 negative examples for the query set. The generated patterns are inspired by the Kandinsky Patterns of Shindo et al. \cite{shindo2023alpha} Example images can be seen in \autoref{fig:kp}. 

\begin{table}[h]
\centering
\caption{Overview of relations that are used to create \kp data set.}
\label{tab:relations}
\begin{tabular}{p{0.4\columnwidth} | p{0.5\columnwidth}}
\toprule
Relation               & Description                                                                                               \\ \midrule
\texttt{same\_color}           & $\forall x,y: color(x) = color(y)$ \\ \hline
\texttt{same\_shape}            & $\forall x,y: shape(x) = shape(y)$ \\ \hline
\texttt{same\_size}            & $\forall x,y: size(x) = size(y)$ \\ \hline
\texttt{one\_red\_triangle} & $\exists x: color(x)=red \land shape(x)=triangle \land \forall y: x \not = y \land color(y) \not = red \land shape(y) \not = triangle $\\
\bottomrule
\end{tabular}%
\end{table}

There are two types of concepts, those where the relations refer to all objects in the image and those where the relations refer to only a pair of objects in the image. The relations include object concepts like "same shape" and "one object is a red triangle". The used relations are listed in \autoref{tab:relations}. The relations can be combined with \texttt{and} and \texttt{or} and \texttt{not} can be applied to relations.

In \kp the smallest number of objects is two and the maximum number is six. For concepts where a relation refers to all objects all objects in an image have to indicate the specific concept. For concepts where the relations only refer to a pair, this means that among all objects there should exist at least one pair for which the objects have the same shape and the same color. More complex patterns of this type can have relations for the remaining distinct pairs of objects in the image as well. An example of a clause describing such a pattern is given in equation \autoref{eq:kpexample}, \ie, there is a pair of objects that has the same shape and the same color, there is another distinct pair that also has the same shape and the same color and there is a third pair that does not has the same shape or it does not have the same color. 

\begin{align}
\begin{split}
 &\exists x_1 \exists x_2 \exists y_1 \exists y_2 \exists z_1 \exists z_2 \\ 
&    ((x_1 \neq x_2) \land (x_1 \neq y_1) \land (x_1 \neq y_2) \land (x_1 \neq z_1) \land (x_1 \neq z_2)   \\
 &\land (x_2 \neq y_1) \land (x_2 \neq y_2) \land (x_2 \neq z_1) \land (x_2 \neq z_2)\\
&\land (y_1 \neq y_2) \land (y_1 \neq z_1) \land (y_1 \neq z_2) \\
&\land (y_2 \neq z_1) \land (y_2 \neq z_2) \land (z_1 \neq z_2) \\
&\land (same\_shape(x_1, x_2) \land same\_color(x_1, x_2)) \\
&\land (same\_shape(y_1, y_2) \land \neg same\_color(y_1,y_2))\\
&\land (\neg same\_shape(z_1, z_2) \lor \neg same\_color(z_1, z_2)))\\
\end{split}
\label{eq:kpexample}
\end{align}

\subsection{CURI}\label{app:curi}

The \textbf{CURI} dataset \citep{vedantam21a} is based on CLEVR images~\citep{johnson2017clevr}, which depict 3D objects that possess the attributes \textit{color}, \textit{shape}, \textit{size} and \textit{material} (\cf \autoref{fig:why} for example images). The dataset has a total number of 14 929 abstract concepts. For each concept, the dataset contains at least one \textit{episode}, which consists of a support and a query set of images, each with five positive and 20 negative image examples. Overall, the data set is designed to test for compositional generalization and thus contains eight different concept splits that are based on specific properties that occur only in the test set. 

The ``counting'' split tests for counting generalization via 47 novel combinations of property-count concepts in its test set.
There are the intrinsic and extrinsic property splits, where in the training set concepts like "green" and "metal" or "red" and position "1" (on the x or y axis) do not occur together. For the boolean split there occur some combinations of properties and logical operators only in the test split, \eg, "green" and "or". Further, the binding splits have some object attributes only in the test concepts, \ie, the shape cylinder occurs only in test concepts for Binding(shape) and the colors purple, cyan and yellow occur only in the test concepts for Binding(color).
For the counting split, there is a selection of 47 concepts that are counting-based in the test set, but still some other counting concepts in the train set as well. 
The complexity split takes only concepts that are shorter than 10 tokens (\ie, that are less complex) for training and the longer ones for testing. We refer to the original work~\citep{vedantam21a} for further details.

\section{Details on Experimental Evaluations}\label{sec:expdetails}

In the following, we provide additional experimental details, but importantly, also ablation evaluations where both CURI-B and \ptc are provided with ground-truth object information input rather than the raw images. \citet{vedantam21a} refer to this type of input as \textit{schema} input.

All experiments were performed using the following hardware: CPU: AMD EPYC 7742 64- Core Processor, RAM: 2064 GB, GPU: NVIDIA A100-SXM4-40GB GPU with 40 GB of RAM.

\subsection{Choice of CURI Support Sets for \ptc}

The CURI data set is constructed in a way that a model can predict labels for a query set based on support examples. Therefore, for one concept in CURI there exists often multiple support sets with respective query sets. Our Method \ptc works different in a sense that it retrieved a program based on a support set of examples and based on that is able to classify arbitrary query examples, \ie, executing the program on them. To evaluate \ptc on the CURI dataset we therefore chose to consider just one support set per concept to reduce the number of programs that need to be enumerated per concept. In \autoref{tab:curi_results_support_sets}, we analyze if this changes the performance of \ptc whereby we show that this doesn't affect the model's performance notably. In our evaluations of \ptc we therefore consider only one support set per concept.

\begin{table}[h]
    \centering
    \caption{Comparison of one support set per task and different support sets per task. Mean Acc@all of \ptc on test tasks of CURI splits with schema inputs are reported.}
    \label{tab:curi_results_support_sets}
    \begin{tabular}{lccc}
    \toprule
     CURI Splits 100   &  Pix2Code  & Pix2Code   \\
     & (same support) & (diff support) \\
    \midrule
     Boolean        & 80.58  & 80.39 \\ 
     Counting       & 58.24  & 57.77  \\ 
     Extrinsic      & 77.51  & 78.03  \\
     Intrinsic      & 89.25  & 89.93  \\
     Binding(color) & 80.89  & 81.04 \\ 
     Compositional  & 77.45  & 77.77 \\ 
     Binding(shape) & 78.35  & 77.58 \\ 
     Complexity     & 73.42  & 73.52 \\ 
     \bottomrule
    \end{tabular}
\end{table}


\subsection{Learning visual concepts.}

\autoref{tab:iid_results_schema} presents the results of an ablation study where we provide ground truth symbolic representations of the objects in each image for the \kp iid and CURI iid split, rather than the representations of \ptcs object extractor or CURI-B's image encoder. We observe that \ptc represents a competitive approach over CURI-B particularly when considering the accuracy of the solved tasks.

\autoref{tab:number_solved_curi_image} (top) presents how many tasks \ptc has solved per dataset. Leading to an average of $93\%$ for \kp iid and $68.86\%$ for CURI iid.
For schema inputs \autoref{tab:number_solved_curi_schema} (top) the average of solved tasks for \kp is 93.67$\%$ and for CURI $72.42\%$, which is slightly higher. 

\begin{table}[h]
\centering
\small
\caption{Mean test accuracy on Kandinsky and CURI concepts with iid train test splits and \textit{schema} input.}
\label{tab:iid_results_schema}
\resizebox{0.5\columnwidth}{!}{
    \begin{tabular}{@{}lcc|c@{}}
    \toprule
    \multirow{2}{*}{} & \multirow{2}{*}{CURI-B} & \multicolumn{1}{c|}{\multirow{2}{*}{\begin{tabular}[c]{c}\ptc\\(Acc@all)\end{tabular}}} & \multicolumn{1}{c}{\multirow{2}{*}{\begin{tabular}[c]{c}\ptc\\(Acc@solved)\end{tabular}}} \\ 
    & & \\ \midrule
    \multirow{1}{*}{\begin{tabular}[l]{l}\kp\end{tabular}} &  \textbf{91.36 \mbox{\scriptsize$\pm 0.59 $}} & 91.01\mbox{\scriptsize$\pm 0.90 $} &  93.95 \mbox{\scriptsize$\pm 0.51 $}\\  
    \multirow{2}{*}{\begin{tabular}[l]{l}CURI\\(iid Split)\end{tabular}} & \multirow{2}{*}{73.73 \mbox{\scriptsize$\pm 0.31 $}} & \multirow{2}{*}{\textbf{74.16 \mbox{\scriptsize$\pm 1.18 $}}} & \multirow{2}{*}{83.31 \mbox{\scriptsize$\pm 1.65 $}} \\ & & \\
    \bottomrule
    \end{tabular}
}
\end{table}

\begin{table}[h]
    \centering
    \caption{Number of solved CURI concepts for \textit{image} input with \ptc over the three seeds.}
    \label{tab:number_solved_curi_image}
    \begin{tabular}{lccc|cc}
    \toprule
     Datasets    &  Seed 0 & Seed 1 & Seed 2 & Avg. & Total Tasks\\
    \midrule
    \kp iid          &  91 &  94 &  95 &  93 &         100 \\
     CURI iid            &   7005 &  4936 &  5389 &  5777 &       8389          \\ 
     \midrule
    Boolean            &  1665 &  2002 &  1776 &  1814 &      2565 \\
    Counting           &     9 &    20 &    19 &    16 &       47 \\
    Extrinsic            &   540 &   501 &   632 &   558 &      750 \\
    Intrinsic           &   258 &   272 &   223 &   251 &        283 \\
    Binding(color)      &  1927 &  1940 &  2211 &  2026 &        2590 \\
    Compositional        &  1699 &  1597 &  1605 &  1634 &        2402 \\
    Complexity          &  6739 &  6811 &  6914 &  6821 &        8363 \\
    Binding(shape)     &  1002 &   789 &  1165 &   985 &        1484 \\
     \bottomrule
    \end{tabular}
\end{table}

\subsection{Time Costs of Pix2Code}
In \autoref{tab:time_cost} we provide the mean durations (in sec.) for the training of CURI-B and Pix2Code (and its sub-modules) on the iid data set of CURI over three seeds.

\begin{table}[h]
\centering
\caption{Training times of CURI and Pix2Code.}
\label{tab:time_cost}
\begin{tabular}{lllll}
\toprule
         & CURI-B             & Pix2Code       & \textit{Object Extractor} & \textit{Program Synthesis }\\ \midrule
Duration & 1094.3 $\pm$ 122 s & 48575.7 $\pm$ 2338s & 17247.3 $\pm$583s    & 31328.3$\pm$2404s    \\ \bottomrule
\end{tabular}
\end{table}

The training of CURI-B for $1000$ steps takes, on average, $1094$ seconds (ca.~$18$ minutes). Pix2Code was trained for 15 iterations which takes on average $48,575.5$ seconds (ca.~$13,5$h). Indeed, this is a substantially longer training time. However, we consider this to be a trade-off for the benefits of improved generalisability, interpretability, and revisability.

\subsection{Generalizing to novel combinations of known visual concepts.}
In \autoref{tab:curi_results_schema} we provide the ablation results when both models are trained on schema input. We observe the same trend as in the evaluations of the main text.
\autoref{tab:number_solved_curi_image} (bottom) presents how many tasks \ptc has solved per CURI split. Leading to a median of $72.56\%$ over all splits. For schema inputs \autoref{tab:number_solved_curi_schema} (bottom) the median is $74.95\%$. 

\begin{table}[h]
\centering
\caption{Mean accuracy (with std) for meta-test tasks of CURI splits reported individually and as the median (with median absolute deviation) over all splits. Hereby the models were provided with \textit{schema} inputs, rather than images.}
\label{tab:curi_results_schema}
\resizebox{0.6\columnwidth}{!}{
    \begin{tabular}{lcc|c}
    \toprule
    \multicolumn{1}{l}{\multirow{2}{*}{\begin{tabular}[l]{l}CURI\\(Comp. Splits)\end{tabular}}} &
     \multirow{2}{*}{CURI-B} & \multicolumn{1}{c|}{\multirow{2}{*}{\begin{tabular}[c]{c}\ptc\\(Acc@all)\end{tabular}}} & \multicolumn{1}{c}{\multirow{2}{*}{\begin{tabular}[c]{c}\ptc\\(Acc@solved)\end{tabular}}} \\ 
     & & \\ \midrule
     Boolean        & 75.69 \mbox{\scriptsize$\pm 0.41 $} & \textbf{80.46 \mbox{\scriptsize$\pm 1.28 $}}  & 91.07 \mbox{\scriptsize$\pm 2.39 $}\\ 
     Counting       & \textbf{70.56 \mbox{\scriptsize$\pm 0.44 $}} & 58.34 \mbox{\scriptsize$\pm 0.45 $}   & 69.16 \mbox{\scriptsize$\pm 2.81 $} \\ 
     Extrinsic      & 76.97 \mbox{\scriptsize$\pm 0.18 $} & \textbf{78.67 \mbox{\scriptsize$\pm 1.82 $}}  & 89.68 \mbox{\scriptsize$\pm 1.93 $} \\
     Intrinsic      & 78.18 \mbox{\scriptsize$\pm 1.61 $} & \textbf{87.47 \mbox{\scriptsize$\pm 3.34 $}}  & 92.64 \mbox{\scriptsize$\pm 0.98 $} \\
     Binding(color) & 78.37 \mbox{\scriptsize$\pm 0.61 $} & \textbf{80.61 \mbox{\scriptsize$\pm 2.27 $}}  & 87.62 \mbox{\scriptsize$\pm 2.49 $} \\ 
     Compositional  & 74.12 \mbox{\scriptsize$\pm 0.91 $} & \textbf{77.19 \mbox{\scriptsize$\pm 0.52 $}}  & 87.21 \mbox{\scriptsize$\pm 0.68 $} \\ 
     Binding(shape) & 72.60 \mbox{\scriptsize$\pm 0.82 $} & \textbf{78.75 \mbox{\scriptsize$\pm 1.68 $}}  & 88.33 \mbox{\scriptsize$\pm 2.26 $} \\ 
     Complexity     & \textbf{75.07 \mbox{\scriptsize$\pm 0.43 $}} &  74.21 \mbox{\scriptsize$\pm 0.56 $}  & 78.43 \mbox{\scriptsize$\pm 0.56 $} \\ 
     \hline 
     Mdn.           & 75.38 \mbox{\scriptsize$\pm 2.19 $} & \textbf{78.71 \mbox{\scriptsize$\pm 1.82 $}}  & 87.98 \mbox{\scriptsize$\pm 2.40 $} \\
     \bottomrule
    \end{tabular}
    }
\end{table}

\begin{table}[h]
    \centering
    \caption{Number of solved CURI concepts for \textit{schema} input with \ptc over the three seeds.}
    \label{tab:number_solved_curi_schema}
    \begin{tabular}{lccc|cc}
    \toprule
     Datasets    &  Seed 0 & Seed 1 & Seed 2 & Avg. & Total Tasks\\
    \midrule
    \kp iid         & 92 & 95 & 94 & 94 & 100\\
     CURI iid                &  6875 &  5482 &  5869 &  6075 &        8389 \\
     \midrule
    Boolean            &  1769 &  2119 &  1840 &  1909 &        2565 \\
    Counting           &    23 &    26 &    20 &    23 &          47 \\
    Extrinsic &   498 &   498 &   634 &   543 &         750 \\
    
    Intrinsic     &   263 &   274 &   211 &   249 &         283 \\
    Binding(color)      &  1963 &  2070 &  2300 &  2111 &        2590 \\
    Compositional         &  1817 &  1754 &  1696 &  1756 &        2402 \\
    Complexity     &  7085 &  7060 &  7132 &  7092 &        8363 \\
    Binding(shape)     &  1130 &   963 &  1267 &  1120 &        1484 \\
     \bottomrule
    \end{tabular}
\end{table}

\subsection{Generalizing to variable number of objects.}\label{app:entity-generalization}

For investigating entity generalization in the context of visual concept learning we created images with the CLEVR-Hans repository~\citep{StammerSK21} for generating the CURI variations AllCubes-N and AllMetalOneGray-N. In AllMetalOneGray-N positive images all contain metal objects and at least one gray object. Negative images have a rubber object and others are metal. Examples of the datasets are depicted in \autoref{fig:all_metal_one_gray}.

\begin{figure}[h]
    \centering
    \includegraphics[width=0.6\columnwidth]{images/metal_one_gray.pdf}
    \caption{Examples of created test examples for AllMetalOneGray-N. Positive images have all metal objects and at least one gray one. Negative images have one rubber object}
    \label{fig:all_metal_one_gray}
\end{figure}

For the support sets of AllCubes-N and AllMetalOneGray-N, we used one original, randomly sampled support set of the concepts "all objects are cubes" and "all objects are metal and there exists a gray object" from the CURI data set. For the query sets 100 positive and $100$ negative examples were created and grouped into $25$ examples per query set. 

For the evaluations CURI-B needed to be retrained on the iid train split with the hyperparameter max objects set to $10$, leading to CURI-B-10. The best performing model was the one with transformer pooling, its test results on the CURI iid split are reported in \autoref{tab:iid_results_schema_10_obj}. For Pix2Code, the original trained models from the iid split were used to query a program for the support sets of the concepts and classify the query examples. Both models achieve comparable results on the original data set, however, the evaluations of (Q3) in (\autoref{tab:generalizing}) show that in terms of entity generalization \ptc largly outperforms CURI-B-10.

\begin{table}[h]
\centering
\small
\caption{Mean test accuracy on CURI concepts with iid train test splits and \textit{schema} input where CURI-B is modified to process up to ten objects (CURI-B-10).}
\label{tab:iid_results_schema_10_obj}
\resizebox{0.5\columnwidth}{!}{
    \begin{tabular}{@{}lcc|c@{}}
    \toprule
    \multirow{2}{*}{} & \multirow{2}{*}{CURI-B-10} & \multicolumn{1}{c|}{\multirow{2}{*}{\begin{tabular}[c]{c}\ptc\\(Acc@all)\end{tabular}}} & \multicolumn{1}{c}{\multirow{2}{*}{\begin{tabular}[c]{c}\ptc\\(Acc@solved)\end{tabular}}} \\ 
    & & \\ \midrule
    \multirow{2}{*}{\begin{tabular}[l]{l}CURI\\(iid Split)\end{tabular}} & \multirow{2}{*}{74.5 \mbox{\scriptsize$\pm 1.55 $}} & \multirow{2}{*}{74.16 \mbox{\scriptsize$\pm 1.18 $}} & \multirow{2}{*}{83.31 \mbox{\scriptsize$\pm 1.65 $}} \\ & & \\
    \bottomrule
    \end{tabular}
}
\end{table}

\subsection{Interpreting programs.}\label{app:interpret}
For providing the natural language explanations of \autoref{tab:interpretable_programs} we used 
gpt-4-turbo. An exemplary prompt for "all objects are cyan" is shown in \autoref{txt:prompt}. Note that the prompt uses the raw program output whereas in the \autoref{tab:interpretable_programs} and \autoref{tab:interpretable_programs_app} the programs where parsed to a more readable form and integer values were substituted with their semantic meaning.

Further, we evaluated the language model Gemini\footnote{\url{https://blog.google/technology/ai/google-gemini-ai}} for comparison, the results are given in \autoref{tab:interpretable_programs_app}.

\begin{table*}[h]
\caption{Examples of CURI concepts from with \ptc programs and natural language translation from an LLM, \ie, gemini-pro-dev-api. All programs achieve 100\% accuracy on the CURI test scenes.}
\label{tab:interpretable_programs_app}
\begin{tabular}{p{4cm}|p{6cm}|p{6cm}}
\toprule
Original Concept                  & Retrieved Program                                                                            & Natural Language Translation (Gemini)     \\ \midrule
    \textcolor{brown}{\textbf{All}} objects are \textcolor{teal}{\textbf{cyan}}  
    & 
    ($\lambda$ (x) (($\lambda$ (y z) (\textcolor{brown}{\textbf{forall}} ($\lambda$ (u) (eq? (index y u) z)))) \textcolor{teal}{\textbf{color cyan}} x))
    & 
    "All objects have the color cyan."   \\ 
    \midrule
    \textcolor{brown}{\textbf{All}} objects are \textcolor{teal}{\textbf{cubes}}
    & 
    ($\lambda$ (x) (($\lambda$ (y z) (\textcolor{brown}{\textbf{forall}} ($\lambda$ (u) (eq? (index y u) z)))) \textcolor{teal}{\textbf{shape cube}} x)) &
     "All objects have the shape cube." \\
     \midrule \midrule
    \textcolor{brown}{\textbf{All}} objects are \textcolor{teal}{\textbf{purple}} \textcolor{violet}{\textbf{and}} \textcolor{brown}{\textbf{all}} objects are \textcolor{teal}{\textbf{spheres}}
    &    ($\lambda$ (x) (\textcolor{violet}{\textbf{and}} (($\lambda$ (y z) (\textcolor{brown}{\textbf{forall}} ($\lambda$ (u) (eq? (index y u) z)))) \textcolor{teal}{\textbf{shape sphere}} x) (($\lambda$ (v w) (\textcolor{brown}{\textbf{forall}} ($\lambda$ (a) (eq? (index v a) w)))) \textcolor{teal}{\textbf{color purple}} x)))  &                             
    "All objects are spheres and purple." \\
     \midrule
    \textcolor{brown}{\textbf{All}} objects are \textcolor{teal}{\textbf{brown}} \textcolor{violet}{\textbf{or}} \textcolor{brown}{\textbf{all}} objects are \textcolor{teal}{\textbf{cubes}}
    & ($\lambda$ (x) (\textcolor{violet}{\textbf{or}} (($\lambda$ (y z) (\textcolor{brown}{\textbf{forall}} ($\lambda$ (u) (eq? (index y u) z)))) \textcolor{teal}{\textbf{shape cube}} x) (($\lambda$ (v w) (\textcolor{brown}{\textbf{forall}} ($\lambda$ (a) (eq? (index v a) w)))) \textcolor{teal}{\textbf{color brown}} x))) 
    & "All objects are either cubes and made of rubber or are brown." \\
     \midrule
    \textcolor{brown}{\textbf{All}} objects are \textcolor{teal}{\textbf{small}} \textcolor{violet}{\textbf{and}} \textcolor{brown}{\textbf{there exists}} a \textcolor{teal}{\textbf{purple}} object
    &
    ($\lambda$ (x) (\textcolor{violet}{\textbf{and}} (($\lambda$ (y z) (\textcolor{brown}{\textbf{forall}} ($\lambda$ (u) (eq? (index y u) z)))) \textcolor{teal}{\textbf{size small}} x) (\textcolor{brown}{\textbf{exists}} ($\lambda$ (v) (($\lambda$ (w a b) (eq? (index b w) a)) v \textcolor{teal}{\textbf{purple color}})) x)))
    &  "All objects are small and there exists a green object." \\
    \midrule \midrule
    There are \textcolor{purple}{\textbf{three}} \textcolor{teal}{\textbf{gray}} objects 
    &
     ($\lambda$ (x) (\textcolor{purple}{\textbf{eq?}} (($\lambda$ (y) (\textcolor{purple}{\textbf{count}} (map ($\lambda$ (z) (($\lambda$ (u v) (index u v)) \textcolor{teal}{\textbf{color}} z)) y))) x \textcolor{teal}{\textbf{gray}}) \textcolor{purple}{\textbf{3}}))
    & "All objects have size large." \\
    \midrule
    There exists an arbitrary object and \textcolor{brown}{\textbf{there exist}} \textcolor{purple}{\textbf{three}} other objects that are \textcolor{teal}{\textbf{blue}} 
    &
    ($\lambda$ (x) (\textcolor{purple}{\textbf{gt?}} (($\lambda$ (y) (\textcolor{purple}{\textbf{count}} (map ($\lambda$ (z) (($\lambda$ (u v) (index u v)) \textcolor{teal}{\textbf{color}} z)) y))) x \textcolor{teal}{\textbf{blue}}) \textcolor{purple}{\textbf{2}}))
    & "There are more than 2 objects with size large." \\
 \bottomrule
\end{tabular}
\end{table*}

\clearpage
\begin{small}
\begin{lstlisting}[label=txt:prompt,caption=Example prompt for LLMs.,float,frame=tb, basicstyle=\small]
There is a list of integer lists that represent objects from an image. 
Each object is encoded by four values for the bounding box of the object, 
then one value for the size, one value for the color, one for the shape 
and one for the material. This means an object is encoded by a list of 8 values: 
[x_min, y_min, x_max, y_max, size, color, shape, material]. 

The values for size (index 4): 
0: small
1: large

The values for color (index 5): 
0: gray
1: blue
2: brown
3: yellow
4: red
5: green
6: purple
7: cyan

The values for shape (index 6):
0: cube
1: sphere
2: cylinder

The values for material (index 7):
0: rubber
1: metal

In the following there is a lambda calculus program that processes a list of objects 
and classifies them based on a rule. The rule determines whether the image belongs 
to a pattern or not (True or False). 

Please give description of the pattern that is detected by the program in one 
sentence.

Program: 
(lambda (#(#(lambda (lambda (forall (lambda (eq? (index $2 $0) $1))))) 6 1) $0)) 

Explanation: 
All objects have the shape sphere.

Program: 
(lambda (#(#(#(lambda (lambda (forall (lambda (eq? (index $2 $0) $1))))) 5) 2) $0)) 

Explanation: 
All objects have the color brown.

Program: 
(lambda (#(#(lambda (lambda (forall (lambda (eq? (index $2 $0) \$1))))) 5) 7 $0)) 

Explanation:
\end{lstlisting}
\end{small}

\clearpage 

\subsection{Revise confounders.}\label{app:confounding}

For the evaluations of confounding in concept learning, we propose \textbf{CURI-Hans}. It consists of original CURI concepts listed in \autoref{tab:curi-hans} and a confounded test task. 
This test task is confounded by "all objects are cyan", which is added to the support set of test task, \ie each object gets the color \textit{cyan}. 
The query set stays unconfounded, \ie, every object can have any color from the set of existing colors of the original CLEVR setting.

\begin{table}[h]
\centering
\caption{Subset of CURI concepts for confounded experiment. For each concept, one episode was selected and the test task has been confounded so that in the support set the positive samples had \textbf{all cyan objects.}}
\label{tab:curi-hans}
\begin{tabular}{lrl}
\toprule
Split & Concept & Description \\ \midrule
 Train          &  6746     &  All objects are blue           \\
 Train          &  6001     &  All objects are spheres           \\
 Train          &  7666     &  There exists a metal object and its x-location is greater than 1           \\ 
 Train          & 4399      & There exists a sphere and there exists an object with y-location equal to 7            \\ 
 Train          &  9659     & There exists a metal object and there exists another object which has the y-location 6            \\ 
 Train          &  14275     & All objects are brown and all objects are cylinders            \\ 
 Train          &  13983     & All objects are red and there exists a cube            \\ 
 Train          &  2524     &  There exists a yellow object and all objects are rubber           \\ \midrule
 Test           &   5327    &   There exists a cube and all objects are \textbf{(cyan and)} metal          \\ 
 \bottomrule
\end{tabular}
\end{table}

For revising the program synthesis component of \ptc, we remove program primitives from $L$, as well as collected programs of the training tasks that include the removed program primitives (as the code model $q_\psi$ is trained on them). To do this more easily, we change the object representations so that each object property has its own integer values, leading to \autoref{tab:curi-hans-ints} (in comparison to \autoref{tab:dsl-ints}).

\begin{table}[h]
\centering
\caption{A mapping from integers to the attributes of CURI-EG objects.}
\label{tab:curi-hans-ints}
\begin{tabular}{llll}
\toprule
CURI & & &  \\
size     & color &      shape &  material  \\ 
\midrule
0: small & 2: gray    & 9: cube &  12: rubber   \\
1: large & 3: blue    & 10: sphere &   13: metal  \\
         & 4: brown      & 11: cylinder  &    \\
         & 5: yellow    &   &    \\
         & 6: red       &   &    \\
         & 7: green     &   &    \\
         & 8: purple    &   &    \\ 
         & 9: cyan      &   &   \\
\end{tabular}
\end{table}

To remove the confounder \textit{color cyan}, we can therefore remove the primitive \textit{9} as well as the color index \textit{5} (because color is at index $5$ in the object representations). We finally finetune the code model on the modified library and reevaluate on the unconfounded query set. 

\subsection{Revise Counting.}\label{app:colorcount}
To revise \ptc for the counting split of CURI, we add four primitives to the library $L$. These primitives are each designed to count the number of occurrences for a given attribute (\ie \textit{size, color, shape} and \textit{material}) in a object representation list. The primitives are the following:

\texttt{(}$\lambda$ \texttt{(x)(}
 $\lambda$ \texttt{(y)(count (map (} $\lambda$ \texttt{(z) (index 4 z))y)x)))} to count the times \texttt{y} occurs as size.

 \texttt{(}$\lambda$ \texttt{(x)(}
 $\lambda$ \texttt{(y)(count (map (} $\lambda$ \texttt{(z) (index 5 z))y)x)))} to count the times \texttt{y} occurs as color.

\texttt{(}$\lambda$ \texttt{(x)(}
 $\lambda$ \texttt{(y)(count (map (} $\lambda$ \texttt{(z) (index 6 z))y)x)))} to count the times \texttt{y} occurs as shape.

 \texttt{(}$\lambda$ \texttt{(x)(}
 $\lambda$ \texttt{(y)(count (map (} $\lambda$ \texttt{(z) (index 7 z))y)x)))} to count the times \texttt{y} occurs as material.

The four primitives are added with prior probability of $-0.3$ to $L$. 
After that, the module is trained for another iteration to update the code model and \ptc is evaluated again on the test tasks of the counting split, achieving a much higher accuracy.

\subsection{Extending Pix2Code to Natural Images.}\label{app:coco}

For evaluating \ptc on real-world concepts we created $7$ abstract concepts based on the MS COCO dataset \citep{lin2014microsoft}, an example concept is given in \autoref{fig:coco_concept}. Since we are only investigating the potential of applying \ptc to real-world scenarios, we consider a small set of training tasks and do not investigate the generalization to test tasks here. The training tasks have $25$ support images based on which programs are retrieved and $100$ images for testing the programs. In \autoref{tab:coco_results} the $5$ COCO concepts for which \ptc synthesized programs are presented. The two concepts for which no program was retrieved are "There exist two dogs" and "There exists a book or a teddy bear". 

In the context of integrating Pix2Code to more realistic image settings let us discuss potential bottlenecks and updates to mitigate these. Specifically, more realistic image settings can lead to an increased number and complexity of the object token sequences of the object extractor. However, this should not represent a bottleneck within the object extraction module of Pix2Code as object extractors like Pix2Seq can handle such settings as was illustrated in the original work~\citep{chen2022pix2seq}.

In comparison, the program synthesis module could contain the following possible limitations. First off, given a large symbolic representation space, the code model can experience issues processing the entire symbolic input. However, approaches exist to mitigate this, e.g., an attention-based module could be used. Second, a large symbolic representation space may require more base program primitives (\eg, more integer values), which can lead to an increased search time due to the larger search space. This may limit the applicability in time-sensitive settings. Concerning this, \cite{ellis2023dreamcoder} suggests increasing the number of CPUs, which allows for parallelizing the searches. If time is not of the essence one can increase the search timeout parameters. Another possible measure to mitigate long search times is to incorporate a form of pre-filtering of objects and their attributes, thereby reducing the search space.
In the case of our evaluations in (Q6), we used the same hyperparameters as for the other evaluations and used $16$ CPUs to obtain the results, however, in the case of a higher number of real-world training tasks, it still needs to be investigated whether the search time and number of CPUs need to be increased.

We note that the current architecture of \ptc does not explicitly handle the object extractor's noise. If the noise is too high, it can happen that \ptc does not find a suitable program as in the case of the two COCO concepts. We propose to integrate object representation uncertainty in future work to apply \ptc to real-world settings in a robust way.


