\section{Introduction}

\begin{figure}[t]
    \centering
    \includegraphics[width=1.\columnwidth]{images/motivation6.pdf}
    \caption{Interpretable visual concept learning: learning concepts from few images that can
    \textit{generalize} to unseen examples and unseen concepts, such that human users can \textit{inspect} and potentially \textit{revise} suboptimal learned concepts.
    }
    \label{fig:why}
\end{figure}

Humans possess the ability to identify recurring concepts in their daily lives, \eg, a driver can identify when pedestrians have the priority independent of the number of pedestrians or other changing properties of the traffic.
However, learning such \textit{visual concepts}, particularly without supervision, still poses a major challenge for machine learning (ML) models. 
This is notably due to the diversity of visual scenes (\cf \autoref{fig:why}), but also the immense space of possible concepts that can be arbitrarily composed of many subconcepts. 
Moreover, it remains necessary for human users to be able to inspect the learned concepts and revise potential errors or shortcuts before deployment of such learning systems, particularly in unsupervised learning settings.

Current ML approaches~(\eg, \citep{santoro2017simple}) that tackle this challenging task still have issues, \eg, with detecting visual concepts based on object relations as well as generalizing in few-shot learning scenarios (\cf \citep{stabinger2021evaluating} for a survey). \citep{kim2018not} propose an approach that generalizes to unseen image samples of a concept, but not to unseen concepts. 
\citet{vedantam21a}, on the other hand, show promising results in terms of generalization to unseen concepts, however, the authors neglect to investigate other forms of generalization, \eg, when the number of objects in an image is increased.
Moreover, the nature of the implicit \textit{neural} representations make the learned concepts opaque for human users and impractical to revise.

An orthogonal research field that incorporates generalization, inspectability and revisability for concept learning in general is program synthesis, where knowledge is learned in the form of explicit \textit{programs}. 
Not only do programs allow to extrapolate to novel, unseen inputs regardless of the number of objects, but their compositional nature is particularly useful for learning to reuse existing knowledge in new ways~\citep{ellis2021dreamcoder, stengel2024regal}, particularly in symbolic list processing and text editing settings~\citep{balog2017deepcoder,  ellis2021dreamcoder}. 
In addition, even the longest programs are \textit{readable} for human users \citep{cambronero2023flashfill}, thereby offering an inherent form of interpretability. 
Lastly, program synthesis approaches offer easy human revision~\citep{trivedi2021learning} such as rewriting or updating the programs.
Despite all of this, program synthesis approaches have not been utilized to learn complex \textit{visual} concepts from raw images up to now, likely due to the difficulty of mapping images to symbolic representations.

This work introduces Pix2Code, a neuro-symbolic framework for generalizable, inspectable and revisable visual concept learning. Using both neural and program synthesis components, \ptc integrates the power of neural representations with the generalizability and readability of program representations. 
During inference, \ptc extracts symbolic object representations from raw image inputs uses these to synthesize $\lambda$-calculus programs, that serve as concept classifiers (\ie, ``Do novel images contain this concept?''), but also as inherent interpretations of these concepts. 
\ptc learns to abstract visual concepts by training both a generative program library and a program recognition model based on wake-sleep learning. 
In our evaluations, we investigate the advantages of \ptc in terms of generalization, \eg, for novel concept combinations, but also extend the evaluation setting of previous works to entity generalization, \ie, generalizing to novel instances of a concept. Lastly, we show that the retrieved concept representations of \ptc are inspectable and can easily be revised in case of confounded or suboptimal behavior.\footnote{Code and data available at~\href{https://github.com/ml-research/pix2code}{\url{github.com/ml-research/pix2code}}.}\\
Overall we make the following contributions: 
\begin{description}[itemsep=1pt,parsep=1pt,topsep=0pt,partopsep=0pt]
    \item[(i)] We frame visual concept learning in the context of program synthesis in our \ptc framework.
    \item[(ii)] \ptc learns visual concept representations that are generalizable to unseen concepts. 
    \item[(iii)] We effectively revise the learned representations via human guidance to mitigate suboptimal behavior.
    \item[(iv)] We identify limitations with respect to concept generality in the existing concept learning benchmarks and show how this can be alleviated via \ptc.
\end{description}

Let us now provide a formal description of our \ptc framework, its inference, learning, and revision processes. Next, we move on to experimental evaluations and conclude after presenting related works.

\begin{figure*}[t]
    \centering
    \includegraphics[width=0.97\textwidth]{images/overview.pdf}
    \caption{\textbf{The \ptc architecture.} Objects with bounding box and attribute information are extracted from each positive and negative image example of a visual concept. These representations are converted into a binary classification formulation. The program synthesis component searches for programs to solve each task.
    This search is based on a probabilistic library that is learned and enhanced during training by frequently used program parts.
    The result of the search is the visual concept of the image in form of an executable program that can be translated into a corresponding natural language statement.}
    \label{fig:overview}
\end{figure*}

\section{Pix2Code}
\label{sec:method}

In our work, we consider learning \textit{visual concepts}, \ie, general ideas that are fundamental to the understanding of a visual scene (\cf \autoref{fig:why}).
The goal of the \ptc framework is to discover such concepts in a generalizable, interpretable, and revisable manner. This is achieved by combining differentiable token-based object representations with program synthesis such that concepts are represented as programs. 

Formally, we consider a set of images $X$. 
For an image, $x \in X$, if a concept, $c$, is appearing in the image, we denote $c \subset x$. 
Following the setup of \cite{vedantam21a}, we consider the goal  of identifying a specific concept, $c$, from a positive subset of $X$, $X^{+} := \{x_i^{+}\}_{i=1}^N$, and a negative set $X^{-} := \{x_i^{-}\}_{i=1}^M$. Thus, the goal is to obtain a model, $f_\Theta$ (parameterized by $\Theta$) which proposes a concept,
$f_{\Theta}(X^{+}, X^{-}) = c$, that separates $X^{+}$ from $X^{-}$, \ie, it must hold that 
$\forall x^+_i \in X^+, \ c \subset x^+_i$ \textit{and} $\forall x^-_i \in X^-, \ c \not\subset x^-_i$.  
An overview of how \ptc achieves this is shown in \autoref{fig:overview}. 
Let us provide a step-by-step description of this, beginning with \ptcs inference, followed by its training and revision procedures.

\textbf{Concept learning as a program synthesis task.}
To obtain visual concepts (\eg, \textit{all objects are spheres}) from an image, the first step of the \ptc framework is to cast the above problem of unsupervised concept learning into a suitable program synthesis setting.
For that, we recast the initial task (that consists of the tuple of a positive and negative image set, \cf \autoref{fig:why}), $\{X^+, X^- \}$, to a binary classification task: 
\begin{align}\label{eq:task}
    T &:= \{(x_i, y_i)\}_{i=1}^{N+M},     
\end{align}
where $x_i \in \{X^+\}$ with $y_i = 1$ for $i \in \{0, ..., N\} $, and $x_i \in \{X^-\}$ with $y_i = 0$ for ${i \in \{N+1, ..., N+M\}}$.

\noindent \textbf{Transforming images to symbolic object representations.}
Visual concepts are based on objects, their attributes and relations. 
A necessary first step for performing visual concept learning is therefore to identify relevant objects from an image and extract corresponding object representations. 
Moreover, in order to perform visual concept learning via program synthesis, we specifically require symbolic object representations. 
Given a pretrained object extraction model, $h_{\psi}$, \ptc extracts a set of discrete representations, ${O_i}$, from an image $x_i$ that contains $K_i$ objects:
\begin{align}
    O_i &:= h_{\psi}(x_i) = \{o_j\}_{j=1}^{K_i}.    
\end{align}

Each object representation, $o_j \in O_i$, corresponds to a sequence of tokens: $o_j := [x_\text{min}, y_\text{min}, x_\text{max}, y_\text{max}, a_1, ..., a_C ]$, that includes the object's bounding box coordinates, $[x_\text{min}, y_\text{min}, x_\text{max}, y_\text{max}] \in \mathbb{N}^4$, as well as relevant object properties $[a_1, ..., a_C] \in \mathbb{N}^{C}$. 
For notation reasons, we here consider that all objects in our images possess attributes from the same amount of categories, $C \in \mathbb{N}$, \eg, size, shape, color, and material for the objects of \autoref{fig:why}. 
Moreover, each category contains a finite number of possible attribute instantiations, \ie, $\forall k \in [1,..., C], a_k \in \{1, ..., d_k \}$ with $d_k \in \mathbb{N}$, \eg, \textit{sphere}, \textit{cube} and \textit{cylinder} for the shape category or \textit{red}, \textit{blue}, \textit{green}, \textit{yellow}, etc. for the color category. 
Conclusively, the object extractor, $h$, extracts a set of symbolic object representations. The original input is transformed via the two previous steps to obtain the following task representation:
\begin{align}
    \bar{T} &:= h_{\psi}(X^+, X^-) = \{(O_i, y_i)\}_{i=1}^{N+M}.
\end{align}

\noindent \textbf{Synthesizing programs from object representations.}\\
Having obtained symbolic representations of the input task, we now move on to learning abstract programs via \ptcs program synthesis module, $g_{L, \phi}$. This consists of two components: a library of learned primitives, $L$, 
and a code model, $q_{\phi}$, which predicts the most likely primitives of $L$ given a task.
For describing the inference procedure, we consider that $L$ and $q_{\phi}$ result from an already trained framework.
Specifically, $L$ contains base primitives (\eg, \textit{forall}, \textit{eq?}) as well as learned program \textit{primitives} (\eg, \textit{same shape}), represented as reusable functional programs $L := [p_0, ...,p_B]$. Each primitive, $p_j$, possesses a specific arity $m_j \in \mathbb{N}$ (\ie, number of input variables). With $L$ as vocabulary, the code model $q_{\phi}$ proposes the most likely primitives given a task $\bar{T}$. 
Thus, during an enumerative search, $q_{\phi}$ is used to synthesize the most likely program $P$ which encodes the concept separating the examples of $\bar{T}$.

During the enumerative search, programs are constructed by sampling primitives from $q_\phi$. For an efficient search, beam search is used in order to extend only the most likely partial programs under $q_\phi$. 
At step $\tau$, we denote the concept (in form of an incomplete program) as $P_{\tau}$ and the primitive selected to extend it as $p^*_\tau$. 
Specifically, given task $\bar{T}$ and the current partial program $P_{\tau-1}$, the code model $q_{\phi}$ provides a distribution over the next primitives:  
\begin{align}
\forall p_j \in L: q_{\phi}(\bar{T}, P_{\tau-1}, p_j) = \rho_{\tau}(p_j) \in [0,1].
\end{align}
From this distribution, $p^*_{\tau}$ is sampled and added to the current program, \ie, $P_\tau = P_{\tau-1} \otimes p^*_{\tau}$, where $\otimes$ corresponds to placing $p^*_\tau$ within $P_\tau$. The search starts with an initial program as an empty set, \ie, $P_0 = \{ \}$.
A partial program is complete when the variables of each $p^*_{\tau}$ are set, \ie, each variable of $p^*_{\tau}$ has been substituted with a primitive of arity zero or the variable itself is a primitive with set variables.
This final step is denoted as $\hat{\tau}$.
Finally, at the end of the search the most likely program $P_{\hat{\tau}}$ is obtained, such that 
\begin{align}
g_{L,\phi}(\bar{T}) = P_{\hat{\tau}} =: P.
\end{align}
The resulting program is a composition of primitives, \ie, $P = p^*_0 \otimes p^*_1 \otimes ... \otimes p^*_{\hat{\tau}}$. 
In the example illustrated in \autoref{fig:overview}, the retrieved program is
$ P = \lambda$ (x) $\otimes$ \texttt{forall} $\otimes$ \texttt{same\_shape} $\otimes$ \texttt{sphere} $\otimes$ x, which checks whether all objects in the image have a \textit{spherical shape}. 
Conclusively, the overall inference procedure of \ptc is:
\begin{align}
    c := P = g_{L,\phi}(h_{\psi}(X^+, X^-)).
\end{align}

Lastly, we note that the final concept, represented as a program $P$, serves two important purposes. 
$P$ can be used to (i) classify unseen images (\cf \autoref{app:ptcdetails} and \autoref{app:figclassification}) and at the same time to (ii) provide a transparent procedure of this classification, thus directly serving as an explanation. 

\noindent \textbf{Learning programs from images.}
To train \ptc, we need to optimize each of its parameters $\Theta := \{\psi, L, \phi\}$. 
In our evaluations, we differentiate between optimizing the parameter set of the object extractor, $\psi$, and jointly optimizing the library $L$ and parameter set of the code model, $\phi$, which both represent parameters of the program synthesis model. 
The training of the object extractor is independent and can, in principle, be done in an unsupervised manner \citep{delfosse2021moc}. We here 
follow the procedure of \cite{chen2022pix2seq}. 
However, instead of detecting one class per object as in the original work, we detect $C$ classes per object (one for each attribute category). Specifically, given a training image $x$ with $K$ objects and $C$ attribute categories, the corresponding object sequences are $\hat{y} := \{ \hat{o}_j \}_{j=1}^{K}$, with $\hat{o}_j = (\hat{x}^j_\text{min}, \hat{y}^j_\text{min}, \hat{x}^j_\text{max}, \hat{y}^j_\text{max}, \hat{a}^j_1, ..., \hat{a}^j_C)$. 
The object extractor, $h$, is trained to optimize the maximum-likelihood loss (${\mathrm{argmax}_{\psi}}\, LL(\hat{y}, h_{\psi}(x))$) via gradient descent. 
In this way, $h$ is optimized to identify multiple attributes per object. Further details are provided in \autoref{app:ptcdetails}.

On the other hand, the library and the code model of the program synthesis component are jointly optimized based on the probabilistic approach of \cite{ellis2023dreamcoder} and the wake-sleep algorithm of \cite{hinton1995wake}. 
Specifically, $L$ and $q_{\phi}$ bootstrap each other.
$L$ initially contains only base primitives, \ie, the domain specific language (DSL), and is parametrized by $\mu$, that corresponds to the prior probability of each primitive (initialized uniformly).
For a training task $\Bar{T}_i \in \Bar{T}_{\text{train}}$, a set $\Pi^{i} = \{P^{i}_s\}_{s=1}^S$ of programs is sampled from $L_\mu$ (wake phase), where $S \in \mathbb{N}$ denotes the number of maximum considered programs per task. 
The retrieved programs are used to train $q_{\phi}$ (sleep phase), following:  

\begin{equation}
    \label{eq:lossQ}
    \mathcal{L} =\mathbb{E}_{\bar{T}_i \sim \Bar{T}_{\text{train}}}\left[\log q_\phi(\underset{P^{i}_s \in \Pi^{i}}{\arg \max\ } 
    p( P^{i}_s \mid \bar{T}_i, L_\mu))\right].
\end{equation}

Moreover, \ptc uses a dreaming phase in which new task-program pairs are created, \ie, 
new object-centric data and programs are generated to additionally train $q_{\phi}$ following \autoref{eq:lossQ}. 
$L_\mu$ is optimized using programs sampled from the updated $q_{\phi}$. 
For this, $S$ programs for each task are sampled via $q$: $P_{\text{train}} = \{ q_{\phi}(\bar{T}_i) | \forall \bar{T}_i \in \bar{T}_{\text{train}} \}$. 
With this set of sampled programs, the probability of the library primitives are updated via maximum a posteriori estimation. 
Further, frequently used program parts within $P_{\text{train}}$ are identified and added to $L$ to improve its objective function (\cf \autoref{app:ptcdetails} and \cite{ellis2023dreamcoder} for further details).

\noindent \textbf{Revising latent concept representations.}
\ptc integrates interpretable and accessible components and latent representations that allow human users to identify and revise potentially suboptimal model behavior (\eg, overfitting, confounding~\citep{SchramowskiSTBH20} or other forms of shortcut learning~\citep{GeirhosJMZBBW20}). 

This work mainly differentiates between the following revision possibilities:
(i) removing possibly undesirable primitives from $L$, (ii) adding relevant, yet previously undiscovered primitives to $L$ and (iii) modifying existing primitives in $L$. 
This last form of revision can be further subdivided into (iii-a) modifying the explicit program representation of a primitive or (iii-b) finetuning $q_\phi$ to reweight the probabilities of specific primitives in $L$, \eg, via one of the loss-based approaches of eXplanatory Interactive Learning (XIL)~\citep{SchramowskiSTBH20,FriedrichSSK23}.

\section{Experimental Evaluation}
In our experimental evaluations, we show how \ptc uses programs to discover complex visual patterns from few examples in an interpretable and revisable manner. 
Overall, our evaluations aim to answer the following questions:

\begin{enumerate}[itemsep=2pt, partopsep=3pt]
    \item[\textbf{(Q1)}] Is \ptc able to learn abstract visual concepts? 
    \item[\textbf{(Q2)}] Can \ptc learn concepts that generalize to unseen combinations of concept components?
    \item[\textbf{(Q3)}] Can these concepts generalize to inputs with unseen number of objects? \item[\textbf{(Q4)}] Are the concept representations interpretable?
    \item[\textbf{(Q5)}] Can \ptc be revised to correct for suboptimal behavior?
    \item[\textbf{(Q6)}] Can \ptc abstract concepts from real-world data?
\end{enumerate}

\subsection{Experimental Setup}
We here provide setup details to allow for reproducibility. 

\noindent \textbf{Data.}
\label{exp:data}
For evaluating \ptc, we create an extensive dataset from the \textbf{Kandinsky Patterns} framework~\citep{holzinger2019kandinsky} called \textbf{\kp}, that contains images of 2D objects (depicted in \autoref{fig:kp}), with the attributes \textit{color}, \textit{shape} and \textit{size}. 
The images embed patterns such as "there are two pairs of objects with the same shape", similar to \cite{shindo2023alpha} (for further details, \cf~\autoref{app:kp}). We further use the \textbf{CURI} dataset \citep{vedantam21a}, containing images of 3D objects (illustrated in \autoref{fig:why}), with the attributes \textit{color}, \textit{shape}, \textit{size} and \textit{material}. 
CURI is designed to test  compositional generalization. 
It contains $8$ different concept splits, which are based on specific properties. 
For each split, concepts with these properties occur only in the test sets and not in the training sets (\cf \autoref{app:curi} for more details).
For both datasets, the images are grouped by abstract visual concepts, which are based on the objects' attributes and relations between them. 
For each concept there is a \textit{task} that contains at least one support and one query set, each holding positive and negative image examples of the concept.  
The objective is to recognize the underlying concept from the support set and, based on that, classify the examples from the query set correctly.
The datasets contain training tasks and held-out test tasks. To reduce the computational burden, we randomly select a subset of $100$ training concepts from each CURI split; however, we evaluate on the full, original test concepts of each split. 
For investigating entity generalization (Q3) and confounding behavior (Q5), we introduce extensions of CURI. These contain images created via the data generator framework of \citet{StammerSK21} (\cf \autoref{app:entity-generalization} and \cf \autoref{app:confounding}). 
Finally, to evaluate real-world concepts we created a small set of abstract concepts based on the popular MS COCO dataset \citep{lin2014microsoft} (\cf \autoref{app:coco}).

\noindent \textbf{Models.}
In our evaluations, we compare the performance of our neuro-symbolic \ptc approach to the purely neural model of \cite{vedantam21a}, here referred to as \textit{CURI-B}, and provide further details in \autoref{app:curib}.
The model of \citet{vedantam21a} was introduced with $4$ different pooling alternatives. 
We report performances of the best performing alternative (\cf \autoref{tab:curi_results_curi_b} for a detailed comparison).
For \ptc, we base the pretrained object extraction on the approach of \cite{chen2022pix2seq} to transform the input images into sequences of natural numbers (representing the objects and their attributes (\cf \autoref{tab:pix2seq})) and the program synthesis component on the approach of \cite{ellis2023dreamcoder}. 
We utilize a domain specific language (DSL) that operates on the specific object representations, and that contains base program primitives, \eg, functions like \textit{forall} and logical operators \textit{and} and \textit{or} (\cf \autoref{tab:dsl} and \autoref{app:ptcdetails} for details).
The program synthesis component is pretrained on the ground truth object representations (denoted as \textit{schema} input).
However, unless noted otherwise, it receives the neurally extracted object representations during evaluations. 
Notably, whereas CURI-B must be optimized on both the support and query examples of a task, \ptc is only optimized based on the support examples. 

\noindent \textbf{Metrics.} 
We evaluate both models' accuracies on the query sets of the test tasks, each averaged over $3$ seeded reruns. 
Since \kp and CURI contain more negative examples, we provide class balanced accuracies (\ie, mean between accuracies on the positive set and the negative one) over all test tasks.
Since \ptc uses an enumerative search to retrieve programs that solve test tasks, it may occur that no program is found that solves the task within the preset search time. 
In this case, we assume a random accuracy for the corresponding test task (\ie, $50\%$), to appropriately compare to the neural baseline model (which always produces an output). 
We thus differentiate between the class accuracy with random guessing for not found programs, denoted as ``Acc@all'', from the mean accuracy specifically only of the found programs, denoted as ``Acc@solved''.

\begin{table}[t!]
\centering
\caption{Mean test accuracy on Kandinsky and CURI concepts with iid train test splits.}
\label{tab:iid_results}
\resizebox{1.\columnwidth}{!}{
\setlength\tabcolsep{5 pt}
    \begin{tabular}{@{}lcc|c@{}}
    \toprule
    \multirow{2}{*}{Dataset} & \multirow{2}{*}{CURI-B} & \multicolumn{1}{c|}{\multirow{2}{*}{\begin{tabular}[c]{c}\ptc\\Acc@all\end{tabular}}} & \multicolumn{1}{c}{\multirow{2}{*}{\begin{tabular}[c]{c}\ptc\\Acc@solved\end{tabular}}} \\ 
    & & \\ \midrule
    \kp &  59.69 \mbox{\scriptsize$\pm 0.83 $} &  \textbf{90.05} \mbox{\scriptsize$\pm 0.80 $} &  92.93 \mbox{\scriptsize$\pm 0.98 $}\\  
    CURI (iid) & 
    66.68 \mbox{\scriptsize$\pm 1.50 $} & \textbf{71.54 \mbox{\scriptsize$\pm 1.15 $}} & 81.75 \mbox{\scriptsize$\pm 3.12 $} \\ 
    \bottomrule
    \end{tabular}
}
\end{table}


\begin{table}[t]
\centering
\caption{Mean accuracy (with std) for meta-test tasks of CURI splits reported individually and as the median (with median absolute deviation) over all splits.}
\label{tab:curi_results}
\resizebox{1.\columnwidth}{!}{
\setlength\tabcolsep{5 pt}
    \begin{tabular}{lcc|c}
    \toprule
    \multicolumn{1}{l}{\multirow{2}{*}{\begin{tabular}[l]{l}CURI\\(Splits)\end{tabular}}} &
     \multirow{2}{*}{CURI-B} & \multicolumn{1}{c|}{\multirow{2}{*}{\begin{tabular}[c]{c}\ptc\\Acc@all\end{tabular}}} & \multicolumn{1}{c}{\multirow{2}{*}{\begin{tabular}[c]{c}\ptc\\Acc@solved\end{tabular}}} \\ 
     & & \\ \midrule
     Boolean        & 67.86 \mbox{\scriptsize$\pm 1.21 $} & 
     \textbf{78.93 \mbox{\scriptsize$\pm 1.14 $}}  & 
     91.05 \mbox{\scriptsize$\pm 2.33 $}\\ 
     Counting       & 
     \textbf{62.19 \mbox{\scriptsize$\pm 2.44 $}} & 
     55.52 \mbox{\scriptsize$\pm 2.14 $}   & 
     65.73 \mbox{\scriptsize$\pm 2.19 $} \\ 
     Extrinsic      & 
     72.56 \mbox{\scriptsize$\pm 0.40 $} & 
     \textbf{78.31 \mbox{\scriptsize$\pm 1.60 $}}  & 
     88.23 \mbox{\scriptsize$\pm 1.70 $} \\
     Intrinsic      & 
     67.85 \mbox{\scriptsize$\pm 2.50 $} & 
     \textbf{87.35 \mbox{\scriptsize$\pm 3.21  $}}  & 
     92.09 \mbox{\scriptsize$\pm 0.40$} \\
     Bind.(color) & 
     69.89 \mbox{\scriptsize$\pm 1.54 $} & 
     \textbf{79.03 \mbox{\scriptsize$\pm 2.26 $}}  & 
     87.14 \mbox{\scriptsize$\pm 2.27 $} \\ 
     Composition  & 
     67.63 \mbox{\scriptsize$\pm 0.53 $} & 
     \textbf{74.82 \mbox{\scriptsize$\pm 0.10 $}}  & 
     86.51 \mbox{\scriptsize$\pm 0.98 $} \\ 
     Bind.(shape) & 
     66.35 \mbox{\scriptsize$\pm 0.36 $} & 
     \textbf{74.37 \mbox{\scriptsize$\pm 2.40 $}}  & 
     87.14 \mbox{\scriptsize$\pm 2.27 $} \\ 
     Complexity     & 
     65.24 \mbox{\scriptsize$\pm 0.14 $} &  
     \textbf{72.37} \mbox{\scriptsize$\pm 0.51 $}  & 
     77.43 \mbox{\scriptsize$\pm 0.35 $} \\ 
     \hline 
     \rule{0pt}{2.4ex}
     Median         & 67.74\mbox{\scriptsize$\pm 0.87$} & \textbf{76.57}\mbox{\scriptsize$\pm 1.87$}  & 
     87.14\mbox{\scriptsize$\pm 1.95$} \\
     \bottomrule
    \end{tabular}
    }
\end{table}

\subsection{Evaluations}

Let us now verify if \ptc can learn abstract visual interpretable concept representations, useful for solving logically challenging tasks from both synthetic and real world data, and can easily generalize to novel situations or be revised. 

\noindent \textbf{Learning visual concepts (Q1).}
We first investigate whether \ptc can learn visual concepts that allow to separate positive from negative images.
Specifically, we evaluate \ptc and the (neural) CURI-B algorithm on the \kp and CURI datasets. 
For these evaluations, we assume independent and identically distributed (iid) training and test task sets. 
This stands in contrast to more structured, curriculum-like task splits of later evaluations. 
Focusing first on the two left columns of \autoref{tab:iid_results}, we observe that \ptc largely outperforms the neural baseline over both datasets, even when assuming random performance for test tasks for which no program is found. 
The accuracy of only the found test tasks ($80.93\%$ solved tasks over both datasets \cf \autoref{tab:number_solved_curi_image}) in the right-most column of \autoref{tab:iid_results} is significantly higher on every task. Thus, when \ptc discovers relevant concepts these provide considerably improved generalization to unseen image samples of the same task, which motivates for further investigation on the visual concepts generalization. We refer to \autoref{fig:abstracted_primitives} for visualization of \ptcs learned concept library.
Overall, our results indicate that visual concept learning via our program synthesis based \ptc represents a competitive alternative to purely neural based approaches (\cf \autoref{tab:iid_results_schema} for ablations). 

In the following, we focus on the generalization performances of the evaluated models. 
We distinguish between two forms of generalization: the ability of a model to reuse previously acquired concepts for composing novel concepts (denoted as \textit{compositional generalization}) and the ability of an extracted  representation to generalize to an unseen number of objects (denoted as \textit{entity generalization}). While the first focuses more on the generalizability of the learning components, the second focuses on the generalizability of the concept representations themselves. Let us begin by investigating compositional generalization.

\textbf{Generalizing to novel combinations of known concepts (Q2).}
We focus these evaluations on the $8$ original \textit{compositional} concept splits of the CURI dataset (in contrast to the previous iid splits), which were specifically designed for investigating compositional generalization in concept learning. 
We provide both models' accuracies in \autoref{tab:curi_results}, of each concept split individually, and the median performance over all splits. 
We observe that in $7$ out of $8$ CURI splits,
\ptc greatly outperforms the CURI-B model in terms of generalizing to unseen combinations of concepts (also seen in the median accuracy over all splits).
Notably, the counting split appears to be more challenging for \ptc, but this can easily be revised, as we show in later evaluations. 
A possible reason for the overall performance ascendancy of our approach is that CURI-B must learn an individual concept representation for each novel composition, while the modular nature of \ptcs programs allows for easy combinations of existing knowledge to form novel representations.
Conclusively, we answer Q2 affirmatively and conclude that \ptc possess better generalization abilities to unseen concept compositions over the neural baseline  (\cf \autoref{tab:curi_results_schema} for ablations). 

\begin{table}[t]
\centering
\caption{Class balanced accuracy on AllCubes-$N$ and AllMetalOneGray-$N$ for CURI-B and \ptc.}
\label{tab:generalizing}
\begin{tabular}{@{}llc@{}}
\toprule
Dataset    & CURI-B & \ptc \\ \midrule
AllCubes-CURI        & 77.19  \mbox{\scriptsize$\pm 6.56 $}   & \textbf{100.00 \mbox{\scriptsize$\pm 0.00 $}} \\
AllCubes-5  &  70.33 \mbox{\scriptsize$\pm 5.25 $}        & \textbf{100.00 \mbox{\scriptsize$\pm 0.00 $}}         \\
AllCubes-8  & 57.83 \mbox{\scriptsize$\pm 6.20 $}        &  \textbf{100.00 \mbox{\scriptsize$\pm 0.00 $}}       \\
AllCubes-10 & 56.00 \mbox{\scriptsize$\pm 5.61 $}        & \textbf{100.00 \mbox{\scriptsize$\pm 0.00 $}}        \\ \midrule
AllMetalOneGray-CURI & 60.00  \mbox{\scriptsize$\pm 3.33 $}       &  \textbf{96.94 \mbox{\scriptsize$\pm 4.32 $} }   \\
AllMetalOneGray-5  & 52.50 \mbox{\scriptsize$\pm 3.54 $}       &  \textbf{100.00 \mbox{\scriptsize$\pm 0.00 $} }       \\
AllMetalOneGray-8  & 52.17 \mbox{\scriptsize$\pm 3.06 $}        &  \textbf{89.00 \mbox{\scriptsize$\pm 15.56 $} }      \\
AllMetalOneGray-10 & 54.17 \mbox{\scriptsize$\pm 4.25 $}        &  \textbf{89.17 \mbox{\scriptsize$\pm 15.32 $} }      \\ \bottomrule
\end{tabular}
\end{table}

\begin{figure}[t!]
    \centering
    \includegraphics[width=\columnwidth]{images/all-cubes.pdf}
    \caption{Test examples of AllCubes-$5$ (left), AllCubes-$8$ (middle) and AllCubes-$10$ (right) sets. Positive images contain only cubes, while negative images possess all cubes but one cylinder or sphere. }
    \label{fig:curin}
\end{figure}

\textbf{Generalizing to variable number of objects (Q3).}
Let us now move on to the second form of generalizabilty, \ie, entity generalization. 
While the setup of CURI is valuable for testing the compositional generalization ability of a model, it is insufficient for testing the entity generalizability of its learned concept representations. 
To investigate Q3, we first need to extend the initial CURI dataset accordingly. 
To this end, we select $2$ arbitrary concepts from CURI's original tasks, "all objects are cubes" and "all objects are metal and one is gray", and increase the number of objects in the corresponding test images\footnote{The original CURI images contain between $2$ and $5$ objects for both the training and test splits.} and investigate how well a model can classify these as (still) representing the original concept.
Specifically, we create $3$ data sets for each of these two concepts with respectively $5$, $8$ and $10$ objects in the test scenes. We refer to these data sets as AllCubes-$N$ and AllMetalOneGray-$N$, where $N \in \{5,8,10\}$, and to the original CURI images with ``-CURI''. We provide example images of AllCubes-$N$ in \autoref{fig:curin}. 
Each dataset contains $100$ positive image examples and $100$ negative ones. We refer to these CURI variations as \textbf{CURI-EG}.

As these investigations focus on concept generalization, we revert to using CURI-B and \ptc models that were trained with schema input to avoid image encoding noise. 
These models are trained on the CURI iid split (\cf \autoref{tab:iid_results_schema_10_obj}) and evaluated on the CURI-EG test sets. 
Our results, provided in~\autoref{tab:generalizing}, show that even with few more objects in the test data set the accuracy of CURI-B drops significantly. In contrast, the performance of \ptc stays solid at 100\% for the AllCubes-$N$ sets. 
For every AllMetalOneGray-$N$, \ptc significantly outperforms CURI, though in $1$ out of $3$ seeds \ptc did not find the ``perfect'' concept representation, but one that is overfitting to the support images. 
This illustrates that programs alone do not overcome suboptimal behaviors such as overfitting. 
In the next evaluations, we will focus on identifying and revising such behavior.

Overall, these evaluations indicate that the concept representations of \ptc can generalize well to inputs with unseen number of objects, answering Q3 affirmatively. 
The results particularly highlight the importance of entity generalization as a relevant aspect of concept learning validation that neither the evaluations of \autoref{tab:iid_results} nor \autoref{tab:curi_results} could reveal. Moreover, CURI-B's performance in \autoref{tab:generalizing} raises questions concerning the generalizability of its concepts. 

\begin{table*}[t!]
\caption{\textbf{\ptcs concepts are transparent programs and can be brought to natural language by LLMs.} Examples of original CURI concepts (left), corresponding \ptcs program representations (middle) and (right) natural language translations from an LLM (here gpt-4-turbo). We have grouped related concepts and provide syntax highlighting for easier comparison. All programs achieve 100\% accuracy on the CURI test scenes.}
\label{tab:interpretable_programs}
\centering
\resizebox{0.98\textwidth}{!}{
\setlength\tabcolsep{12 pt}
\begin{tabular}{p{4cm}|p{6.5cm}|p{5.5cm}}
    \toprule
    Original Concept                  & Retrieved Program                                                                            & GPT4 Natural Language Translation       \\ 
    \midrule
    \textcolor{brown}{\textbf{All}} objects are \textcolor{teal}{\textbf{cyan}}
    & 
    ($\lambda$ (x) (($\lambda$ (y z) (\textcolor{brown}{\textbf{forall}} ($\lambda$ (u) (eq? (index y u) z)))) \textcolor{teal}{\textbf{color cyan}} x))
    & 
    "All objects are cyan in color."   \\ 
    \midrule
    \textcolor{brown}{\textbf{All}} objects are \textcolor{teal}{\textbf{cubes}} 
    & 
    ($\lambda$ (x) (($\lambda$ (y z) (\textcolor{brown}{\textbf{forall}} ($\lambda$ (u) (eq? (index y u) z)))) \textcolor{teal}{\textbf{shape cube}} x)) &
     "All objects are cubes." \\
     \midrule \midrule
    \textcolor{brown}{\textbf{All}} objects are \textcolor{teal}{\textbf{purple}} \textcolor{violet}{\textbf{and}} \textcolor{brown}{\textbf{all}} objects are \textcolor{teal}{\textbf{spheres}}
    &    ($\lambda$ (x) (\textcolor{violet}{\textbf{and}} (($\lambda$ (y z) (\textcolor{brown}{\textbf{forall}} ($\lambda$ (u) (eq? (index y u) z)))) \textcolor{teal}{\textbf{shape sphere}} x) (($\lambda$ (v w) (\textcolor{brown}{\textbf{forall}} ($\lambda$ (a) (eq? (index v a) w)))) \textcolor{teal}{\textbf{color purple}} x)))  & 
    "All objects are spheres and all objects are purple." \\
     \midrule
    \textcolor{brown}{\textbf{All}} objects are \textcolor{teal}{\textbf{brown}} \textcolor{violet}{\textbf{or}} \textcolor{brown}{\textbf{all}} objects are \textcolor{teal}{\textbf{cubes}}
    & ($\lambda$ (x) (\textcolor{violet}{\textbf{or}} (($\lambda$ (y z) (\textcolor{brown}{\textbf{forall}} ($\lambda$ (u) (eq? (index y u) z)))) \textcolor{teal}{\textbf{shape cube}} x) (($\lambda$ (v w) (\textcolor{brown}{\textbf{forall}} ($\lambda$ (a) (eq? (index v a) w)))) \textcolor{teal}{\textbf{color brown}} x))) 
    & "All objects are either cubes or all objects are brown." \\
     \midrule
    \textcolor{brown}{\textbf{All}} objects are \textcolor{teal}{\textbf{small}} \textcolor{violet}{\textbf{and}} \textcolor{brown}{\textbf{there exists}} a \textcolor{teal}{\textbf{purple}} object
    &
    ($\lambda$ (x) (\textcolor{violet}{\textbf{and}} (($\lambda$ (y z) (\textcolor{brown}{\textbf{forall}} ($\lambda$ (u) (eq? (index y u) z)))) \textcolor{teal}{\textbf{size small}} x) (\textcolor{brown}{\textbf{exists}} ($\lambda$ (v) (($\lambda$ (w a b) (eq? (index b w) a)) v \textcolor{teal}{\textbf{purple color}})) x)))
    &  "All objects are small in size, and there is at least one purple object." \\
    \midrule \midrule
    There are \textcolor{purple}{\textbf{three}} \textcolor{teal}{\textbf{gray}} objects 
    &
     ($\lambda$ (x) (\textcolor{purple}{\textbf{eq?}} (($\lambda$ (y) (\textcolor{purple}{\textbf{count}} (map ($\lambda$ (z) (($\lambda$ (u v) (index u v)) \textcolor{teal}{\textbf{color}} z)) y))) x \textcolor{teal}{\textbf{gray}}) \textcolor{purple}{\textbf{3}}))
    & "There are three objects that are gray." \\
    \midrule
    There exists an arbitrary object and \textcolor{purple}{\textbf{there exist three}} other objects that are \textcolor{teal}{\textbf{blue}} 
    &
    ($\lambda$ (x) (\textcolor{purple}{\textbf{gt?}} (($\lambda$ (y) (\textcolor{purple}{\textbf{count}} (map ($\lambda$ (z) (($\lambda$ (u v) (index u v)) \textcolor{teal}{\textbf{color}} z)) y))) x \textcolor{teal}{\textbf{blue}}) \textcolor{purple}{\textbf{2}}))
    & "There are more than two objects that are blue in color." \\
     \bottomrule
\end{tabular}
}
\end{table*}

\newpage
\noindent \textbf{Interpreting \ptcs concept representations (Q4).}
Although our previous results suggest that \ptcs programs are more generalizable, \ptc can provide suboptimal programs (\cf results on AllMetalOneGray) when overfitting or learning shortcuts. 
However, a considerable advantage of \ptc is the readable nature of its program representations, which allows human users to understand and thus detect such suboptimal behaviors. 
This stands in stark contrast to the opaque concept representations of purely neural approaches such as CURI-B. 

We exhibit \ptcs transparency in \autoref{tab:interpretable_programs}, where we present program solutions for a collection of test tasks from the CURI dataset. 
The leftmost column describes the target underlying concepts (with increasing complexity over the rows). 
The middle column presents \ptcs corresponding concept representations. 
Although these programs are written as $\lambda$-calculus (which may appear difficult to discern for novices), they possess a straightforward reading procedure with definite variables and operations semantics (provided in~\autoref{tab:dsl}).
For example, the first program of \autoref{tab:interpretable_programs} reads as follows: the program takes an input list of objects x and applies the function $\lambda$ (y z), parameterized by \textit{color} and \textit{cyan}, to x. This function applies \textit{forall} with a predicate function $\lambda$ (u) on each object. The specific predicate function tests if the color attribute of the object representation equals cyan. Overall, the program returns \textit{true} if and only if all objects in x are of color cyan. 

Furthermore, large language models can help to translate \ptcs $\lambda$-calculus programs into natural language statements. 
We exemplify this in the right-most column of \autoref{tab:interpretable_programs} based on gpt-4-turbo~\citep{gpt4} (\cf \autoref{app:interpret} for prompting details). 
In principle, other LLMs can be used as well (as we show in~\autoref{tab:interpretable_programs_app}). These can help human users to further understand \ptcs proposed concepts.
Thus, although the $\lambda$-calculus structure of \ptcs program representations can present a challenge for novice $\lambda$-calculus users, they present a readable and executable knowledge representation and can be translated by LLMs. This provides an affirmative answer to Q4. 

\textbf{Mitigating Confounders (Q5).}
Once a user has identified suboptimal behavior in an AI model, it remains important that they can revise it~\citep{TesoK19, SchramowskiSTBH20}, \eg, to ensure trust between model and human.
In our last evaluations, we investigate the revisability of \ptcs representations and showcase the first two revision forms of~\autoref{sec:method}: (i) removing primitives from and (ii) adding primitives to \ptcs library.

\begin{figure}[t!]
\centering
    \includegraphics[width=0.98\columnwidth]{images/confounded_count.pdf}
\caption{Class balanced accuracies after revising \ptc by removing suboptimal primitives from $L$ on the confounded CURI-Hans set (left) and by adding helpful primitives to $L$ in the  counting split of CURI (right). \ptc+XIL indicates the revised models.}
\label{fig:revise}
\end{figure}

\ptc can be affected by shortcut learning. This is exemplified by the last concept of \autoref{tab:interpretable_programs}. For this task, there is no negative example that contains \textit{only} $3$ blue objects, and thus the retrieved program obtains a perfect accuracy while diverging from the intended concept.
Further, ``Confounding'' can occur when unknown spurious correlations, absent from the query images, appear in the support set images. In this case, \ptcs detected programs might classify the support images based on these features and thus fail to apply to the unconfounded query data. 
We demonstrate this via a confounded task $T_{\text{conf}}$, using the concept ``all objects are metal and there exists one cube''. 
In the support set, all objects are cyan (confounding feature), irrespective of the actual underlying concept. In the query set, however, objects possess varying colors. 
We train \ptc on a set of $8$ original tasks from the CURI dataset and evaluate on such a confounded test task (\cf \autoref{sec:expdetails} for details). 
We refer to this data split as \textbf{CURI-Hans} and provide test query accuracy results of \ptc and CURI-B in \autoref{fig:revise} (left). 
Both approaches are strongly influenced by the confounder, as indicated by the low query set accuracy (in contrast to the CURI iid base accuracy of \autoref{tab:iid_results_schema}), though for \ptc this effect is slightly reduced. 
We can now, however, easily mitigate the behavior of the \ptc model by removing the library primitives for color and cyan, as well as abstracted functions that use these. 
The revised model (``\ptc+XIL'') reaches $100\%$ test accuracy. It thus appears to ignore the confounder and capture the true concept. 

\textbf{Revising \ptc to Count (Q5).}
Another possibility of revising \ptcs representations is via the addition of relevant library elements. 
For example, \ptc could lack some relevant DSL primitives such that it can only find shortcut based programs for some concepts. 
Upon inspection of \ptcs concept representations, a user may, however, identify the missing concepts and add these. 
To test this scenario, we revert to the \textit{Counting} split of CURI, for which \ptc had obtained low test accuracy (\cf \autoref{tab:curi_results} and \autoref{tab:curi_results_schema}). 
As the concepts from this test set all contain some form of counting operations the low test accuracy indicates that \ptc was not able to properly capture the basic concept of ``counting''. We, therefore, formulate program primitives that count the number of occurrences for each existing attribute 
(\cf \autoref{sec:expdetails} for details) and
add these primitives to the library of the trained \ptc model of \textit{Counting}. 
The test accuracy of the revised model (\ptc+ XIL) nearly jumps to $90\%$ in \autoref{fig:revise}~(right). 
Conclusively, \ptc allows for easy revision of its programs to overcome suboptimal behavior, answering Q5 affirmatively.

\begin{figure}[t!]
\centering
    \includegraphics[width=0.98\columnwidth]{images/coco_concept.pdf}
\caption{Examples of the concept "Exists person and exists dog" based on the COCO dataset.}
\label{fig:coco_concept}
\end{figure}

\textbf{Extending Pix2Code to Real-World Images (Q6).}
In our previous evaluations, we observed that \ptc performs impressively at detecting abstract concepts in synthetic data sets. In our last evaluation, we investigate \ptcs potential for discovering abstract concepts also in real-world image scenarios. 
Applying \ptc to this setting requires that the object extractor has to detect more visually complex objects and the program synthesis module has to find patterns in more complex input representations (more details in \autoref{app:coco}).
We illustrate this setting based on concepts that are contained in the MS COCO dataset \citep{lin2014microsoft},
\eg, concepts like "There exists a dog" and "There exists a dog or there exists a cat" (\cf \autoref{app:coco}).
In \autoref{tab:coco_results} we provide the accuracy of \ptcs learned concepts.
We observe that \ptcs learned programs result in high test set accuracies suggesting that indeed \ptc is able to synthesize programs for these real-world concepts. 
We further refer to \autoref{app:coco} for additional discussions, \eg, the influence of noisy object perception. 
Overall, these results suggest that \ptc can abstract concepts from real-world images. We thus answer Q6 affirmatively.

\begin{table}[t!]
\centering
\caption{Accuracy of programs synthesized by Pix2Code for concepts in the MS COCO dataset. For each concept, 25 example images were provided and the programs were evaluated on 100 test images.}
\label{tab:coco_results}
\vspace{-2mm}
\resizebox{1.\columnwidth}{!}{
\begin{tabular}{lc}
\toprule
COCO Concept & Pix2Code \\ \midrule
\texttt{Exists dog}                 &  0.8938        \\ \midrule
\texttt{Exists cat}                 &  0.9211        \\  \midrule
\texttt{Exists dog and exists person} & 0.9286         \\  \midrule
\texttt{Exists dog or cat}           &  0.9324        \\  \midrule
\texttt{There are 3 persons}         &  0.6813        \\ 
 \bottomrule
\end{tabular}
}
\end{table}

In summary, our evaluations provide evidence of the advantages of utilizing the power of 
program synthesis for visual concept learning via \ptc, in terms of generalizability, interpretability, and revisability.

\section{Related Work}
\ptc is closely related to several lines of research, among which program synthesis and concept extraction from images are the closest.

\textbf{Program Synthesis.} There has been a recent interest in the task of program synthesis within the realm of machine learning~\citep{chen2018execution,nye2020learning,odena2020bustle}. Program synthesis has been looked at from various points of view such as neuro-symbolic AI~\citep{parisotto2016neuro,bhatia2018neuro}, lifelong learning~\citep{valkov2018houdini} and interactive machine learning~\citep{zhang2020interactive,ferdowsifard2021loopy}. The various application domains for program synthesis include videos~\citep{sun2018neural,le2021ccvs}, images~\citep{laich2020guiding,ellis2021dreamcoder} and text~\citep{ellis2019write,desai2016program}. There are several methods that develop program synthesis libraries focused on visual reasoning, such as LILO~\citep{grand2023lilo}, ROAP~\citep{tang2023perception} and DreamCoder~\citep{ellis2023dreamcoder}. The biggest drawback of these approaches is the lack of generalization and the process of revision of the learned concepts which Pix2Code addresses.

\textbf{Interpretable (Relational) Concepts Learned from Images.}
The Neuro-Symbolic Concept Learner of \citet{mao2019neuro} learns visual concepts from images without any explicit supervision. Whereas \citet{StammerMSK22} learns single object-based visual concepts via weak supervision. Lime-Aleph~\citep{rabold2020enriching} combines the explainable AI method of Lime~\citep{ribeiro2016should} with the classical inductive logic programming system Aleph~\citep{srinivasan2001aleph}. The method learns explainable relational concepts on the blocksworld domain. 
Recently, ~\citet{shindo2023alpha} proposed $\alpha$ILP, a neuro-symbolic framework that can learn generalized rules from complex visual scenes. The advantage of $\alpha$ILP is that it uses differentiable inductive language programming where the logic programs are learned using gradient descent. This was further extended in NEUMANN~\citep{shindo2023learning} where a graph-based differentiable forward reasoner is used for more efficient reasoning framework. 

\cite{Delfosse2023InterpretableAE} integrate $\alpha$-ILP in reinforcement learning agents. 
The value of interpretable relational concepts was further evidenced in the context of reinforcement learning by \citet{Delfosse2024InterpretableCB}, though both works require prior relational functions. Such interpretable RL agents can also be translated into tree programs~\citep{kohler2024interpretable}. However, these RL applications do not learn to extract concepts, but assume their extraction (using \eg OCAtari~\citep{Delfosse2023OCAtariOA}). Instead, one could integrate \ptc into concept bottleneck RL agents. 
Lastly, \citep{WebbM023, Kerg22, VaishnavS23} focus on purely neural object-centric approaches for learning relational concepts, which, however, lack the ability to inspect and revise the model's concepts.

\section{Conclusion}
In this work, we propose \ptc, a neuro-symbolic framework for generalizable, inspectable, and revisable visual concept learning. 
It captures and reuses concepts as program primitives to compose new concepts, thereby making \ptc generalizable to unseen tasks.
Our evaluations show that \ptcs generalization is especially effective when the number of objects in the visual scene increases, in stark contrast to the neural baseline.
Moreover, we show empirically that \ptcs learned concepts are interpretable and can be revised via human guidance.

$\lambda$-calculus programs are interpretable but not very natural to humans and can become quite nested for more complex programs. Handling this via a more general program synthesis framework is a natural next step. Integrating the natural language interpretations as part of the training procedure by labeling the learned library primitives with semantic descriptions is another important direction, but also 
further, pursue applying \ptc to more natural images and relations. 
Additionally, in our evaluations, we have focused on the first two revision procedures of \ptc as these represent the more fundamental interactions that a user can perform. We suspect the third type of revision procedure should be a straightforward combination of the two investigated ones or, otherwise, an application of standard XIL approaches. However, future investigations should confirm this.
Finally, making the program synthesis component less dependent on the quality of the extracted object representations by allowing probabilistic inputs for the programs can make the \ptc framework more widely applicable. 