
\documentclass[accepted]{uai2022}

%% Choose your variant of English; be consistent
\usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
%\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example
%
% Use the postscript times font!
\usepackage{times}
\usepackage{soul}
\usepackage{url}
\usepackage[utf8]{inputenc}
\usepackage{graphicx}
\usepackage{amsmath}
\usepackage{amsthm}
\usepackage{booktabs}
\usepackage{algorithm}
\usepackage{algorithmic}
\usepackage{cleveref}
\urlstyle{same}
%
% My commands
\usepackage{amsfonts}
\usepackage[inline]{enumitem}
\usepackage{xcolor}
\newenvironment{inlinelist}{\begin{enumerate*}[label=\emph{(\roman{*})}]}{\end{enumerate*}}
\usepackage{multirow}
\usepackage{makecell}
\usepackage{todonotes}
\usepackage{color}
\newcommand{\aanote}[1]{{\todo[color=cyan,inline]{\textbf{Agiollo:} #1}}}
\newcommand{\aatodo}[1]{{\todo[size=\tiny,color=cyan]{\textbf{Agiollo:} #1}}}
\newcommand{\nn}{NN}
\newcommand{\nnlong}{Neural Networks}
\newcommand{\nnlowercase}{\lowercase\expandafter{\nnlong}}
\newcommand{\gnn}{GNN}
\newcommand{\gnnlong}{Graph Neural Networks}
\newcommand{\gnnlowercase}{\lowercase\expandafter{\gnnlong}}
\newcommand{\gnntognn}{GNN2GNN}
\newcommand{\gnntognnlong}{\gnnlong{} to Generate \nnlong}
\newcommand{\gnntognnlowercase}{\lowercase\expandafter{\gnntognnlong}}
\newcommand{\nas}{NAS}
\newcommand{\naslong}{Neural Architecture Search}
\newcommand{\naslol}{NAS101}
\newcommand{\nats}{NATS}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\title{\gnntognn: \gnntognnlong}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Add authors
\author[1,2]{\href{mailto:<andrea.agiollo@unibo.it>?Subject=GNN2GNN UAI paper 301}{Andrea Agiollo}{}}
\author[1]{Andrea Omicini}
% Add affiliations after the authors
\affil[1]{%
    Department of Computer Science and Engineering (DISI), \textsc{Alma Mater Studiorum}---University of Bologna, Italy
}
\affil[2]{%
  The Research Hub by Electrolux Professional S.p.A., Pordenone, Italy
}

%\author{
%Author 1$^{1,2}$
%\and
%Author 2$^1$
%\affiliations
%$^1$ University 1\\
%$^2$ University 2
%\emails
%\{auth1, auth2\}@university.uni
%}

\begin{document}

\maketitle

\begin{abstract}
  %
  The success of \nnlowercase{} (\nn{}s) is tightly linked with their architectural design---a complex problem by itself.
  %
  We here introduce a novel framework leveraging \emph{\gnntognnlong} (\gnntognn{}) where powerful \nn{} architectures can be learned out of a set of available architecture-performance pairs.
  %
  \gnntognn{} relies on a three-way adversarial training of \gnn, to optimise a generator model capable of producing predictions about powerful \nn{} architectures.
  %
  Unlike \naslong{} (\nas) techniques proposing efficient searching algorithms over a set of \nn{} architectures, \gnntognn{} relies on learning \nn{} architectural design criteria.
  %
  \gnntognn{} learns to propose \nn{} architectures in a single step -- i.e., training of the generator --, overcoming the recursive approach characterising \nas{}.
  %
  Therefore, \gnntognn{} avoids the expensive and inflexible search of efficient structures typical of \nas{} approaches.
  %
  Extensive experiments over two state-of-the-art datasets prove the strength of our framework, showing that it can generate powerful architectures with high probability.
  %
  Moreover, \gnntognn{} outperforms possible counterparts for generating \nn{} architectures, and shows flexibility against dataset quality degradation.
  %
  Finally, \gnntognn{} paves the way towards generalisation between datasets.
  %
\end{abstract}

\section{Introduction}
%
Deep Learning (DL) techniques have recently seen an unstoppable rise in popularity: DL has changed the approach to most intelligence task, ranging from vision to text and audio processing.
%
Among those techniques, \nnlowercase{} (\nn{}s) have become the most popular solution.
%
\nn{} performance is tightly linked with the operations they leverage and how these are connected -- their architecture -- whose design has been shown to be as complex as much as it is relevant.

On the other hand, there is no trivial way to find out the best \nn{} design for a specific task.
%
\emph{\naslong} (\nas) techniques have recently emerged to tackle \nn{} design issue~\citep{ElskenJmlr2019}: the underlying concept of \nas{} approaches is to efficiently search for the best \nn{} architecture over a set of structures defined as a search space.
%
Despite its success, \nas{} exhibits several drawbacks (see \Cref{sec:related}), as it relies on looking for the best architecture rather than learning architectural criteria for building the optimal \nn{}.
%
Moreover, \nas{} approaches are not flexible with respect to slight changes of the application scenario, require huge amount of resources to run, and are limited by their search space specifications.
%
Also inspired by these limitations, in this work we present \gnntognn{}, a novel tool leveraging \gnnlowercase{} to generate \nn{} architectures.

\gnntognn{} is a meta-learning framework exploiting \gnnlong{} (\gnn{}s) to learn generating efficient \nn{} structures.
%
\gnn{}s are particular models proposed to tackle \emph{graph}-processing tasks via convolution-equivalent operation over graphs~\citep{WuIeee2021}.
%
A \nn{} structure can be seen as a Directed Acyclic Graph (DAG) where nodes represent layers -- implementing common operations like convolution, pooling, etc. -- and edges represent how the output of one layer is fed to the following one.
%
In this context, we propose a three-way adversarial learning setup to allow \gnn{} to learn the features of an efficient \nn{} structure and generate novel architectures.
%
More in details, a generator \gnn{} is trained to produce plausible architectures, while a discriminator \gnn{} is optimised to distinguish between generated and real architectures.
%
Finally, a valuer \gnn{} aims at optimising the performance of the generated architectures.
%
During training, the generator loss is defined as a mixture of the discriminator and valuer feedbacks, therefore aiming at enabling the learning of realistic -- i.e., discriminator feedback -- and powerful -- i.e., valuer feedback -- architectures.

While being embeddable into a broader \nas{} approach, \gnntognn{} represents a powerful approach to propose \nn{} architectures by itself.
%
Indeed, differently from \nas{} techniques, which aim at efficiently searching \nn{} architectures over a set of available ones, \gnntognn{} aims at intrinsically learning architectural criteria from a set of available architecture-performance pairs.
%
While \nas{} consider to recursively propose, evaluate and optimise a set of \nn{} structures (see \Cref{fig:nas-vs-gnn2gnn} left), we here consider learning to propose architectures from a set of \nn{}s in a single step---i.e., training of the generator \gnn{} (see \Cref{fig:nas-vs-gnn2gnn} right).
%
Once trained, \gnntognn{} is capable of proposing multiple efficient \nn{} architectures at once, rather than focusing solely on the local optimum obtained from the deployed search algorithm.
%
Therefore, \gnntognn{} significantly shifts the paradigm of the approach to the problem of \nn{} architecture design, from relying on \emph{searching} architectures to \emph{learning} design criteria.

%
\begin{figure*}
  \centering
  \includegraphics[width=\linewidth]{images/nas_vs_gnn2gnn}
  \caption{\textbf{Left}: \nas{} approaches rely on a recursive sampling, evaluation and optimisation procedure.
  A \nas{} policy is used to sample architectures from the search space.
  The sampled architectures are then trained and evaluated to optimise the \nas{} policy depending on their performance.
  Once a convergence criterion is met, \nas{} identifies the sub-optimal \nn{} architecture.
  \textbf{Right}: The \gnntognn{} approach rely on a single training procedure where \gnntognn{} learns to propose effective \nn{} architectures.
  The trained generator is able to produce multiple powerful \nn{} architectures, rather than identifying solely the local sub-optimal \nn{} architecture.}
  \label{fig:nas-vs-gnn2gnn}
\end{figure*}
%

To summarize, the contributions that our work brings are the following:
%
\begin{itemize}
  %
  \item We present \gnntognn{}, the first -- up to our knowledge -- framework leveraging \gnn{} to generate powerful \nn{} architectures, without relying on inefficient searching procedure.
  %
  \item We show the effectiveness of our framework over two state-of-the-art datasets, highlighting its flexibility and generalization capability.
  %
\end{itemize}

\section{Preliminaries on \gnnlong}
%
As the proposed framework relies on graph manipulation via \gnn{}s, here we briefly introduce \gnnlong{}, presenting their fundamental concepts.
%
\gnnlong{} (\gnn{}s) have been proposed as an extension of traditional \nn{}s to enable processing of non-rigidly structured data such as graphs.
%
\gnn{}s are mathematical models operating upon directed graphs, whose vertices (respectively, arcs) are labelled with vectors (or matrices, or tensors) of real numbers -- $\mathbf{x}_{v} \in \mathbb{R}^{d}$ for vertex $v$, and $\mathbf{a}_{v, w} \in \mathbb{R}^{c}$ for the arc between vertex $v$ and $w$ --, each one carrying further numeric information about the corresponding vertex (resp., arc).
%
\gnn{}s rely on \emph{graph convolution}, which represents the generalisation of a 2-dimensional convolution over graph-structured data.
%
Graph convolution is defined over a single vertex $v$ and its neighbourhood $N(v)$, and relies on three successive phases:
%
\begin{description}[leftmargin=0pt,font=\normalfont\itshape]
\item[propagation] --- the information $\mathbf{x}_{v'}$ belonging to each vertex $v' \in N(v)$ is weighted by the information $\mathbf{a}_{v,v'}$ belonging to the arc among $v$ and $v'$, then propagated to vertex $v$;
\item[aggregation] --- the information propagated from each vertex $v' \in N(v)$ to $v$ is aggregated via an aggregation function;
\item[transformation] --- the aggregated information corresponding to vertex $v$ is transformed into a new embedding vector and assigned back to vertex $v$, as its new state $\mathbf{x}_{v}'$ .
\end{description}
%
The single convolution operation is applied in parallel to each vertex in $G$, updating the whole graph representation.

\gnn{}s have proven to be successful in many tasks involving graph structured data.
%
Most common applications concerns computational chemistry~\citep{FungNature2021}, social recommendations~\citep{FanWww2019}, computer vision~\citep{WangAcm2019}, and many others~\citep{HamiltonNips2017,YuIjcai2018}.
%
However, a comprehensive review of \gnn{}s and the underlying techniques is clearly out the scope of this paper: therefore, we refer interested readers to~\cite{WuIeee2021,ZhouAi2020}.
%

\section{\gnntognn{}}\label{sec:gnn2gnn}

In this section we present our framework, namely \gnntognn{}.
%
\gnntognn{} leverages \underline{G}raph \underline{N}eural \underline{N}etworks to \underline{G}enerate \underline{N}eural \underline{N}etworks.
%
We first present briefly how \nn{} architectures can be mapped into graph structures (\Cref{ssec:nns-as-graphs}).
%
We then introduce the general pipeline for generating and processing \nn{} architectures (\Cref{ssec:gnn2gnn}), focusing specifically on its components.
%

\subsection{\nnlong{} as Graphs}\label{ssec:nns-as-graphs}
%
\nn{} architectures are uniquely defined by a set of layers $\mathcal{L}$, a set of operations $\mathcal{O}$ applied on layers, and a set $\mathcal{I}$ of interconnections between layers.
%
Each layer $l_{v} \in \mathcal{L}$, with $v \in \left[ 0, \lvert \mathcal{L} \rvert \right]$, identifies a specific component of the \nn{} architecture and is characterised by a specific operation $o_{v} \in \mathcal{O}$.
%
Interconnections between layers, on the other hand, define how layers are linked to each other.
%
An interconnection $i_{v,w} \in \mathcal{I}$ identifies that layers $l_{v}$ and $l_{w}$ are connected, and more specifically, it identifies that the output of the operation $o_{v}$ applied at layer $l_{v}$ is used as an input for the operation $o_{w}$ applied at layer $l_{w}$.
%
It is important to notice that, thanks to the feedforwarding nature of standard \nn{}s, there exists total ordering among the layers in $\mathcal{L}$ and interconnections can only exist between successive layers.
%
Mathematically speaking, $\exists i_{v,w} \in \mathcal{I} \iff w > v$.
%

Following the above notations, \nn{} architectures can be mapped easily into graph structures, specifically to Directed Acyclic Graphs (DAGs).
%
Layers in $\mathcal{L}$ are mapped into a set of vertices $\mathcal{V}$ characterised by a set of features $\mathcal{X}$ representing layers operations ($\mathcal{O}$), while interconnections ($\mathcal{I}$) are mapped into a set of directed edges $\mathcal{E}$.
%
Vertices -- i.e., layers --, along with their features -- i.e., operations --, are defined as vectors $\mathbf{x}_{v} \in \mathbb{R}^{d}$, where $v$ enumerates the graph vertices, and $d$ represents operations cardinality.
%
On the other hand, the set of edges -- i.e., interconnections --, is denoted by the adjacency matrix $\mathbf{E} \in \mathbb{R}^{\vert \mathcal{V} \vert \times \vert \mathcal{V} \vert} $, where  $\mathbf{e}_{v,w}=1 \iff \exists i_{v,w}$.
%
Therefore, a \nn{} architecture can be mapped into a DAG defined by $\mathbf{X} \in \mathbb{R}^{\vert \mathcal{V} \vert \times d} $, where  $\mathit{row}_{k}(\mathbf{X})=\mathbf{x}_{v}$ -- i.e. a matrix of vertices characterised by the operations they apply -- and $\mathbf{E}$---i.e., the adjacency matrix defining how operations are connected.

\subsection{Adversarially Generate Architectures}\label{ssec:gnn2gnn}
%
The proposed framework relies on a generative adversarial approach (GAN)~\citep{GoodfellowNips2014}, applied over graph structured data leveraging \gnnlong{}.
%
The proposed framework is presented in \Cref{fig:gnn2gnn} and relies on three components:
%
\begin{itemize}
  \item A \emph{generator} \gnn{} $G$ is in charge of proposing graph structures representing \nn{} architectures.
  %
  \item A \emph{discriminator} model $D$ is responsible for distinguishing between \nn{} structures proposed by $G$ and real architectures.
  %
  \item A \emph{valuer} network $V$ is responsible for predicting the architecture performance, therefore optimising the generation towards powerful structures.
  %
\end{itemize}
%
\gnntognn{} relies on such triplets of \gnn{}s to allow $G$ to intrinsically learn optimal architectural criteria.
%
Indeed, the discriminator and valuer models are used during training to optimise the generator status.
%
More in details, the generator loss is defined as a mixture of the discriminator and valuer feedbacks:
%
\begin{equation}\label{eq:gen-loss-generic}
  \mathcal{L}_{G} = \lambda \cdot \underbrace{\mathcal{F}_{D}}_\text{$D$ feedback} + (1 - \lambda) \cdot \underbrace{\mathcal{F}_{V}}_\text{$V$ feedback}
\end{equation}
%
Here, $\lambda$ represents the balancing factor between the two feedbacks.
%
Leveraging such mixture loss, $G$ is capable of proposing realistic -- i.e., $D$ feedback -- and powerful -- i.e., $V$ feedback -- architectures.
%
Finally, once trained, the \gnntognn{} framework exploits solely the generator component to propose significant \nn{} architectures.
%
\begin{figure*}
  \centering
  \includegraphics[width=\linewidth]{images/gnn2gnn}
  \caption{The \gnntognn{} framework. The generator \gnn{} produces \nn{} architectures, starting from randomly initialised fully connected DAGs.
  The discriminator \gnn{} aims at distinguishing artificial \nn{} architectures from the real ones.
  The valuer network aims at predicting architectures performance, distinguishing between strong and weak structures or regressing their accuracy.
  Different colors of graph nodes represent different operations---embeddings.
  \textcolor{red}{Red} nodes identify input/output nodes, while \textcolor{green}{green} and \textcolor{yellow}{yellow} nodes may represent $3\times 3$ convolution and max-pooling respectively.
  \textcolor{gray}{Gray} nodes represent randomly initialised node embeddings.}
  \label{fig:gnn2gnn}
\end{figure*}
%

\subsubsection{Generator}\label{sssec:gnn2gnn-gen}
%
Generating graph structures that satisfy specific properties is complex and represents an open research issue~\citep{YouIcml2018,LiCorr2018}.
%
This task complexity is three-folded:
%
\begin{itemize}
  \item[\textbf{Q1}] \emph{Generate realistic structures.} For a generated structure to result realistic, the generative framework should learn which nodes should be linked and which not.
  \item[\textbf{Q2}] \emph{Generate realistic nodes.} The generated graph should be characterised by nodes having realistic features.
  \item[\textbf{Q3}] \emph{Stopping criteria.} In the generating process, it is important to identify when the generated graph structure has reached its final structure, which is non-trivial.
\end{itemize}
%
To tackle the aforementioned problems and generate realistic \nn{} architectures, we here propose a novel generative \gnn{}.
%
Indeed, \gnn{}s are particularly suited for handling interconnections and node features, while they exhibit limitations on stopping criteria identification.
%
However, given the nature of available \nas{} techniques and datasets, this \gnn{}s limitation is not an issue.
%
Indeed, available \nas{} techniques restrict their searching space, limiting the number of layers -- vertices -- that compose the \nn{} architecture.
%
Therefore, publicly available \nas{} datasets build on top of this rationale, fixing the number of \nn{} layers.
%

Building on top of the same rationale, we here propose a generator model that receives as input a fully-connected DAG -- i.e., where every node is connected with every other node -- having $N$ nodes.
%
$N$ represents an hyperparameter of the framework, and can be arbitrarily set depending on the features of the \nas{} dataset at hand, or, on the complexity of the architecture to generate.
%
Fixing $N$ immediately satisfies property \textbf{Q3}, implicitly setting a stopping criteria for the graph generation process.
%
It is also important to notice that value $N$ only provides an upper limit on the number of layers composing the generated architectures.
%
Indeed, architectures having $n \leq N$ can be generated by the proposed approach, thanks to edge removal and node isolation.
%
Finally, node features of the input graph are randomly initialised, mirroring the usual GAN approach.

The proposed generator framework relies on four successive steps, presented in \Cref{fig:generator} along with an example of input graph and generated architecture, and explained in details below.
%
\paragraph{Graph convolution.}
%
The generator applies $\mu$ layers of graph convolution to the randomly generated fully-connected graph received in input.
%
Graph convolution layers allow elaboration of the random information received, building the backbone of the generated \nn{} architecture.
%
Depending on the number $\mu$ of convolutional layers selected, we should expect more or less fine-grained embeddings as output of this step.
%
However, given the fully-connected nature of the input graph, a small value of $\mu$ is enough to obtain a meaningful graph embedding.
%
\paragraph{Edge scoring \& sampling.}
%
Once a proposal of fully-connected architecture is obtained from the graph convolution layers, the generator applies a learnable scoring function to each edge of the graph at hand.
%
This procedure allows different scores to be assigned to each edge of the architecture, depending on their relevance.
%
To score edges we first build edge features vectors, through the concatenation of adjacent vertices features.
%
Mathematically speaking, the feature vector of edge connecting vertex $v$ to vertex $w$ is $\mathbf{e}_{v,w}=\mathbf{x}_w \mathbin\Vert \mathbf{x}_w$, where $\mathbin\Vert$ denotes concatenation.
%
Once the edge feature is obtained, the edge relevance is scored using a standard densely-connected layer followed by normalisation in $\left[0, 1\right]$, obtaining $\mathbf{e}^{'}_{v,w}$ which represents the score given to the edge between $v$ and $w$.
%
To avoid non-differentiability issues that may arise from scores thresholding, edges are then sampled depending on their scores using a gumbel softmax layer.
%
This procedure ensures the survival of relevant -- from the generator perspective -- edges only, aiming at satisfying \textbf{Q1}.
%
Edge scoring and sampling are here presented as a unique step, given their logical bond.
%
However, it is also possible to conceive these two as separate steps, as done in \Cref{fig:generator} to ease reader understanding of the framework.
%
\paragraph{Layers operations generation.}
%
The aim of this step is to assign one operation to each vertex -- i.e., layer -- of the graph corresponding to the \nn{} architecture.
%
To do so, the graph embedding obtained from the graph convolution step is combined with the sampled edge scores and used as input for a new layer of graph convolution.
%
A softmax operation is then applied to the output of the convolutional layer to produce the one-hot encoding of the operations corresponding to each node.
%
This specific step, aiming at identifying realistic nodes features -- i.e., operations --, is meant to satisfy \textbf{Q2}.
%
Here, the layer generation step focus solely on the layer type -- i.e., operation to deploy --, ignoring layers dimensioning issues.
%
Indeed, we consider layers size to be automatically inferrable from the overall \nn{} architecture, as stated in \cite{YingPmlr2019}.
%
\paragraph{Graph refinement.}
%
Finally, the generator removes unsampled edges from the graph as well as isolated nodes, obtaining the final \nn{} architecture.
%
Possible cycles and pending nodes are also removed during this step, ensuring therefore to produce a DAG architecture.
%
The graph-refinement operation is left as the last operation to avoid possible non-differentiability issues which may arise from removal of nodes or edges.
%
However, this does not influence the generation of layer operations, since zero-scored edges do not propagate information in the previous convolution step.
%

\begin{figure*}
  \centering
  \includegraphics[width=\linewidth]{images/generator}
  \caption{The generator receives in input randomly initialised -- \textcolor{gray}{gray} nodes -- fully connected DAGs, and process them via graph convolution (1.).
  %
  The new graph embedding, obtained from (1.) is used to score edges (2.a.).
  %
  Light gray (dark gray) edges represent links having small (high) score.
  %
  Edges are sampled (2.b.) and the scores are propagated to the next graph convolution step to obtain operations embedding (3.).
  %
  Different colors of graph nodes represent different operations---embeddings.
  %
  \textcolor{red}{Red} nodes identify input/output nodes, while \textcolor{green}{green} and \textcolor{yellow}{yellow} nodes may represent $3\times 3$ convolution and max-pooling respectively.
  %
  Finally, the graph is refined removing unsampled edges and  nodes (4.).}
  \label{fig:generator}
\end{figure*}
%

\subsubsection{Discriminator}\label{sssec:gnn2gnn-dis}
%
The discriminator model aims at distinguishing between synthetically generated architectures and architectures available in the dataset at hand.
%
In the proposed framework we build the discriminator model stacking $\nu$ layers of graph convolution, followed by a single densely-connected classification layer.
%
Graph convolutional layers extract graph-structured features from the input graph, while the classification layer outputs a binary prediction.
%
The complexity of the discriminator model -- i.e., the number of graph convolutions $\nu$ -- depends on the complexity of the architectures under inspection.
%
Available \nas{} datasets consider fairly small architectures, as they deal with identical block structures, therefore in our experiments we set $\nu=2$.
%

\subsubsection{Valuer}\label{sssec:gnn2gnn-val}
%
The valuer model aims at identifying the performance of the architectures given as input.
%
Structurally speaking, we build the valuer model similarly to the discriminator, stacking few layers of graph convolution, followed by a single densely-connected layer.
%
The prediction of the valuer model over the structures generated by $G$ are also used for the generator optimisation, aiming to push $G$ toward the generation of more powerful \nn{} architectures.
%
Indeed, the generator model is trained minimising a combination between the standard GAN loss and the reward loss obtained from the valuer \nn{}:
%
\begin{equation}\label{eq:gen-loss}
  \mathcal{L}_{G} = \lambda \cdot \underbrace{\log(1 - D(G(z)))}_\text{standard GAN loss} + (1 - \lambda) \cdot \underbrace{\mathcal{L}_{R}(V(G(z)))}_\text{reward loss}
\end{equation}
%
where $z$ represents the randomly initialised graph used as input for $G$, $\mathcal{L}_{R}$ represents the reward loss and $\lambda$ represents a balancing factor between the two loss terms.
%
The definition of $\mathcal{L}_{R}$ depends on the role of the final densely-connected layer of $V$, which can be used either to regress the performance of the graph at hand or to binary classify graphs---strong vs.\ not-strong architecture.
%
In the first approach, $\mathcal{L}_{R}$ is represented via mean-squared error loss between the predicted performance of generated architecture and the best performing architecture.
%
In the second, the reward loss is represented via cross-entropy loss between predicted classification and strong architecture labels.
%
Our experiments (see \Cref{ssec:exp-abl}) show that the second approach is more consistent.
%

\section{Experiments and Results}\label{sec:exp-results}
%
In this section we propose a set of experiments to show the effectiveness of \gnntognn{} for generating strong \nn{}s.
%
Our source code is available at \url{https://github.com/AndAgio/GNN-2-GNN}.
%
\subsection{Datasets}\label{ssec:datasets}
%
To test our framework performance we rely on the \naslol{}~\citep{YingPmlr2019} and \nats{}~\citep{DongTpami2021} benchmark datasets.
%
Both datasets contain a set of \nn{} architectures along with their recorded performance over a specific image classification task.
%
Here, \nn{} architectures are built from the repetition of identical cells, which are the target of our \gnntognn{} approach.
%
\naslol{} contains $423k$ \nn{} architectures trained multiple times over CIFAR-10~\citep{KrizhevskyLearning2009}.
%
On the other hand, \nats{} contains a set of $15k$ \nn{} topologies trained over three different datasets:
%
\begin{inlinelist}
  \item CIFAR-10;
  \item CIFAR-100;
  \item ImageNet-16-120.
\end{inlinelist}
%
However, \nats{} represent operations over graph edges, while \gnntognn{} and \naslol{} represent operations over graph nodes, as introduced in \Cref{ssec:nns-as-graphs}.
%
Therefore, we translate \nats{} architectures into \Cref{ssec:nns-as-graphs} form and remove possible duplicates, thus obtaining a refined version of \nats{} consisting of $7K$ unique architectures.
%

\naslol{} and \nats{} datasets rely on similar search spaces used for the construction of \nn{}s.
%
Indeed, both consider a small set of operations, containing:
%
\begin{inlinelist}
  \item $3\times 3$ convolution,
  \item $5\times 5$ convolution, and
  \item \emph{pooling}---\naslol{} considers max-pooling, while \nats{} examines average-pooling.
\end{inlinelist}
%
\naslol{} contains \nn{} cells with at most $7$ nodes and $9$ edges, while \nats{} examines cells with at most $8$ nodes, without imposing any restriction on the number of edges.
%

\subsection{Experimental Setup}\label{ssec:exp-setup}
%
To test \gnntognn{} ability to produce novel architectures and generate strong cells, we remove part of the architectures from the training dataset.
%
We eliminate some randomly picked cells from the dataset, as well as the best $10\%$ of architectures---w.r.t. their classification accuracy.
%
Under these settings, the generator can not extract information from the strongest models during training, rendering the generation task more complex.
%
Therefore, a generator capable of producing the best $10\%$ of architectures is to be considered a strong model.
%
$N=10$ was selected since in \naslol{} there exists quite a significant performance delta between the top-$10\%$ architectures and the rest.
%

Each \gnntognn{} instance is trained for $20$ epochs over the training set using standard Stochastic Gradient Descent and setting the learning rate to $0.001$ and the batch size to $32$.
%
Moreover, during the first half of the training procedure we set $\lambda=1$.
%
This is done to allow $V$ properly learning to distinguish between strong and weak architectures, before leveraging it to optimise $G$ with backpropagation.
%
Indeed, backpropagating information from a partially trained $V$ to the generator $G$ may increase the noise of its training, slowing down or hindering its optimisation.
%
Therefore, in the first $10$ epochs the generator model is optimised only through the discriminator $D$.
%
After this setup period $\lambda$ is set back to its desired value, enabling the interaction between $G$ and $V$ as described by \Cref{eq:gen-loss}.
%

\subsection{Evaluation Metrics}
%
Throughout our experiments, we consider only models which always output valid \nn{} architectures, since they output DAGs thanks to some refinement step.
%
Therefore, the metrics that we define refer solely to the quality of the generated architectures.
%
Moreover, since our framework is not directly comparable with \nas{} approaches, we avoid considering common \nas{} metrics---e.g., convergence time, etc.
%
%Validity is defined as the percentage of valid architectures generated.
%
\emph{Novelty} measures the percentage of generated architectures not belonging to the training set used.
%
The \emph{Top-$N$} metrics measure the percentage of generated architectures that belong to the best $N\%$ of architectures in terms of classification accuracy.
%
%Finally, Acc$_{n}$ measures the percentage of generated architectures that reach an accuracy greater than $n$ out of the models that belong to the dataset.
%%
\emph{Acc$_{n}$} measures the ratio between the number of generated architectures that reach an accuracy greater than $n$, and the number of generated models that belong to the dataset.
%
Finally, \emph{$|$Acc$|$} measures the average accuracy reached by generated architectures.
%

\subsection{Ablation Study}\label{ssec:exp-abl}
%
To identify the best hyper-parameters setup for \gnntognn{}, we propose a thorough ablation study.
%
The ablation study is performed over the \naslol{} dataset, given its higher degree of expressiveness w.r.t.\ \nats{}.
%

\paragraph{Hyper-parameters}
%
We consider the influence of the parametric values that may alter the generation of \nn{} architectures.
%
We take into account the balancing factor $\lambda$ used during training, the temperature $\tau$ of the gumbel softmax layer used to perform edge sampling, and the number of graph convolution layers used by the generator $\mu$.
%
\Cref{tab:ablation_generator} shows the results of the ablation study on such parameters.
%
It is possible to notice that the model is highly affected by the balancing factor $\lambda$, which injects performance-critical information into the generator.
%
Indeed, leveraging smaller $\lambda$ increases the performance of the proposed architectures, as the generator focuses more on the information received by $V$ through backpropagation.
%
Smaller $\lambda$ values also improve \gnntognn{} ability to predict more complex models.
%
%Architectures generated using $\lambda=0.1$ have on average $9.6 \cdot 10^{6}$ parameters, while their $\lambda=1$ counterparts only $4.0 \cdot 10^{6}$.
%%
Architectures generated using $\lambda=0.1$ have on average twice the number of parameters of their $\lambda=1$ counterparts.
%
This phenomenon is encouraging, as it shows that \gnntognn{} is capable of mapping the whole space, thanks to $V$.
%
However, smaller $\lambda$ increases the risk of mode collapse issues, as highlighted by the slight drop in novelty obtained with $\lambda=0.01$.
%
Finally, $\mu$ and $\tau$ do not seem to heavily influence the \gnntognn{} performance.
%
%\input{tables/ablation.tex}
\begin{table}[!h]%[tb]
    \caption{Ablation study over hyperparameters of $G$. Bold values highlight the best setup for each metric.}
    \resizebox{\columnwidth}{!}{\begin{tabular}{ c | c | c | c | c | c | c | c | c }
        \toprule
        \multicolumn{3}{c|}{Parameters} & \multirow{2}{*}{Novelty} & \multirow{2}{*}{Top-5} & \multirow{2}{*}{Top-10} & \multirow{2}{*}{Top-50} & \multirow{2}{*}{Acc$_{90}$} & \multirow{2}{*}{$|$Acc$|$}\\
        \cline{1-3}
        $\mu$ & $\tau$ & $\lambda$ &  &  &  &  && \\
        \midrule
        % content
        \multirow{8}{*}{$1$} & \multirow{4}{*}{$0.01$} & \multirow{1}{*}{$1$} & $50.13\%$ & $10.60\%$ &$13.70\%$ & $27.20\%$ & $45.58\%$ & $88.55\%$ \\
         &  & \multirow{1}{*}{$0.5$} & $71.23\%$ & $34.66\%$ & $40.20\%$ & $52.98\%$ & $75.18\%$ & $90.38\%$ \\
         &  & \multirow{1}{*}{$0.1$} & $82.32\%$ & $46.30\%$ & $50.50\%$ & $57.00\%$ & $80.50\%$ & $91.53\%$ \\
         &  & \multirow{1}{*}{$0.01$} & $81.63\%$ & $45.10\%$ & $49.40\%$ & $58.10\%$ & $80.14\%$ & $91.44\%$ \\
        \cline{2-9}
         & \multirow{4}{*}{$0.1$} & \multirow{1}{*}{$1$} & $51.79\%$ & $12.10\%$ & $15.40\%$ & $25.20\%$ & $40.23\%$ & $88.10\%$ \\
         &  & \multirow{1}{*}{$0.5$} & $67.81\%$ & $19.01\%$ & $22.62\%$ & $39.20\%$ & $59.23\%$ & $89.48\%$ \\
         &  & \multirow{1}{*}{$0.1$} & $82.52\%$ & $45.19\%$ & $50.53\%$ & $58.91\%$ & $80.60\%$ & $91.48\%$ \\
         &  & \multirow{1}{*}{$0.01$} & $82.47\%$ & $\mathbf{46.50\%}$ & $52.14\%$ & $\mathbf{57.30\%}$ & $79.32\%$ & $91.35\%$ \\
        \cline{1-9}
        \multirow{8}{*}{$2$} & \multirow{4}{*}{$0.01$} & \multirow{1}{*}{$1$} & $51.66\%$ & $8.57\%$ & $11.40\%$ & $26.83\%$ & $41.63\%$ & $88.53\%$ \\
         &  & \multirow{1}{*}{$0.5$} & $73.84\%$ & $40.41\%$ & $45.28\%$ & $54.61\%$ & $75.01\%$ & $90.89\%$ \\
         &  & \multirow{1}{*}{$0.1$} & $82.06\%$ & $46.09\%$ & $51.03\%$ & $57.82\%$ & $81.55\%$ & $91.54\%$ \\
         &  & \multirow{1}{*}{$0.01$} & $82.23\%$ & $45.66\%$ & $50.20\%$ & $57.19\%$ & $79.69\%$ & $91.42\%$ \\
        \cline{2-9}
         & \multirow{4}{*}{$0.1$} & \multirow{1}{*}{$1$} & $53.57\%$ & $10.08\%$ & $12.63\%$ & $25.21\%$ & $42.15\%$ & $88.49\%$ \\
         &  & \multirow{1}{*}{$0.5$} & $68.76\%$ & $25.59\%$ & $30.84\%$ & $45.54\%$ & $69.94\%$ & $90.45\%$ \\
         &  & \multirow{1}{*}{$0.1$} & $\mathbf{82.60\%}$ & $45.91\%$ & $\mathbf{52.21\%}$ & $57.37\%$ & $\mathbf{81.79\%}$ & $\mathbf{92.04\%}$ \\
         &  & \multirow{1}{*}{$0.01$} & $81.51\%$ & $45.90\%$ & $51.10\%$ & $59.50\%$ & $79.98\%$ & $91.27\%$ \\
         \bottomrule
    \end{tabular}}
    \label{tab:ablation_generator}
\end{table}

\paragraph{Valuer mode}
%
The mechanism used by the valuer network $V$ to identify strong and weak architectures may cause variation in the generation performance of \gnntognn{}.
%
We distinguish between a classification-based valuer C and a regression-based valuer R.
%
The former identifies strong architectures as the cells belonging to the best half of the dataset.
%
On the other hand, in the regression-based setup, $V$ aims at predicting precisely the classification accuracy of a cell from its structure.
%
We pick the three best models in \Cref{tab:ablation_generator}, retrain them using a regression-based $V$, and compare them against their classification-based counterparts.
%

\Cref{tab:ablation_valuer} shows the results of the ablation study, highlighting the superiority of the classification-based setup.
%
Indeed, regressing exactly \nn{} performance from its architecture is complex, mostly since few small architectural modifications may lead to relevant performance changes.
%
Such variability is complex to handle in a regression setup and hinders $V$ ability to predict correctly cells strength.
%
%\input{tables/valuer_mode.tex}
\begin{table}[!h]%[tb]
    \caption{Ablation study over evaluation mode adopted by $V.$}
    \resizebox{\columnwidth}{!}{\begin{tabular}{ c | c | c | c | c | c | c | c | c | c }
        \toprule
        \multicolumn{3}{c|}{Parameters} & \multirow{2}{*}{$V_{mode}$} & \multirow{2}{*}{Novelty} & \multirow{2}{*}{Top-5} & \multirow{2}{*}{Top-10} & \multirow{2}{*}{Top-50} & \multirow{2}{*}{Acc$_{90}$} & \multirow{2}{*}{$|$Acc$|$} \\
        \cline{1-3}
        $\mu$ & $\tau$ & $\lambda$ &  &  &  &  &  &  & \\
        \midrule
        % content
        \multirow{2}{*}{$2$} & \multirow{2}{*}{$0.1$} & \multirow{2}{*}{$0.1$} & C & $\mathbf{82.60\%}$ & $\mathbf{45.91\%}$ & $\mathbf{52.21\%}$ & $\mathbf{57.37\%}$ & $\mathbf{81.79\%}$ & $\mathbf{92.04\%}$ \\
         &  &  & R & $72.10\%$ & $26.59\%$ & $32.92\%$ & $50.10\%$ & $67.11\%$ & $89.93\%$ \\
        \midrule
        \multirow{2}{*}{$1$} & \multirow{2}{*}{$0.1$} & \multirow{2}{*}{$0.01$} & C & $\mathbf{82.47\%}$ & $\mathbf{46.50\%}$ & $\mathbf{52.14\%}$ & $\mathbf{57.30\%}$ & $\mathbf{79.32\%}$ & $\mathbf{91.35\%}$ \\
         &  &  & R & $71.06\%$ & $25.90\%$ & $34.13\%$ & $51.07\%$ & $65.79\%$ & $90.07\%$ \\
        \midrule
        \multirow{2}{*}{$2$} & \multirow{2}{*}{$0.1$} & \multirow{2}{*}{$0.01$} & C & $\mathbf{81.51\%}$ & $\mathbf{45.90\%}$ & $\mathbf{51.10\%}$ & $\mathbf{59.50\%}$ & $\mathbf{79.98\%}$ & $\mathbf{91.27\%}$ \\
         &  &  & R & $70.33\%$ & $27.04\%$ & $33.54\%$ & $50.97\%$ & $66.43\%$ & $90.01\%$ \\
        \bottomrule
    \end{tabular}}
    \label{tab:ablation_valuer}
\end{table}
%

%
In the remainder of the experiments, we build the \gnntognn{} model employing a classification-based $V$ and the best hyperparameters values---i.e.\ $\mu=2$, $\tau=0.1$, $\lambda=0.1$, as highlighted in \Cref{tab:ablation_generator}.
%
Indeed, these values represent a good starting point for deploying \gnntognn{} over multiple scenarios, given \naslol{} generality.


\subsection{Performance Comparison}
%
To show the effectiveness of the proposed approach, we compare \gnntognn{} against other generative mechanisms.
%
We first consider generating random \nn{} architectures using the Erd\"{o}s–R\'{e}nyi model \citep{ErdosPmihas1960}.
%
We then evaluate the strength of our approach against two GAN-based frameworks, relying on different generation strategies:
%
\begin{description}
    %
    \item[MOLGAN-like]
    %
    The model generates nodes and edges independently and simultaneously, recalling the approach by~\cite{DecaoCorr2018}.
    %
    Two matrices representing node types and connections between them are generated from a random input vector and sampled using gumbel softmax.
    %
    \item[RNN]
    %
    The model generates architectures starting from a single input node and appending new vertices -- with corresponding edges -- until a stopping criteria is met.
    %
    This approach resembles the one by~\cite{ZhangNips2019} and leverages Recurrent \nn{}s to deal with graph construction via recursive node appending.
    %
\end{description}
%
To make the comparison fair, both the MOLGAN-like and the RNN model are built using the three-way \nn{}s adversarial approach that characterises \gnntognn{}.
%
Therefore, the three approaches differ solely on the generation criteria embodied by the generator model $G$.
%

\Cref{tab:perf_comparison} shows the performance of the different models.
%
\gnntognn{} vastly outperforms the counterparts, as it produces more accurate predictions for strong \nn{} architectures.
%
Moreover, more than $80\%$ of the predictions performed by our model are \nn{}s characterised by an accuracy greater than $90\%$, while the best counterpart model -- i.e., MOLGAN -- fails to reach even $60\%$.
%
This proves \gnntognn{}'s generation consistency.
%
Indeed, even the simple random generation approach can sporadically generate powerful architectures, as shown also in~\cite{XieIccv2019}.
%
However, it suffers in terms of consistency, as it is uncommon to obtain articulate architectures starting from a random empty graph.
%
%\input{tables/perf_comparison.tex}
\begin{table}[!h]%[tb]
    \caption{Performance comparison between \gnntognn{} and other GAN based approaches to generate \nn{} architectures.}
    \centering
    \resizebox{\columnwidth}{!}{\begin{tabular}{l | c | c | c | c | c | c | c }
        \toprule
        Dataset & Model & Novelty & Top-5 & Top-10 & Top-50 & Acc$_{90}$ & $|$Acc$|$ \\
        \midrule
        % content
        \multirow{4}{*}{\naslol{}} & \gnntognn{} & $82.60\%$ & $\mathbf{45.91\%}$ & $\mathbf{52.21\%}$ & $\mathbf{57.37\%}$ & $\mathbf{81.79\%}$ & $\mathbf{92.04\%}$ \\
         & MOLGAN & $65.41\%$ & $22.63\%$ & $27.29\%$ & $45.20\%$ & $59.71\%$ & $89.39\%$ \\
         & RNN & $\mathbf{96.34\%}$ & $1.69\%$ & $2.32\%$ & $4.81\%$ & $53.04\%$ & $89.32\%$ \\
         & Random & $51.81\%$ & $11.17\%$ & $13.74\%$ & $28.43\%$ & $43.18\%$ & $88.54\%$ \\
%        \midrule
%        \multirow{4}{*}{\nats{}} & \gnntognn{} & - & - & - & - & - & \\
%         & MOLGAN & - & - & - & - & - & \\
%         & RNN & - & - & - & - & - & \\
%         & Random & - & - & - & - & - & \\
        \bottomrule
    \end{tabular}}
    \label{tab:perf_comparison}
\end{table}
%
%We also show the superiority of \gnntognn{}, both in terms of accuracy and model complexity, plotting the distribution of the generated models against the counterparts in \Cref{fig:models_distribution}.
%%
%\begin{figure}
%  \centering
%  \includegraphics[width=\linewidth]{images/distribution_different_models}
%  \caption{Distribution of architectures generated by \gnntognn{}, MOLGAN, RNN.}
%  \label{fig:models_distribution}
%\end{figure}

\subsection{Resistance to Dataset Quality Degradation}\label{ssec:exp-qual-degrad}
%
To study the flexibility of our approach against poorly-constructed datasets, we analyse \gnntognn{} performance when a high number of strong models are removed from the training dataset.
%
More in details, we first remove the best $N\%$ of models from the \naslol{} training set, varying $N$ between $10$ and $90$, then retrain \gnntognn{}.
%
\Cref{tab:perf_resistance} shows these tests results.
%
The performance loss between different setups is minimal, highlighting \gnntognn{} strength against dataset quality degradation.
%
Indeed, even when almost all best models are removed from the training set, \gnntognn{} produces strong predictions, showing just a $3\%$ loss in the Top-$5$ metric and a $0.82\%$ decrease of the average accuracy reached by generated models.
%
%\input{tables/perf_resistance.tex}
\begin{table}[!h]%[tb]
    \caption{Performance comparison when the top $N$\% of best architectures is removed from the training dataset.}
    \centering
    \resizebox{\columnwidth}{!}{\begin{tabular}{l | c | c | c | c | c | c | c }
        \toprule
        Dataset & $N$ & Novelty & Top-5 & Top-10 & Top-50 & Acc$_{90}$ & $|$Acc$|$ \\
        \midrule
        \multirow{9}{*}{\naslol{}} & $10\%$ & $82.60\%$ & $45.91\%$ & $52.21\%$ & $57.37\%$ & $81.79\%$ & $92.04\%$ \\
         & $20\%$ & $83.01\%$ & $45.54\%$ & $52.32\%$ & $58.48\%$ & $82.05\%$ & $91.98\%$ \\
         & $30\%$ & $83.67\%$ & $46.10\%$ & $51.07\%$ & $57.04\%$ & $80.14\%$ & $91.68\%$ \\
         & $40\%$ & $84.89\%$ & $45.80\%$ & $51.03\%$ & $58.71\%$ & $81.03\%$ & $91.55\%$ \\
         & $50\%$ & $85.00\%$ & $44.01\%$ & $48.81\%$ & $56.30\%$ & $79.24\%$ & $91.33\%$ \\
         & $60\%$ & $84.58\%$ & $44.40\%$ & $49.12\%$ & $57.72\%$ & $80.30\%$ & $91.30\%$ \\
         & $70\%$ & $84.20\%$ & $44.66\%$ & $49.38\%$ & $56.71\%$ & $79.52\%$ & $91.37\%$ \\
         & $80\%$ & $84.33\%$ & $43.90\%$ & $48.27\%$ & $55.51\%$ & $78.16\%$ & $91.27\%$ \\
         & $90\%$ & $85.71\%$ & $42.77\%$ & $46.70\%$ & $55.72\%$ & $78.24\%$ & $91.22\%$ \\
        \bottomrule
    \end{tabular}}
    \label{tab:perf_resistance}
\end{table}
%

\subsection{Generalisation Between Datasets}\label{ssec:exp-general}
%
We now consider the generalisation ability of our framework.
%
We start by training \gnntognn{} over \nats{} and showing its performance.
%
As \Cref{tab:generalisation} shows, the performance obtained over \nats{} are poor, probably due to the small size of \nats{}---i.e., only $7K$ \nn{} architectures.
%
We then apply the generator model trained over \naslol{} to \nats{}, analysing its prediction performance.
%
\Cref{tab:generalisation} shows the results of our generalisation study.
%
While still not being satisfactory, we notice that performance strongly increase when \gnntognn{} is transferred from \naslol{} to \nats{}.
%
This is encouraging, especially if we consider the strong difference between \naslol{} and \nats{}.
%
Indeed, only $576$ \nats{} architectures are available also in \naslol{}, and their performance vary on average by $16.183\%$ between the two datasets.
%
%\input{tables/generalisation.tex}
\begin{table}[!h]%[tb]
    \scriptsize
    \caption{Performance of \gnntognn{} when generalising between different datasets. Subscript refers to \nats{} split. $C10$ and $C100$ stand for CIFAR10 and CIFAR100, while $I$ stands for ImageNet.}
    \centering
    \resizebox{\columnwidth}{!}{\begin{tabular}{ c | c | c | c | c | c | c }
        \toprule
        \multicolumn{2}{c|}{Dataset} & \multirow{2}{*}{Novelty} & \multirow{2}{*}{Top-5} & \multirow{2}{*}{Top-10} & \multirow{2}{*}{Top-50} & \multirow{2}{*}{$|$Acc$|$} \\
        \cline{1-2}
        Train & Test &  &  &  &  \\
        \midrule
        % content
        \nats$_{C10}$ & \nats$_{C10}$ & $\mathbf{76.73\%}$ & $1.65\%$ & $3.39\%$ & $15.61\%$ & $68.91\%$ \\
        \naslol{} & \nats$_{C10}$ & $73.67\%$ & $\mathbf{2.64\%}$ & $\mathbf{5.08\%}$ & $\mathbf{16.55\%}$ & $\mathbf{70.25\%}$ \\
        \midrule
        \nats$_{C100}$ & \nats$_{C100}$ & $71.71\%$ & $1.30\%$ & $3.28\%$ & $13.10\%$ & $33.03\%$ \\
        \naslol{} & \nats$_{C100}$ & $\mathbf{72.03\%}$ & $\mathbf{2.64\%}$ & $\mathbf{4.82\%}$ & $\mathbf{16.90\%}$ & $\mathbf{35.40\%}$ \\
        \midrule
        \nats$_{I}$ & \nats$_{I}$ & $81.03\%$ & $0.91\%$ & $2.17\%$ & $8.42\%$ & $16.75\%$ \\
        \naslol{} & \nats$_{I}$ & $\mathbf{82.30\%}$ & $\mathbf{1.93\%}$ & $\mathbf{3.64\%}$ & $\mathbf{11.49\%}$ & $\mathbf{18.84\%}$ \\
        \bottomrule
    \end{tabular}}
    \label{tab:generalisation}
\end{table}
\normalsize
%

\subsection{Preliminary comparison against \nas{}}\label{ssec:exp-nas-comparison}
%
\gnntognn{} does not represent a traditional \nas{} technique, as it does not rely on search space exploration and focuses solely on the architecture generation procedure.
%
However, we can compare \gnntognn{} against state-of-the-art \nas{} in terms of the performance obtained by the generated architectures over \naslol{}.
%
Results shown in \Cref{tab:nas_comparison} are extracted from \cite{YuIclr2020} and consider 1000 \gnntognn{} generation samples.
%
The average accuracy of \gnntognn{} generation is comparable with other \nas{} approaches.
%
Meanwhile, results show that \gnntognn{} vastly outperforms \nas{} techniques in terms of best accuracy.
%
Indeed, the best architecture generated by \gnntognn{} achieves $94.32\%$, while NAO tops up at $93.33\%$.
%
Moreover, the architecture generated by \gnntognn{} achieves a lower rank value, meaning that they are closer to the optimal architecture.
%
Indeed, the best achievable accuracy in \naslol{} is $95.06\%$, which represents an increase of only $0.72\%$ compared to what \gnntognn{} achieves.
%
%\input{tables/perf_comparison_nas.tex}
\begin{table}[!h]%[tb]
    \caption{\gnntognn{} performance against state-of-the-art \nas{} approaches over \naslol{}. Subscript refers to the percentage of samples removed from \naslol{}.}
    \centering
    \resizebox{0.8\columnwidth}{!}{\begin{tabular}{c | c | c | c}
        \toprule
        Model & $|$Acc$|$ & Best Acc & Best Rank \\
        \midrule
        DARTS~\cite{LiuIclr2019} & $92.21\%$ & $93.02\%$ & $57079$  \\
        NAO~\cite{LuoNips2018} & $\mathbf{92.59\%}$ & $93.33\%$ & $19552$  \\
        ENAS~\cite{PhamIcml2018} & $91.83\%$ & $92.54\%$ & $96939$ \\
        \gnntognn$_{10}$ & $92.04\%$ & $\mathbf{94.32\%}$ & $\mathbf{5372}$ \\
        \gnntognn$_{30}$ & $91.68\%$ & $94.18\%$ & $7843$ \\
        \gnntognn$_{60}$ & $91.30\%$ & $94.01\%$ & $9371$ \\
        \gnntognn$_{90}$ & $91.22\%$ & $93.69\%$ & $12570$ \\
        \bottomrule
    \end{tabular}}
    \label{tab:nas_comparison}
\end{table}

\section{Discussion}\label{sec:discussion}
%
\gnntognn{} relies on an architecture-performance pairs dataset to remove part of the complexity burden of extracting architectures performance.
%
This might represent a possible drawback, as it requires the training of a set of hand-crafted \nn{}s.
%
However, results of \Cref{ssec:exp-qual-degrad} show how \gnntognn{} learns to generate effective \nn{}s even when most -- i.e., 90\% -- of the arch-performance pairs are not available.
%
Moreover, \gnntognn{} does not impose any requirement on the dataset quality, as it can generate powerful architectures even when trained on the worst part of the dataset---i.e., worst 10\%.
%
Finally, \Cref{ssec:exp-general} hints how \gnntognn{} can translate the generation process to a new setup, without requiring the extraction of a new dataset.
%

\section{Related Work}\label{sec:related}
%
The propose framework is tightly related to state-of-the-art techniques for searching \nn{} structures---namely, Neural Architecture Search.
%
\nas{} techniques have been recently proposed to tackle \nn{} design~\citep{ElskenJmlr2019,RenAcm2021}.
%
\nas{} techniques define a search space $\mathcal{S}$, containing \nn{} architectures.
%
A set of architectural rules identify the list of operations available at \nn{} layers, as well as a list of rules defining admissible connections between \nn{} layers.
%
\nas{} approaches aims then at efficiently explore $\mathcal{S}$ to identify the strongest \nn{} architecture.
%
While the search space is explored, architectures are sampled, the corresponding \nn{}s are built, trained over the task at hand, and evaluated, depending on a performance estimation strategy.
%
Proposed \nas{} approaches may vary for the selected search space and the exploration strategy they deploy~\citep{RealAaai2019,TanIcml2019,ChuEccv2020,AgiolloExtraamas2021}.
%

\nas{} approaches have proven to be successful in identifying powerful \nn{} architectures.
%
However, these approaches present drawbacks, which raise concerns about their applicability, namely:
%
\begin{description}[leftmargin=0pt,font=\normalfont\itshape]
  %
  \item[Learning to search vs.\ learning to build.]
  %
  Whereas \nas{} approaches aim at efficiently searching a sub-optimal \nn{} among the ones available in $\mathcal{S}$ -- i.e., \emph{learning to search} --, they fail to learn any proper architectural criteria --- i.e., \emph{learning to build}.
  %
  In contrast, \gnntognn{}, relying on graph learning techniques, aims at implicitly learning \nn{} design criteria, representing a step forward towards \emph{learning to build} \nn{}s.
  %
  \item[Flexibility.]
  %
  \nas{} techniques currently lack in flexibility and generality, as they cannot identify architectural criteria and focus on a specific task and dataset.
  %
  Instead, \gnntognn{} is limited only by the dataset at hand, being adaptable to the most diverse architectures, operations, and interconnections.
  %
  Moreover, aiming to learn architectural criteria, \gnntognn{} paves the way towards generalisation between datasets.
  %
  \item[Search space restrictions.]
  %
  Most, if not all, \nas{} techniques limit the number of available operations or the way in which they can be connected to ease the searching procedure over $\mathcal{S}$.
  %
  \gnntognn{} exploits \gnn{}, which are applicable to any graph structure, avoiding restrictions on architectural rules.
  %
\end{description}

Our work is also related to the application of \gnn{}s to \nn{} architectures.
%
Indeed, some works have exploited, with some success, the graph processing nature of \gnn{}s to extract relevant information about \nn{}s from their architecture.
%
A common approach here consists of predicting the performance of a \nn{} from its design, aiming to avoid expensive training processes \citep{LukasikDagm2020}.
%
Few of them even integrate \gnn{}s into \nas{} algorithms~\citep{YanNips2020,WenEccv2020}.
%
Such frameworks, however, rely on the \gnn{} regressive power solely to evaluate rapidly the performance of the proposed architectures, removing the training process from the \nas{} loop.
%
Therefore, such approaches exploit \gnn{} mostly as a speed-up component for the \nas{} procedure, failing to capture the proper expressive power of \gnn{}s, which is their capability to learn and sub-symbolically express \nn{} architectural criteria.
%
Other works aim at finding relevant embeddings for the \nn{} architectures at hand.
%
Such embeddings identify a continuous embedding space, which can then be used to optimize the \nn{} structure \citep{LuoNips2018} or propose quick search mechanisms \citep{LiAaai2020}.
%

\section{Conclusions and Future Works}
%
In this paper we present a novel \gnn{}-based three-way adversarial framework for learning to generate strong \nn{} architectures.
%
The experiments completed over two state-of-the-art datasets highlight the strength of our approach.
%
We show \gnntognn{} ability to predict optimal \nn{} architectures and its superiority against other available generation approaches.
%
Moreover, given its flexibility against dataset quality degradation, the proposed framework represents a step forward towards learning architectural criteria for \nn{}s design.
%
Indeed, the \gnntognn{} generator is capable of predicting unseen strong architectures even when dealing with unsound dataset.
%
Finally, some experiments on knowledge transferability suggest the generalisability of our approach.
%

While aiming to overcome \nas{} limitations -- via removal of search algorithms -- \gnntognn{} can also be integrated into a \nas{} approach as a proposal technique.
%
Here, the adversarial framework characterising \gnntognn{} would require online training, similarly to other \nas{} approaches.
%
The implementation of a \gnntognn{}-based \nas{} approach, along with thoroughly comparison with other \nas{}, is left for future works.
%
Finally, we also intend to focus on boosting \gnntognn{} generalisation ability, by introducing dataset embedding techniques.
%

\begin{acknowledgements}
This paper has been partially supported by the CHIST-ERA IV project “EXPECTATION” (G.A. CHIST-ERA-19-XAI-005).
\end{acknowledgements}

\bibliography{agiollo_301}

\end{document}

