\section{Introduction}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Intro agents
% Not all agents are intelligent
% What are intelligent agents doing?

% The RL and DL fields have focused on systems that opt for a single well-defined goal, which might be opposed to what intelligent beings are doing (+ fallacy of the objective K. O. Stanley)

% The reward is enough hypothesis

% Open-ended field (UED, focus on the env., not on the agent)

% Our proposal, the "optimal explorer"

% Connections to the open-ended field

% Connections to the free energy principle
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% Intro agents
Agents---understood as systems acting by themselves according to certain goals or norms in an environment \citep{barandiaran2009defining}---are the substrates of intelligent life as we know it. 
% Not all agents are intelligent
% What are intelligent agents doing?
However, not all agents can be classified as intelligent. For instance, a thermostat fits into most definitions of agency, while 
not exhibiting intelligence as found in biological agents.
This raises the question: \textit{what are intelligent agents doing?} In other words, what are the objectives that intelligent agents pursue that give rise to the incredibly complex emergent behaviors that we broadly observe in natural life? 


% The field of AI and RL
On the Artificial Intelligence (AI) field side, Reinforcement Learning (RL) has been the area of research that has most prominently focused on the subject of intelligent agents \citep{kaelbling1996reinforcement,sutton2018rlbook}. 
Although RL has led to many breakthroughs in the last decades \citep{silver2016alphago,abramson2024alphafold}, most RL literature has focused on developing agents to pursue a single, well-defined objective. 
% The reward is enough hypothesis
In fact, Sutton's \textit{reward hypothesis} states that all goals can be framed as cumulative reward maximization \citep{sutton2004rewardhypo,bowling2023settling}. This led \cite{silver2021reward} to hypothesize that optimizing reward can lead to the emergence of general intelligence and complex behavior in a sufficiently rich environment \cite{silver2021reward}. 
% However, scalar rewards are inherently limited [D. Abel dogmas 2024 RLC] and it is not clear how agents could be able to learn to fulfill these tasks just by RL.  

% ==> Open-endedness
% Stanley book page 8
On the other hand, as discussed by \cite{stanley2015greatness} and \cite{soros2017open}, following an explicit objective can lead to dead ends. These works
challenge the effectiveness of explicit objectives, arguing that direct goal optimization often fails to discover necessary stepping stones. It emphasizes open-ended exploration over direct optimization, suggesting that breakthroughs arise from serendipity and novelty search rather than predefined goals \citep{lehman2011abandoning,kumar2024asal}. The emergence of intelligent behavior via open-ended novelty search has inspired a growing number of works in recent years \citep{bauer2023human,bruce2024genie,matthews2025kinetix}, even characterizing it as essential for superhuman-level intelligence \citep{hughes2024position}. 
Many of these works focus on automating an open-ended environment (task) generation process \citep{bruce2024genie,faldor2025omniepic} and learn a robust policy that will generalize to unseen tasks, known as Unsupervised Environment Design (UED) \citep{parker2022evolving,bauer2023human, rigter2024rewardfree,beukman2024refining}. 
%% NOTA: Voyager asume high-level control y acceso a high-level semantic via una API high-level para minecraft. 
However, most open-ended literature assumes that the learning method has access to and control of the environment to generate vast amounts of tasks (e.g., UED) or high-level control (e.g., Voyager by \cite{wang2024voyager}). 
% is limited to environments that allow procedurally generating thousands of diverse scenarios (\cite{bauer2023human}, $10^{40}$) requiring vast amounts of interactions with those.
% Although some works exist on open-ended lifelong learning agents in a single complex environment \citep{wang2024voyager}.

Instead, in natural life, agents interact with a (single) complex environment only through perception and (low-level) action.
Since \cite{helmholtz1867handbuch}, most prominent theories of cognition of today agree that the brain maintains and updates a model of its environment (i.e., the real world) \citep{doya2002bayesian,friston2009brain}.   
Based on these ideas and recent work on lifelong learning and open-endedness theory \citep{abel2023definition,hughes2024position} this work hypothesizes that (informally):
\begin{center}
    \begin{minipage}{0.9\textwidth}
    \textit{
    The agents that most efficiently learn an internal model of the environment are more likely to produce emergent intelligent behavior in reward-free scenarios over a bounded time scope.
    }
\end{minipage}
\end{center}
In this context, the agent's model of the environment---referred to as the \textit{world model}---is trained on agent-generated trajectories (i.e., sequences of interactions with the environment). Efficiency is measured as the expected sum across timesteps of the world model's prediction error with respect to the environment over all the possible trajectories.
We refer to this as the \textit{optimal explorer hypothesis}.  
Note that this hypothesis does not state that the agents that most efficiently learn their world model are the only or the most likely ones to induce emergent behaviors, just that they are more likely to cause them by doing so. 

In the search for emergent behavior, this hypothesis directly introduces the intrinsic objective of acting to generate the most informative trajectories for the world model in the long run.  
Based on this hypothesis and literature on active inference \citep{friston2009brain} and model-based RL \citep{chua2018deep}, the next part of this project will propose a practical implementation of an agent to optimize this long-term intrinsic objective. Equipped with a deep Neural Network (NN) ensemble-based world model \citep{lakshminarayanan2017deepensemble}, we aim to introduce an agent that plans and selects the sequences of actions that maximize the world model's epistemic uncertainty, in the long run, using the Cross-Entropy Method (CEM) \citep{rubinstein1999cem}. 
This way, the action selection policy and the world model (constantly updated with the trajectories sampled by the latter) play a minimax game that explores in face of the unknown while otherwise exploiting to explore. 
Finally, we will conduct an extensive empirical evaluation on challenging environments to analyze the behavior of the proposed agent. We expect the agent to solve complex games (in episodic setups) even without having a reward signal, and to improve the sample efficiency of reward-based RL methods model-based (e.g., DreamerV3 \citep{hafner2023dreamerv3}) and non-model-based methods (e.g., proximal policy optimization \citep{schulman2017proximal}).

% In the next lines, we first introduce previous work on open-ended learning, active inference, and planning. Then, we develop the first contribution of this work, defining and formalizing the \textit{optimal explorer hypotheis}. Next, we propose a practical implementation of an open-ended learning agent based on this hypothesis. In the last section, we validate this agent scaling to challenging partially observable environments and draw connections to previous work on active inference and open-ended theory.

In summary, the main objectives of this project are the following:

\begin{enumerate}
    \item \textbf{Formalization of the optimal explorer hypothesis.} Define and analyze the hypothesis, establishing its theoretical foundations and connections to related research areas such as open-ended learning and active inference.
    \item \textbf{Combinatorial optimization formulation.} Frame the problem of optimal exploration as a Combinatorial Optimization (CO) task, identifying suitable problem representations and constraints.
    \item \textbf{Algorithm development.} Design and implement an approximate optimal explorer agent by leveraging techniques from model-based RL and combinatorial optimization, such as Estimation of Distribution Algorithms \citep{larranaga2002estimation} (employed in the CEM). 
    \item \textbf{Empirical evaluation.} Conduct experiments in diverse and challenging environments to assess the effectiveness of the proposed agent in inducing emergent behaviors in reward-free scenarios.
\end{enumerate}


\section{Previous work}

The following lines provide a brief overview of the fields and work upon which this work is mainly based.

\paragraph{Lifelong and open-ended learning.} 
Lifelong and open-ended learning focus on agents that continuously acquire and refine knowledge over time, adapting to novel scenarios by leveraging past experiences. 
% CRL
However, learning continuously introduces many challenges as catastrophic forgetting and interference, loss of plasticity, or computational cost \citep{hadsell2020embracing}.
Addressing these issues is an active area of research \citep{khetarpal2022towards,wolczyk2024fine,malagon2024self}.
% UED
Other works depart from sequential tasks and focus on meta-learning a robust policy on a distribution of environments \citep{parker2022evolving,beukman2024refining}.
% Final
Although these deeply connected fields have gained increasing attention in recent years, they are still in the phase of formally defining themselves  \citep{abel2023definition,hughes2024position}.   

% Lifelong learning emphasizes retaining and transferring knowledge across tasks, while open-ended learning explores unbounded environments where agents autonomously generate goals and behaviors. Both are crucial for developing adaptive, intelligent systems capable of handling dynamic and unpredictable real-world scenarios.


\paragraph{Exploration strategies.}
%% Intrinsic motivation
%% Planning to explore (MPC)
Although many goals can be framed as a reward maximization problem \citep{sutton2004rewardhypo}, learning a policy can be extremely difficult in the absence of a dense informative reward signal. Thus, the field of RL has come with a vast body of work on intrinsic reward: an auxiliary reward function to guide exploration to promising trajectories \citep{pathak2017curiosity,burda2018exploration,nikulin2023anti}. Even with intrinsic motivation, RL agents greatly suffer from sample efficiency. In this realm, model-based methods learn (or directly employ when available) a model of the environment which is used to plan the actions \citep{kaiser2020model,hafner2023dreamerv3}. However, model-based RL incorporates additional complexity and agents can exploit biases in the model that lead to substantial degradation of performance in these types of methods \citep{janner2019trust}.

\paragraph{Active inference.} Active inference is based on Friston's Free Energy Principle (FEP) \citep{friston2009brain}. According to the FEP, living beings minimize expected free energy, maximizing the probability of being in desirable states (maintaining homoeostatic equilibrium) while maximizing information gain (minimizing epistemic uncertainty) in the long run \citep{friston2015epistemic}.
% Information-directed data acquisition has been used in Bayesian optimization and active inference fields for decades \cite{eric2007active,friston2009brain}. 
Despite the appeal of biologically plausible active inference agents \citep{friston2010free} and recent efforts to incorporate deep neural networks \citep{fountas2020deep}, scaling beyond toy environments remains a challenge for these methods \citep{sajid2021active}.

\section{The optimal explorer hypotheis} \label{sec:hypothesis}

As described in the introduction, we focus on agents that maintain and update an internal model (i.e., world model) of their environment.\footnote{The world model can be naturally defined as the distribution over all the possible states given the current state and action, $p_\phi(s_{t+1}|s_t,a_t)$.} Moreover, the environment is only composed of a transition function and without a reward function (i.e., reward-free environments). 
Every timestep the agent interacts with the environment by generating a new transition, and the world model is updated accordingly.

In this setup, we hypothesize that the agents that generate the trajectories (sequences of interactions) that most efficiently update their world model are more likely to induce emergent intelligent behavior in a finite scope of time.
%
In this context, we define the efficiency of an agent as the expected sum of the global world model error at each timestep by following the agent's policy.\footnote{We refer to an agent's policy in the classic RL sense, that is, the probability distribution over actions given the current state $p(a|s)$.} In turn, we refer to global error as the world model's error modeling of the environment given all the possible trajectories. Thus, if the global error is zero, the world model and the environment define the same probability distribution. 


Intuitively, those agents that most efficiently optimize their world models will be those that find (often by exploiting \textit{shorcuts} in complex environments) the best trajectories to explore their environments. Note that this substantially differs from random exploration (e.g., $\epsilon$-greedy exploration), as the most efficient agents will be those that exploit to explore. For instance, in an episodic environment such as an Atari game \citep{machado2018revisiting} (and most games) an efficiently exploring agent (in terms of our hypothesis) would have to solve the game as fast as possible to update its model with interactions from advanced stages of the game. 
%
% Finally, in many complex environments, it is well known that reward maximizing agents often find ways to \textit{hack} reward functions---find strategies to maximize reward in ways that were not intended to do so. 


\section{Proposing a practical implementation based on CO}\label{sec:implem}

In this part of the project, we aim to explore the implications of the hypothesis proposed in the previous section. Specifically, we leverage the ideas from the optimal explorer hypothesis to propose an agent that efficiently explores its environment in the absence of a reward function (i.e., without explicit objectives). 
Note that many possible implementations of such an agent exist and that the one from this part of the project is just a proposal to analyze the experimental implications of the hypothesis. 

Specifically, we aim to leverage previous work on uncertainty quantification for deep NN models \citep{gal2016dropout,lakshminarayanan2017deepensemble} to select those trajectories that are more informative to the world model (a deep NN). 
Finding the action that will cause the most efficient world model update (in terms of the hypothesis) at each timestep can be framed as searching for the action sequence that will lead the agent toward the most informative interactions---of highest epistemic uncertainty---for the world model.  

Note that this problem can be framed as a \textbf{combinatorial optimization} problem where the space of \textbf{possible solutions} $\Omega$ are the sequences of actions, $a\in\mathcal{A}$, $\mathbf{x} = (a_1, a_2,\ldots,a_n)$ of a given length $n$ (that corresponds to the planning horizon).\footnote{Where $\mathcal{A}$ is finite and its elements discrete.} Accordingly, the \textbf{objective function} $f(x)$ is the cumulative epistemic uncertainty of the world model at each interaction, where the interactions are autoregressively sampled from the world model itself. Formally, considering the world model a probability distribution parametrized by $\phi$ over the next states of the environment conditioned on the current action and state, the objective function $f$ can be written as, 
\begin{equation}
    f(\mathbf{x}, i, s) = EU(p_\phi(\cdot|s, \mathbf{x}_i)) + \mathbb{E}_{s'\sim p_\phi(\cdot|s,\mathbf{x}_i)}[f(\mathbf{x}, i+1, s')].
\end{equation}
Where $EU(p_\phi(\cdot|s, \mathbf{x}_i))$ is the epistemic uncertainty of the state $s$ and action $\mathbf{x}$ in the world model $p_\phi$.
Note that the fitness of a solution $\mathbf{x}$ is always given with respect to a specific state $s$, as the utility of a given action sequence is completely dependent on the initial state in which it is taken. Thus, our agent proposal, being in a state $s$, would select the action $a$ such that, 
\begin{equation}
    a = \underset{a_1\in\mathcal{A}}{\arg \max} f((a_1, \ldots), 1, s).
\end{equation}
Intuitively, at each state, the agent would choose the action that maximizes the expected long-term epistemic uncertainty of its world model.

\section{Conclusion}

This project outlines a novel approach to emergent intelligent behavior in the absence of explicit objectives (i.e., without reward function). We first (informally) introduce the optimal explorer hypothesis, which connects the efficiency of learning a model of the environment and the likelihood of inducing emergent behaviors. From this hypothesis, we structure the project into four objectives: (1) formalizing the hypothesis, (2) formulating it as a combinatorial optimization problem, (3) developing an agent based on the combinatorial optimization problem formulation, and (4) extensive empirical analysis of the agent. 
The project aims to advance our comprehension of exploration strategies and learning dynamics in artificial agents, paving the way for more adaptable and intelligent systems and their formal understanding. 
