\section{Introduction}
\label{introduction}

Machine learning has become ubiquitous in computer networks, with applications in areas such as traffic engineering \cite{zuo2019learning}, performance optimization \cite{ali2020performance}, network security \cite{akbar2012improving}, anomaly detection \cite{shon2007hybrid} and root cause analysis \cite{gonzalez2017root}. Reinforcement Learning (RL) in particular has gained substantial momentum in developing congestion control (CC) algorithms \cite{badarla2011learning,jay2019deep, jin2018congestion, lan2019deep,li2016learning,li2018qtcp,mai2019self,nie2019dynamic,ramana2005learning,silva2016smart,tarraf1995reinforcement, xiao2019tcp, xu2019experience, abbasloo2020classic}, routing \cite{mammeri2019reinforcement}, video rate control \cite{huang2018qarc, mao2017neural}, network access \cite{wang2018deep, naparstek2018deep}, network security \cite{nguyen2019deep, uprety2020reinforcement} and proactive caching \cite{zhu2018deep, he2017integrated}. Learning-based algorithms have the potential to adapt to a wide range of network deployments, topologies, traffic workloads and link technologies without requiring engineering (and tuning) of bespoke algorithms to cover said diversity. This, in turn, opens up opportunities for developing future-proof algorithms that can be trained offline, and continuously optimise their behaviour as networks and user workloads evolve.

Learning-based protocols are still in their infancy and substantial research is required to yield deployable solutions \cite{fuhrer2022implementing}. Developing an RL-based protocol is a complex process that requires (1) deciding on a particular RL algorithm, (2) devising a suitable and effective RL model, (3) training the agent(s) in a realistic network setup, and (4) deploying the agent in the wild. The first two of these involves a range of design decisions related to the action and state space of the RL agent, the reward function and the RL algorithm itself. Training the agent is far from trivial; a potentially large number of hyper-parameters (e.g., the discount rate \textit{gamma}, the size of replay/memory buffer, etc.) need to be explored to ensure that the resulting policy is the best possible given the selected RL algorithm and model, the training setup (e.g. in terms of the used network parameters, topology and workload) and the expected deployment parameters. At the same time, training an agent requires the collection of (very) large amounts of agent experience, which can happen either using a real network deployment (e.g., as in \cite{abbasloo2020classic}), a network emulator (e.g., as in \cite{sacco2021owl}) or network simulations (e.g., as in \cite{tessler2022reinforcement}). 

We argue that network simulators, in particular discrete event, packet level simulators, provide a very effective platform for training RL-based network protocols and respective algorithms. First, simulators offer a \textit{fully controllable and configurable} experimentation environment. Network simulators employ domain-specific languages that make defining `networks', including the underlying topology, link characteristics, and traffic workloads, to be simulated easy. Moreover, simulations can run independently and in parallel to each other. It would be extremely expensive and time-consuming to enable such configurability on a real network. Network emulations such as the ones used in \cite{netravali2015mahimahi} support only some limited configurability. Second, training an agent for a network protocol (e.g., a CC policy for a TCP sender) requires exchanging traffic between multiple endpoints. In a real or emulated network, said traffic must be actually played out in the network which, depending on the training setup and parameters, may add a substantial overhead in the training process, in terms of the time it takes to collect agent experience (e.g., in a scenario where the network capacity is very small or when the agent is still far from optimised)\footnote{In \cite{abbasloo2020classic}, models are trained on a pool of servers with a total of 320 CPU cores and edge-switches connected through high-speed links, with training and experimentation constrained by the real-time cost of transmitting data in a real network, in addition to the cost of setting up the system. }. To make things worse, training congestion control policies in high-speed network deployments can be very problematic (or, in fact, impossible) due to the agent needing actions much faster than what the trainer can calculate\footnote{For example, in \cite{abbasloo2020classic} they fix the agent's action calculation/monitoring interval to 20ms, indicating that much lower values would not be possible.}. On the contrary, network traffic generated by inefficient policies can be played very quickly within a simulated network, while the simulated nature of the deployment eliminates the issue around high-performance network deployments.
Third, network simulators support \textit{reproducibility of results} by design, which is crucial for learning based approaches; it is surprising that many papers in RL-based CC (and other fields of computer networks) are being published without any provision for (or even discussion around) reproducibility\footnote{A shining exception is Remy, one of the earliest learning-based CC algorithms\cite{winstein2013tcp, sivaraman2014experimental}.}, when other communities have had reproducibility protocols embedded in their peer review processes. Reproducibility with real or emulated networks is not really possible. Finally, we posit that the development of a \textit{training playground} for computer network researchers would be beneficial for the research community; we envisage a framework such as \cite{brockman2016openai} where researchers can formulate problems and train agents to operate within the given problem space boundaries, and share learned policies (along with all selected parameters and hyper-parameters).

In this paper, we introduce \textit{RayNet}, a scalable and flexible framework for developing learning-based network protocols; we focus on \textit{RL-based CC algorithms} but there is no fundamental limitation in the framework that would prevent the development of other types of learning-based protocols, such as RL-based routing \cite{stampa2017deep}. RayNet integrates two state of the art frameworks, namely OMNeT++ \cite{varga2008overview} and Ray \cite{moritz2018ray}, in an elegant and resource-efficient way. OMNeT++ is a state-of-the-art packet level, discrete event network simulator that is widely used by the networking research community and supports fully reproducible simulations. Ray is a general-purpose and universal distributed compute framework designed to perform any compute-intensive job (written in Python) with flexibility, including distributed training, hyper-parameter tuning, deep RL, and production model serving. RLlib \cite{liang2018rllib} in particular is an open-source RL toolkit that employs fine-grained nested parallelism to achieve state-of-the-art performance across a wide variety of RL workloads and provides scalable abstractions for assembling new RL algorithms with minimal programming overhead. RayNet is embedded deeply within OMNeT++ by tapping into the simulator's event loop, to control the simulation when needed; e.g., to collect experience - observations and reward - and enforce agent actions within the RL setup. At the same time, RayNet operates as a Ray Trainer, enabling users to run distributed multi-node training on OMNeT++ simulated networks through a set of Python bindings. Said integration is very efficient and the only (minimal) overhead is induced by the Python bindings that allow Ray to control a network simulation. We prototype an RL-driven CC approach as a case study to demonstrate how RayNet facilitates the design, engineering, and assessment of RL solutions for complex networking problems. Through experimentation, we show how RayNet enables optimisation and analysis of our RL-driven congestion control protocol, decoupling the learning logic configuration from the networking environment set-up and providing a multi-agent architecture. RayNet is, to the best of our knowledge, the first framework to integrate Ray/RLlib and OMNeT++ end-to-end. In \cite{gawlowicz2019ns}, a ns-3-based framework that exports an OpenAI Gym interface is presented. Unlike \cite{gawlowicz2019ns}, RayNet directly integrates OMNeT++ within a Ray worker by utilising Python bindings and by controlling the execution of the simulator. This reduces the potential for significant overhead associated with the inter-process communication required by \cite{gawlowicz2019ns}. This, along with RayNet's end-to-end interaction with Ray, results in large-scale, parallel training that is on par, in terms of efficiency, with the most advanced frameworks (as explained in Section \ref{experimentation_efficiency}). RayNet is available as an open source project at \textit{https://github.com/giacomoni/raynet}.
 
\section{Background}
\label{background}
In this section we briefly present work related to RayNet. First, we provide an overview of RL and CC. Then, we discuss OMNeT++ (and INET), focussing on its discrete event nature and programming interface for controlling network simulations. Finally, we discuss the key characteristics of Ray and how one can scale up RL using RLlib.

\subsection{Reinforcement Learning}
\label{rl}

\begin{figure}
\includegraphics[width=0.6\textwidth]{diagrams/interaction.pdf}
\centering
\caption{The agent–environment interaction in a Markov decision process \cite{sutton2018reinforcement}}.
\label{rl_interaction}
\end{figure}

RL is the process of learning how to maximise a numerical reward signal by mapping states to actions. The agent is initialised with a random decision-making strategy, thus it must experiment to determine which actions yield the greatest reward. Some actions may affect not only the immediate reward but also some or all of subsequent rewards.
The problem of RL can be formalised as the optimal control of partially-known Markov Decision Processes (MDP), a straightforward framing of the problem of learning from interaction to achieve a goal. The agent and environment interact at each of a sequence of discrete time steps, $t_0,t_1,t_2,t_3,...,t_n$, as shown in Fig. \ref{rl_interaction}.  At each time step $t$, the agent receives a representation of the environment’s state, $S_t \in \mathcal{S}$, and on that basis selects an action, $A_t \in \mathcal{A}$. One time step later, partly as a consequence of its action, the agent receives a numerical reward, $R_{t+1}$, and finds itself in a new state, $S_{t+1}$. The MDP and agent together thereby give rise to a \textit{trajectory}, a sequence of states, actions and rewards $S_0, A_0, R_1, S_1, A_1, R_2, S_2, A_2, R_3, ...$. The ultimate goal for an RL agent is to find the policy, a mapping between space and actions $\pi: \mathcal{S} \rightarrow \mathcal{A}$, that maximises the \textit{discounted} cumulative reward:

\begin{equation}
    \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}
\end{equation}

where $\gamma \in [0,1]$ is the \textit{discount rate}\footnote{$\gamma = 1$ is valid only for episodic environments.}. Deep Reinforcement Learning (DRL) combines RL methods with non-linear function approximation, i.e. neural networks, to cope with target tasks in which states encountered may never have been seen before. In order to make rational judgments in such situations, it is required to generalise from prior experiences with distinct conditions that are comparable to the current ones in some way. Non-stationarity, bootstrapping, and delayed targets are challenges that do not often emerge in traditional supervised learning with NN, but do in RL with function approximation. A more detailed discussion of RL can be found in \cite{sutton2018reinforcement}.

RL algorithms require to sample trajectories of experience in order to improve the current policy. On-policy algorithms learn from experience sampled by the policy itself, whereas off-policy algorithms use (or re-use) experience collected by other policies to improve the currently optimised policy. \textit{Sample efficiency} is a key performance metric when designing and testing new RL algorithms \cite{yu2018towards}. Collecting trajectories of experience can be non-trivial for certain tasks. First, RL performs a trial-and-error search, which arises the challenge of the trade-off between exploration and exploitation. To gain a large cumulative reward, an agent must favour actions that it has previously tried and found to be rewarding. To discover such actions, however, it must try actions that have not been chosen previously. In certain applications (e.g., autonomous driving/robotics), the risk associated with the exploration can be high and the agent has to be trained on a safe environment before deployment on the real one. Second, some tasks may span over large time-scales and collecting samples of experience may be time-consuming. It is therefore very common that simulations are used to provide an efficient and safe environment for training RL agents. Agents can be subsequently be deployed `on the field' using the learned policy, which may or may not be updating based on experience collected by the deployed agent.

\subsection{Congestion Control}
\label{congestion control}

Multiple users accessing the same network must share available network resources - bandwidth and buffers - which are finite. Network congestion is a network state characterised by increased network delays, high packet loss and an overall network performance degradation, as a result of having one or more flows going through a bottleneck link where the required bandwidth exceeds the available one; this, in turn, results in severe degradation of users' quality of experience. Congestion must therefore be monitored and controlled. Congestion control is a distributed process which involves end-hosts, and potentially in-network devices, and aims to maximise network resource utilisation while fairly allocating resources among all users. During data transmission, the amount of required network resources depends on the sender's transmission rate. In modern computer networks, regulation of the transmission rate is done by end-hosts. 

Research on congestion control has been active for the past four decades for two main reasons. First, the decentralised nature of computer networks, coupled with the heterogeneity of application requirements and network architectures, makes congestion control an inherently complex problem. Second, networks and traffic patterns evolve, and new network architectures constantly emerge. This requires frequent rethinking of congestion control algorithms or the introduction of new ones, otherwise current algorithms may operate sub-optimally. As a result, new iterations of design, engineering and experimentation of protocols are required. For example, multiple iterations of CC have been deployed in the wild, since TCP Tahoe, including TCP Cubic \cite{ha2008cubic}, TCP Compound \cite{song2006compound} and MultiPath TCP \cite{paasch2014multipath}. At the same time, there has been a plethora of proposed CC algorithms for data centre (\cite{alizadeh2010data, mittal2015timely, wu2010ictcp}), wireless (\cite{mascolo2001tcp, kliazovich2006tcp, shimonishi2005improving}) and satellite (\cite{akyildiz2001tcp, taleb2006refwa, durresi2001congestion}) networks all of which present wildly different characteristics and challenges compared to each other.

Most commonly deployed congestion control protocols use a sliding window of data packets (with or without explicit data pacing by the sender) to monitor and control the number of data in transit, by adjusting the window's size. CC algorithms modify the size of the window in response to congestion events (e.g., inferred packet loss, acknowledgement of packet delivery, or in-network marking of packets). Historically, there have been three fundamental categories of congestion control solutions: loss-based, delay-based, and network-specific solutions. Loss-based (or reactive) solutions, such as TCP Cubic \cite{ha2008cubic}, are general purpose protocols that perform congestion control based on inferred packet loss; the window size increases until packet loss is inferred, at which point, the window size is decreased. Delay-based (or proactive) solutions, such as TCP Vegas \cite{brakmo1994tcp}, monitor per packet delay and change the congestion window to proactively decrease connection latency and, therefore, aim to avoid congestion packet loss. Some protocols, such as TCP Compound \cite{song2006compound} and BBR \cite{cardwell2017bbr}, take a combined approach reacting to both loss and delay increase. Network-specific solutions are built to operate in specific networking settings, allowing for greater assumptions during protocol design. For example, protocols like DCTCP \cite{alizadeh2010data} target data centre networks, characterised by shallow buffers, high speed links and standardised network topologies (such as Clos topologies); TCP Westwood \cite{mascolo2001tcp} uses bandwidth estimation to distinguish congestion and stochastic packet loss in wireless channels. 

Recently, a new approach for end-to-end congestion control has arisen, arguing that congestion signals and control actions are too complex for humans to interpret and that algorithms may provide superior policies. This defines an objective function to guide the development of the control strategy (e.g., on each ACK or periodically) that will optimise the specified function. Early work in this thread included off-line optimization of a fixed rule table \cite{winstein2013tcp, sivaraman2014experimental} and online gradient ascent optimization \cite{dong2015pcc, dong2018pcc}, with later work adopting sequential decision-making optimization via Reinforcement Learning algorithms \cite{badarla2011learning,jay2019deep, jin2018congestion, lan2019deep,li2016learning,li2018qtcp,mai2019self,nie2019dynamic,ramana2005learning,silva2016smart,tarraf1995reinforcement, xiao2019tcp, xu2019experience, abbasloo2020classic}.

\subsection{OMNeT++ Simulator}
\label{OMNeT++}
OMNeT++ is an extensible, modular and component-based C++ simulation framework that is primarily intended for the development of network simulators. The term `network' is used in a broad sense, encompassing wired and wireless communication networks, on-chip networks, queuing networks, etc. A simulation model consists of one or more components that encapsulate network functionality (e.g., the TCP/IP protocol stack, communication channels, mobility models) and interact with each other through gates. Components are developed in C++, then integrated using a high-level language into compound components and, finally, models to be simulated. INET is a model library for the OMNeT++ simulation environment that incorporates functionality for simulating computer networks (including the TCP/IP stack and several link/physical layer technologies).

OMNeT++ is a discrete-event simulator, which means that time progresses through scheduling events in the simulated future. The event queue is therefore a key data structure; scheduled events are placed in the queue according to their scheduled execution (simulated) time and the simulator executes events sequentially from the head of the queue, until no more events exist in the queue; i.e. the end of the simulation.

\begin{algorithm}
\caption{Simulation Life Cycle}\label{standalone}
\begin{algorithmic}[1]
\State initialise network \textit{model}
\While{ $\textit{queue.size} \neq 0$}
\State $\textit{event} \gets \textit{queue.get}$
\State $\textit{currentTime} \gets \textit{event.timestamp}$ \Comment{advance simulated time}
\State \textbf{process(}\textit{event}\textbf{)}  \Comment{processing may alter event queue}
\EndWhile
\State  finish simulation  \Comment{statistics collection and clean-up }
\end{algorithmic}
\end{algorithm}

OMNeT++ simulations run as a single-threaded process; the pseudo-code shown in Listing \ref{standalone} shows at a high-level the life cycle of a standalone simulation execution. The network model is first imported and initialised (line 1). All simulation components are declared and interconnected in a collection of descriptive files (NED files) that constitute the model. The NED files specify the simulation model's structure, such as the number of nodes in the network, the links between nodes, and the protocols stacks supported at each node. Configuration (.ini) files set model components to work in a certain way, including the kind of application running on each node, the type of traffic, the physical link attributes, etc. Each component, whose behaviour is implemented in C++, is then dynamically linked to the simulation kernel. During initialisation of the network model, including its individual components (e.g., applications, links, TCP/IP stack), one or more events may be instantiated and inserted in the event queue. If none is created, then the simulation is completed and clean-up is performed (line 6). In the opposite case, the simulator program iterates over each one of the events in the queue in a chronological order and process it (lines 2 - 5); such processing may involve sending a message through a network link, or performing a timeout event. Processing an event may therefore generate new events; e.g., a timer is reset and rescheduled for the simulated future after its expiration. The simulation ends when no more events exist in the queue.
OMNeT++ offers an Application Programming Interface (API) through which simulations can be executed and controlled programmatically. More specifically, through this API, a programmer can iterate over events in the queue and process them individually; this feature is key in RayNet's integration with the Ray and RLlib as discussed in Section \ref{raynet_implementation}.

\subsection{Ray and RLlib}
\label{ray_rllib}

Ray \cite{moritz2018ray} is a platform for general-purpose cluster computing that supports simulation, training, and servicing for RL applications. RLlib \cite{liang2018rllib} is an open-source library that provides scalable software primitives for RL and enables a broad range of algorithms to be implemented with high performance, scalability, and substantial code reuse. RLlib supports a variety of environment interfaces for training agents.\footnote{See https://docs.ray.io/en/latest/rllib/rllib-env.html for an excellent discussion of supported RLlib environments.} \textit{OpenAI Gym} is the primary interface for single-agent training environments. When an episode begins, the initial observation is returned to the agent and the environment is reset to its initial state. The agent interacts with the environment by providing the action, and it receives a reward for the action performed and the subsequent observation. This interaction takes place in \textit{steps}, until an episode termination condition is met, either because the environment has achieved a terminal state or because the environment's maximum number of steps has been reached. In a multi-agent environment, numerous agents may act simultaneously, sequentially, or in a combination of the two. The \textit{MutiAgentEnv} interface enables mapping of trajectories to individual agents and the assignment of distinct policies to distinct agents. An agent can be mapped to a single policy, while a policy can be mapped to multiple agents. In many cases, the environment must run autonomously, outside of RLlib's control, such as in gaming engines or robotic simulator examples, although RLlib is still employed for training the respective agents' policies. Instead of having the agent actively step the environment and wait for the returned tuple, RLlib provides the \textit{ExternalEnv} interface, which permits querying a policy for actions and logging end-of-step tuples. RLlib allows scalability of experience collection in two ways: vectorisation of multiple environments within a single process and batching policy evaluations across these environments; i.e., distributing multiple environments across multiple processes, where each environment runs as an independent process.

\section{Design Principles}
\label{principles}

In this section, we discuss the key design principles underpinning RayNet, leaving all the implementation-specific details for Section \ref{raynet_implementation}.

\noindent\textbf{Separation of environment from learning.} As with standard RL training playgrounds, the environment must be logically separated from the learning process and its execution needs to adhere to a few simple operations (and, programmatically, to a respective API). This ensures that one can change learning algorithms and hyper-parameters without having to change anything in the environment where the agent(s) operate(s) in. Conversely, it is crucial that the environment can change without requiring any changes to the learning infrastructure, in order to support learning in different contexts and scenarios. A typical example of such a separation is the \textit{OpenAI Gym} abstraction that is widely adopted in numerous RL setups. Such a separation is particularly important in the context of computer networks, where a single agent may have to be trained in diverse network setups, regarding the physical topology, number of end-hosts, traffic workloads etc. In addition, the step size definition may be different depending on the nature of the problem being solved by the RL agent. Typically, games, such as Chess or Go, have a discrete turn-based structure in which each step corresponds to a player's turn. Other tasks, including Atari Games and Robot Control, necessitate the discretisation of time. For networking-related problems, time is continuous\footnote{Note that we refer to continuous time despite the fact that RayNet employs a discrete event simulator where simulated time progresses through the execution of events.}. Agents act at predefined time intervals that signal an RL step, the length of which depends on the particular task. It is therefore important that fine-grained control of the step size is supported, where agents can step independently to each other, in groups or individually. RayNet allows for a step-based approach (built on top of the \textit{OpenAI Gym} environment) by having Ray directly controlling the event execution loop for each simulation, as discussed in Section \ref{event_looping}.

\noindent\textbf{Support for multi-agent environments.} Learning policies for core network functions, such as CC and routing, requires operating multiple agents within a single environment. For example, as part of the learning process, one could have multiple TCP flows (i.e., CC agents that use the same policy) competing for bandwidth on a network link. Similarly, multiple routers in a network may be acting on flows independently, following the same or different policies. It is therefore crucial to support environment execution (for training and/or evaluation purposes) with multiple agents that may or may not learn/employ the same policy. RayNet supports this by integrating Ray/RLlib's multi-agent interface with a bespoke signalling system for disseminating and collecting actions, rewards, and observations that we developed using OMNeT++'s API, as detailed in Section \ref{raynet_environment}.

\noindent\textbf{Reproducibility.} Reproducibility enables researchers to replicate published results, identify errors or limitations and propose ways forward. RL is generally hard to reproduce due to the algorithms' intrinsic variance, the environments' stochasticity, and the potentially large number of hyper-parameters that can go unreported. \textit{RayNet} aims to minimise factors that can lead to non-reproducible results by employing OMNeT++ as the underlying environment for collecting experience to optimise agent policies. OMNeT++ simulations are deterministic by design, allowing for fully reproducible results; OMNeT++'s sophisticated pseudo-random number generator framework allows for controlling randomness and enabling truly independent runs of the same simulation (i.e., a RayNet environment). On top of this, Ray and RLlib, support state-of-the-art reporting of hyper-parameters, limiting non-determinism only to algorithms' intrinsic variance.

\noindent\textbf{Efficiency and scalability.} RL requires the collection of a very large amount of experience through which agents learn how to best interact with their environment. Experience is then collected into a replay/memory buffer and learning is usually done by drawing an experience batch out of this buffer. Three issues are crucial in this process; (1) the environment must be quick in transforming agents' actions into a reward, and some partially observable state, as modelled in the learning process; (2) the learning itself must be done efficiently; (3) and all available computational resources must be used as efficiently as possible; by enabling parallel instantiation and execution of as many environments as possible; and by minimising the overhead in the interaction between the learning and environment execution components of the RL setup. RayNet adheres to this principle, by using lightweight Python bindings (see Section \ref{overview}) to integrate Ray/RLlib with OMNeT++ in a programmatic fashion. Ray allows for running multiple environments in parallel, and OMNeT++ itself runs each environment (i.e., a network simulation) efficiently as a single-threaded process which enables parallel learning at scale.

\section{RayNet Architecture}
\label{raynet_implementation}

In this section, we discuss RayNet in detail. We first provide a high-level overview of its architecture, and describe how the RLlib environment is integrated with OMNeT++'s event loop and simulation models. We then focus on the bespoke signal system we developed within OMNeT++ so that Ray/RLlib can efficiently communicate actions to agents and collect observations and rewards from them at user-defined learning steps. Finally, we discuss how we can deploy trained agents (and associated policies) into a simulated environment and in a real-world computer network.

\subsection{Overview}
\label{overview}
 
\begin{figure}

\includegraphics[width=\textwidth]{diagrams/worker.pdf}
\centering
\caption{An overview of a RayNet worker's components. The \textit{Trainer} and \textit{Worker} are Ray processes. The Worker initialises and controls a Python class that implements the \textit{OpenAI Gym} interface and serves as the RL environment. Through the \textit{pybind11} API, OMNeT++ and the simulation model are embedded into the environment.}
\label{core_modules}
\end{figure}

RayNet employs Ray and RLlib which operate as discussed in Section \ref{ray_rllib}. OMNeT++ is used to instantiate learning environments. Ray's trainer and workers are completely oblivious of the implementation details of the learning environment, which are abstracted away through the \textit{OpenAI Gym} interface supported by RLlib. Figure \ref{core_modules} illustrates an overview of RayNet; for clarity, only a single worker is shown to interact with Ray's trainer. Having multiple workers running in parallel is trivially supported by Ray and wouldn't affect any of the discussed components in this section.

The trainer, i.e. the process running the RL algorithm, delegates policy evaluation to one or more parallel processes, referred to as rollout workers, to speed up experience collection during training. Each rollout worker is assigned one (as in Figure \ref{core_modules}) or more RayNet environments and interacts with them only by calling the methods exported by the \textit{OpenAI Gym} interface, namely \textit{initialise()}, \textit{step()} and \textit{reset()}, as shown in Figure \ref{core_modules}. 

The environment itself is an OMNeT++ simulation, which consists of the discrete event loop handler, core OMNeT++ classes, including a user interface class, such as the \textit{cmdenv}\footnote{OMNeT++ calls these environments and are meant to facilitate configuration and execution of simulations, but they have nothing to do with the concept of RLlib environments.}, and all different simulation models that compose an OMNeT++ `network'. As discussed, in Section \ref{OMNeT++}, OMNeT++ exports an API to initialise, and run simulations, programmatically, instead of running a simulation as a standalone process. RayNet does not run OMNeT++ simulations as independent processes; instead, it employs the exported API to programmatically control the life cycle of a simulation; said life cycle is mapped to the methods exported by the \textit{OpenAI Gym} interface, effectively integrating Ray/RLlib with OMNeT++. Below, we briefly describe these methods abstracting away from the details of OMNeT++. In Section \ref{raynet_environment}, we analyse in detail the interactions within the OMNeT++ simulation in response to calling these methods.  

\begin{itemize}
    \item \textbf{initialise():} this method creates and initialises the environment. Objects of all core OMNeT++ classes along with classes that comprise the network are instantiated. The event loop handler is instantiated and, as part of the network initialisation, one or more events are scheduled for the future (e.g., an application sends a packet down to the data transport layer or a wireless node broadcasts a link layer frame). Note that the simulation has not started at this point; an episode can be started only after the \textit{reset()} method below is called. 
    
    \item \textbf{reset():} at any point or at the end of an episode (i.e., a running OMNeT++ simulation), the Ray/RLlib worker can call this method to restore the environment to a random or starting state. The function returns the starting observation for this new episode that is about to begin.
    
    \item \textbf{step(action):} progression within an episode is done by calling this method. The worker provides the action to be performed on the environment, and the environment transitions to the subsequent state. The function returns an observation of the newly attained state, a reward value for the performed action, and a Boolean flag indicating whether the new state is final (i.e. the end of an episode) or not. Multiple agents can step in the environment simultaneously in which case all the aforementioned scalar values become vectors of values, with each element in the vector (e.g., the reward vector) being associated to a specific agent operating in the environment. The definition of the step, action, reward and observation space along with how the environment transitions from one state to another are problem-specific. The RL algorithm that runs within the trainer (see Figure \ref{core_modules}) feeds the action, reward and observation space into its internal model, but is oblivious of how the environment transitions or, in fact, what the environment is. In RayNet, all environment-specific knowledge is embedded within the OMNeT++ code. In the next section we explain how OMNeT++ signifies the end of a step, which triggers the \textit{step()} method to return control to the worker.
    
\end{itemize}

The last but crucial link of this integration is the Python bindings implemented using \textit{pybind11}\footnote{https://pybind11.readthedocs.io/}. As shown in Figure \ref{core_modules}, the Python bindings sit between the worker and the OMNeT++ API; they implement the \textit{OpenAI Gym} methods by subsequently calling bindings of C++ methods that directly call methods exported by the OMNeT++ API.

\subsection{Event Looping and Environment Stepping}
\label{event_looping}

\begin{algorithm}
\caption{Environment Stepping}\label{embedded}
\begin{algorithmic}[1]
\Procedure{Step}{}
\While{ $\textit{queue.size} \neq 0$}
\State $\textit{event} \gets \textit{queue.get}$ \Comment{retrieve event from head of event queue}
\State $\textit{currentTime} \gets \textit{event.timestamp}$ \Comment{advance simulated time}
\If{\textit{event}.type = \textsc{step}}
\State $\textit{obs} \gets \textit{model.getObs}$
\State $\textit{reward} \gets \textit{model.getReward}$
\State $\textit{done} \gets \textit{model.getDone}$
\State \Return \textit{(obs, reward, done)} \Comment{return control to the worker}
\Else
\State \textbf{process(}\textit{event}\textbf{)}  \Comment{processing may alter event queue}
\EndIf
\EndWhile
\EndProcedure
\end{algorithmic}
\end{algorithm}

Performing a step in the integrated OMNeT++/OpenAI Gym environment requires executing one or many\footnote{In a large scale network simulation there could be thousands of simulated events in a single environment step.} simulated events. As a result of executing an event, the simulated time advances to the future time at which the event was scheduled to be executed.

As discussed in Section \ref{overview}, RayNet integrates OMNeT++ into Ray workers through OMNeT++'s API and by tapping into its event processing loop. The pseudo-code shown in Listing \ref{embedded}, which describes the behaviour of the \textit{step()} method mentioned in the previous section, illustrates this integration. Every time a worker calls the \textit{step()} method, through the \textit{pybind11} API (see Figure \ref{core_modules}), control passes to the event loop in Listing \ref{embedded}, which iterates over each one of the events in the queue in chronological order and process it (lines 2 - 11), until either no more events exist or a special \textsc{step} event is found (line 5). In the former case, the simulation (and implicitly the last step of the RL episode) is completed, and the worker cleans up the simulation outside this method. In the latter case, the end of the step is signified and the worker collects the reward (or vector of rewards for multiple agents) and observation (or vector of observations) from the agent(s) operating in the environment (lines 6 - 8), as discussed in Section \ref{raynet_environment}. 

There are two important points related to the described integration; (1) Ray/RLlib (i.e., the trainer and workers) are completely oblivious of the nature of the step and the internals of the environment. The worker only knows to call the \textit{step()} method, which, in turn, consumes events in OMNeT++'s queue, effectively progressing the simulation (i.e., performing operations in the environment) without any need to understand the semantics of these events. Similarly, the step does not need to be defined by the trainer or worker(s); the \textit{step()} method returns when a \textsc{step} event is found in the queue. (2) Crucially, it is the responsibility of the environment to place this special event in the queue; this provides flexibility in defining a step within a specific problem space; e.g., in a CC problem the end of the step may be after a fixed amount of time elapses, or after a fixed or dynamically calculated number of packet acknowledgments are received by the sender. In all these cases, it is some user-defined OMNeT++ module that schedules a \textsc{step} event when some problem-specific conditions are met. 


\subsection{RayNet Environment}
\label{raynet_environment}

A RayNet environment consists of OMNeT++ simulation models, implemented as C++ modules, and RayNet-specific modules (namely \textit{RL agents}, the \textit{Stepper} and \textit{Broker} as depicted in Figure \ref{environment_internals}) that facilitate environment stepping and its interaction with the Ray workers. A RayNet environment contains one or more RL agents that act based on policies trained outside the environment (i.e., within the Ray trainer, as depicted in Figure \ref{core_modules}). For example, in a CC setup, an agent could operate within the data transport layer, controlling the transmission rate for a specific network flow, while calculating the required observations and reward that are used for training one or more policies (see Section \ref{cc_with_raynet} for more details on our CC use case). The Stepper module is responsible for coordinating with the RL agents to enable the environment to transition to a new state when a Ray worker calls the \textit{step()} method. The Broker module is responsible for serialising and de-serialising action/observation/reward values (scalar or vectors) and disseminating these to agents (actions) and the Ray worker (observations/rewards), at the beginning and end of a step, respectively.

Interaction between the RayNet-specific modules is implemented using the signalling system provided by OMNeT++, which adopts a publish/subscribe communication paradigm. More specifically, an OMNeT++ module can subscribe by name to one or more signal types; multiple modules can subscribe to the same signal. When a module publishes a signal of one of the types for which other modules have previously subscribed, OMNeT++ passes the signal to these modules through a callback mechanism. A key advantage of such a paradigm is that the coupling between publisher and subscriber modules is loose; i.e., these modules do not need to know of each other to be able to communicate. This is crucial in RayNet because an environment may contain multiple agents that appear and disappear at different times during the life cycle of an environment; e.g., TCP senders for respective TCP flows. With OMNeT++'s signalling system, RL agents can communicate with the Stepper and Broker modules by publishing and subscribing to a priory known signal types without requiring referencing each other at compile time.

Upon environment initialisation the Stepper and Broker modules are also initialised and, as part of this, they subscribe to specific signal types so that (1) they can coordinate with each other and (2) receive messages by RL agents. RL agents that are present when the environment is initialised, register their presence with the Stepper and Broker modules by publishing signals of specific types, and subscribe to specific signal types so that they can receive messages from them; RL agents that appear during the life cycle of an environment follow the same registration process with the Stepper and Broker modules. At this point, initialisation is complete and the \textit{initialise()} method that triggered the operations described above returns control to the Ray worker. 

Next, Ray workers call the \textit{reset()} method of the OpenAI Gym API, which brings the environment to a state where the first step can be taken. This may be involve processing zero to many simulated events that have been queued in the event queue during initialisation and events that need to be scheduled and executed as part of the environment resetting. It is the Stepper module that signals this state by inserting a \textsc{step} event at the front of the queue (i.e., scheduling the event to the present simulated time). The environment is now ready to be stepped and the \textit{reset()} method returns control to the Ray worker, along with the initial environment observation.

\begin{figure}
\includegraphics[width=\textwidth]{diagrams/environment.pdf}
\centering
\caption{OMNeT++ module interactions throughout the life cycle of a step}
\label{environment_internals}
\end{figure}

A Ray worker subsequently steps the environment (by repeatedly calling the \textit{step()} method) until the end of the episode - the definition of the episode is problem-specific; e.g., the end of the simulation or reaching some internal milestone. The end of the episode is signified by the environment, by  returning the step() method with the relevant Boolean flag set to true. Figure \ref{environment_internals} illustrates the sequence of RayNet-specific operations that take place during a step. Note that, as discussed in Section \ref{event_looping}, a potentially large amount of simulated events may be executed during a step; here we focus only on events, signalling and data exchange related to the step itself; all other events are standard OMNeT++ events that simulate some intended network functionality. When the Ray worker calls the \textit{step()} method, the \textsc{step} event that was previously inserted in the event queue is consumed and a new environment step begins ((1) in Figure \ref{environment_internals}). The value of the action is passed to the Broker module, directly through the OMNeT++ API which allows accessing OMNeT++ modules by name ((2) in Figure \ref{environment_internals}). Depending on the number of RL agents present in the environment, this value may be a scalar value or a vector of values (in fact a pair of \{agent-id, action\} values), one for each RL agent. The Broker module then `broadcasts' the action(s) by publishing a signal that RL agents are subscribed ((3) in Figure \ref{environment_internals}). At this point every RL agent is aware of what it needs to do in the current step. The Ray worker loops over events in the event queue until a \textsc{step} event is encountered. As discussed above, such a step event is inserted in the queue by the Stepper module when some environment-specific condition is met ((5) in Figure \ref{environment_internals}). Throughout the duration of each step, the Stepper module coordinates with the Agent modules using the signalling system to delineate the end of the step ((4) in Figure \ref{environment_internals}); e.g., in Section \ref{cc_with_raynet}, we describe how we have used per-agent timers (implemented as self-messages in OMNeT++) to step the environment in our CC use case. RL agents signal their observation and reward to the Broker module at the end of the step ((6) in Figure \ref{environment_internals}). When the Ray worker encounters the \textsc{step} event in the event queue, it collects observations and rewards directly from the Broker module through the OMNeT++ API ((7) in Figure \ref{environment_internals}). These are communicated back to the Ray trainer which places them into the RL replay/memory buffer.

\subsection{Policy Deployment}
\label{deployment}

Deploying a learned policy within RayNet's simulated environment is straightforward; Ray/RLlib allows to reload previously trained policies, either for evaluation or resume training from some saved checkpoint. During evaluation, exploration is not necessary and can be disabled when computing actions. In fact, the reward, unless is part of the input features of the observation, is not required by the agent during the decision-making process, and it is only used during training of the policy. In Section \ref{experimental_evaluation}, we show how trained agents perform when deployed in a wide variety of environments, all of which are simulated.

Deploying a policy that was learned with RayNet in a real network deployment would require (1) exporting the learned model(s) that comprise the policy which is a feature of Ray/RLlib and (2) integrating the policy with a real-world implementation of the network functionality under consideration. For example, in the CC use case, one could integrate the learned policy into a user-space process that communicates with a kernel-space implementation of a modified TCP protocol using the sockets API, as it is done in \cite{abbasloo2020classic}. The modified TCP protocol collects observations and passes them to the user-space process that, in turn, feeds back to it an action calculated by taking into account the received observations.

\section{Learning Congestion Control with RayNet}
\label{cc_with_raynet}

In this section, we describe a use case of RayNet, namely Congestion Control (CC) with deep reinforcement learning (DRL), that we have developed to showcase RayNet's functionality. As discussed in Section \ref{congestion control}, CC is responsible for adjusting the amount of in-flight (i.e., sent but unacknowledged) data and/or the pace at which data is sent, so that (1) network utilisation is maximal, (2) perceived latency is minimal, and (3) some form of fairness is adhered to (e.g. max-min fairness). A CC policy dictates an action (e.g., increasing the amount of allowed in-flight bytes) in response to some signalled or inferred state from the network (e.g. an increase in the experienced packet round trip time, or receiving three duplicate acknowledgments).

\begin{figure}[H]
\includegraphics[width=0.75\textwidth]{diagrams/ccusecase.pdf}
\centering
\caption{Timeline of the congestion window evolution of a simulated episode with two flows (agents). After the initialization, flows are scheduled and \textit{reset() }is called on the environment. The end of the initial step marks the end of a \textit{reset()} call and control of the window is delegated to the agents. The agents adjust the congestion window (red squares) at the beginning of each step.}
\label{ccusecase}
\end{figure}

In our use case, each RL agent sits on the sender side of each data transport flow in the network. For each flow, data transmission takes place in steps, and at the beginning of each step the RL policy fixes the congestion window size for the whole step duration -  no updates in the congestion window size occur within a step in response to incoming acknowledgments or any other in-network signals. Depending on the network's path propagation delay and bandwidth, the steady-state congestion window size of a flow can range over several orders of magnitude, therefore, similarly to \cite{abbasloo2020classic}, to reduce the action space size, the policy action $\alpha$ is a multiplier applied to the current congestion window size. At the beginning of each step $t$, RL agents set the value of the $cwnd_{t}$, according to Equation \ref{cwnd_eq}.

\begin{equation}
\label{cwnd_eq}
    cwnd_{t} = 2^{\alpha} \times cwnd_{t-1}
\end{equation}

The choice\footnote{We do not discuss in any detail the rationale behind selecting the specific action, observation and reward and present these here only for completeness so that we can discuss experimental results presented in Section \ref{experimental_evaluation}.} of $ \alpha $ is limited to the range [-2,2], so that the congestion window can increase by a max of four times and decrease to a max of a quarter of the current window size.

RL agents infer the state of the network through observations defined as as follows:
\begin{enumerate}
    \item The throughput $\mathcal{R}$ achieved in the last step over the estimated maximum bandwidth $\mathcal{R}^{max}$ of the connection.
    \item The min-max normalised smoothed round trip time $\tilde{d}$, measured in the last step, where the min and max RTT values, $d^{min}$ and $d^{max}$ respectively, are estimated from the beginning of the connection.
    \item The ratio of packets lost $\mathcal{L}$ over total packets transmitted in the last step
    \item The current congestion window size.
\end{enumerate}

At each step, the reward $r$ assigned to the agent depends on the throughput $\mathcal{R}$, round trip time $d$ and loss rate $\mathcal{L}$ measured in the step as follows:

\begin{equation}
  r=\begin{cases}
    \frac{\mathcal{R}}{\mathcal{R}^{max}} - L, & \text{if $\frac{\mathcal{R}}{\mathcal{R}^{max}} - L < 1 \land d = d^{min}$}.\\
     (\frac{\mathcal{R}}{\mathcal{R}^{max}} - L) \cdot \frac{d^{min}}{d} \cdot (1 - \tilde{d}), & \text{otherwise}.
  \end{cases}
\end{equation}

where $\tilde{d} = \frac{d - d^{min}}{d^{max} - d^{min}}$.

The life cycle of an example episode with two flows is illustrated in Figure \ref{ccusecase}. After the environment initialisation, Flow 1 (shown in blue) is scheduled to start first at time $t_{f1}$ and Flow 2 (shown in green) is scheduled to start at time $t_{f2}$. Upon calling the \textit{reset()} method, the simulation loop starts executing events until the first flow starts. When the flow starts, it sets the congestion window to a small fixed value (as in \cite{ha2008cubic}), publishes a signal to the \textit{Stepper} module to register its presence, and declares the duration of its initial step ((4) in Figure \ref{environment_internals}). For simplicity, here we assume that some known average of the Round-Trip Time (RTT) in the network is selected as step size.\footnote{In practice, we employ a \textit{slow-start} phase (as TCP does) during which the congestion window size is increased exponentially in every RTT, until either a threshold is reached or packet loss occurs. This enables the sender to acquire good estimates of the current RTT, the minimum observed RTT and the maximum observed throughput; the former is used to set the duration of the first step, and the rest are used in the calculating rewards.}At time $t_{s1}$ the RL agent calculates its environment observation for the initial step and publishes it to the \textit{Broker}, signalling the end of the step and returning from the \textit{reset()} method. After inferring the next action from the received observation, the Ray worker calls the \textit{step()} method passing the new action to the environment. The duration of this (and all subsequent) steps is calculated by the agent at the start of each step. Each RL agent (one for each flow) in the network steps independently, and each step lasts for an amount of simulated time equal to twice the minimum RTT that the sender has observed in the last 10 seconds for this connection - the minimum RTT should change only when the network path changes due to re-routing. Each RL agent declares its own step duration to the Stepper module ((4) in Figure \ref{environment_internals}), which subsequently schedules a \textsc{step} event accordingly ((5) in Figure \ref{environment_internals}); note that the step duration can vary from step to step, as depicted in Figure \ref{ccusecase}. Flow 2 (shown in green) starts at time $t_{f2}$, in the middle of one of Flow 1's steps. Flow 2 also performs its initial step; at time $t_{s2}$, and before Flow 1 steps, the first observation value is returned and a new step for Flow 2 begins upon receiving a new action drawn from the RL policy. Note that in the figure, RL agents of different flows step independently from the each other collecting observations and rewards that are all fed into the same training process; i.e., learning a single RL policy. 


\begin{figure}
\begin{minipage}[b]{0.47\textwidth}
\centering
\includegraphics[width=0.8\textwidth]{diagrams/Dumbell.pdf}
\caption{Dumbell topology}
\label{dumbel_topology}
\end{minipage}
\hfill
\begin{minipage}[b]{0.47\textwidth}
\centering
\begin{tabular}[t]{||c | c | c||} 
 \hline
 \textbf{Bandwidth} & \textbf{RTT} & \textbf{Buffer} \\ [0.5ex] 
 \hline\hline
 64-128Mbps & 16-64ms & 80-800 packets \\ [0.5ex] 
 \hline
\end{tabular}
\tabcaption{Network Parameters' ranges during training}
\label{tab:training_range}
\end{minipage}
\end{figure}


\section{Experimenting with RayNet}
\label{experimental_evaluation}

In this section, we explore RayNet's capabilities and performance characteristics through experimentation with (1) the congestion control use case discussed in Section \ref{cc_with_raynet} and (2) a simple CartPole environment \cite{barto1983neuronlike} that we also developed in RayNet. Our aim is to showcase that RayNet meets the design principles set out in Section \ref{principles}, namely \textit{separation of environment from learning}, \textit{support for multi-agent environments}, and \textit{efficiency and scalability}.\footnote{We do not discuss the \textit{reproducibility} principle any further here; OMNeT++ simulations are deterministic therefore any reproducibility limitations only stem from RL algorithms' intrinsics.}First, we demonstrate how \textit{RayNet}'s design facilitates learning in the context of a complex networking task which involves the search and optimization of RL algorithms and respective hyper-parameters, evaluation on diverse networking environments, and analysis of multi-agent performance. Second, we demonstrate that RayNet's overhead in terms of CPU and memory utilisation, and learning efficiency is negligible when compared to learning to perform the same baseline task (i.e., the CartPole task) using Open AI's CartPole environment. All experiments were conducted on a Linux server with 32 CPUs and 128GB of RAM.

\subsection{Separation of environment from learning}
\label{experimentation_separation}

Here, we train models to yield efficient congestion control policies using the reward function and observations discussed in Section \ref{cc_with_raynet}. Specifically, we train a single RL agent (i.e., a sender) on a Dumbell network shown in Figure \ref{dumbel_topology}. Delay, loss rate, and maximum bandwidth for the connection are determined by the path bottleneck's traffic load, physical transmission rate, and buffer size. Consequently, we model any network end-to-end path as a single bottleneck link with propagation delay equal to the path's delay and link rate equal to the link in the path with the lowest link rate. We adopt a `train and deploy' approach for our congestion control policy. During training, the policy is optimised by a specific algorithm, which generates samples of experience, and continuously updated so as to maximise the expected cumulative reward. 

\noindent\textbf{Varying environment parameters.} First, we demonstrate how RayNet enables varying the environment completely independently of its learning components. We expose the congestion control agent to a variety of network configurations, with network parameters sampled from the ranges shown in Table \ref{tab:training_range}. We train the agent for 1 million environment steps using sixteen parallel rollout workers. Each worker creates its own RL environment (i.e., simulated network) by uniformly sampling its parameters over each of the parameter ranges listed in Table \ref{tab:training_range}. Workers simultaneously produce and provide the trainer with experience by stepping their environments independently. Parallel experience gathering from multiple networking scenarios prevents the model from overfitting particular network conditions and avoids ``catastrophic forgetting'' of network scenarios for which experience had been gathered earlier in the training process \cite{abbasloo2020classic}. We define the parameterised networks using OMNeT++'s NED language and set the precise values of bottleneck bandwidth, propagation delay, and buffer size at the beginning of each episode. The agent is trained using Deep Deterministic Policy Gradient (DDPG), a model-free actor-critic algorithm based on the deterministic policy gradient that can operate over a continuous action space \cite{lillicrap2015continuous}, in conjunction with distributed prioritised experience replay \cite{horgan2018distributed}. Similar to other deep learning and reinforcement learning solutions, DDPG comes with a set of hyper-parameters, such as the replay buffer size, neural network learning rates, target network update scale, etc., which are frequently optimised for the specific task at hand through search. Optimization of the algorithm's hyper-parameters is outside the scope of this study, thus we have fixed them to the default values of the RLlib implementation. However, RLlib allows hyper-parameter to be easily configured using a simple key-value map. Ray also includes Ray Tune \cite{}, an automated tool for tuning hyper-parameters, which can be customised to employ sophisticated search techniques, such as grid search and Bayesian optimisation. All this advanced functionality is complementary to RayNet and accessible by default through Ray.

We evaluate the DDPG-trained policy on networks with parameters sampled from a wider ranges than the ones used for training so that we can assess how well the policy generalises to unseen environments. Using neural networks, deep RL models relate continuous state space to actions and/or expected reward. Due to the state value space being extremely large, the model cannot exhaustively explore all of the space, eventually encountering states it has never seen before. Assessing the model's performance in regions of the state space that were not observed during training can help prevent deployment failures. We evaluate the agent by varying one of the three studied dimensions (bandwidth, propagation delay, and buffer size) within a range that includes but is broader than the respective training range (see Table \ref{tab:training_range}), while keeping the other two fixed at the mean value of the training range. We assess the performance of the agent (and its underlying congestion control policy) by measuring three key performance metrics; \textit{normalised throughput} (i.e., the measured throughput over the theoretical maximum one), \textit{queuing delay} (at the network bottleneck), and \textit{packet loss}.

\begin{figure}[t]
\centering
   \includegraphics[width=1\linewidth]{plots/bandwidth2.pdf}
   \caption{Normalised throughput, queuing delay and average loss rate of a single flow as the bottleneck bandwidth varies. Shaded region indicates the bottleneck bandwidth range used during training.}
   \label{fig:bandwidth} 



\centering
   \includegraphics[width=1\linewidth]{plots/rtt2.pdf}
   \caption{Normalised throughput, queuing delay and average loss rate of a single flow as the propagation delay varies. Shaded region indicates the propagation delay range used during training.}
   \label{fig:rtt}



\centering
   \includegraphics[width=1\linewidth]{plots/buffer2.pdf}
   \caption{Normalised throughput, queuing delay and average loss rate of a single flow as the buffer size varies. Shaded region indicates the buffer size range used during training.}
   \label{fig:buffer}
\end{figure}

Figures \ref{fig:bandwidth}, \ref{fig:rtt}  and \ref{fig:buffer} depict the (a) normalised throughput, (b) queuing delay, and (c) loss-rate for a single flow in networks when varying bandwidth, propagation delay, and buffer size, respectively. Lines depict the mean values and the blue-shaded regions depict the standard deviation. On the x-axis and y-axis, we show the varying parameter of the network and measured performance metric, respectively. The red-shaded region depicts the training range for the varying network parameter. Although analysing these results in depth is beyond the scope of this paper, for completeness, we briefly discuss the behaviour of the learned policy, noting that it is RayNet that allows such analysis through its clean separation of the environment from the learning components. 

The network's bottleneck bandwidth influences flow's throughput, queuing delay and loss rate the most compared to the other two dimensions, i.e. propagation delay and buffer size. In fact, if the bandwidth of the network falls within the training range values, the flow achieves the highest bandwidth utilisation (Figure \ref{fig:bandwidth}.a), lowest queuing delay (Figure \ref{fig:bandwidth}.b) and negligible packet loss rate (Figure \ref{fig:bandwidth}.c). If the bottleneck bandwidth falls outside of the training range, the flow's throughput degrades when the available bandwidth is less than the experienced values (Figure \ref{fig:bandwidth}.a), due to large congestion windows that overfill the bottleneck buffer, increasing queuing delays (Figure \ref{fig:bandwidth}.b) and packet loss (Figure \ref{fig:bandwidth}.c); when the available bandwidth is greater than the experienced range, the policy's control of the congestion window results in underutilisation of the bottleneck link, characterised by low throughput (Figure \ref{fig:bandwidth}.a) and no queuing delay (Figure \ref{fig:bandwidth}.b). Even when available bandwidth is underutilised, the flow experiences  loss rate (Figure \ref{fig:bandwidth}.c). This is due to an increase packet loss at the conclusion of the Slow Start phase in larger BDP connections, when the number of in-flight packets is larger and so is packet loss. When the propagation delay is raised, the resultant behaviour is comparable (Figure \ref{fig:rtt}.c). Setting the propagation delay outside of the training ranges does not affect performance as much as the variation of bandwidth does.  In fact, the stepping system of the policy is delay agnostic, and always set to the least round trip time measured during the connection. With increased propagation delay, BDP and loss rate towards the end of slow start rise (Figure \ref{fig:rtt}.c). Queues build-up is also more likely (Figure \ref{fig:rtt}.b). Variation in buffer size mostly increases queuing times, which increase linearly with buffer size (Figure \ref{fig:buffer}.b).


\begin{figure}[t]
\centering
\begin{minipage}[t]{.32\textwidth}
  \centering
  \includegraphics[width=\linewidth]{plots/episode_reward_mean.pdf}
  \captionof{figure}{Cumulative episode reward during training over the number of experience steps collected so far. Mean and standard deviation are computed across the 15 parallel episodes and seeds at each training step. }
  \label{fig:episode_reward_mean}
\end{minipage}
\hfill
\begin{minipage}[t]{.32\textwidth}
  \centering
  \includegraphics[width=\linewidth]{plots/episode_length_mean.pdf}
  \captionof{figure}{Episode length during training over the number of experience steps collected so far. Mean and standard deviation are computed across the 15 parallel episodes and seeds at each training step}
  \label{fig:episode_len_mean}
\end{minipage}
\hfill
\begin{minipage}[t]{.32\textwidth}
  \centering
  \raisebox{0.3cm}{\includegraphics[width=\linewidth]{plots/time_over_algorithm.pdf}}
  \captionof{figure}{Duration of a training session of 1 million steps with PPO, SAC and DDPG. Error bars depict the standard deviation across training sessions with different seeds.}
  \label{fig:alg_len}
\end{minipage}
\end{figure}

\noindent\textbf{Varying learning parameters and hyper-parameters.} Discovering and optimising RL policies often requires empirical evaluation to identify the best RL algorithm and its hyper-parameters for a given task/problem and it has been shown that specific algorithms perform best in some problems than others \cite{xu2014reinforcement}. For instance, some task may inherently require a stochastic decision making policy to maximise the objective, like in the Rock-Paper-Scissors example, where a deterministic policy would inevitably lead to sub-optimal decisions. Even after fixing the RL algorithm to be used in a specific use case, one needs to evaluate its efficacy within a broad range of hyper-parameters associated to the selected algorithm. For example, DDPG require fine tuning of the exploration strategy \cite{lillicrap2015continuous}; SAC enforces exploration including the maximization of the policy's entropy in the reward formulation, at the cost of introducing a new \textit{temperature} hyper-parameter \cite{haarnoja2018soft} that trades off exploration and rewards; PPO uses a surrogate loss function to keep the step from the old policy to the new policy within a safe range, which requires either a clipping threshold, a weighted KL divergence factor or a combination of the two \cite{schulman2017proximal}.  Several hyper-parameters are common to multiple RL algorithms but need to be optimised separately for each algorithm and task at hand. For example, off-policy algorithms often store experience in replay buffers, whose size must be set; the discount factor $\gamma$ is a common hyper-parameter to the majority of RL algorithms implementations; function approximators like neural networks bring their whole package of hyper-parameters, such as learning rates, loss optimizers, activation functions, and so on. RayNet supports such exploratory studies out of the box, by integrating Ray/RLlib with OMNeT++ so that the RL environment is completely separated to the learning components. To demonstrate this capability, we train RL policies in the experimental setup discussed above, using three state-of-the-art algorithms; PPO \cite{schulman2017proximal}, a policy gradient algorithm, (APEX) DDPG \cite{lillicrap2015continuous}, a deterministic policy gradient algorithm with distributed prioritised experience replay, and SAC \cite{haarnoja2018soft}, a soft policy optimization version of the actor-critic algorithm. All three algorithms are part of RLlib \cite{liang2018rllib} and all relevant configuration is done using RLlib's APIs, completely independently of the underlying environments (which are parameterised as in the previous section). 

Figure \ref{fig:episode_reward_mean} illustrates the average cumulative reward of episodes attained during training of the agent with the three aforementioned algorithms. Both PPO and SAC optimise a stochastic policy, that is, a distribution of actions given the state, and each policy update is constrained by selecting a safe region for the policy update (PPO) or by imposing the maximum entropy principle (SAC). In both instances, the cumulative reward is increasing monotonically, and both algorithms converged to their asymptotic optimum after the same number of training steps (around 125K). The asymptotic cumulative reward of SAC is dependent on the entropy weight factor, and the default entropy maximisation strategy led in a lower cumulative reward than PPO. DDPG, meanwhile, optimises a deterministic strategy. The policy update is not restrictive, and the policy change is contingent on the exploration strategy. Figure \ref{fig:episode_reward_mean} shows that an initial warm-up time of 200,000 steps with random experience gathering delays learning of DDPG. As the policy begins training, the cumulative reward reaches a local maximum before recovering and converging to a superior asymptotic maximum. This behaviour is a result of the varied duration of episodes. As per the definition of our congestion control model, an episode can end in three ways: (1) the policy creates a high level of congestion that may be too difficult to recover from, at which point the episode ends; (2) the flow is entirely delivered; or (3) the episode reaches a maximum number of steps (400 steps in our case). In the second scenario, the policy's cumulative reward may be deceptive regarding the quality of the policy itself. Figure \ref{fig:episode_len_mean} and Figure \ref{fig:episode_reward_mean} illustrate that longer episodes may result in greater cumulative reward, but for a fixed size flow, shorter episodes imply shorter flow completion times, a consequence of well behaving congestion control policies. Among the three classes of algorithms, SAC requires the longest wall clock time to complete training, twice as much as PPO and DDPG do (Figure \ref{fig:alg_len}).

\subsection{Support for multi-agent environments}
\label{eval_multi_agent}

\begin{figure}[t]
\centering
\begin{minipage}[t]{.45\textwidth}
  \centering
  \includegraphics[width=.85\linewidth]{plots/cwnd_multi.png}
  \caption{Evolution of the congestion window of two contending flows. Flows are controlled by two independent agents, but both agents use the same policy.}
  \label{fig:cwnd_multi}
\end{minipage}
\hfill
\begin{minipage}[t]{.45\textwidth}
  \centering
  \includegraphics[width=.85\linewidth]{plots/throughput_multi.png}
  \caption{Throughput achieved by two contending flows governed by two independent agents using the same policy. }
  \label{fig:thr_multi}
\end{minipage}
\end{figure}

In this section, we showcase \textit{RayNet}'s support for multi-agent environments by experimenting with networks where two flows contend for resources at the shared bottleneck shown in Figure \ref{dumbel_topology}. To simplify the discussion, we only allow multiple agents to run in the network when evaluating a previously learned policy. The policy itself is learned using a single agent that experiences different networks as discussed in the previous section. Internally, \textit{RayNet} employs the same exact mechanisms (discussed in Section \ref{environment_internals}) to enable multiple agents to operate in an environment and communicate with Ray in the context of policy learning or evaluating a policy. We run the two flows on a network with a 440-packet buffer, a 100Mbps bottleneck and 35ms propagation delay. 

Figure \ref{fig:cwnd_multi} shows the evolution of the congestion window size for the two flows. When the first agent (blue line) starts controlling the window, it brings its size to match the connection's bandwidth delay product (BDP), that is the optimal window size for a single flow traversing an empty network path \cite{kleinrock2018internet}. When the second flow (green line) starts transmission, the window grows exponentially as part of the slow start phase, until loss occurs. Then, the RL agent takes over control and the congestion window is adjusted towards the BDP of the connection. Despite the lack of multiple flows experience during training, the policy achieves a relatively fair allocation of bandwidth among the two flows (Figure \ref{fig:thr_multi}). The first flow (agent) brings the window to match the BDP as soon as the second flow terminates. 

\subsection{Efficiency and Scalability}
\label{experimentation_efficiency}

\begin{figure}[t]
\centering
\begin{minipage}[t]{.45\textwidth}
  \centering
  \includegraphics[width=0.85\linewidth]{plots/cpu_usage.pdf}
  \caption{Average CPU usage of RayNet and OpenAI Gym when training a DQN agent on the CartPole task.}
  \label{fig:cpu_usage}
\end{minipage}
\hfill
\begin{minipage}[t]{.45\textwidth}
  \centering
  \includegraphics[width=0.85\linewidth]{plots/ram_usage.pdf}
  \caption{Average RAM usage of RayNet and OpenAI Gym when training a DQN agent on the CartPole task.}
  \label{fig:ram_usage}
\end{minipage}
\end{figure}





So far we have showcased that RayNet enables rich experimentation with RL-based network protocols, by separating the environment from the learning and supporting multiple agents when learning and evaluating RL policies. It is crucial that such features do not come at a cost of slow and resource-hungry learning. In this section we provide evidence that RayNet's overhead is minimal when compared to a state-of-the-art learning framework. We do so by implementing the CartPole task \textit{CartPole-v1} as an OMNeT++ model within RayNet and comparing its learning efficiency with that of the CartPole task implemented in the Open AI Gym.

In the CartPole task, a non-actuated junction connects a pole to a frictionless track-traveling cart. The pole is positioned vertically on the cart, and the objective is to balance it by applying forces to the cart's left and right sides. Before detailing the specifics of the experiments, we discuss briefly our \textit{RayNet} implementation of the CartPole environment based on \cite{1606.01540} utilising Figure \ref{environment_internals} as a reference. The dynamics of the CartPole state transitions are implemented in a single simulation component. During environment initialization, the \textit{Broker}, \textit{Stepper} and CartPole modules register to specific signal types so that actions, observations, and rewards can be exchanged. When \textit{reset()} is called on the environment, the internal state of the CartPole component is reset to a random value (of the state space). The CartPole component immediately sends the randomly generated observation to the \textit{Broker}, and the \textit{Stepper} inserts a \textsc{step} event into the queue. After retrieving the observation from the \textit{Broker} and calculating a new action, the rollout worker invokes \textit{step(action)} on the environment with the new action to execute. The \textit{Broker} delivers the action to the CartPole component through a signal, and the state transition accompanied by reward calculation immediately follows. The new observation and reward are then pushed to the \textit{Broker}, the \textit{Stepper} inserts a new \textsc{step} event into the queue, and the step() method finishes.

\begin{figure}[t]
\centering
\begin{minipage}[t]{.45\textwidth}
  \centering
  \includegraphics[width=0.85\linewidth]{plots/reward_over_absolute_time.pdf}
  \captionof{figure}{Mean reward averaged across multiple DQN training sessions on the CartpPole task with 2, 4, 8, 16, 32 and 64 parallel workers over wall time. Policies trained with OpenAI Gym environment and RayNet yield similar reward.}
  \label{fig:reward_on_time}
\end{minipage}
\hfill
\begin{minipage}[t]{.45\textwidth}
  \centering
  \includegraphics[width=0.85\linewidth]{plots/reward_and_time_by_workers.pdf}
  \captionof{figure}{Relative wall clock time and reward achieved when training DQN on the CartPole task with a varying number of parallel workers. RayNet achieves the target reward in the same wall clock time when using Omnet++ based environment and OpenAI gym environment. The maximum reward achieved depends on the number of parallel workers.}
  \label{fig:reward_time_on_workers}
\end{minipage}
\end{figure}

For both setups, we train a DQN \cite{mnih2013playing} policy using a varying number - between 2 and 64 - of rollout workers that operate in parallel. For each \{environment, number of workers\} pair we run the training with three different seeds and each run terminates upon yielding a mean cumulative reward of 450 across all parallel workers or after 2000 seconds of training. The maximum achievable cumulative reward for the CartPole-v1 environment is 500 but the exploration during training can take suboptimal actions and the maximum reward may never be measured. Figures \ref{fig:cpu_usage} and \ref{fig:ram_usage} show the CPU and RAM usage, respectively, when training the DQN agent using the \textit{RayNet} and Open AI gym environment, respectively. Given the low computational cost of modelling the mechanics of the CartPole, as the number of parallel workers increase, the CPU usage is bounded by the I/O operations required by the communication between trainer and workers. We observe that the RayNet's penalty for integrating OMNeT++ through the \textit{pybind11} API is negligible regardless of the number of workers producing experience in parallel. RAM utilisation grows linearly with the number of parallel workers and both implementations utilise a similar amount of memory.

Crucially, the training time of the DQN agent is also invariant with respect to the two environment implementations. Figure \ref{fig:reward_time_on_workers} shows the relative wall time taken to train the policy and the mean episode reward achieved at the end of training. Any extra complexity added with \textit{RayNet}'s environment does not affect the time required to train the DQN agent compared to the OpenAI Gym environment. The reward achieved is susceptible to neural network weights initialisation, the randomness introduced by the distributed nature of the system (e.g. arrival times of batches of experience at the trainer) and the number of parallel workers; however, the agent's cumulative reward achieved during training is similar regardless of the trained environment (Figure \ref{fig:reward_on_time}).

\section{Conclusion}
\label{conclusion}

In this paper, we presented \textit{RayNet}, a simulation platform for training and evaluating reinforcement learning-driven network protocols. RayNet integrates a widely used, off-the shelf discrete-event simulator, OMNeT++, and Ray/RLlib, a distributed platform for reinforcement leaning at scale. The integration is achieved through the usage of Python bindings and the signalling system that is implemented by OMNeT++, that allow RL agents to control decision making in the simulated environment.

We presented a case study on the design and experimentation of an RL-driven approach for congestion control. Our results show how RayNet’s design facilitates learning-based protocol development; it allows separate and extensible configuration for the learning algorithm and the networking environment; it supports multi-agent simulation, with each agent stepping independently; and it introduces minimal overhead compared to existing general-purpose training frameworks.

Our future work is on developing more use cases for RayNet, including RL-based routing and traffic engineering. RayNet is available as an open-source project for the research community to use and develop further. At the same time we are currently using RayNet in researching fair congestion control algorithms and studying existing RL-based CC models in depth.

\bibliographystyle{ACM-Reference-Format}

\section{Introduction}









 
