
\section{Conclusions}
\label{sec:conclusion}

This paper surveys ML-based systems deployed in the real world from a DOA perspective. We offer a detailed discussion of DOA principles and how they can support software designers and developers to address the challenges that emerge from the deployment of ML algorithms as part of larger systems. We analyse to what extent existing real-world ML-based systems have adopted these principles. We have identified few works that have fully adopted the data-oriented principles to address requirements such as big-data management, efficient real-time processing, and resilience. Most of the reviewed systems partially adopt the principles. We observe that while DOA is not a widespread paradigm yet, it offers a range of properties desirable for deployment of data-driven solutions. This fact opens a wide research agenda towards developing better understanding of DOA and suitability for various ML deployment scenarios. We hope that our work will increase community awareness of DOA and ignite interest in research, development, and adoption of that promising paradigm.
\section{Data-Oriented Architectures Paradigm and ML Challenges}
\label{sec:doa-principles}

\begin{figure}[t!]
	\centering
	\includegraphics[width=\textwidth]{figures/challenges-principles-map.pdf}
	\caption{Map between ML Workflow Challenges at Deployment~\cite{paleyes2022challenges} and DOA's principles. The left side shows the ML challenges at deployment and the right side shows the DOA's principles. The links between them represent which challenges are addressed by the respective principles.}
	\label{fig:map-challenges-principles}
\end{figure}

Currently prevalent software design paradigms (i.e., SOA) provide modularity and scalability~\cite{oreilly2020microservices, aniche2019current}, which are critical for building robust systems. They rely on individual components that hide their data, expose their functionalities through interfaces (APIs), and interact via calls to these interfaces~\cite{stopford2016data}. Modularity enables the assignation of systems components to different software developers' teams (i.e., separation of concerns). However, it also makes it harder for teams to monitor the quality of data flowing between separate modules. Interface calls not only prioritise control flow, they also make data exchanged during the communication ephemeral. This obstacle can negatively impact data-driven components (e.g., ML-based components) as they perform worse when the quality of the data they process is poor. Additional efforts are needed to create external mechanisms (e.g., middlewares) that monitor systems' data quality and adapt individual components accordingly~\cite{paleyes2022fbpsoa}. Recognizing this drawback of SOA that impedes faster and smoother deployments of data-driven ML solutions, the community has proposed multiple strategies to bridge this gap, such as data meshes~\cite{dehghani2019move} or domain-oriented SOA~\cite{gluck2020doma}.

Data-Oriented Architecture (DOA) is an emerging software architecture paradigm for building systems that aims to close the same gap. Compared with SOA, DOA places strong emphasis on data that is highly available by design. Such availability facilitates data monitoring and adaptation to guarantee its quality in the whole system. DOA proposes a set of principles for building data-oriented software: considering data as a first class citizen, decentralisation as a priority, openness~\cite{joshi2007data,vorhemus2017data,ning2019middleware}. Systems that follow these principles achieve high data availability while treating it in reusable and maintainable way. They also display scalability, resilience, and autonomy. These properties can help software developers to mitigate the challenges that real-world environments pose to the deployment of ML, and that are difficult to address or require additional efforts when following current paradigms (e.g., SOA). Figure~\ref{fig:map-challenges-principles} maps ML deployment challenges and DOA principles addressing them. The left side of the figure shows the challenges of the ML workflow at deployment discussed by Paleyes et al.~\cite{paleyes2022challenges}, while the right side shows the DOA principles we have extracted from the literature~\cite{joshi2007data,vorhemus2017data,ning2019middleware}. 

The reminder of this section discusses each DOA principle in greater detail, while in Section~\ref{sec:doa-survey} we analyse to what extent these principles have been applied in practice.

\subsection{Data as a First Class Citizen}
\label{sec:datafirst}

The integration of ML algorithms as components of larger software systems requires such systems to become data-driven. The system's outputs and overall performance depend on the quality of the data that flows through the system's components~\cite{fisler2021datacentric}. Thus, data management tasks become critical. Real-world environments challenge these tasks as they usually generate high volumes of variable data, which needs to be analysed in real-time to create value for organisations and individuals. These properties are also know as "the four Vs" of the big data: volume, variability, velocity, and value~\cite{nazabal2020data}. The integration of ML algorithms requires engineers to understand, parse, and organise large amounts of heterogeneous and dynamic data into data structures that support systems operations. The quality of the data must be monitored during systems execution to evaluate their performance, identify failures, and trigger adaptations. The quality of ML algorithms directly depends on data sets collected from data management tasks. The nature of such data influences the model selection and impacts its training efficiency and hyper-parameter selection. The resulting data sets must encode the problem requirements, which are in turn used to verify and test the learning algorithms~\cite{paleyes2022challenges}. In addition, the way that data flows between systems components and the quality of the data define the ethics, trust, and security requirements of systems at deployment. Components in current software design paradigms (e.g., microservices or objects) usually fall into "The Data Dichotomy" as they hide their data, while data management requires exposing data. The dichotomy does not suit data-related tasks as additional efforts are needed to access the data that flows through the system\cite{paleyes2022fbpsoa}.

DOA proposes to treat the \textbf{data as a first class citizen}, understanding data as the common denominator between disparate components~\cite{joshi2007data}. It means that the data in a DOA-based system is primary and the operations on data are secondary. This principle makes systems \textit{data-driven} by design, which matches the nature of ML algorithms. DOA-based systems rely on a \textit{invariant shared data model} which is processed and nourished by multiple system components. The shared data model is a single data structure equivalent to the one that data engineers build into the data management stage of the ML workflow. The key difference is that this data model is automatically built from the system components' interactions. Systems components do not expose any APIs, and instead interact via data mediums, where input is listened to and output is offloaded. Such kind of interaction enables DOA-based systems to achieve \textit{data coupling}, considered as the loosest form of coupling~\cite{OLSSON753212}. The shared data model stores the history of the system during its whole life cycle. The data that describes the system current and past states is fully available, which facilitates data management tasks, as well as systems monitoring, failure detection, and adaptation. Components behaviour is programmatically observable, traceable, and auditable. Such transparency benefits the responsible design of data-driven systems regarding ethics, trust, and security.

\subsection{Prioritise Decentralisation}
\label{sec:decentralise}

The breakthroughs of ML algorithms were enabled thanks to the growth of available data and the increasingly powerful hardware~\cite{lecun2015deep,dovsilovic2018explainable}. However, these enablers are not always present in systems deployed in the real world. Modern systems are logically decentralised based on microservices architectures. But these systems are also deployed in cloud-based data centers where their whole data is stored and processed. Such physical centralisation creates unique points of failure that threaten systems' data availability despite the efforts of cloud infrastructure engineers (e.g., server replicas, load balancers, etc.). Cloud resources are expensive and they are not available in real-world environments which are constrained in budget and technical knowledge. These resource-constraint environments also threaten the practical adoption of ML algorithms as different stages of the ML workflow are computationally expensive. For example, the training stage is an iterative process that solves an optimization problem to find learning model parameters. This process is computationally expensive in complex models (e.g., neural networks with billions of parameters) that learn from non-quadratic, non-convex, and high dimensional data sets~\cite{judd1990neural,orr2003neural,goodfellow2017deep}. Hyperparameters improve the efficiency of the training process as well as the accuracy of the learning models~\cite{goodfellow2014qualitatively}. However, the selection of these hyperparameters is also a resource-demanding optimisation problem~\cite{paine2020hyperparameter,bischl2021hyperparameter}. ML algorithms with high computational and/or memory requirements are challenging to deploy in the real world despite their potential. Real-world environments exacerbate these requirements because they are likely to produce considerable amounts of high-dimensional and dynamic data. Real-world systems also have low latency and high robustness requirements, and they are composed of a considerable number of components that interact with each other towards a goal~\cite{cabrera2018services,asghari2018service}. Large data-driven systems need scalable architectures that support the integration, monitoring, and adaptation of their components. 

The simplest way to deliver a shared data model (Principle~\ref{sec:datafirst}) would seem to be to centralise it, but in practice scalability requirements mean that in DOA we \textbf{prioritise decentralisation}. Such decentralisation should be logical and physical. Logical decentralisation enables organisations to scale when developing data-driven software as different development teams focus on smaller systems' components (i.e., separation of concerns). Physical decentralisation enables the deployment of ML models in constrained environments where there is no access to expensive computational resources. Data-driven systems components should be deployed as decentralised \textit{entities that store data chunks} of the shared information model described in the previous principle. These entities first perform their operations with their local resources (i.e. \textit{local first}). If local resources (i.e., data, computing time, or storage) are not enough, entities have the ability to connect temporally with other participants to share resources. Entities first scan their local environment for potential resources they need. They prioritise interactions with nodes in the close vicinity to share or ask for data and computing resources (i.e., \textit{peer-to-peer first}). Cloud servers are used as fallback mechanisms~\cite{vorhemus2017data}. This principle enables system's data replication by design as different entities can store the same data chunk. Such replication provides data availability because if one entity fails, its information is not lost. Similarly, replication provides scalability as different entities can respond to concurrent data requests. Prioritising decentralisation also alleviates the high demand for resources of ML algorithms as data-driven systems can perform their data-related tasks (e.g., ML models training) in data sets that are partitioned by design. In addition, decentralisation creates a flexible ecosystem where resources from different devices can be used on-demand. This DOA principle advocates for a more sustainable approach that prioritises the computational power available in everyday devices over the expensive cloud resources. It is important to note that despite such prioritisation, there are environments where cloud resources are available and they are the best option to build systems on top of. For example, large corporations whose information systems rely on strong cloud-based backbones. In such cases, fully-decentralised architectures can not be considered. However, data replication, partitioned data sets, and flexible resource management are DOA-enabled properties that can still benefit even partial decentralisation of ML-based systems in the cloud.

\subsection{Openness}
\label{sec:openness}

Data-driven systems require the development of automated mechanisms to support engineers at different stages. Such automation is mainly required because of the large amount of data these systems manage, the complex processes they perform, and the fact that their users are usually experts in domains different from AI (e.g., healthcare, physics, etc.)~\cite{waring2020automated}. AutoML emerged as a recent sub-field that aims to automate the whole ML life cycle including processes such as data processing, model selection, and hyperparameter optimisation~\cite{escalante2020automated,vaccaro2021empirical}. Real-world environments pose particular automation requirements when adopting ML algorithms, in addition to the challenges AutoML already explores. Data-driven systems rely on the interaction of a significant number of components. These components must be integrated, composed, monitored, and adapted in a way that satisfies end-to-end system quality requirements. The scale and dynamic nature of real-world environments make human intervention infeasible when performing these tasks~\cite {cabrera2019self}. 

DOA proposes \textbf{openness} as a principle that data-driven systems should follow. Systems components should be \textit{autonomous, asynchronous} and communicate with each other using a \textit{message exchange protocol}. This principle creates open environments where systems components interact autonomously~\cite{joshi2007data}. Data-driven systems can take advantage of such environments when adopting ML algorithms. Entities could perform the integration, composition, monitoring, and adaptation tasks in an autonomous and decentralised manner. This principle aims to build a self-adaptive system that creates a feedback loop to optimise the system's overall performance~\cite{lalanda2013autonomic}. Asynchronous entities produce their outputs and can subscribe to inputs at any time. These steps are public and explicit in favor of data trust, traceability, and transparency. Similarly, entities are autonomous to decide which data to store, which data to make public, and which data to hide for security and privacy~\cite{vorhemus2017data}. The message exchange protocol between asynchronous entities replaces dependencies between components in traditional architecture by asynchronous messages between data producers and consumers, achieving loose form of coupling. 

\subsection{Data-oriented design}

While DOA as a software paradigm for ML applications is an emerging pattern, the principles behind DOA are not new. In fact, many software engineering industries have already discovered and are reaping the benefits of applying these principles. In this section, we discuss similar paradigm known as data-oriented design (DOD), which applies many DOA principles on a lower level of abstraction, with claims of significant improvements over analogous OOP solutions. Its primary motivation comes from the observation that OOP often results in poor data placement in memory, which leads to suboptimal usage of CPU cache. Instead of grouping together data records that represent different traits of one entity in an object, DOD proposes to focus on transformations of data, ensuring data records are separated, grouped or sorted according to where and when they are needed. 

Video game development is the industry that has successfully applied DOD to improve memory and cache utilisation \cite{acton2014data}. For example, Coherent Labs utilized DOD while creating their proprietary game engine Hummingbird \cite{nikolov2018oop}. The studio has released several successful game titles using engines based on Chromium and WebKit, and the key motivation behind developing their own solution was the performance limitations. DOD was considered a key design paradigm to lean on while developing Hummingbird because of the way it separates data from logic, removes hidden states from the system, and promotes deep domain knowledge. This is in contrast with OOP paradigm used by Chromium, where data and operations are inseparable, and where heterogeneous data is encapsulated as hidden state inside objects that are considered black boxes. As a concrete example, the authors discuss the implementation of animations in both Chromium and Hummingbird. The implementation of animation tick in OOP is scattered across 6 classes with a non-trivial inheritance tree that gives unclear object lifetime semantics. Because of the hidden state that tracks if the animation is active a lot of branch mispredictions by the compiler are observed and many caches misses happen. Finally, there is an unwanted coupling with other mechanisms, such as events or styles. In contrast, DOD implementation is designed as a flow of data between tables through stateless operations. The authors show how existence-based predication allows the elimination of hidden states, and reduces branches, and thus the problem with compiler branch prediction. Orientation on data allows for simple templating that improves cache hits by orders of magnitude (in the particular example used in the talk OOP resulted in 6000 cache misses while DOD resulted in only 2). Authors report that DOD approach gave significant performance improvement of Hummingbird compared to Chromium, e.g. 6.12x for the animations. They also argue that their DOD solution scales better to multi-threading case, allows for simpler unit testing and is easier to modify in the long run, after two years of developing and maintaining Hummingbird. Nevertheless, the authors stress that DOD has several downsides. In particular, correct data coupling can be difficult to find, existence-based predication is not always possible, quick modifications are easier with OOP, and the whole paradigm can be difficult to grasp for someone who is coming from an OOP background.

Applications of DOD are not limited to game industry. Mironov et al. \cite{Mironov2021comparison} utilized DOD to improve the performance of a trading strategy backtesting utility. Trading backtester is a simulation in which a trading algorithm is tested on the exchange data from the past. Backtesters are traditionally implemented with OOP, and process incoming data sequentially. The authors realized that DOD can allow for a greater parallelization of the backtester. They created two functionally equivalent versions of the backtester, one using OOP and another using DOD. The DOD implementation follows the "structure of arrays versus array of structures" principle, reorganizing data into matrices as much as possible to streamline its processing and improve data allocation in memory. Instead of orienting on a single object at a time, each operation inside the backtester was changed to process matrices of inputs, and output intermediate results of calculations instead of passing them to downstream objects. The performance tests showed up to 66\% performance increase for the DOD solution, as well as improved opportunities for parallelization. However, authors note few drawbacks they observed in the DOD designed solution: it has higher memory requirements, and falls back to sequential execution in cases of limited bandwidth.

As can be seen from these use cases, DOD is strongly \textit{data driven} and has a \textit{shared data model} with components \textit{coupled via data} inputs and outputs. DOD programs are decentralized dataflow-like pipelines where each component only accessing parts of memory it needs (similarly to \textit{local data chunks} principle). Thus the DOA-like principles, especially around data prioritisation, emerge at a different level of abstraction with different motivation, and nevertheless bring significant benefit to software developers.









\section{ML Applications Survey from DOA Principles Perspective}
\label{sec:doa-survey}

\begin{table}
\caption{All papers reviewed in our survey. For each paper we show whether it adopts (fully or partially) each or the DOA sub-principles discussed in Section \ref{sec:doa-principles}.}
\resizebox{0.7\textwidth}{!}{%
\label{tab:survey}
\begin{tabular}{l|lll|lll|lll|}
\cline{2-10} & 
    \multicolumn{3}{c|}{\textbf{\begin{tabular}[c]{@{}c@{}}Data as a First \\ Class Citizen\end{tabular}}} &
    \multicolumn{3}{c|}{\textbf{\begin{tabular}[c]{@{}c@{}}Prioritise \\ Decentralisation\end{tabular}}} &
    \multicolumn{3}{c|}{\textbf{\begin{tabular}[c]{@{}c@{}}Openness\end{tabular}}}
\\ \hline
    \multicolumn{1}{|c|}{\textbf{\begin{tabular}[c]{@{}c@{}}Research\\ work\end{tabular}}}  & 
    \multicolumn{1}{c|}{\rotatebox[origin=c]{90}{\textbf{\begin{tabular}[c]{@{}c@{}}Data\\ driven\end{tabular}}}} & 
    \multicolumn{1}{c|}{\rotatebox[origin=c]{90}{\textbf{\begin{tabular}[c]{@{}c@{}}Invariant and shared \\ data mode\end{tabular}}}} & 
    \multicolumn{1}{c|}{\rotatebox[origin=c]{90}{\textbf{\begin{tabular}[c]{@{}c@{}}Data \\ coupling\end{tabular}}}} & 
    \multicolumn{1}{c|}{\rotatebox[origin=c]{90}{\textbf{\begin{tabular}[c]{@{}c@{}}Local\\ data chunks\end{tabular}}}} & 
    \multicolumn{1}{c|}{\rotatebox[origin=c]{90}{\textbf{\begin{tabular}[c]{@{}c@{}}Local\\ first\end{tabular}}}} & 
    \multicolumn{1}{c|}{\rotatebox[origin=c]{90}{\textbf{\begin{tabular}[c]{@{}c@{}}Peer-to-peer\\ first\end{tabular}}}} & 
    \multicolumn{1}{c|}{\rotatebox[origin=c]{90}{\textbf{\begin{tabular}[c]{@{}c@{}}Autonomous\\ entities\end{tabular}}}} & 
    \multicolumn{1}{c|}{\rotatebox[origin=c]{90}{\textbf{\begin{tabular}[c]{@{}c@{}}Asynchronous\\ entities\end{tabular}}}} & 
    \multicolumn{1}{c|}{\rotatebox[origin=c]{90}{\textbf{\begin{tabular}[c]{@{}c@{}}Message exchange\\ protocol\end{tabular}}}}
\\ \hline
   
    \multicolumn{1}{|l|}{Junchen et al.~\cite{Junchen2017}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark}   
\\ \hline
   
    \multicolumn{1}{|l|}{Lebofsky et al.~\cite{lebofsky2019breakthrough}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
    
   
    \multicolumn{1}{c|}{\checkmark} &
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark}
\\ \hline
   
    \multicolumn{1}{|l|}{Herrero et al.~\cite{herrero2022i40}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
    
   
    \multicolumn{1}{c|}{\checkmark} &
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark}
\\ \hline
   
    \multicolumn{1}{|l|}{Zhang et al.~\cite{zhang2016emotion}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark}
\\ \hline
   
    \multicolumn{1}{|l|}{Karageorgou et al.~\cite{karageorgou2020sentiment}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{--} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
    
   
    \multicolumn{1}{c|}{\checkmark} &
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark}

\\ \hline
   
    \multicolumn{1}{|l|}{Sultana et al.~\cite{Sultana2021}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
    
   
    \multicolumn{1}{c|}{--} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark}
\\ \hline
   
    \multicolumn{1}{|l|}{Calancea et al.~\cite{calancea2019iassistme}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{--} &
   
    \multicolumn{1}{c|}{--} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{--} & 
   
    \multicolumn{1}{c|}{--} & 
   
    \multicolumn{1}{c|}{--}
\\ \hline
   
    \multicolumn{1}{|l|}{Schumann et al.~\cite{schumann2012software}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
    
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark}
\\ \hline
   
    \multicolumn{1}{|l|}{Alves et al.~\cite{alves2020industry}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
    
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark}
\\ \hline
   
    \multicolumn{1}{|l|}{De Caro et al.~\cite{decaro2022toolkit}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
    
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark}
\\ \hline
   
    \multicolumn{1}{|l|}{Nguyen et al.~\cite{nguyen2021finger}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
    
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} &
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{--} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark}
\\ \hline
   
    \multicolumn{1}{|l|}{Xu et al.~\cite{Xu2018}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}}
\\ \hline
    \multicolumn{1}{|l|}{Alonso et al.~\cite{Alonso2020}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{--} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{--} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\checkmark}
\\ \hline
   
    \multicolumn{1}{|l|}{Sarabia-J\'acome et al.~\cite{Sarabia2020}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{--} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{--} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{--}
\\ \hline
   
    \multicolumn{1}{|l|}{Santana et al.~\cite{santana2020smartbuildings}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{--} &
   
    \multicolumn{1}{c|}{--} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{--} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\checkmark}
\\ \hline
   
    \multicolumn{1}{|l|}{Shih et al.~\cite{shih2020warning}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{--} & 
    
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{--} & 
   
    \multicolumn{1}{c|}{--} & 
   
    \multicolumn{1}{c|}{--}
\\ \hline
   
    \multicolumn{1}{|l|}{Lu et al.~\cite{lu2020digitaltwin}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{--} &
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{--} & 
    
   
    \multicolumn{1}{c|}{--} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{--}
\\ \hline
   
    \multicolumn{1}{|l|}{Brumbaugh et al.~\cite{brumbaugh2019bighead}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{--} & 
   
    \multicolumn{1}{c|}{--} & 
    
   
    \multicolumn{1}{c|}{--} &
   
    \multicolumn{1}{c|}{--} & 
   
    \multicolumn{1}{c|}{--} & 
    
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}}
\\ \hline
   
    \multicolumn{1}{|l|}{Shan et al.~\cite{shan2022poligraph}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
    
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}}
\\ \hline
   
    \multicolumn{1}{|l|}{Schubert et al.~\cite{schubert2021onorbit}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
    
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark}
\\ \hline
   
    \multicolumn{1}{|l|}{Dai et al.~\cite{dai2019bigdl}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
    
   
    \multicolumn{1}{c|}{\checkmark} &
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}}
\\ \hline
   
    \multicolumn{1}{|l|}{Zhang et al.~\cite{zhang202148learningadd}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{--} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}}
\\ \hline
   
    \multicolumn{1}{|l|}{Quintero et al.~\cite{Quintero2019}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{--} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{--}
\\ \hline
   
    \multicolumn{1}{|l|}{Habibi et al.~\cite{habibi2019itelescope}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{--} & 
    
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{--} & 
   
    \multicolumn{1}{c|}{--}
\\ \hline
   
    \multicolumn{1}{|l|}{Gorkin et al.~\cite{gorkin2020sharkey}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{--} &
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{--} & 
   
    \multicolumn{1}{c|}{--} & 
   
    \multicolumn{1}{c|}{--}
\\ \hline
   
    \multicolumn{1}{|l|}{Shi et al.~\cite{Shi2019}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}}
\\ \hline
   
    \multicolumn{1}{|l|}{Franklin et al.~\cite{franklin2014lida}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark}
\\ \hline
   
    \multicolumn{1}{|l|}{Bayerl et al.~\cite{Bayerl2020}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{--} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}}
\\ \hline
   
    \multicolumn{1}{|l|}{Bellocchio et al.~\cite{bellocchio2016smartseal}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{--} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark}
\\ \hline
   
    \multicolumn{1}{|l|}{Johny et al.~\cite{Johny2021}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{--} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{--}
\\ \hline
   
    \multicolumn{1}{|l|}{Barachi et al.~\cite{barachi2020crowdsensing}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{--} &
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{--} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}}
\\ \hline
   
    \multicolumn{1}{|l|}{Salhaoui et al.~\cite{Salhaoui2020}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{--} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{--} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{--}
\\ \hline
   
    \multicolumn{1}{|l|}{Hegemier et al.~\cite{Hegemier2021}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{--} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{--} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{--}
\\ \hline
   
    \multicolumn{1}{|l|}{Cabanes et al.~\cite{cabanes2019autonomous}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
    
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}}
\\ \hline
   
    \multicolumn{1}{|l|}{Agarwal et al.~\cite{agarwal2016making}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
    
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} &
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}}
\\ \hline
   
    \multicolumn{1}{|l|}{M\"{u}ller et al.~\cite{Muller2019}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{--} & 
    
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}}
\\ \hline
   
    \multicolumn{1}{|l|}{Gao et al.~\cite{gao2016icn}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{--} & 
    
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}}
\\ \hline
   
    \multicolumn{1}{|l|}{Amrollahi et al.~\cite{amrollahi2020aidex}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{--} & 
   
    \multicolumn{1}{c|}{--} & 
    
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} &
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}}
\\ \hline
   
    \multicolumn{1}{|l|}{Niu et al.~\cite{niu2017adls}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{--} &
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{--} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}}
\\ \hline
   
    \multicolumn{1}{|l|}{Gallagher et al.~\cite{gallagher2019intellimav}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{--} &
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{--} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}}
\\ \hline
   
    \multicolumn{1}{|l|}{Conroy et al.~\cite{conroy2022infection}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} &
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{--} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}}
\\ \hline
   
    \multicolumn{1}{|l|}{Falcao et al.~\cite{falcao2021piwims}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}}
\\ \hline
   
    \multicolumn{1}{|l|}{Hawes et al.~\cite{hawes2017strands}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} &
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}}
\\ \hline
   
    \multicolumn{1}{|l|}{Kemsaram et al.~\cite{kemsaram2020vision}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} &
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}}
\\ \hline
   
    \multicolumn{1}{|l|}{Qiu et al.~\cite{qiu2020phm}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} &
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{--} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}}
\\ \hline
   
    \multicolumn{1}{|l|}{Ali et al.~\cite{ali2016idviewer}} & 
    
   
    \multicolumn{1}{c|}{\checkmark} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} &
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
    
   
    \multicolumn{1}{c|}{--} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}} & 
   
    \multicolumn{1}{c|}{\cellcolor{gray!25}}
\\ \hline
    \multicolumn{10}{c}{\checkmark = Adopted, -- = Partially adopted, \crule[gray!25]{1cm}{0.3cm} = Not adopted}
\end{tabular}
}
\end{table}

This section presents the survey of ML-based systems deployed in real-world environments. The main goal of this survey is to understand to what extent and how the DOA principles (Section~\ref{sec:doa-principles}) have been applied in practice. The answer to this question allows us to identify the systems' requirements that DOA principles satisfy, and the practical approaches for implementing DOA, as well as to define a research agenda to develop the next generation of DOAs for ML-based systems. 

Table~\ref{tab:survey} shows the final list of reviewed papers against the DOA principles. We found that the ML-based systems reported in these papers can fully, partially, or do not adopt the principles. Such adoption depends on the requirements these systems aim to satisfy, the nature of the data they handle, and the environments where they are deployed. The rest of this section quantifies and describes such adoption for each principle and sub-principle.  

\subsection{Data as a First Class Citizen}

\begin{figure}[t!]
	\centering
	\includegraphics[width=.7\textwidth]{figures/data-first.pdf}
	\caption{Adoption of data as a first-class citizen principle. All systems reviewed are data-driven, over 60\% at least partially adopt shared data model, and just under half utilize data coupling.}
	\label{fig:data-first}
\end{figure}

Figure~\ref{fig:data-first} shows the degree of adoption of the reviewed papers for the \textbf{data as a first-class citizen} principle. We found that all reviewed papers report \textit{data-driven} systems. This result is expected as we are reviewing ML-based systems that are \textit{data-driven} by nature. For example, Schumann et al.~\cite{schumann2012software} uses a Bayesian network for health monitoring of space vehicles (e.g., rovers), Sarabia-Jacome et al.~\cite{Sarabia2020} builds a DL model for fall detection using data that IoT devices collect from Ambient Assisted Living (AAL) environments, Junchen et al.~\cite{Junchen2017} proposes Pytheas as a data-driven approach that optimises the Quality of Experience (QoE) of applications based on network QoE metrics, Agarwal et al.~\cite{agarwal2016making} discusses deployment of an RL-based decision making system.

We found that 58.7\% of the reviewed systems fully adopt the sub-principle of handling their information using \textit{shared data models}, which are implemented either using data streams or database schemes. Systems use streams when data is generated and needs to be analysed in real-time. We observed that the use of streams is particularly common for systems that are built with dataflow architecture \cite{culler1986dataflow,paleyes2022fbpsoa}, where data inputs are transformed by different components while they flow through the system \cite{Quintero2019,Sultana2021,Junchen2017,cabanes2019autonomous,habibi2019itelescope,shih2020warning,barachi2020crowdsensing,decaro2022toolkit,zhang2016emotion,lebofsky2019breakthrough,lu2020digitaltwin,santana2020smartbuildings,nguyen2021finger,herrero2022i40,conroy2022infection}. A good example of system architected in that way is described by Sultana et al.~\cite{Sultana2021}. Their software for ML model  deployment with Acumos and ONAP platforms is separated into decoupled components that exchange data between themselves via Kafka streams. Similarly, Herrero et al.~\cite{herrero2022i40} presents a data intensive platform for Industry 4.0 following the RAI4.0 reference architecture~\cite{lopez2021datacentric}  where software and hardware components are modelled as streams producers and consumers. The nature of the data also plays a role when selecting streams as the system's data model. Data that is produced by different sources in a continuous fashion fits the streams data model according to our review. That is the case of works that process video streams~\cite{gao2016icn,Junchen2017,habibi2019itelescope,falcao2021piwims}, sensors data~\cite{schumann2012software,cabanes2019autonomous,lu2020digitaltwin,shih2020warning,barachi2020crowdsensing,zhang202148learningadd,nguyen2021finger,decaro2022toolkit,herrero2022i40,conroy2022infection}, social media data~\cite{zhang2016emotion,Xu2018,Muller2019,shan2022poligraph,brumbaugh2019bighead}, and network metrics~\cite{santana2020smartbuildings,Sultana2021}. Databases schemes are more appropriate as \textit{shared data models} when system components require to store historical records~\cite{hawes2017strands,calancea2019iassistme,alves2020industry} or process large amounts of data~\cite{schubert2021onorbit}. For example, Alves et al.~\cite{alves2020industry} propose a system for industrial predictive maintenance based on historical data collected from devices, while Schubert et al.~\cite{schubert2021onorbit} introduce a deep learning-based system in which components store and retrieve data from AWS S3 buckets. The Texas Spacecraft Laboratory uses this tool to generate synthetic images and support on-orbit spacecraft operations. Thirteen percent of the reviewed papers partially adopt a \textit{shared data model} between their systems' components. That is the case of edge deployments like the ones described by Sarabia et al.~\cite{Sarabia2020}, Alonso et al.~\cite{Alonso2020}, and Bayerl et al.~\cite{Bayerl2020}. Edge nodes in these works have common models that are used to store local data. This data is not shared with other edge nodes and is usually preprocessed and then transmitted to cloud servers where learning models are trained. There are other systems that combine different storage technologies in data models that are shared but not unique. For example, the AIDEx platform~\cite{amrollahi2020aidex} queries an electronic medical records (EMR) database, and passes such data to a set of microservices that predict the risk of patients infection sepsis based on their own data models. AIDEx creates patients health data streams base on the prediction results which are stored in a MongoDB instance that is used for visualisation purposes. Karageorgou et al.~\cite{karageorgou2020sentiment} propose a system for multilingual sentiment analysis of Twitter streams. This platform uses RabbitMQ queues to collect tweets in different languages, and Kafka streams to analyse them. Just over a quarter of the reviewed papers report systems which components that do not have common \textit{shared data models}. Some of the components of these systems act as independent entities that interact toward a goal. They can act as sensors that transmit data to cloud servers~\cite{bellocchio2016smartseal,niu2017adls,gallagher2019intellimav,Salhaoui2020,gorkin2020sharkey,Hegemier2021,Johny2021} or systems' subcomponents that hide their own data model~\cite{franklin2014lida,Quintero2019,Shi2019,qiu2020phm,kemsaram2020vision}. 

Systems' components in 30.4\% of the reviewed papers are designed following the \textit{data coupling} sub-principle. It means that these systems' components interact with each other by reading from and writing to data mediums. We found that components of real-time systems~\cite{schumann2012software,Junchen2017,cabanes2019autonomous,karageorgou2020sentiment,nguyen2021finger,Sultana2021,herrero2022i40,decaro2022toolkit,brumbaugh2019bighead} act as subscribers and publishers of data to streams that represent the state of the data at different stages in a workflow, again in accordance with principles of dataflow architecture. For example, the components of Pytheas~\cite{Junchen2017} apply operations on video streams to group them in sessions and optimise their users' QoE. Streams-based systems make use of different technologies such as Apache Kafka~\cite{Junchen2017,Sultana2021,herrero2022i40} or Spark Streaming~\cite{Junchen2017,brumbaugh2019bighead}, sometimes adopting the underlying stream-based programming model for the entire system \cite{dai2019bigdl}. Message queues (e.g., RabbitMQ) are also used in systems to collect heterogeneous data from different sources~\cite{nguyen2021finger} or enable interaction between components that act in parallel~\cite{karageorgou2020sentiment,lebofsky2019breakthrough}. There is a preference for databases as the data medium from systems that handle large amounts of data~\cite{schubert2021onorbit} or require historical data to satisfy user needs~\cite{alves2020industry}. Components read and write data in shared databases that reflect the data states and enable batch processing. The system proposed by Zhang et al.~\cite{zhang2016emotion} illustrates these types of \textit{data coupling} by combining streams and databases. Social media data from Weibo and Chinese forums are stored in distributed Apache Kafka streams that are processed using Apache Storm to enable real-time sentiment analysis. The system also offers batch processing where social media is stored using the Hadoop Distributed File System (HDFS) and an HBase database. The Apache Spark machine learning library (MLlib) is used to analyse such data in a distributed fashion. Systems that partially adopt \textit{data coupling} (i.e., 13\%)~\cite{gao2016icn,Muller2019,habibi2019itelescope,amrollahi2020aidex,shih2020warning,brumbaugh2019bighead} are the ones where some components communicate through data mediums and others using traditional process calls (e.g., API calls or Remote Procedure Calls (RPC)). These design decisions are supported by particular requirements these systems need to satisfy. For example, systems can use API calls to collect data from data sources~\cite{Muller2019,amrollahi2020aidex} or to receive requests from and send responses to end users~\cite{gao2016icn}. The rest of reviewed papers (i.e., 56.5\%) do not adopt \textit{data coupling} as data is hidden by each systems' components and these components interact using traditional process calls. They can use API calls~\cite{niu2017adls,Quintero2019,gallagher2019intellimav,gorkin2020sharkey,santana2020smartbuildings,lu2020digitaltwin,barachi2020crowdsensing,kemsaram2020vision,Johny2021,shan2022poligraph,conroy2022infection} or remote procedure calls (RPC)~\cite{franklin2014lida,bellocchio2016smartseal,hawes2017strands,Xu2018,Shi2019,calancea2019iassistme,Bayerl2020,qiu2020phm,Sarabia2020,Alonso2020,Salhaoui2020,falcao2021piwims,zhang202148learningadd,Hegemier2021} for communication. We found that not all the systems where components share a common data model necessarily follow the \textit{data coupling}  sub-principle. This is the case of distributed systems (e.g., edge architectures) where distributed components have a common local data model, but there is no interaction between them~\cite{Xu2018,Sarabia2020,Alonso2020,Bayerl2020,zhang202148learningadd}. These distributed components play the role of intermediate nodes in the systems, which collect and transmit data to centralised servers where decision-making processes take place.

\subsection{Prioritise Decentralisation}

\begin{figure}[t!]
	\centering
	\includegraphics[width=.7\textwidth]{figures/decentralised.pdf}
	\caption{Adoption of prioritising decentralisation principle. While approximately half of the reviewed works follow ``local data chunks'' and ``local first'' principles, less than 20\% use peer-to-peer type of communication.}
	\label{fig:decentralised}
\end{figure}

Figure~\ref{fig:decentralised} presents to what extent the reviewed papers \textbf{prioritise decentralisation} when deploying their ML-based systems. Thirty percent of the papers report systems where distributed entities store \textit{local data chunks}. These works can be classified as distributed~\cite{Junchen2017,Shi2019,Sarabia2020,Alonso2020,zhang202148learningadd} and decentralised approaches~\cite{zhang2016emotion,Xu2018,Quintero2019,Bayerl2020,shan2022poligraph,lebofsky2019breakthrough}. Distributed approaches create federated networks where servers at the edge (i.e., front-end servers) have \textit{local data storage}. Such servers transmit their whole data or part of it to back-end servers that have a full control of the system's state. These approaches are nowadays implemented as edge architectures~\cite{Alonso2020,Sarabia2020} which aim to improve systems performance (e.g. latency, resource usage) while preserving the control provided by a central server in the back~\cite{patel2014mobile,tabatabaee2022mecsurvey}. There are no central entities behind nodes in decentralised approaches. Nodes have a partial view of the system state as each node stores part of the whole system's data. Decentralised approaches usually rely on distributed protocols and technologies. For example, Zhang et al.~\cite{zhang2016emotion} uses HBase and Kafka, Karageorgou et al.~\cite{karageorgou2020sentiment} uses Spark as tools for scalable sentiment analysis on data from social media, Xu et al.~\cite{Xu2018} uses Distribute Hash Tables (DHTs) to store and group social data by topic, Shan et al.~\cite{shan2022poligraph} stores and replicates data using the Byzantine Fault Tolerant (BFT) distributed system, and Herrero et al.~\cite{herrero2022i40} integrates Zookeeper, Kafka, and Apache Cassandra in a decentralised platform for a predictive maintenance industrial service. Most of the reviewed papers report ML-based systems that rely on centralised storage (i.e., 50\%). Some of these systems are deployed on single cloud servers~\cite{franklin2014lida,gao2016icn,habibi2019itelescope,Muller2019,amrollahi2020aidex,qiu2020phm,schubert2021onorbit,conroy2022infection}, where all data is stored. Other systems are deployed on resource-constraint devices like robots~\cite{schumann2012software,hawes2017strands,kemsaram2020vision}, or small processing units (e.g., Raspberry Pis)~\cite{cabanes2019autonomous, Johny2021,falcao2021piwims}. While cloud servers are flexible and can handle big-data requirements like the ones addressed by Habibi et al.~\cite{habibi2019itelescope} or Schubert et al.~\cite{schubert2021onorbit}, resource-constraint devices limit the amount of stored data to the point where only trained learning models are deployed \cite{Johny2021}. Finally, there are systems that partially distribute the storage of data (i.e., 19.6\% of the papers). They have distributed devices acting as sensors, which can store some data but mainly transmit it to cloud nodes where the main processing takes place. Among these devices we found IoT sensors~\cite{bellocchio2016smartseal,niu2017adls,gallagher2019intellimav,lu2020digitaltwin,gorkin2020sharkey,barachi2020crowdsensing,santana2020smartbuildings,shih2020warning,alves2020industry,decaro2022toolkit}, smartphones~\cite{calancea2019iassistme}, robots~\cite{Hegemier2021}, and UAVs~\cite{Salhaoui2020}.

We found that 30.4\% of the reviewed systems prioritise \textit{local processing} before requesting centralised entities (e.g., cloud servers). Learning models are deployed on distributed computing nodes that process requests, make decisions, and provide systems' functionalities. This approach also proves popular for big data pre- and postprocessing pipelines \cite{lebofsky2019breakthrough,gao2016icn,dai2019bigdl,karageorgou2020sentiment}. In most cases these nodes provide such functionalities without any direct cooperation between them~\cite{Junchen2017,Quintero2019,Shi2019,Bayerl2020,Sarabia2020,karageorgou2020sentiment,Johny2021,zhang202148learningadd}. Such loose coupling and decentralisation is achieved by utilising dataflow architecture in these systems. There are few cases where distributed nodes cooperate to offer systems' functionalities. These systems also follow the \textit{peer-to-peer first} sub-principle and correspond to the 13\% of the reviewed papers. Such peer-to-peer systems are based on distributed technologies to handle the complexity of decentralised computing. Examples of the use of these technologies are Zhang et al.~\cite{zhang2016emotion} using Stream and Spark, Junchen et al.~\cite{Junchen2017} based on Kafka, Xu et al.~\cite{Xu2018} using DHT, and Shan et al.~\cite{shan2022poligraph} using BFT, and Herrero et al.~\cite{herrero2022i40} based on Zookeeper. Workflow orchestration tools, such as Apache Airflow, can also be used for similar purposes \cite{brumbaugh2019bighead}. Fifeteen percent of the reviewed systems partially adopt the local first principle~\cite{calancea2019iassistme,santana2020smartbuildings,Alonso2020,Salhaoui2020,Hegemier2021}. Different nodes are in charge of different stages of learning models' life-cycle in these systems. Cloud servers are usually in charge of the data management, model learning, and model verification stages as they collect the whole data of the system. Edge servers are in charge of the execution of the trained models (i.e., decision making) as well as some tasks related to data collection and preprocessing. We found that just over half of the papers report systems where the whole decision-making process and the full learning models' life cycles happen in central nodes. Some of these works are implemented using traditional client-server architectures~\cite{franklin2014lida,bellocchio2016smartseal,gao2016icn,Muller2019,habibi2019itelescope,amrollahi2020aidex,qiu2020phm,Sultana2021,schubert2021onorbit}, others use sensors devices to perform basic data collection tasks~\cite{niu2017adls,gallagher2019intellimav,barachi2020crowdsensing,lu2020digitaltwin,gorkin2020sharkey,shih2020warning,alves2020industry,decaro2022toolkit,conroy2022infection}, and others work in extreme resource constraint environments deploying the whole system in single and isolated devices~\cite{schumann2012software,hawes2017strands,cabanes2019autonomous,kemsaram2020vision,nguyen2021finger,falcao2021piwims}.

Systems that follow the \textit{peer-to-peer first} sub-principle correspond to the 13\% of the reviewed papers as we mentioned above. The rest of the papers (i.e., 87\%) do not follow this principle. Either they are implemented as centralised architectures~\cite{schumann2012software,franklin2014lida,gao2016icn,hawes2017strands,cabanes2019autonomous,Muller2019,habibi2019itelescope,shih2020warning,barachi2020crowdsensing,kemsaram2020vision,amrollahi2020aidex,qiu2020phm,Sultana2021,schubert2021onorbit,falcao2021piwims,agarwal2016making}, or federated architectures where there are no interactions between nodes at the same layer (e.g., servers at the edge layer)~\cite{bellocchio2016smartseal,niu2017adls,calancea2019iassistme,Quintero2019,gallagher2019intellimav,gorkin2020sharkey,lu2020digitaltwin,Sarabia2020,Alonso2020,Salhaoui2020,alves2020industry,santana2020smartbuildings,karageorgou2020sentiment,nguyen2021finger,Johny2021,Hegemier2021,decaro2022toolkit,conroy2022infection}.

\subsection{Openness}

\begin{figure}[t!]
	\centering
	\includegraphics[width=.7\textwidth]{figures/openness.pdf}
	\caption{Adoption of openness principles. Majority of reviewed systems use autonomous entities, but only third use them asynchronously. More than 50\% utilize message protocols.}
	\label{fig:openness}
\end{figure}

Figure~\ref{fig:openness} shows to what extent the reviewed systems adopt the \textbf{openness} principle. We found that 28.3\% of systems are based on flexible architectures where new components can be easily added. These components can be new sensor devices or entire software components. The automatic discovery and inclusion of new sensors rely on interoperable protocols that use different communication technologies to integrate sensors from different providers~\cite{schumann2012software,Alonso2020,alves2020industry,zhang202148learningadd}. For example, Alonso et al.~\cite{Alonso2020} use Fiware as a middleware to manage things (i.e., devices) joining and leaving. Architectural patterns such as DHT~\cite{Xu2018}, dataflow \cite{Sultana2021,Junchen2017,alves2020industry,schumann2012software,shan2022poligraph,dai2019bigdl}, the observer design~\cite{franklin2014lida}, or publish/subscribe~\cite{zhang2016emotion,Junchen2017,karageorgou2020sentiment,Sultana2021,decaro2022toolkit,herrero2022i40} are used to provide flexibility and autonomy at the software component level. Federated architectures partially (i.e., 39.1\% of the papers) adopt the \textit{autonomous entities} subprinciple~\cite{bellocchio2016smartseal,niu2017adls,Quintero2019,calancea2019iassistme,gallagher2019intellimav,gorkin2020sharkey,santana2020smartbuildings,barachi2020crowdsensing,Sarabia2020,qiu2020phm,shih2020warning,Salhaoui2020,lu2020digitaltwin,Johny2021,Hegemier2021,nguyen2021finger,conroy2022infection}. These architectures provide flexibility to add new devices at the sensors layer or integrate new users at any time, but they are not flexible at the edge or cloud layer. The addition of new sensors is also based on the use of different communication technologies, while the addition of edge or cloud servers requires manual effort. A third of papers report ML-based systems whose components are not autonomous. It is mainly because these systems are designed as self-contained~\cite{hawes2017strands,Muller2019,cabanes2019autonomous,Shi2019,kemsaram2020vision,Bayerl2020,falcao2021piwims,shan2022poligraph} with static architectures where all their components are predefined~\cite{gao2016icn,habibi2019itelescope,amrollahi2020aidex,schubert2021onorbit,ali2016idviewer}.

A third of papers report systems that are based on \textit{asynchronous entities}. These entities (e.g., software components) interact in an asynchronous fashion. Such interaction in some cases is based on publish/subscribe protocols (e.g., MQTT or RabbitMQ) where entities act as producers and consumers of data~\cite{schumann2012software,franklin2014lida,bellocchio2016smartseal,Shi2019,habibi2019itelescope,alves2020industry,karageorgou2020sentiment,nguyen2021finger}. Data-coupled systems~\cite{zhang2016emotion,karageorgou2020sentiment,schubert2021onorbit,decaro2022toolkit,herrero2022i40}, including those following dataflow architecture~\cite{Junchen2017,Sultana2021}, also enable asynchronous communication as their components interact by writing and reading in data mediums. Only one in twenty papers report systems where only part of their components interact in an asynchronous fashion. Calancea et al. ~\cite{calancea2019iassistme} propose a system to support visually challenged people that uses RabbitMQ to interact with end users. Similarly, Sharkeye~\cite{gorkin2020sharkey} uses the AWS simple notification service (SNS) to send messages warning people about the presence of sharks via smart watches. The rest of the reviewed papers (i.e., 63.0\%) report systems whose components interact synchronously~\cite{gao2016icn,hawes2017strands,Xu2018,Quintero2019,Muller2019,cabanes2019autonomous,gallagher2019intellimav,barachi2020crowdsensing,amrollahi2020aidex,santana2020smartbuildings,lu2020digitaltwin,Sarabia2020,Alonso2020,qiu2020phm,Bayerl2020,Salhaoui2020,kemsaram2020vision,Hegemier2021,Johny2021,falcao2021piwims,zhang202148learningadd,shan2022poligraph,ali2016idviewer,conroy2022infection}. These systems are  based on traditional communication patterns such as RPC or REST calls. 

Asynchronous communications require entities to know a protocol that rules when and how messages are produced and consumed. This is the case of systems whose components interact based on publish/subscribe protocols~\cite{schumann2012software,franklin2014lida,bellocchio2016smartseal,karageorgou2020sentiment,Alonso2020,alves2020industry,nguyen2021finger} or data mediums~\cite{zhang2016emotion,Junchen2017,santana2020smartbuildings,karageorgou2020sentiment,Sultana2021,schubert2021onorbit,decaro2022toolkit,herrero2022i40}. These systems fully adopt the \textit{message exchange protocol} subprinciple and correspond to the 32.6\% of the reviewed papers. Just over 20 percent of systems that partially adopt this subprinciple~\cite{Quintero2019,habibi2019itelescope,calancea2019iassistme,gorkin2020sharkey,Sarabia2020,lu2020digitaltwin,shih2020warning,Salhaoui2020,schubert2021onorbit,Hegemier2021,Johny2021}. These systems are based on federated architectures that use protocols to exchange messages between layers (e.g., from IoT sensors to edge servers). Finally, just under half of the papers describe systems whose components do not use any message exchange protocol. These systems are either self-contained~\cite{hawes2017strands,Shi2019,Muller2019,cabanes2019autonomous,Bayerl2020,amrollahi2020aidex,falcao2021piwims,shan2022poligraph,conroy2022infection} or use traditional RPC or API calls~\cite{gao2016icn,niu2017adls,Xu2018,gallagher2019intellimav,barachi2020crowdsensing,kemsaram2020vision,qiu2020phm,zhang202148learningadd,ali2016idviewer}.

\subsection{Summary}

In this section we summarize our findings and observations from the survey of the selected papers, and provide practical advice for practitioners towards building DOA systems.

Even though DOA can be considered an emerging concept, we have found ML-based systems~\cite{zhang2016emotion,Junchen2017,lebofsky2019breakthrough,herrero2022i40}, that fully adopt the DOA principles. These systems have common requirements regarding big data management, real-time processing, and flexibility. which will be more and more widespread in future real-world data-driven systems. We observe that the adoption of the DOA principles enables such systems to satisfy these requirements while deployed in real-world and large environments. 

The selection of the data models that systems' components share is influenced by the nature of the data that systems handle. Data-coupling based on databases is appropriate for systems that work with data that needs to be persisted in time, while streams fit better systems that handle continuous data from different data sources. In both cases, data coupling enables easier and more transparent big data management as well as more efficient real-time processing. This is mainly because data is read from and written directly to the mediums. It avoids data transmission between systems' components as payloads of direct calls (e.g., REST calls). Distributed technologies such as Apache Kafka, Spark Streaming, HDFS, or HBase are used by the systems that partially or fully adopt the \textit{data as a first-class citizen} principle. \textbf{Practical advice:} consider using data communication mediums, such as databases, streams and message queues, to improve big data availability and management in the system.

There is a clear preference for centralised architectures in the reviewed papers. It shows the current prevalence of cloud platforms that offer flexible services and facilitate the deployment of systems in production. Further research of distributed architectures is necessary to enable benefits of decentralisation while also making it a feasible option for systems deployment. Edge computing has emerged as the research trend toward this goal. It proposes federated architectures with increasingly powerful edge servers. However, the direct interaction between edge nodes is missing in most of the cases according to our survey. Cloud servers in the back still play the main role in most of the systems. A major collaboration between nodes at lower layers in edge architectures has the potential to enable more sustainable systems by exploiting the computing power of everyday devices. Together with distributed storage technologies (e.g., Apache Kafka, HDFS, etc.), DHTs and BFT are two distributed protocols that were used by the reviewed papers to handle the complexity caused by decentralised solutions. \textbf{Practical advice:} absence of a central orchestrator in favor of direct communication between any two nodes of the system is a straightforward way to move to decentralisation. Such decentralisation advocates for more sustainable systems where systems exploit the computing power of every day systems.

Systems that collect data from data-generating components, such as sensors, are usually open. New sensors and data sources can be added to these systems based on the interoperability that current communication technologies offer (e.g., Wi-Fi, Bluetooth, Zigbee, etc). Architectures are more closed and static at upper levels whereas software components are predefined in most cases. These components are designed as static entities which usually communicate in a synchronous fashion via RPC or REST calls (i.e., tight coupling). Data coupling and open environments have a strong correlation according to our survey. Reading from and writing to data mediums is an asynchronous process where systems' components are modelled as data consumers and producers. This process requires components to know message exchange protocols that rule how and when to read and write data. Such protocols also enable seamless and flexible architectures where components can join or leave at any time. Communication protocols such as MQTT and RabbitMQ are well-known tools on top of which open systems are built. \textbf{Practical advice:} use of message exchange protocols and data coupling results in systems that are open and flexible enabling data availability and horizontal scalability as resources are added on demand.

We have observed that systems that were built with such high-level architectures as dataflow and publish/subscribe turned out to be more data-oriented and followed more of the principles we discussed in Section~\ref{sec:doa-principles}. This is not surprising, as these patterns are described in terms of data exchange between components, and do not assume any coupling on control flow level. \textbf{Practical advice:} we recommend adoption of dataflow or publish/subscribe architectures for designing DOA systems.

\subsection{Threats to validity}
In this section we discuss threats to validity and limitations of our work.

\begin{itemize}
    \item [a.] \textit{Research design validity.} A lot of ML deployments are not described in scientific literature. Sometimes they are presented as blog posts, but more often their details are not published anywhere. Besides, we dismissed some of the published reports because they omitted information about their software architecture from the paper. We addressed this threat by covering wide range of fields and areas of ML applications.

    \item [b.] \textit{Publication selection validity.} We used a multi-stage selection process to find papers for the survey. While we have followed an established methodology for such process, this approach has its validity threats. First, we may have missed search terms while implementing lookup in digital libraries. We have iteratively improved our search procedure multiple times to mitigate that risk. Second, the search functionalities are different between databases in our automatic search. We tried to include as many databases as possible even if they overlap. Our tool is publicly available and easy to extend to include new sources in the future. Finally, the automatic filters applied to select papers could exclude relevant papers. The number of retrieved papers made it necessary to automate the filtering process. We used state-of-the-art algorithms for the automatic filters and tested them under different configurations to get the best possible result.

    \item [c.] \textit{Analysis validity.} Quality of analysis and conclusions of this paper hinges on expertise of its authors, and therefore can be prone to personal biases. To alleviate this risk we sought feedback on our work, presenting intermediate results of our study internally to our research group and at external scientific events.
\end{itemize}


\section{Introduction}
\label{sec:introduction}

Artificial intelligence (AI) solutions based on machine learning algorithms have gained a lot of attention in recent years. They have been deployed to solve challenging problems in domains as varied as healthcare, agriculture, robotics, physics, and transportation~\cite{lecun2015deep}. Their success has been driven by the growth of available data, increasingly powerful hardware, and the development of novel machine learning (ML) algorithms~\cite{dovsilovic2018explainable}. Many of these algorithms were originally developed in academic environments, but their demonstrated practical value has led to their rapid adoption in real-world software systems. The contrast between real-world environments that are usually large, complex and dynamic~\cite{joshi2007data,cabrera2019self}, and the more controlled environment from which these algorithms originate, makes systems built on ML difficult to manage. The unstable nature of real-world environments presents software developers with difficult challenges when they adopt and deploy ML algorithms as part of larger systems~\cite{paleyes2022challenges}. In particular, real-world environments produce large amounts of heterogeneous, dynamic, and high dimensional data, which require ML-based systems to be scalable, adaptable, secure, and autonomous while enabling data availability, reusability, monitoring, and trust~\cite{polyzotis2018data, paleyes2022challenges, Lwakatare2020LargescaleML}. Currently, the most common software design paradigm is Service-Oriented Architecture (SOA)~\cite{oreilly2020microservices, aniche2019current}. This paradigm is formulated around services, as well as their interactions. This approach provides modularity and scalability, which are critical for building robust and available systems. However, SOA is not suitable to satisfy data-related requirements because services hide the data in the system behind interfaces, and do not offer any mechanism for collection, monitoring or discovery of data on the paradigm level. This situation is known as ``The Data Dichotomy'': high-quality data management is about exposing data, the services and objects paradigms hide data~\cite{stopford2016data}. Traditionally in software engineering data and logic are kept separate, to allow them to evolve independently. However, ML algorithms require data to learn, leading to a tight coupling between data and logic, thus violating some of the principles of modern software systems design. Consequently, data-related tasks require additional efforts from developers and data scientists when working with systems built with existing paradigms~\cite{paleyes2022fbpsoa}.

Data-Oriented Architecture (DOA) is an emerging software engineering paradigm that aims to support data-related tasks by design while creating loosely coupled, decentralised, scalable and open systems. DOA proposes to achieve these goals by considering data as the common denominator between disparate system components~\cite{joshi2007data,ning2019middleware}. The components in DOA are distributed, autonomous, and communicate with each other at the data level (i.e., data coupling) using asynchronous message exchange protocols~\cite{vorhemus2017data}. These design approach allows DOA-based systems to achieve desirable properties such as data availability, reusability, and monitoring, as well as systems adaptability, scalability, and autonomy~\cite{joshi2007data,vorhemus2017data,ning2019middleware}. Such properties have the potential to benefit the adoption and deployment of ML algorithms in real-world systems. For example, a DOA-based system can automatically store results from different preprocessing treatments a data engineer applies to a data set in the form of data snapshots. An ML engineer or a meta-learning system can reuse these snapshots to select the most suitable data-driven model based on the available data. Similarly, the data coupling and automatic communication between system components in DOA create open systems which can enable a transparent and autonomous integration of ML algorithms into larger real-world applications. Despite these benefits highlighted by applications in various domains~\cite{cai2019survey,bohg2013data,qin2012survey}, DOA has not yet proliferated among ML practitioners. There are no community-accepted best practices around DOA: tool recommendations, design patterns, and typical architecture choices are yet to be formulated. Additionally, while there are several works formulating high-level concepts of DOA, there is no unified research agenda for the community to work on to develop the paradigm.

This paper presents a survey of real-world systems based on ML from a data-oriented software architecture perspective. Even though the majority of existing reports on ML deployments do not mention DOA explicitly, their authors had to resolve the same challenges that DOA aims to solve, and thus implicitly embedded some of the DOA principles in their projects to achieve successful delivery. By observing commonalities among the existing deployed ML applications, we evaluate to what extent DOA principles are implemented in practice and distil a set of DOA best practices. We first introduce the DOA principles and discuss their potential to mitigate the challenges of ML algorithms deployment in real-world systems in Section~\ref{sec:doa-principles}. This established correspondence between DOA principles and ML challenges is then used to analyse research works that apply ML algorithms in different domains and identify how DOA principles are being adopted in Sections \ref{sec:survey-method} and \ref{sec:doa-survey}. We finish with an outline for open questions and research directions for the next stage of DOA paradigm development in Section~\ref{sec:open-issues}. Analysis of related work (Section~\ref{sec:related-work}) and final remarks (Section~\ref{sec:conclusion}) conclude the paper.

\section{Survey Methodology}
\label{sec:survey-method}

\begin{figure}[t]
	\centering
	\includegraphics[width=\textwidth]{figures/slr-process.pdf}
	\caption{Survey process that depicts the steps from the review need identification to the full-text reading of the selected papers. This process is based on the methodology proposed by Kitchenham et al.~\cite{kitchenham2007guidelines, Kitchenham20132049}}
	\label{fig:survey-process}
\end{figure}

In this paper, we want to survey research works that have used ML to solve problems in different domains, and have been actually deployed and tested as systems in real-world settings. We are particularly interested in works that report the software architectures behind these systems and, if possible, the design decisions authors made toward such architectures. The selection of these works is not straightforward as myriad of papers that apply ML in different domains have been published in recent years. This large number of papers makes their manual selection unfeasible. For this reason, we developed a semi-automatic framework based on a well-known methodology for systematic literature reviews (SLRs) in software engineering~\cite{kitchenham2007guidelines, Kitchenham20132049}. The framework is available as a GitHub public project\footnote{Semi-automatic Literature Survey: \url{https://github.com/cabrerac/semi-automatic-literature-survey}} to allow the reproducibility of this work, as well as the reusability of this framework in other surveys. This framework queries the search APIs from different digital libraries to retrieve papers' metadata (e.g., title, abstract, and citations) in an automatic fashion. It then applies syntactic and semantic filters over the retrieved papers to reduce the search space, which is manually explored to select the papers to be surveyed. Figure~\ref{fig:survey-process} depicts the stages of the survey process. It has two principal stages described in the next section.:

\begin{table}
\caption{Search query format used to retrieve papers. The query is composed by three search terms in conjunction. Each of them are replaced by the the values in the second column.}
\label{tab:search-query}
\begin{tabular}{|l|l|}
\hline
\textbf{Search Query}                                    & \textless{}search\_term\_1\textgreater AND \textless{}search\_term\_2\textgreater AND \textless{}search\_term\_3\textgreater{}                                                                                  \\ \hline
\textbf{\textless{}search\_term\_1\textgreater{}} & \begin{tabular}[c]{@{}l@{}}"autonomous vehicle" OR "health" OR "industry" OR "smart cities" OR  "multimedia" OR \\ "science" OR "robotics" OR "oceanology" OR "finance" OR "space" OR "e-commerce"\end{tabular} \\ \hline
\textbf{\textless{}search\_term\_2\textgreater{}} & "machine learning"                                                                                                                                                                 \\ \hline
\textbf{\textless{}search\_term\_3\textgreater{}} & "real world" AND "deploy"                                                                                                                                                                                       \\ \hline
\end{tabular}
\end{table}

\begin{table}
\caption{Synonyms to extend queries. Search terms in the Word column are expanded using their respective synonyms.}
\label{tab:synonyms}
\begin{tabular}{|l|l|}
\hline
\multicolumn{1}{|c|}{\textbf{Word}} & \multicolumn{1}{c|}{\textbf{Synonyms}}                                                                                                                                                                          \\ \hline
"health"                            & "healthcare", "health care", "health-care", "medicine", "medical", "diagnosis".                                                                                                                                 \\ \hline
"industry"                          & "industry 4", "manufacture", "manufacturing", "factory", "manufactory", "industrial".                                                                                                                           \\ \hline
"smart cities"                      & \begin{tabular}[c]{@{}l@{}}"sustainable city", "smart city", "digital city", "urban", "city", "cities", "mobility", \\ "transport", "transportation system".\end{tabular}                                       \\ \hline
"multimedia"                        & \begin{tabular}[c]{@{}l@{}}"virtual reality", "augmented reality", "3D", "digital twin", "video games", "video", \\ "image recognition", "audio", "speech recognition", "speech".\end{tabular}                  \\ \hline
"science"                           & \begin{tabular}[c]{@{}l@{}}"pyshics", "physicology", "chemistry", "biology", "geology", "social", "maths",\\"materials", "astronomy", "climatology", "oceanology", "space".\end{tabular}                      \\ \hline
"autonomous vehicle"                & \begin{tabular}[c]{@{}l@{}}"self-driving vehicle", "self-driving car", "autonomous car", "driverless car", \\"driverless vehicle", "unmanned car", "unmanned vehicle", "unmanned aerial vehicle".\end{tabular} \\ \hline
"networking"                        & "computer network", "intranet", "internet", "world wide web".                                                                                                                                                   \\ \hline
"e-commerce"                        & "marketplace", "electronic commerce", "shopping", "buying".                                                                                                                                                      \\ \hline
"robotics"                          & "robot".                                                                                                                                                                                                        \\ \hline
"finance"                           & "banking".                                                                                                                                                                                                      \\ \hline
"machine learning"                  & \begin{tabular}[c]{@{}l@{}}"ML", "deep learning", "neural network", "reinforcement learning", \\"supervised learning", "unsupervised learning", "artificial intelligence", "AI".\end{tabular}                  \\ \hline
"deploy"                            & "deployment", "deployed", "implemented", "implementation", "software".                                                                                                                                          \\ \hline
"real world"                        & "reality", "real", "physical world".                                                                                                                                                                            \\ \hline
\end{tabular}
\end{table}


\subsection{\textbf{Planning Stage}}

The first stage consists of the review plan definition as follows:

\begin{itemize}
    
    \item [a.]\textit{Need identification:} Section~\ref{sec:doa-principles} introduced the DOA principles and how these can support software developers when addressing challenges of ML deployment in the real world. Despite these potential benefits, and several surveys in ML and its applications (Section~\ref{sec:related-work}), it is not clear yet to what extent current ML-based systems have adopted these principles. We want to conduct a survey of deployed ML-based systems from a DOA perspective to fill this gap and to identify best practices, and open research directions toward the development of the next generation of DOAs for ML systems in the real world.
    
    \item [b.]\textit{Research questions:} The main research question we want to answer with this survey is \textit{to what extent current ML-based systems have adopted the DOA principles?} The answer to this question will allow us to identify the research gaps and directions to develop the next generations of DOAs. The long-term goal of our work is to establish DOA as a mature and competitive paradigm for designing, developing, implementing, deploying, monitoring, and adapting ML-based systems.
    
    \item [c.]\textit{Search terms:} We want to search for papers that present ML-based systems deployed in real-world environments in different domains. Table~\ref{tab:search-query} shows the query format and the search terms we use to retrieve such papers. The query is composed of three search terms in conjunction (i.e., AND operator). The first term refers to popular domains where ML has been applied. The second term filters papers that apply machine learning in these domains, and the third term filters the papers that actually deploy their solution in the real world. Search engines in scientific databases use different matching algorithms. Some of them search for exact words in the papers' attributes (e.g., title or abstract), which can be too restrictive. We expand these queries by including synonyms for the different words in the search terms (Table~\ref{tab:synonyms}). Synonyms extend queries using inclusive disjunction (i.e. OR operator) with their respective words.  
    
    \item [d.]\textit{Source selection:} We search in the most popular scientific repositories using the APIs they offer. They are IEEEXplore\footnote{IEEEXplore API: \url{https://developer.ieee.org/}}, Springer Nature\footnote{Springer Nature API: \url{https://dev.springernature.com/}}, ScienceDirect\footnote{ScienceDirect API: \url{https://www.elsevier.com/solutions/sciencedirect/librarian-resource-center/api}}, Semantic Scholar\footnote{Semantic Scholar: \url{https://www.semanticscholar.org/product/api}}, CORE\footnote{CORE API: \url{https://core.ac.uk/services/api}}, and ArXiv\footnote{ArXiv API: \url{https://arxiv.org/help/api/}}. Some popular repositories, such as the ACM digital library, could not be used as they do not provide an API to query. Nevertheless, because of the significant overlap with other sources (papers can be published in multiple libraries, or indexed by meta-repositories such as Semantic Scholar), we are confident in the sufficient coverage of our search.
\end{itemize}

\begin{table}
\caption{Categories and keywords for Lbl2Vec algorithm~\cite{lbl2vec2021}}
\label{tab:categories}
\begin{tabular}{|l|l|}
\hline
\multicolumn{1}{|c|}{\textbf{Category}} & \multicolumn{1}{c|}{\textbf{Keywords}}                                                                                                                           \\ \hline
"system"                                & "architecture", "framework", "platform", "tool",  "prototype".                                                                                                   \\ \hline
"software"                              & \begin{tabular}[c]{@{}l@{}}"develop", "engineering", "methodology", "architecture", "design", \\ "implementation", "open", "source", "application".\end{tabular} \\ \hline
"deploy"                                & \begin{tabular}[c]{@{}l@{}}"production", "real", "world", "embedded", "physical", "cloud", "edge",  \\ "infrastructure".\end{tabular}                            \\ \hline
"simulation"                            & "synthetic", "simulate".                                                                                                                                         \\ \hline
\end{tabular}
\end{table}

\subsection{\textbf{Conducting Stage}}

The second stage consists of the review execution as follows:

\begin{itemize}
    \item [a.]\textit{Automatic search:} We implemented clients that consume the APIs exposed by the selected repositories as part of our semi-automatic framework. In this step, each client parses the query in the format the respective API understands, submits the request, and stores the search results in separate csv files. The search results are the metadata of the retrieved papers (e.g., title, abstract, publication data).
    \item [b.]\textit{Preprocessing of Retrieved Data:} Each API provides papers metadata in its own format, there are duplicated papers between repositories, and some records can be incomplete (e.g., a paper missing an abstract). The preprocessing step prepares the data for the following steps in our semi-automated framework. It joins the papers' metadata in a single file, cleaning the data, and removing repeated and incomplete papers. A total of 34,932 papers were selected after this step.    
    \item [c.]\textit{Syntactic and Semantic Filters:} All the data of the retrieved papers are stored in a single file after the preprocessing step. But the number of papers is still too large for manual processing. We reduce the search space by applying two filters. A syntactic filter selects the papers that talk in the abstract about real-world deployments. In particular, this filter searches for the "real world" and "deploy" words and their synonyms (Table~\ref{tab:synonyms}) in the papers' abstracts. We found that the selected papers can be classified into four categories, after the syntactic filtering. The first category includes the papers that present architectures of deployed ML-based systems, the second category includes papers that present software engineering approaches to build ML-based systems in practice, the third category includes papers that present physical implementations (e.g., edge architectures) of ML-based systems with a special focus on the infrastructure, and the final category includes papers that experiment and evaluate ML algorithms and systems based on synthetic data and simulated environments. We used an unsupervised-learning algorithm  Lbl2Vec to semantically classify the selected papers in these four groups following the work proposed by Schopf et al.~\cite{lbl2vec2021}. Lbl2Vec requires as inputs the set of texts to classify (i.e., selected papers abstracts), and a set of predefined categories (Table~\ref{tab:categories}). The algorithm assigns papers to the most relevant category. We use this semantic classification to select the papers that belong to the first three categories (i.e., system, software, and deploy). These filters produced a total of 5,559 papers.
    \item [d.]\textit{Semi-automatic filtering:} The syntactic filters in the previous step reduce the set of papers to a number that is more feasible to be manually explored. Our framework in this step shows the paper information to the user in a centralised interface where papers are selected as included or excluded. Such manual selection has two stages following the methodology defined by Kitchenham et al.~\cite{kitchenham2007guidelines, Kitchenham20132049}. We select the papers by reading their abstracts in the first stage. Then, we filter the selected papers by skimming the full text. The papers that pass these two filters are part of the final set of selected papers. Manuals filters produce a total of 101 papers.
    \item [e.]\textit{Snowballing:} We use the API from Semantic Scholar to retrieve metadata of the papers that cite the selected papers (i.e., 101 papers) from the previous stage. We apply the preprocessing stage as well as the syntactic, semantic, and manual filters to the resulting papers from this snowballing process. The papers that pass these filters are added to the final set of selected papers after removing repeated papers. The preprocessing step produced 596 papers, the syntactic and automatic filters produced 18 papers from which 3 were manually selected. A total of 103 papers were selected for full-text reading after removing repeated papers.
    \item [f.]\textit{Full-text reading:} We read the 103 papers and selected 46 papers to report in this survey. We made annotations about to what extent these papers adopt the DOA principles (Section~\ref{sec:doa-principles}). The survey of these 46 papers from a data-oriented perspective is presented in Section~\ref{sec:doa-survey}.
    \item [g.]\textit{Exclusion criteria:} We exclude the following works during our semi-automatic process:
        \begin{itemize}
        \item Papers that report experiments on synthetic data or simulated environments.
        \item Papers that propose systems architectures or methodologies without a proper real-world deployment and evaluation.
        \item Papers that present isolated ML algorithms which are not part of larger systems.
        \item Papers with missing metadata that cannot be analysed by our framework (i.e., a paper without abstract).
        \item Papers that are duplicates of already included papers.
        \item Papers that are not written in English.
        \item Survey and review papers.
        \item Thesis and report documents.
    \end{itemize}
\end{itemize}
\section{Research Agenda}
\label{sec:open-issues}

In the previous section, we observe that the DOA principles are not widely adopted while they actually offer the desired properties that enable systems to achieve demanding requirements at deployment. We believe more research efforts are needed to advance the community's understanding of why and how to build DOA systems and take advantage of the capabilities they provide. This section highlights new and exciting opportunities for DOA research and development.

\subsection{Systems Monitoring and Shadow Systems}

Regression testing is one of the key techniques to ensure continuous delivery of software updates to a live system. Deployment pipelines often include automated testing steps to ensure changes do not introduce new issues or degradation of performance. Unlike traditional software, DOA systems rely on data as much as they do on code, and input data is often generated by mechanisms that are outside of the developers' control. This opens an opportunity to develop new practices around monitoring of DOA systems. An approach that fits very well with the key features of DOA is a network of shadow emulators or ``shadow system''. A statistical emulator is a probabilistic surrogate model of a given process that can be trained in a data efficient manner and allows to quantify uncertainty to inform decision making \cite{emukit2019}. By exposing all intermediary data streams within the system, DOA makes it possible to create and maintain a separate emulator (or a set of emulators) that correspond to each component of the system. Crucially, because of availability of intermediate processing data at arbitrary points, DOA allows us to single out any subsystem of choice and emulate it. A network of such emulators acts as a shadow system, which is capable of measuring the gap in behavior between the real world and the software system, monitoring and identifying fluctuations in data streams, and estimating effects of potential changes.

To move the idea of shadow system beyond a mere concept, two main research efforts are required. First, the DOA paradigm should be developed to identify most efficient approaches and practices around automatic fitting of surrogate models to software components. Existing work mostly focuses on auto-tuning of system parameters \cite{Alabed2022BoGraphSB, dalibard2017boat} and has limited scalability potential. Thus, more case studies are needed that illustrate use of shadow emulators as monitoring and explainability tools for software, as well as suggesting scalable ways of automatically building surrogate models of systems' components. Second, innovation in mathematical composition of individual emulators is required to build networks that can efficiently propagate uncertainty between components \cite{Damianou2013DeepGP}. While there is prior work on uncertainty propagation in software \cite{mishra2011uncertainty}, it still is not common to see interfaces and APIs that provide access or otherwise handle input or output uncertainty. A network of shadow emulators can provide a visibility into data-related uncertainty within the system, thus enabling new research directions.

\subsection{End-to-end Systems Optimisation}

Optimisation is a ubiquitous problem in production environments, where developers and users often seek to find the best configuration of a certain hardware or software tool. With the ever growing intricacy of modern software systems, their end-to-end optimisation becomes progressively more complex. There are many sources of such complexity. First, individual components within a system may have their own parameters affecting their behaviour, and the end-to-end optimisation process has to account for them, thus growing the input space. Second, isolated optimisation of a single component can have unforeseen downstream effects, thus emphasising the need for joint optimisation of all components taking into account their interactions \cite{zeng2016joint}. Third, developers have to deal with many, often conflicting priorities, thus leading to the added complexity of multi-objective optimisation and Pareto front discovery \cite{Avent2020AutomaticDO}. Finally, large-scale systems can be expensive to execute, thus limiting the amount of time their performance with different parameter values can be observed. 

While not eliminating all of these challenges completely, DOA provides tools that can make them easier to tackle. Availability of intermediate data allows automated data-driven analysis of connections and interactions between components. Networks of emulators described above allow the construction of surrogate model of the entire system, thus making multi-objective Bayesian optimisation techniques applicable \cite{paleyes2022hippo}. Clarity of data dependencies between components can allow for more informed optimization procedures \cite{Aglietti2020CausalBO}. High quality surrogate models can also significantly reduce the need for running the real system and thus improve experimentation.

As an example, end-to-end optimisation is widely used in recommender systems, which personalise the response to a user. As machine learning becomes prevalent online (e.g., Large Language Models~\cite{brown2020language}), aligning models with the user’s intention is becoming more and more important. However, it is difficult to accurately determine how user feedback should be propagated through the system and which nodes should be credited for the outcome. When purchasing an object online, there are several factors that could influence the decision, such as loading time, imagery, and the order of objects presented. Traditionally, A/B tests are used to manually assess the various decision points in a system. However, by exposing each node and data in a DOA architecture, credit assignment can be automated using reinforcement learning and causal inference techniques~\cite{johnson2020causal,dubslaff2022causal}.

Another open research question around end-to-end optimisation of software systems lies in the area of ``deep emulation'' --- propagation of uncertainty in hierarchical structures arising from dataflow graphs of systems, and ability to combine multiple emulators that form a single network. New methods to represent uncertainties in hierarchical and multi-component systems are required, as well as the ability to evaluate explicit and implicit variational approximation techniques for deep structural learning.

\subsection{Edge Computing and Federated Learning}

Low latency requirements have become common for systems that must process information in real time~\cite{cabrera2022maaco}. The realization of the Internet of Things (IoT) paradigm is exacerbating these requirements as IoT devices generate more data that needs to be collected and processed, while end users require faster responses. This new data enables novel applications in different domains (e.g., health care, virtual reality, etc.) that constitute a fertile ground for ML~\cite{Al-Fuqaha2015, M.A2015}. Edge Computing is a research area that enables low latency by processing data at the edge through small data centers closer to end-users~\cite{shi2016edge,tabatabaee2022mecsurvey,cabrera2022maaco}. Such local processing implies the deployment of software components (e.g., ML models) in a distributed fashion. Federated Learning enables the training of ML models using decentralised data ~\cite{bonawitz2019federated}. This approach constitutes an initial step towards the goal of “bringing the code to the data, instead of the data to the code” and addresses challenges related to data privacy and ownership. 

Edge Computing and Federated Learning naturally complement each other. For example, service placement and offloading algorithms from the edge computing domain~\cite{tabatabaee2022mecsurvey} can be used as schedulers to optimise the training process in terms of efficiency and resource usage. These service placement and offloading algorithms are already based on ML models that can be implemented using available Federated Learning technologies such as TensorFlow Federated\footnote{TensotFlow Federated: \url{https://github.com/tensorflow/federated}}. DOAs can rely on both edge computing and Federated Learning approaches to enable decentralised systems. At the same time, DOAs can support these approaches by enabling higher data availability. Edge computing and Federated Learning approaches can use shared and exposed data models to make better-informed decisions. For example, data status can inform edge architectures about corrupted edge nodes that should not be considered by schedulers, or data status can inform about possible bias to Federated Learning approaches when training ML models. A successful synergy between Edge Computing, Federated Learning, and DOAs requires research efforts towards the adaptation of current edge computing approaches to the ML domain, the adaptation of complex and expensive ML models to versions that can run in resource constraint devices, and a generalization from Federated Learning to Federation Computation which implies the inclusion of additional tasks from the ML life cycle in the federated framework~\cite{bonawitz2019federated}.

\subsection{Self-adaptive and Continual Learning Systems}

Systems must deal with evolving requirements and unexpected failures when they are deployed in dynamic real-world environments~\cite{gerasimou2019sefias}. Self-adaptive systems address these challenges by sensing possible sources of changes in the environments and triggering adaptations of the system's behaviour and configurations~\cite{lalanda2013autonomic,giese2013software}. These systems use machine learning techniques (e.g., reinforcement learning) to predict possible changes in the environment and act accordingly~\cite{gerasimou2019sefias,cabrera2019self}. One desired property of self-adaptive systems is the ability to learn new tasks without forgetting about the past ones. Continual learning (CL) systems have this ability as they focus on learning a large number of tasks without forgetting previous knowledge~\cite{liu2017lifelong}. Besides the ability to adapt to new tasks, multiple additional applications of CL for systems are possible. ML-based systems suffer from hidden feedback loops~\cite{sculley2015hidden}, and CL techniques can help mitigate these problems \cite{khritankov2021hidden}. Real-life data contains outliers and exhibits sudden distribution shifts, which can be addressed with strategies proposed in the CL literature~\cite{cai2021online}.

DOAs propose to create shared data models where systems' historic data is fully available and traceable. This high data availability enables the building of CL and self-adaptive systems that can easily access systems' past states. For example, Diethe et al. describes a reference architecture for self-adaptive systems~\cite{diethe2019continual}, noting that these systems are capable of self-maintenance and therefore handling one of the biggest challenges modern software faces: evolving data. The suggested architecture is decentralized and stream-based, which are some of the core principles behind DOA. Any advances in building such a self-learning system will help develop an overall understanding of DOA benefits and challenges. Self-learning systems are a step towards life-long learning~\cite{silver2013lifelong,liu2017lifelong}, an ultimate goal of intelligent and autonomous systems research.

\subsection{Systems FITness}

There is a growing interest in understanding the impact of automated decision making on individuals and communities. In parallel, there is a legislative effort to control and mitigate the potentially negative effects of such decision making systems. In the area of ML a lot of attention is given to the field of algorithmic fairness, that aims to understand impact of ML models on different groups of population, and privacy, that aims to protect data of individuals used for model training. Overall, the community is increasingly more concerned with FIT models - models that are Fair, Interpretable and Transparent. While these research efforts are commendable, they might be shifting the focus of the community towards standalone models. Crucially, ML systems include many components in addition to the models themselves. This brings forth important questions of FITness of ML systems as whole. While understanding behavior of a single model is important, it is equally important to understand the behavior of the entire system and its effect on a particular individual or a group. This transition opens a range of research opportunities in understanding FITness of DOA systems, such as analysis of complex interactions between individual components, propagating effects of data shifts, tracing a system output throughout the decision pipeline, system-wide counter-factual explanations \cite{wachter2017counterfactual}.

DOA systems are ideally suited to address this shift of attention from FIT models to FIT systems, because of decomposability, data flow modelling and traceability they exhibit. Since components in DOA software are loosely coupled via data interfaces and decentralized, any subset of connected components can be extracted and examined independently, with entire historical data of its inputs and outputs available for fairness and interpretation analysis. Such subsets can range from individual component (e.g. model) to an entire system, proving engineers and analysts with flexibility. System FITness is closely related to legislative initiatives on data privacy and protection, such as GDPR or Equal Credit Opportunity Act (ECOA). Many existing works highlight dataflow as a key feature that allows systems to successfully fulfill these law requirements \cite{Schwarzkopf2019PositionGC, singh2018decision, akoush2022desiderata}. DOA paradigm exposes flow of system's data by design, thus naturally providing support for such concepts as compliance by construction \cite{Schwarzkopf2019PositionGC}, data provenance \cite{carata2014primer} and decision provenance \cite{singh2018decision}. 

\subsection{Security and Privacy}

The DOA paradigm advocates for open systems where entities are autonomous and can freely access shared data models. These data models store systems' current and past states, which naturally raises questions regarding systems' security and privacy. Malicious entities can access and modify systems' data and behaviour at any time in such open environments. Depending on the application considered, data access and systems' components may need to be restricted to a specific set of users. For example, a healthcare management system needs to implement restrictive data access policies to avoid data privacy issues. The decentralisation principle can mitigate the security and privacy threats by storing and processing data in devices closer to end users (e.g., smartphones)~\cite{shi2016edge,tabatabaee2022mecsurvey,cabrera2022maaco}. However, managing authentication, permissions, and encryption keys in such a setup is challenging. 

Different research efforts from the security community can be applied to DOA open setups to address security and privacy challenges. Homomorphic encryption~\cite{fontaine2007survey} is an interesting direction for performing decentralised computations directly on encrypted data, without needing to provide the decryption key to the participant nodes. For example, a payment system might inquire about the validity of a transaction without having access to the underlying data (e.g. bank account number). Early deployment of zero-knowledge proof~\cite{fiege1987zero} and homomorphic encryption are taking place in the industry~\cite{blum2019non}. This technology offers a solution to privacy issues in decentralized networks, but the field is still in its infancy. Algorithms developed are often computationally expensive, so further research is needed to make them more practical in resource-constrained devices. In addition to these technical advances, security and privacy issues also require authorities to develop novel initiatives and policy frameworks to keep up with advances in technology~\cite{montgomery2021policy}. 
\section{Related Work}
\label{sec:related-work}

A wide range of papers have surveyed the application of AI in the last few years. They are usually focused on the AI techniques and algorithms that have been applied to solve problems in specific domains. They report the challenges that these algorithms face and the open research gaps that require the development of novel AI methods. For example, Cai et al.~\cite{cai2019survey} provide a survey of multimodal data-driven techniques applied to the domain of smart healthcare. They review systems for disease analysis, triage, diagnosis, and treatment. Bohg et al.~\cite{bohg2013data} survey data-driven methodologies for robot grasping. They focus on techniques based on object recognition and pose estimation for known, familiar, and unknown objects. The review includes a comparison between data-driven methodologies and analytics approaches. Qin et al.~\cite{qin2012survey} review the state-of-the-art of data-driven methods for industrial fault detection and diagnosis. The focus of their survey is on fault detectability and identifiability methods for industrial processes with different complexity and at different scales. Wong et al.~\cite{Wong2020TheRE} presents the challenges of applying deep learning (DL) techniques in radio frequency applications. This work reviews DL applications in the radio frequency domain from the perspective of data trust, security, and hardware/software issues for DL deployment in real-world wireless communication applications. Joshi et al.~\cite{Joshi2022} surveys the deployment of deep learning approaches at the edge. Their review presents deep learning models architectures, enabling technologies, and adaptation techniques. Their work also describes different metrics for deep learning models at the edge which can be used to design and evaluate DL techniques.

Recent survey papers have also focused on the AutoML domain, which focuses on the automatic selection, composition, and parametrisation of learning models~\cite{waring2020automated}. Faes~\cite{faes2019e232} et al. reviewed and evaluated automated DL software and tools to develop medical image diagnostic classifiers by healthcare professionals. They found that these professionals can use AutoML software to develop DL algorithms whose performance is comparable to the ones applied in the existing literature. Waring et al.~\cite{waring2020automated} review the state-of-the-art of automatic machine learning from a computer science and biomedical perspective. This paper shows that automated techniques can support experts in different ML tasks by reducing processing times. Escalante~\cite{escalante2020automated} describes the main paradigms of AutoML in the context of supervised learning. This paper surveys different research works in this area and outline future research opportunities. Similarly, Zheng et al.~\cite{zheng2023automl} reviews how AutoML techniques have been applied to the domain of recommender systems. They propose a reference architecture for a recommender system and analyse the state of the art according to the architecture's components.

Previous surveys study the application and automation of data-driven algorithms in specific domains. But they do not consider the systems architectures where such algorithms are integrated and how such architectures are deployed in real-world environments. In this paper, we survey ML-based systems deployed in the real world and we consider data-driven algorithms as components of larger systems. A data-oriented perspective allows us to determine the principles that systems should follow to address the ML life-cycle challenges at deployment~\cite{paleyes2022challenges}. We use this perspective to quantify the extent to which current systems adopt these principles, and to identify their design decisions, good practices, and enabling technologies. We then outline a research agenda to define a vision for the next generation of Data-Oriented Architectures.