\section{Introduction}\label{sec:1}
\subsection{Background and Motivation}
High-mobility, agility, and accessibility are the representative advantages of unmanned aerial vehicles (UAVs) that can help serve a wide range of services beyond 5G or 6G~\cite{9044378,9428629,9459481,9023459,saad2019vision}. UAV deployments continue to grow, and the number of UAV fleets will reach 1.6 million by 2024, according to a Federal Aviation Administration (FAA) prediction~\cite{mozaffari2021toward}. Examples of varied UAV applications include real-time surveillance~\cite{yun2022cooperative}, package delivery~\cite{9098900}, smart factory~\cite{10012051,lee2018online}, and disaster monitoring/management~\cite{9583845}. Notably, UAVs can provide wireless communication service without a fixed terrestrial infrastructure to create mobile access networks, providing broader wireless coverage and real-time service~\cite{tvt202106jung,8660516}.
UAVs can rapidly establish wireless connections with ground UEs in emergencies (\textit{i.e.}, military issues or disasters) and in extreme areas where installing terrestrial base stations is not easy for technical or economic reasons~\cite{zhang2021joint, bor2016new,zeng2016wireless}.
However, constructing a UAV-enabled mobile access system is demanding since there are manifold unexpected dynamics and uncertainties in a real environment (\textit{i.e.,} obstacles, gales, collisions, energy limitations, or malfunctioning). For this reason, designing a machine learning algorithm that autonomously and adaptively provides the optimal UAV trajectories can be an alternative solution to the abovementioned issues.

\begin{figure}
    \centering
    \includegraphics[width=\linewidth]{Figure/Sys_architecture.pdf}
    \caption{Proposed architecture of mobile access for quantum multi-UAV systems.}
    \label{fig:sys_architecture}
\end{figure}

Although eclectic machine learning (ML) and deep learning (DL) algorithms can become dominant solutions, reinforcement learning (RL) has shown the most notable performance in sequential decision-making problems~\cite{pieee202105park}. RL can enable effective resource management service in a dynamic networking system, including UAV-aided mobile access~\cite{lahmeri2021artificial}. In addition, RL also has scalability to use multi-agent reinforcement learning (MARL) to control multiple UAVs. However, there are various interactions between UAVs in MARL where they train their policies contemporaneously and affect each other's policy training performance. In addition, UAVs can observe partial information about the dynamic environment due to physical limitations. These factors considerably make the MARL environment non-stationary. 

Quantum MARL (QMARL) using quantum entanglement can handle the above non-stationary problem~\cite{oh2020tutorial}. Building quantum neural networks (QNNs) using quantum computing (QC) enables efficient consumption of computing resources with fewer model parameters than conventional neural networks, resulting in more reliable and faster learning~\cite{chen2020variational, yun2022quantum,9555241}. 
Existing studies have performed CTDE-based computation for the cooperation of multiple agents using a structure called \textit{CommNet}~\cite{sukhbaatar2016learning}. \textit{CommNet} is a structure that receives state information of all agents as input and learns the actions of all agents in one single neural network. If $m$ agents represent the state as an $n$-dimensional vector, the input of the CommNet increases to $m\times n$. Accordingly, the input size scales up linearly as the number of agents increases. However, this linear scale-up is very burdensome in the QC environment due to the near intermediate-scale quantum (NISQ) limitation in using qubits, which is a chronic scalability issue~\cite{shor1995scheme}.

\subsection{Algorithm Design Rationale}
We take care of this scalability issue, which is mentioned previously, in the quantum domain by using a structure in which a \textit{centralized critic} network trains multiple \textit{actor} networks iteratively~\cite{yun2022quantum}. This paper proposes a CTDE-based training algorithm, referred to as a {QMACN} algorithm utilizing the benefit of QC (\textit{i.e.}, quantum entanglement, efficient utilization of computing resource with fewer parameters). Using our proposed algorithm, we can realize cooperative mobile access, fully utilizing quantum supremacy. As observed in Fig.~\ref{fig:sys_architecture}, the CTDE-based approach can overcome physical limitations due to weight or storage conditions (\textit{e.g.}, cryogenic environment) when mounting a real quantum computer on a UAV. In addition, inter-UAV communications are not needed thanks to the information-sharing provided by CTDE. 

Finally, this paper designs concrete environments by taking into account the UAVs' information noise, which usually hinders them from behaving as intended. Considering this factor, this paper authenticates UAVs can react more adaptively to environmental noises through performance evaluations.

\subsection{Contributions} 
The contributions of this work are as follows.
\begin{itemize}
    \item This paper presents a {QMACN} algorithm enabling multiple UAVs to provide autonomous mmWave communication services cooperatively by training UAVs to find optimal trajectories in uncertain and dynamic environments.
    \item In addition, this paper proposes a CTDE framework to realize QMARL by rising above the limitations of quantum computing, \textit{i.e.}, quantum error in increasing the number of qubits in the NISQ era and the impossibility of putting a quantum computer on a UAV.
    \item Moreover, we design a realistic environment by noise injection to construct robust mobile access in multi-UAV systems, where the environmental noises are considered based on reality. One of the representative characteristics of RL is to train agents to respond adaptively to dynamic elements.
    \item Lastly, This paper corroborates the advantages of QMARL and realistic noise reflection on training policies by carrying out various data-intensive evaluations in policy training and its inference process. 
\end{itemize}

\subsection{Organization} 
The rest of this paper is organized as follows. Sec.~\ref{sec:2} reviews preliminary knowledge. The details of the baseline models are described in Sec.~\ref{sec:3}. Sec.~\ref{sec:4} presents a novel quantum reinforcement learning algorithm for multi-UAV mobile access. Sec.~\ref{sec:5} evaluates the performance via data-intensive simulations. Lastly, Sec.~\ref{sec:6} concludes this paper.

\section{Quantum Machine Learning}\label{sec:2}

\begin{figure}
    \centering
    \subfigure[Classical MARL.]
    {
    \includegraphics[width=0.95\linewidth]{Figure/MARL_Layer.pdf}
    \label{fig:MARL_Layer}
    }
    \subfigure[Quantum MARL.]
    {
    \includegraphics[width=0.95\linewidth]{Figure/Quantum_Layer.pdf}
    \label{fig:Quantum_Layer}
    }
    \caption{Architectures of classical computing and quantum computing.}
\end{figure}

\subsection{Qubits}
Qubits are used in QC as the basic unit of information. They can take any value between 0 and 1 because they are expressed as a combination of two bases: $|0\rangle$ and $|1\rangle$. A qubit that is composed of both bases is said to be in a state of \textit{superposition} and can also be used to express quantum states~\cite{bouwmeester2000physics}. In addition, \textit{entanglement} between qubits is also possible, which can significantly increase the correlation between two individual qubits. This nature of qubits allows QC to contain and control more information compared to classical computing. Assuming a $q$ qubits system, a quantum state existing in the Hilbert state of the system can be expressed as $|\psi\rangle=\alpha_{1}|0\cdots 0\rangle + \cdots + \alpha_{2^q}|1\cdots 1\rangle$,
where $\alpha$ stands for the probability amplitude, \textit{i.e.}, $\sum^{2^q}_{i=1}\alpha_{i}^2=1$. 
As $\alpha$ is a complex number, a quantum state can be expressed as a point on the Bloch sphere geometry. 

\subsection{Quantum Neural Network}
In order to design and compute QNN using qubits, the qubits should be controllable for training the neural network. The control can be achieved via the utilization of basic quantum gates in order to control the positions of qubits over Bloch sphere. Representative examples of the basic quantum gates are rotation gates, which are expressed as $R_x$, $R_y$, and $R_z$, which are for the rotation over $x$-, $y$-, and $z$-axes. For more details, the gate functions can be performed as unitary operations on a single qubit, causing it to rotate by a specific value in the given directions of $x$-, $y$-, and $z$-axes. Furthermore, these gates can not only control qubits but also can encode classical bit-scale data. While basic quantum rotation gates are single qubit gates that can only be applied to a single qubit simultaneously, there are also multiple qubit gates acting on two or more qubits simultaneously. For example, \textit{CNOT} gate can cause entanglement among several qubits by performing \textit{XOR} operation on two qubits~\cite{williams1998explorations}. Based on these theories and concepts, QNN models can be built by assembling these gates. conventional QNN models consist of following three components, \textit{i)} state encoding circuit (refer to \textsf{A} in Fig.~\ref{fig:Quantum_Layer}), \textit{ii)} parameterized quantum circuit (PQC) (refer to \textsf{B} in Fig.~\ref{fig:Quantum_Layer}), and \textit{iii)} quantum measurement (refer to \textsf{C} in Fig.~\ref{fig:Quantum_Layer}) layers.
More details for these layers are as follows.

\subsubsection{State Encoding Circuit}
First of all, the encoding layer's function is to encode classical data into quantum states because quantum circuits cannot take classical bits as input. Therefore, the state encoder converts bits into qubits by passing $q$ number of $|0\rangle$ into an array of rotation gates using classical data used as parameters denoted as $\theta_{enc}$. Additionally, the input data $X$ is split into $[x_{1} \cdots x_{N}]$ such that they can be individually used as parameters, where $N$ is the number of split data for the input $X$. Then, the output quantum state of this layer will contain the information of classical data. 


\subsubsection{Parameterized Quantum Circuit}
Secondly, there is the PQC which carries out the desired computation, and it is equivalent to a classical deep neural network (DNN), especially accumulated hidden layer multiplication. In this layer, the input quantum state is rotated by a specific angle using quantum gates such that the output will give the required value (\textit{i.e.}, action and state values). For this paper, the qubits are computed using the \textit{Controlled-Universal (CU)} gate which has flexible control over the direction of rotation, entanglement, and disentanglement.  
The structure of the QNN model in this paper is as illustrated in Fig.~\ref{fig:Quantum_Layer}, and it can be seen that the encoding layer followed by the CU3 layer is repeated several times. This particular structure is due to the data re-uploading technique~\cite{P_rez_Salinas_2020}, simultaneously encoding and rotating the qubits. As a result, the computation efficiency of each qubit is maximized, \textit{i.e.}, the number of qubits is decreased which is required to produce the values needed for MARL.

\subsubsection{Quantum Measurement}
Lastly, the quantum state produced from the PQC becomes the input of the measurement layer. In this stage, the input is measured such that the quantum data can be decoded back into classical data for optimization. This measurement operation is equivalent to the multiplication of a projection matrix with respect to $z$-axis. While $z$-axis is most commonly used for measurement, it can be any other properly defined directions. After conducting the measurement of the quantum state, the quantum state collapses, and it becomes an \textit{observable}. 
After the decoding procedure, the \textit{observable} is used to minimize the loss function. Then, it should be differentiated for backpropagation, however, quantum data cannot be differentiated because applying chain rule will completely collapse the qubits. Thus, the technique for obtaining the loss gradient via the symmetric difference quotient of the loss function is used for QNN training.

\subsection{Quantum Reinforcement Learning}
The quantum circuit's final \textit{observable} value represents the agent's action probability in MARL computation. It coincides with the \textit{softmax} function, as illustrated in Fig.~\ref{fig:MARL_Layer}. Using the information in the replay buffer will compute the action distribution of the agents in the environment. Then, as the agents move based on this action, a new state, observation, and reward data will be produced for another repeated iteration of learning~\cite{yun2022quantum}. Previous studies with many spurs have verified better performance than DNN-based RL methods composed of the same parameters~\cite{lockwood2020reinforcement, yun2022quantum, chen2020variational}. The proposed algorithm in this paper takes advantage of QC for robust mobile access; and it investigates the performance evaluation of QMARL compared to traditional MARL to show quantum supremacy in our considering problems in this paper.

\begin{figure*}[t]
    \centering
    \includegraphics[width=\linewidth]{Figure/overall_training_architecture.pdf}
    \caption{Proposed CTDE-based training pipeline for constructing the quantum multi-UAV systems.}
    \label{fig:sys_pipeline}
\end{figure*}

\section{Model}\label{sec:3}
\subsection{Mobile Access Model using Multi-UAV Networks}
We propose {QMACN} algorithm aiming to construct reliable and robust autonomous multi-UAV networks to use QMARL in a dynamic environment, as illustrated in Fig.~\ref{fig:sys_pipeline}. To implement QC, our proposed {QMACN} algorithm can overcome scalability and physical issues of QC as described in Sec.~\ref{sec:1}. Regarding the delay, we assume the state of the server's queue is ideal. It also takes into account the situation where if communication for sending a UAV's experience fails, it will retransmit until it succeeds like the standard TCP protocol. Considering the natural CTDE environment, UAVs may encounter multifaceted uncertainties represented as noise when sending their experiences for training to the central server. Not only the interactions between UAVs can affect the training policy of other agents, but also various noise factors can. It is the reason for using RL, and two types of noises are fed into our MARL system to consider a realistic environment in this paper; \textit{i)} state noise and \textit{ii)} action noise. We will discuss the detailed descriptions of these noises in Sec.~\ref{sec:3-c}.



\subsection{Noise Distribution Model in UAV Positioning}\label{sec:3-c}

\BfPara{State Noise}
UAVs trained by our {QMACN} algorithm determine their own locations and positioning using a global positioning system (GPS) which is trivial in many mobility systems and platforms~\cite{mobisys2010paek}. 
However, the GPS receivers have noise, \textit{i.e.}, interference, jamming, and delay lock loop (DLL), which can be modeled to Gaussian or non-Gaussian distributions. The considering model of the state noise in GPS sensors is a canonical model, \textit{i.e.}, generalized Cauchy noise~\cite{liu2008performance}, as follows,
\begin{equation}
    p_{cy}(z,\sigma_z)=\frac{Y}{\left(1+v^{-1}[|z|/X]^k\right)^{v+1/k}},
    \label{eq:state_noise}
\end{equation}
where $X=\Big[\sigma_z^2\frac{\Gamma(1/k)}{\Gamma(3/k)}\Big]^{1/2}, Y=\frac{kv^{-1/k}\Gamma(v+1/k)}{2X\Gamma(v)\Gamma(1/k)}$, and $\Gamma(t)=\int_{0}^{\infty} x^{t-1}e^{-x}\, dx$ (Gamma function). In the variance of the generalized Cauchy density, $k$ and $v$ mean the impulsiveness and variance of the GPS noise~\cite{kassam2012signal}. We depict the probability distribution function (PDF) of the state noise as generalized Cauchy density in Fig.~\ref{fig:state_noise} with the scale parameter $\sigma_z$ where we set $k=0.20$, $v=40$, and $\sigma_z^2=0.22$~\cite{liu2008performance}.

\BfPara{Action Noise}
The wind noise can be dealt with Weibull distribution, a widely used probability distribution for wind speed modeling, and is formulated as~\cite{feng2015modelling},
\begin{equation}
    p_{wb}(v,\sigma_v)=\left(c/A\right)\left(v/A\right)^{c-1}\texttt{exp}(-vc/A),
    \label{eq:action_noise}
\end{equation}
where $v$ is a wind velocity, $A=\left(\frac{\sigma_v}{\Bar{v}}\right)^{-1.086}$ and $c=\frac{\Bar{v}}{\Gamma(1+1/A)}$ stand for scale and shape parameters with the mean $\bar{v}$ and the standard deviation $\sigma_v$ of wind speed calculated from the data~\cite{feng2015modelling}. The meteorological measurements for a total of $145,048$ data were conducted during three years (from June 1st, 1999 to May 31, 2002). Fig.~\ref{fig:action_noise} is the PDF of Weibull distribution with $A=10.97\,m/s$ and $c=2.29$ to represent the probability of the wind speed. In addition, Fig.~\ref{fig:wind_direction} means the probability of $12$ directions where the wind can blow from.

\begin{figure}[ht]
    \centering
    \subfigure[State Noise.]{
    \includegraphics[width=0.465\linewidth]{Figure/Caucy_Noise_Model-eps-converted-to.pdf}
    \label{fig:state_noise}
    }
    \subfigure[Action Noise.]{
    \includegraphics[width=0.465\linewidth]{Figure/Weibull_Distribution-eps-converted-to.pdf}
    \label{fig:action_noise}
    }\\
    \subfigure[Wind Direction.]{
    \includegraphics[width=\linewidth]{Figure/Direction_Prob-eps-converted-to.pdf}
    \label{fig:wind_direction}
    }
    \caption{Probability distribution functions of state and action noises.}
    \label{fig:Noise_model}
\end{figure}

\BfPara{Modeling the Gap between Real-World and Ideal}
Considering the aforementioned noise states, we can reconfigure the \textit{actor-critic} model. This paper considers both \textit{ideal} and \textit{real} state/action cases. The ideal case doesn't consider the noise in the environment, and the other considers the noise. The relationship between these cases can be modeled as follows,
\begin{align}
    s_{\textrm{real}} & = s_{\textrm{ideal}} + n_s,
    \\
    a_{\textrm{real}} & = a_{\textrm{ideal}} + n_a,
\end{align}
where $n_s$ and $n_a$ stand for the noises of states and actions. Note that $n_s\sim p_{cy}(z, \sigma_z)\;\{x,y\}\in z$ and $n_a \sim p_{wb}(v, \sigma_v)$, where $x$ and $y$ are the cardinal points in the two-dimensional coordinate of UAVs' positions.

\section{MARL Algorithm for Multi-UAV Cooperation}\label{sec:4}
\subsection{MARL Formulation}


Our considered multi-UAV system is mathematically defined as the decentralized partially observable MDP (Dec-POMDP), which is widely used for CTDE MARL framework~\cite{rashid2020monotonic}, because each UAV conducts distributed sequential decision-making with partial environment information due to its physical limitations.
The Dec-POMDP of $M$ UAVs can be denoted as $\langle\mathcal{S}, \mathcal{A}, P, r, \mathcal{O}, \mathcal{Z}, \gamma\rangle$ where 
    $\mathcal{S}$ is a set of states where $s\in\mathcal{S}$; 
    $\mathcal{A}$ is a set of actions where ${a_m}\in A_m \subset \mathcal{A}$, and here, $a_m$ denotes the $m$-th UAV's action, which composes the joint UAV actions, \textit{i.e.}, $\textbf{a}\in \mathcal{A}$, and note that $A_m$ means a set of $m$-th UAV's actions; 
    $P$ is a state transition probability $P:\Pr(s'\,|\,s,\textbf{a})=\mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow\mathcal{S}$ with joint UAV actions $\textbf{a}$; 
    $r$ is a shared reward which is given to all UAVs equally with $r(s,\textbf{a},s')=\mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow\mathbb{R}$.
    In addition, 
    $\mathcal{O}$ is a set of observations where $m$-th UAV's observation is denoted as $o_m\in O_m \subset \mathcal{O}$ and note that $O_m$ means a set of observations of $m$-th UAV. Here, the joint UAV observation for training is denoted as $\textbf{o}\in\mathcal{O}$.
    $\mathcal{Z}$ is conditional observation probabilities function space, \textit{i.e.}, $\mathcal{Z}(s',\textbf{a},\textbf{o})$ and $\gamma$ is a discount factor.

When each $m$-th UAV takes action $a_m$ while observing $\mathcal{O}_m$ based on $\mathcal{Z}(s',\textbf{a}, \textbf{o})=\mathcal{S}\times\mathcal{A}\rightarrow\mathcal{O}$, the state is updated based on $P$. After this computation, the reward is engendered using $r(s,\textbf{a},s')$. More details about observations, states, actions, rewards, and objective are as follows.

\BfPara{Observations}
The $m$-th UAV can partially observe its own environment information based on its position, which is composed of position $p_m\in\{x_m,y_m\}$ and energy state $e_m$ where $x_m$ and $y_m$ stands for Cartesian coordinates. In addition, the UAV can observe the distance between its own position and other UAV within the observable scope of the $m$-th UAV, \textit{i.e.}, 
\begin{equation}
    d_{mm'} = \begin{cases}
        \|p_m - p_{m'}\|_2,& \textit{if.}~~~\|p_m - p_{m'}\|_2 \leq D_{\textit{th}}, \\
        -1,& \textit{(otherwise)},
    \end{cases}
\end{equation}
where $D_{\textit{th}}$ and $\|\cdot\|_2$ mean the observation scope and L2-norm. The observation of the $m$-th UAV is defined as $\mathcal{O}_m \triangleq\{p_m, e_m, \bigcup_{m'=1}^{M}\{d_{mm'}\}\}$, $\forall \{m,m'\} \in M$.

\BfPara{States} 
The state consists of two service details in mobile access; \textit{i)} availability of service $c_{mn}\triangleq\{0,1\}$ and \textit{ii)} quality of service (QoS) $q_{mn}$. The QoS of $n$-th user supported by $m$-th UAV can be formulated as follows,
\begin{equation}
    q_{mn}=
    \begin{cases}
        \left(1+\exp^{-w_{a}\left(\kappa_{mn}-w_{b}\right)}\right)^{-1},\;\textit{(video traffic)},\\
        \log\left(w_{c}\cdot\kappa_{mn}+w_{d}\right),\;\textit{(otherwise)},
    \end{cases}
\label{eq:quality}
\end{equation}
where $\kappa_{mn}$ is the $n$-th user's data rates serviced by the $m$-th UAV. In addition, the values of weight parameters are $w_a$\,=\,0.01, $w_b$\,=\,1024, $w_c$\,=\,1, and $w_d$\,=\,1~\cite{jung2021infrastructure}. 
Here, all UAVs have the same state information simultaneously because the total service is affected by all UAVs.
Accordingly, the state information can be denoted as $s\triangleq \bigcup^{M}_{m=1}\bigcup^{N}_{n=1}\{c_{mn}, q_{mn}\}$.

\BfPara{Actions}
All UAVs take actions sequentially based on their policies at $t$. UAVs can move in four cardinal points because they are in the two-dimensional Cartesian coordinates $(x,y)\in\mathbb{R}^2$. Therefore, the action set that UAVs can take is $\mathrm{a}\triangleq\{x_m \pm (v_m \times t), y_m \pm (v_m \times t)\}_{m=1}^M$ where $v_m$ is UAV's velocity.

\BfPara{Rewards}
The reward $r(s,\textbf{a})$ is generated with the current state $s$ and all UAVs' selected actions $\textbf{a}$. Then, the reward function can be is as follows,
\begin{equation}
    r(s,\textbf{a},s')=w_c\times\sum_{m=1}^{M}\sum_{n=1}^{N}\left(c_{mn} \times q_{mn}\right)\times\tau\times\mathbbm{1}(e_m),
    \label{eq:reward}
\end{equation}
where $w_c$ is a reward weight to make the learning process more stable; $\tau^t$ is all UAVs' overlapped rate at $t$ which has to be decreased to reduce interference among them; $\mathbbm{1}(\cdot)$ is an indicator function to differentiate whether it is zero or not. Therefore, it can be seen that UAVs try to maximize the ground users' supported rate and QoS within the energy limit according to this reward function definition.

\BfPara{Objective}
Our main objective in MARL is formulated as,
\begin{equation}
\begin{split}
    & \pi^*_{\boldsymbol{\theta}} = \\
    & \argmaxD_{\boldsymbol{\theta}}\mathbb{E}_{s_{\textrm{real}}\sim E,\,\mathrm{a}_{\textrm{real}}\sim\pi_{\boldsymbol{\theta}}}\left[\sum_{t=1}^T\gamma^{t-1}\!\cdot\!r\left(s_{\textrm{real}},\mathbf{a}_{\textrm{real}},s'_{\textrm{real}}\right)\right],
    \label{eq:goal}
\end{split}
\end{equation}
where $E$, $T$, and $\gamma \in [0,1)$ stand for the environment where UAVs exist, an episode length, and a discounted factor. 
By redefining the objective function in RL, UAVs can consider realistic environmental noise (\textit{i.e.,} $s_{\textrm{real}}$, $\mathbf{a}_{\textrm{real}}$, and $s'_{\textrm{real}}$) to make their policy more robust to versatile noises. More details about how UAVs achieve optimal decision-making with the reconfigured objective function are discussed in Sec.~\ref{sec:4B}.


\subsection{QMACN for Cooperative Multi-UAV Mobile Access}\label{sec:4B}
\BfPara{QMARL Algorithm Design}
Sec.~\ref{sec:1} clearly states the scalability issue due to the limited number of qubits. However, in the existing \textit{actor-critic} training method~\cite{konda1999actor}, the number of qubits must also increase as the number of agents grows in MARL. This need brings out quantum errors that inhibit system stability~\cite{shor1995scheme,yun2022quantum}. 
Accordingly, we propose a novel {QMACN} algorithm utilizing the CTDE, the methodology based on the multi-agent \textit{actor-critic} RL framework~\cite{yun2022quantum, lowe2017multi}. 
In CTDE, there is one \textit{centralized critic} and multiple \textit{actor} networks where the number of actor networks is commensurate with the number of agents in Dec-POMDP~\cite{oliehoek2016concise} as observed in the server of Fig.~\ref{fig:sys_pipeline}. CTDE-based agents make the sequential decision dispersively and train their \textit{actor} networks corresponding to the policy by evaluating the value of \textit{centralized critic} network, which can be expressed as follows,
\begin{align}
\!\!\!Q(o,a;\boldsymbol{\theta}) &=\beta_a \langle O_a \rangle_{o,\boldsymbol{\theta}}\!\!\!\!&=\beta_a\text{Tr}(U^{a\dagger}(o;\boldsymbol{\theta})M_a U^{a}(o;\boldsymbol{\theta}))
\label{obs:actor}\\
\!\!\!V(s;\boldsymbol{\phi}) &=\beta_c \langle O \rangle_{s,\boldsymbol{\phi}}\!\!\!\!&=\beta_c\text{Tr}(U^{c\dagger}(s;\boldsymbol{\phi})M_c U^{c}(s;\boldsymbol{\phi}))
\label{obs:critic}
\end{align}
where operators $\text{Tr}(\cdot)$, $U(\cdot)$, and $(\cdot)^\dagger$ represent trace operator, the unitary operation for qubit rotation, and the entanglement of multiple qubits and complex conjugate, respectively. When the quantum state is measured, the output (known as observable) exists between -1 and 1, \textit{i.e.}, $\forall \langle O\rangle\in [-1,1]$, we utilize hyper-parameters $(\beta_a,\beta_c)$ for \textit{actors} and \textit{critic} networks to be well-trained.
Note that $M_a$ and $M_c$ are Hermitian matrices. With \eqref{obs:actor}, \eqref{obs:critic}, and the hyper-parameters, the \textit{actor-critic} networks can approximate the value function.

\BfPara{Quantum Actor}
At every time step $t$, the $m$-th quantum \textit{actor} chooses the action with the most significant probability among the currently possible actions based on its state and observation information, which is represented as follows,
\begin{equation}
    \mathrm{a_{m,real}} = \argmaxD_{\mathrm{a}}\pi_{\boldsymbol{\theta}_m}(\mathrm{a}_{\textrm{ideal}}|s_{\textrm{ideal}}+n_s,{o}_m)+n_a,
\end{equation}
subject to
\begin{align}
    & \pi_{\boldsymbol{\theta}_m}(\mathrm{a}_{\textrm{ideal}}|s+n_s,{o}_m) \triangleq \textit{softmax}(Q(o,a;\boldsymbol{\theta}_m)),\\
    & \textit{softmax}(\mathbf{x}) \triangleq \left[\frac{e^{x_1}}{\sum_{i=1}^N e^{x_i}},\cdots,\frac{e^{x_N}}{\sum_{i=1}^N e^{x_i}}\right],
\end{align}
where the \textit{softmax}$(\cdot)$ is an activation function to normalize the inputs. By using it, we can extract all actions' probabilities of the \textit{actor} with the observable $\langle O_a\rangle_{o,\boldsymbol{\theta}}$ in~\eqref{obs:actor}.

\BfPara{Quantum Centralized Critic}
The CTDE has a \textit{centralized critic} responsible for valuing the current state with a state-value function as follows,
\begin{multline}
        V_{\boldsymbol{\phi}}(s) \!=\! \langle O_c \rangle_{s, \boldsymbol{\phi}} \!\simeq \! \\ 
        \mathbb{E}_{s_{\textrm{real}}\sim E,\,\mathrm{a}_{\textrm{real}}\sim\pi_{\boldsymbol{\theta}}}\left[\sum_{t'=t}^{T} \gamma^{t'-t}\!\!\cdot r(s_{\textrm{real}},\!\mathbf{a}_{\textrm{real}},s'_{\textrm{real}})\right],
\end{multline}
where $s_t$ is the measured state at the current state at $t$. We can also use the \textit{critic} network's observable to evaluate the current state's value.

\subsubsection{Training and Inference}
As mentioned in~\eqref{eq:goal}, agents in MARL try to maximize the expected return. We utilize the congruent state-value function $V_\phi$ of the \textit{centralized critic} network to derive the gradients from maximizing the common goal. With the parameters of \textit{actor} and \textit{critic} networks, which correspond to $\boldsymbol{\theta}$ and $\boldsymbol{\phi}$, we can configure a multi-agent policy gradient (MAPG) based on the temporal difference \textit{actor-critic} model by \textit{Bellman optimality equation}, as follows,
\begin{multline}
    \nabla_{\boldsymbol{\theta}}J(\boldsymbol{\theta}) = \\ \mathbb{E}_{s_{\textrm{real}} \sim E} \left[\sum\limits^{T}_{t=1}\sum\limits^{M}_{m=1} 
    \delta_{\boldsymbol{\phi}}\cdot\nabla_{\boldsymbol{\theta}_m}\log\pi_{\boldsymbol{\theta}_m}(\mathrm{a}_m|s_{\textrm{real}}, o_m)\right], 
    \label{eq:l_actor}
\end{multline}
and
\begin{equation}
    \nabla_{\boldsymbol{\phi}}\mathcal{L}(\boldsymbol{\phi}) = \sum^{T}_{t=1}\nabla_{\boldsymbol{\phi}}\left\|\delta_{\boldsymbol{\phi}}\right\|^2,
    \label{eq:l_critic}
\end{equation}
subject to
\begin{equation}
    \delta_{\boldsymbol{\phi}} = r\left(s_{\textrm{real}},\mathbf{a}_{\textrm{real}},s'_{\textrm{real}}\right)+\gamma V_{\boldsymbol{\phi}}(s'_{\textrm{real}})-V_{\boldsymbol{\phi}}(s_{\textrm{real}}),
    \label{eq:delta}
\end{equation}
where \eqref{eq:l_actor} is the objective function for \textit{actor} networks, and the other is the loss function for \textit{critic} network.

We describe how to obtain loss gradients with quantum and classical computing. Hereafter, we denote an \textit{actor}-network and \textit{critic} network as $\boldsymbol{\theta}$ and $\boldsymbol{\phi}$ for mathematical amenability.
We can calculate their loss values with the temporal difference error of \textit{centralized critic} $\delta_{\boldsymbol{\phi}}$ in~\eqref{eq:delta}, where the derivative of \textit{actor}/\textit{critic}'s $i$-th parameters is expressed as follows,
\begin{eqnarray}
    \frac{\partial J(\boldsymbol{\theta})}{\partial \theta_i} = \frac{\partial J(\boldsymbol{\theta})}{\partial \pi_{\boldsymbol{\theta}}} \cdot \frac{\partial \pi_{\boldsymbol{\theta}}}{\partial \langle O \rangle_{o,\boldsymbol{\theta}}}\cdot \frac{{\partial \langle O \rangle_{o,\boldsymbol{\theta}}}}{\partial \theta_i},\label{eq:loss_actor_derivative} \\
    \frac{\partial\mathcal{L}(\boldsymbol{\phi})}{\partial \phi_i}  = \frac{\partial\mathcal{L}(\boldsymbol{\phi})}{\partial V_{\boldsymbol{\phi}}} \cdot \frac{\partial V_{\boldsymbol{\phi}}}{\partial \langle O \rangle_{s,\boldsymbol{\phi}}} \cdot \frac{{\partial \langle O \rangle_{s,\boldsymbol{\phi}}}}{\partial \phi_i}, \label{eq:loss_critic_derivative}
   
   
\end{eqnarray}
where the first and second derivatives of RHS in \eqref{eq:loss_actor_derivative} and  \eqref{eq:loss_critic_derivative} can be calculated by classical computing. However, the latter derivative cannot be calculated because the quantum state is unknown before its measurement. 
Thus, we use the parameter-shift rule~\cite{crook19}. This allows us to bridge classical and quantum computing by multiplying the partial derivatives of classical and quantum computing. The parameter-shift rule applied to~\eqref{eq:l_actor}--\eqref{eq:l_critic}, where the derivative of $i$-th \textit{actor} parameter can be obtained with zeroth derivative as follows,
\begin{equation}
   \frac{{\partial \langle O \rangle_{o,\boldsymbol{\theta}}}}{\partial \theta_i} =\langle O \rangle_{o,\boldsymbol{\theta} + \frac{\pi}{2} \mathbf{e}_i } - \langle O \rangle_{o,\boldsymbol{\theta} - \frac{\pi}{2} \mathbf{e}_i },\label{eq:param-shift}
\end{equation}
where $\mathbf{e}_i$ correspond to the $i$-th basis of $\boldsymbol{\theta}$, respectively. Similarly, the LHS of \eqref{eq:loss_critic_derivative} is obtained via \eqref{eq:param-shift}.
Finally, we can calculate the gradient of the objective function as elaborated in \eqref{eq:l_actor}--\eqref{eq:l_critic}.
\subsubsection{Algorithm Pseudo-Code}
Details of the proposed CTDE-based training and inference procedure is explained in Algorithm~\ref{alg:CTDE} and the corresponding descriptions are as follows.
\begin{enumerate}
    \item
    Initialize the parameters of \textit{actor} and \textit{centralized critic} networks, which are $\boldsymbol{\theta}$ and $\boldsymbol{\phi}$, respectively \textsc{(line 1)}.
    \item 
    All agents learn their policy by repeating the below procedure until all training epochs reach maximum epochs:
    (i) At the start of every epoch, initialize environments to set starting state $s_0$ \textsc{(line 3)}.
    (ii) Each UAV selects an action based on its policy for every time step. By taking action, the environment is transited to the next time step ${s}^{t+1}$ \textsc{(lines 5–7)}. Calculate the total reward $R^t$ and transition pairs (\textit{i.e.,} experiences) by UAVs $\xi = \{s^t, \textbf{o}^t, R^t, s^{t+1}, \textbf{o}^{t+1}\}$ are stored in the replay buffer $\mathcal{D}$ \textsc{(lines 8-10)}.
    (iii) UAVs randomly sample the mini-batch from $\mathcal{D}$ for getting $V_{\boldsymbol{\phi}}$. By doing so, the learning performance improves by reducing the continuity of the data used for training~\cite{mnih2013playing}. Note that we start training networks of UAVs to prevent the direction of training from being biased to initial data \textsc{(lines 12-13)}.
    (iv) Update parameters of the \textit{centralized critic} network $\boldsymbol{\phi}$ by gradient descent to the loss function to reduce its value \textsc{(lines 14-15)}.
    (v) Update parameters of all \textit{actor} network by gradient ascent to the objective function in the direction of increasing its value evaluated by \textit{centralized critic} network \textsc{(line 16)}.
    After completing all UAVs' policies training, they perform multiple inference processes on the environment \textsc{(lines 19-27)}.
\end{enumerate}

\begin{algorithm}[t]
\small
    Initialize weights of the \textit{actor} and \textit{centralized critic} networks which are denoted as $\boldsymbol{\theta}$ and $\boldsymbol{\phi}$, $\forall m \in [1,M]$ \\
    \For{Epoch = 1, MaxEpoch}{
        $\triangleright$ \textbf{Initialize Multi-UAV Environments}, set $s_0$ \\
        \For{time step = 1, $T$}{
            \For{each UAV $m$}{
                $\triangleright$ Select the action $\mathrm{a}_m$ based on its policy $\pi_{\boldsymbol{\theta}_m}(a_m\,|\,o_m)$ at time step $t$ \\
            }
            $\triangleright$ $s^t \rightarrow s^{t+1}$, $\textbf{o}^t \rightarrow \textbf{o}^{t+1}$ with the reward $R^t$

            $\triangleright$ Set $\xi= \{s^t, \textbf{o}^t, \textbf{a}^t, R^t, s^{t+1}, \textbf{o}^{t+1}\}$ \\
            $\triangleright$ Update replay buffer $\mathcal{D}$: $Enqueue(\mathcal{D},\xi)$ \\
        }
        \If {$\mathcal{D}$ \textbf{is full enough to train:}}{
        
        \For{each UAV $m$}{
        $\triangleright$ Get $V_{\boldsymbol{\phi}}$ by sampling mini-batch from $\mathcal{D}$\\
        
        $\triangleright$ Update $\boldsymbol{\phi}$ by \textbf{gradient descent} to loss function of the \textit{centralized critic} network: $\nabla_{\boldsymbol{\phi}}\mathcal{L}(\boldsymbol{\phi})$ \\
        
        $\triangleright$ Update $\boldsymbol{\theta}_m$ by \textbf{gradient ascent} to objective function of the \textit{actor} network: $\nabla_{\boldsymbol{\theta}_m}J(\boldsymbol{\theta}_m)$
        }
    }
    }
    \For{Episode = 1, MaxEpisode}{
        $\triangleright$ \textbf{Initialize Multi-UAV Environments}, set $s_0$ \\
        \For{time step = 1, $T$}{
            \For{each UAV $m$}{
                    $\triangleright$ Select the action $\mathrm{a}_m$ based on its policy $\pi_{\boldsymbol{\theta}_m}(a_m\,|\,o_m)$ at time step $t$ \\
                }
                $\triangleright$ $s^t \rightarrow s^{t+1}$, $\textbf{o}^t \rightarrow \textbf{o}^{t+1}$ with the reward $R^t$ \\
        }
    }
    \caption{QMACN training and inference process for multi-UAV cooperation}
    \label{alg:CTDE}
\end{algorithm}

\begin{table}[t!]
\small
\caption{Experimental setup parameters}
\centering
\begin{tabular}{c|l|r}\toprule[1pt]
    \multicolumn{2}{c}{\textbf{{Parameters}}} & \textbf{{Values}} \\
    \midrule
    $M$ & The number of UAVs & $4$ \\
    $N$ & The number of UEs  & $25$ \\
    $T$ & Episode length & $30\,\mathrm{min}$ \\
    - & State dimension & $179$ \\
    - & Action dimension & $4$ \\
    $\mathcal{D}$ & Replay buffer size & $50$k \\
    - & Mini-batch size & $32$ \\
    $\gamma$ & Discount factor & $0.98$ \\
    $w_c$ & Reward coefficient & $0.01$ \\
    $\epsilon$ & Initial value of epsilon & $0.275$ \\
    - & Annealing epsilon & $0.00005$ \\
    - & Minimum epsilon & $0.01$ \\
    - & Training epochs & $10$k \\
    - & Learning rate of \textit{actor} & $0.001$ \\
    - & Learning rate of \textit{centralized critic} & $0.00025$ \\
    - & Activation function & ReLU \\
    - & Optimizer & Adam \\
    \bottomrule[1pt]
\end{tabular}
\label{tab:parameter}
\end{table}

\section{Performance Evaluation}\label{sec:5}
\subsection{System Setup}
\subsubsection{Environment}
Our environment setting includes two primary objects, ground users (\textit{i.e.}, UEs) and UAVs. UEs are randomly distributed over $6,000\times6,000\,\mathrm{m}^2$ area/map; and UAVs fly at the altitude of $2500\,\mathrm{m}$. For UAV mechanical modeling, DJI Phantom4 Pro v2.0 is used~\cite{tvt202106jung}. We assume that all UAVs are located at the center of the map, it the start of every episode.
Table \ref{tab:parameter} summarizes other environmental parameters.

\subsubsection{Communication Methodology}
In addition, a 60\,GHz mmWave wireless technology is considered for communications. 
The reason why the 60\,GHz wireless network is considered is that it has \textit{i)} a large channel bandwidth, \textit{ii)} low-latency transmission, \textit{iii)} high beam directivity, \textit{iv)} high diffraction, and \textit{v)} high scattering~\cite{singh2011interference}. 
These characteristics are advantageous due to less sensitivity to interference from nearby mobile access (spatial reuse) while being sensitive to blocking~\cite{singh2011interference}.
In an urban environment, for example, blocking by high-rise buildings may affect mmWave wireless systems. However, our proposed algorithm can minimize this problem through the optimal cooperative UAVs positioning. Thus, it is possible to provide better QoS to users using high directivity as well as low latency~\cite{singh2011interference}. More details are in~\cite{park2022cooperative} and also summarized in Appendix~\ref{sec:mmwave}.


\subsubsection{Benchmark}
We will compare the performance of our proposed method against various benchmarks to substantiate the proposed QMARL algorithm, as outlined below.
\begin{itemize}
   
    \item `{{w/ Dual noise (Proposed)}}' trains each UAV's policy with our QMARL algorithm in a realistic environment with state and action noise. Accordingly, the position of the UAV is affected by not only its action decision. We conduct an \textit{ablation study} to investigate how each noise component affects the performance and robustness of our model against noise.
    \item `{{w/ State noise}}' and `{{w/ Action noise}}' train the policy with QMARL algorithm with only state noise from GPS sensors and action noise from winds, respectively.
    \item `{{Ideal}}' trains the policy using QMARL algorithm in an environment without noise. It means that the position of the UAV is only dependent on its action decision.
   
    \item `{{CMARL}}' is the latest MARL using classical neural networks to compare the training performance with our proposed QMARL algorithm. The multi-UAVs in this training scheme train their policies by the classical CTDE using backpropagation~\cite{hecht1992theory}. Note that we set the same number of model parameters with our proposed QMARL training benchmarks.
    \item `{{Random walk}}' algorithm means all UAVs decide their actions randomly regardless of given state and observation information. By comparing it, we can verify the superiority of MARL.
\end{itemize}


\subsection{Feasibility of Reward Function Design}
In this subsection, we will scrutinize the feasibility of reward function design with Fig.~\ref{fig:overall_result}. 
From the results, we observe that noise-reflected benchmarks (\textit{i.e.}, `w/ Dual noise', `w/ State noise', and `w/ Action noise') show huge fluctuations (\textit{i.e.,} high variance) than the ideal due to the noise effect. 
However, the total rewards of QMARL-based training algorithms converge to a more considerable value than the above two benchmarks at the end of the training process. Despite fewer parameters, we emphasize that our proposed QMARL algorithm has a more predominant policy training performance than the CMARL. UAVs trained by the QMARL get analogous final reward values in the order of `Ideal', `w/ Action noise', `w/ State noise', and `w/ Dual noise'.
Next, the average support rate and QoS of ground UEs' tend to be similar to the reward convergence as observed in Fig.~\ref{fig:support_rate}--\ref{fig:qos}. These results show that our formulation about the reward function is well set for multi-UAVs to achieve the intended goals of reliable wireless communication service provisioning.
In summary, we confirm that our reward design is justified for not only ideal condition but noise contained environment. 
\begin{figure*}[!ht]
    \centering
    \includegraphics[width=0.75\linewidth]{Figure/Comp_description.pdf}\\
    \subfigure[Average total reward.]{
    \includegraphics[width=0.31\linewidth]{Figure/reward-eps-converted-to.pdf}
    \label{fig:reward}
    }
    \subfigure[Average support rate.]{
    \includegraphics[width=0.31\linewidth]{Figure/support_rate-eps-converted-to.pdf}
    \label{fig:support_rate}
    }
    \subfigure[Average total QoS.]{
    \includegraphics[width=0.31\linewidth]{Figure/quality_of_service-eps-converted-to.pdf}
    \label{fig:qos}
    }
    \caption{Average values of training results in all training benchmarks in every time step. Fig.~\ref{fig:reward} shows UAVs' total obtained rewards over the entire epochs, Fig.~\ref{fig:support_rate} shows UAVs' total obtained average support rate over the entire epochs, and Fig.~\ref{fig:qos} shows the ground UEs' support rate and QoS served by UAVs.}
    \label{fig:overall_result}
\end{figure*}

\begin{figure}
    \centering
    \includegraphics[width=0.75\linewidth]{Figure/NN_description.pdf}\\
    \includegraphics[width=0.75\linewidth]{Figure/reward_dim-eps-converted-to.pdf}
    \caption{Comparison reward convergence of each training method in classical neural networks and quantum computing. \textit{dim} means the dimension of the hidden layer in each classical neural network.}
    \label{fig:with_classical_NN}
\end{figure}
\begin{figure}[ht]
    \centering
    \includegraphics[width=0.9999\linewidth]{Figure/Comp_description2.pdf}\\
    \subfigure[Support Rate.]{
    \includegraphics[width=0.43\linewidth]{Figure/num_user-eps-converted-to.pdf}
    \label{fig:inference_support_rate}
    }
    \subfigure[QoS.]{
    \includegraphics[width=0.43\linewidth]{Figure/qos-eps-converted-to.pdf}
    \label{fig:inference_qos}
    }
    \\
    \subfigure[Average Support Rate.]{
    \includegraphics[width=0.415\linewidth]{Figure/Bar_support_rate-eps-converted-to.pdf}
    \label{fig:inference_support_rate_avg}
    }
    \subfigure[Average QoS.]{
    \includegraphics[width=0.415\linewidth]{Figure/Bar_qos-eps-converted-to.pdf}
    \label{fig:inference_qos_avg}
    }
    \caption{Overall support rate and quality of service in the {inference process} where there are state and action noise.}
    \label{fig:inference}
\end{figure}


\subsection{Comparison with Classical Neural Networks}

This subsection will determine the hidden layer's dimension of the `{{CMARL}}', and investigate quantum performance compared to existing neural networks. Fig~\ref{fig:with_classical_NN} illustrates the tendency of the reward convergence with the classical neural networks varying from 1 to 128, the ideal, and the random walk in an ideal environment. Placing an adequate number of neurons in the hidden layers can reduce training time with high accuracy, but overfitting problems arise when there are unnecessary increases in neurons or layers. It means that more neural networks with more neurons do not necessarily have better training performance. Among all training benchmark algorithms, our QMARL outperforms in terms of the reward convergence value and speed. Except for QMARL, only classical neural networks with 64 and 128 hidden dimensions obtained higher rewards than the random walk at the end of the training epochs. However, the classical neural network with 128 hidden dimensions has a more unstable training performance than random walk, and fewer rewards than Ideal. Therefore, it is reasonable to set the CMARL's hidden layer level to 64. 
In Fig.~\ref{fig:reward}, the reward convergence trends of all training benchmarks are illustrated over the whole training epochs. {{CMARL}} and {{the random walk}}, which are not based on QMARL algorithms, have the inferior performance of training policy in terms of the total reward value than QMARL-based training benchmarks. 
In summary, our proposed QMARL training algorithm (\textit{i.e.}, `Ideal') enables UAVs to learn the more near-optimal policy than CMARL in the ideal environment.


\subsection{Impact of Noise on Robustness in Realistic Environment}
After the learning process, all UAVs sequentially make decisions based on the learned policy in the inference process for attesting to the policy's feasibility. We will compare our QMARL algorithms training in various environments (\textit{i.e.,} ablation study) by inference of 100 iterations. As illustrated in Fig.~\ref{fig:inference}, `w/ Dual noise' has the most robust and highest value regarding support rate (avg.\,60.9\,\%) and QoS (avg.\,37.2) in the entire inference process. `w/ State noise' and `w/ Action noise' constitute the second reliable mobile access with 18\,\% lower support rate and 25\,\% lower QoS than `w/ Dual noise'. `Ideal' provides the poorest wireless communication service due to environmental noise among all QMARL training algorithms. It served an average 77\,\% lower support rate and QoS to ground UEs than `w/ Dual noise'. In a nutshell, the more noise we consider similar to the real environment, the more robust and reliable the agent can learn policies to build mobile access in the real world.

\subsection{Discussion}
This section will discuss the performance of our proposed noise-reflected QMARL training algorithm with the experimental results in Sec.~\ref{sec:5}.

\subsubsection{Quantum Advantage in MARL Environments}
Fig.~\ref{fig:with_classical_NN}--\ref{fig:overall_result} show that our QMARL algorithms based on PQC have comparable policy training performance with fewer parameters when compared to CMARL~\cite{yun2022quantum2}. This empirical advantage lies in memorization, a representative hallmark of PQC~\cite{jerbi2021parametrized}. Preserving knowledge about existing labels when learning new labels helps train policies near-optimally~\cite{yun2022quantum2}. Quantum has sufficient flexibility in this feature, but conventional neural networks (\textit{i.e.,} DNN policy) do not~\cite{jerbi2021parametrized}. This difference affects the speed or performance of training, as observed in our numerical performance evaluations. In addition, quantum gates employing quantum features (\textit{i.e.,} superposition, entanglement) have the advantage of exponential computational gain over conventional neural networks~\cite{shor1999polynomial}.

\subsubsection{Effects of Noise Injection for Realistic Environment Design}
This paper conducted an ablation study with noise reflection. In general, environmental noise affects the behavior of agents in RL. However, some Gaussian noises improve policy robustness~\cite{he2019parametric} and alleviate the malfurious effects on the gradient quality by selective noise injection (SNI)~\cite{igl2019generalization}. This scheme helps the agent achieve desired goals more efficiently as well as work robustly to natural noise. Our noise models in~\eqref{eq:state_noise}--\eqref{eq:action_noise} following Gaussian distributions make UAVs come across abundant experience in environments and prevent overfitting to restricted training environments for a reason in Fig.~\ref{fig:reward}. Even taking some fluctuations into account, benchmarks with noise (`w/ Dual noise', `w/ State noise', and `w/ Action noise') have a faster convergence speed and higher value regarding rewards compared to `Ideal'. Furthermore, we investigated the model robustness in realistic environments with noise in Fig.~\ref{fig:inference}.

\section{Concluding Remarks}\label{sec:6}
This paper proposes a CTDE-based quantum multi-agent actor-critic networks ({QMACN}) training algorithm for constructing a robust mobile access using multiple UAVs. For the practical use of QC, we adopt a CTDE in overcoming scalability and physical issues in order to realize quantum supremacy. Our proposed {QMACN} algorithm verifies the advantage of QMARL with the remarkable performance improvements in terms of training speed and wireless service quality in various data-intensive evaluations. Moreover, we successfully validate that the noise injection scheme can help multiple UAVs react to environmental uncertainties, making mobile access more robust. In a nutshell, our proposed {QMACN} algorithm shows considerable performance improvements in constructing the cooperative multi-UAV mobile access compared to the conventional training methods with fewer model parameters, which leads to efficient computing resource management.

\bibliographystyle{IEEEtran}

