\section{Preliminary}

\section{Introduction}

\IEEEPARstart{R}{ecent} advances within the Smart Grid (SG) paradigm are geared towards the incorporation of several Internet of Things (IoT) based devices and advanced computing technologies to ensure reliability, flexibility and efficiency of critical power systems \cite{Lamnatou_Chemisana_Cristofari_2022}. With the prevalence of Artificial Intelligence (AI), the enormous amount of highly granular power-related data generated by such intelligent devices enable energy service providers to improve load forecasts, maximize financial gains, devise effective demand management and other grid operation strategies, etc \cite{Sakhnini_Karimipour_Dehghantanha_Parizi_Srivastava_2021}. Besides, consumers can experience better quality of service through personalization of the power system applications and tools \cite{9478223}. However, present data analytics solutions for SGs primarily emphasize on centralized and decentralized  approaches that require the direct sharing of data and/or learning models to dedicated central servers \cite{9381850} . In such cases, the sharing of fine-grained load consumption profiles collected from individual smart meters to central data servers imposes several privacy concerns to energy data owners \cite{husnoo2021false, reda2021taxonomy}. For instance, several studies \cite{10.1145/1878431.1878446, 10.1145/2528282.2528295} have highlighted that simple analysis of load consumption patterns recorded by smart meters can reveal household occupancy rates, the presence of people within a house, and sleep/wake-up time of residents, without any prior knowledge. Indeed, higher resolution of smart meter data leads to higher granularity in information and allows third parties to infer more sensitive information about households.

In such a scenario, Federated Learning (FL) emerges as a viable privacy-preserving distributed computing alternative which transfers computation to energy data owners and allows the training of a global model through collaboration of devices without requiring the migration of data to a central repository for model training \cite{9084352}. Typically, edge devices in an energy system network iteratively train a local model and update the resulting parameters to a central aggregator which accumulates and processes the parameters and then sends back the updated parameters to the edge devices. The communication rounds continue until successful convergence of the model. In spite of the privacy preservation benefits due to the omission of raw data sharing requirements, FL is also efficient in terms of communication resource usage and has higher scalability \cite{9084352}. Recently, FL has gained much attention from researchers to explore its potential benefits within several smart grid domains, namely short-term load forecasting \cite{husnoo2022fedrep, 9148937}, energy theft detection \cite{9531953}, to name a few. Nevertheless, despite its promising privacy-preserving potentials, recent literature has revealed that FL may fail to provide sufficient privacy guarantees in certain circumstances. For example, researchers have discovered that they are able to reconstruct the original raw data from the sharing of gradients of the model during iterations \cite{iDLG}. Furthermore, due to the distributed nature of FL, it is vulnerable to Byzantine faults/attacks whereby the client nodes behave arbitrarily which may be a result of adversarial manipulations or software/hardware faults \cite{FLTrust}. Therefore, it is imperative to design FL mechanisms that are fault-tolerant to such behaviours, provide good generalisation performance and are communication efficient. Consequently, we investigate this research gap in the field of smart grids by contributing to the following:

\begin{enumerate}[leftmargin=*]
    \item Inspired by the idea of gradient quantization, we develop a state-of-the-art privacy-preserving federated learning-based framework that leverages the SIGNSGD algorithm to improve the robustness of FL strategies for residential short-term load forecasting against Byzantine attacks. 
    \item Specifically, in this paper, we highlighted three key data integrity attacks against short term load forecasting FL models. We design the data integrity threat models and their counter measures.
    
    \item We further extend the proposed framework towards a privacy-preserving SIGNSGD-based FL approach whereby the clients locally perturb their trained parameters by adding noise prior to uploading to the server for aggregation to prevent parameter information leakage and ensure privacy preservation more effectively.
    
    \item We conduct comprehensive case studies and extensive empirical evaluations to verify the effectiveness of our proposed scheme using a real Australian energy consumption dataset obtained from Ausgrid Network.
\end{enumerate}

\noindent The rest of this paper is structured as follows. Section \ref{sect:prelim} provides some background information in relation to our conceptual framework. Section \ref{sect:probdef} covers the problem definition section where we discuss some popular adversarial byzantine threat models on FL. In Section \ref{propmethod}, we describe our proposed FL architecture followed by Section \ref{Results} which focusses on the evaluation and comparison of our proposed framework under several scenarios. Finally, Section \ref{Conclusion} concludes this manuscript and provides some potential future directions for research.

\section{Related Works}

Typically, byzantine threats on federated learning scenarios consist of updating arbitrary model parameters from the clients to the server in the aim of impacting the convergence of the model \cite{BarossoFedThreatSurvey}. More specifically, byzantine attacks are typically untargeted threats during which adversarial clients either train their local models on corrupted datasets or fabricate random model updates. Inherently, byzantine threats are usually less stealthy and can be detected through close analysis of the global model performance \cite{9220780}. To address byzantine resiliency in FL, a number of works have been proposed in recent literature. Throughout this section, we briefly summarize the main studies undertaken in regards to byzantine resiliency in FL.

A common approach to byzantine-resiliency in FL is to employ  aggregation operators which are based on statistically robust estimators. For instance, the authors in \cite{FLTrust, 10.1145/3154503, 9029245} leveraged the use of Byzantine-robust aggregation rules by comparing the local updates of clients and filtering out statistical outliers prior to global model updates. Furthermore, Blanchard et al. \cite{10.5555/3294771.3294783} proposed a computationally expensive \textit{Krum} algorithm which performs gradient update selection and has the least sum of distances from the nearest gradient updates during each iteration. In addition, \cite{pmlr} introduced \textit{Bulyan} as an extension of Krum to recursively find subset of nodes using Krum and eventually perform an element-wise pruned mean on the updates to exclude the high magnitude values. Similarly, a handful of other byzantine-robust aggregation operators \cite{distChen, RSALi, pmlr_v80_yin18a, 9153949,PillutlaAggregation, GonzalezByzantine, ShuhaoResidual} have been proposed in existing literature to mitigate the vulnerability of FL to byzantine attacks. Another interesting study in \cite{9669031} utilized a mixed-strategy game-theoretic approach between the server and the clients whereby each client can either update good or corrupted model parameters while the server can either choose to accept or discard them. By employing the Nash Equilibrium property, the clients' updates were selected based on their probability of providing the correct updates.

In addition to the design of byzantine-robust operators, several other defence strategies have been employed through anomaly detection \cite{ShiqiDefending, 9054676, 8975792}, pre-processing methods \cite{https://doi.org/10.48550/arxiv.2004.04986}, etc. However, while much work has been carried out to mitigate the threats of FL, little to no work has been carried out on secure, privacy-preserving and fault-tolerant FL frameworks for residential short-term electrical load forecasting to the best of our knowledge.  

\section{Preliminary}
\label{sect:prelim}
Throughout this section, we will discuss some preliminary and related background knowledge on FL and Differential Privacy (DP). Furthermore, within this section, we shall discuss a conventional FL set-up for short term load forecasting which will be used as a baseline during the evaluation of our proposed scheme.

\subsection{Federated Learning}
For the past couple of decades, Artificial Intelligence (AI) has transformed every walk of life and proven its benefits within several fields. However, one of the biggest real-world challenge faced by AI is the design of high-performing models due to natural data fragmentation coupled with security and privacy enforcement. Eventually, to alleviate this issue, \cite{Fedpap} introduced a fundamentally novel learning technique known as \textit{Federated Learning} which enables the collaborative decentralised training of machine learning models without the physical migration of raw data as depicted in Fig. \ref{fig:fedillus}.
\begin{figure}[!h]
    \centering
    \includegraphics[width=7cm]{Federated_Learning_illustration.png}
    \caption{An illustration of the steps involved in FL.}
    \label{fig:fedillus}
\end{figure}

Suppose we have $N$ clients and each client $C_i$ holds a local training dataset $D_i$ where $i \in {1,2,...,N}$. An active $C_i$, participating in in the local training, aims to collaboratively learn the weights $w_i$ of the shared global model such that a certain empirical loss $L_i$ is minimized. Therefore, we can formulate the optimization problem solved by multiple data owners as $w^* = \underset{w_i}{\mathrm{arg\ min}} \displaystyle \sum_{i=1}^N L_{i} (w_{i})$. Specifically, each communication round proceeds as shown in Fig. \ref{fig:fedillus} through the following steps: (1) The central server sends a unanimous global model $w$ to the active FL clients. (2) Each client trains the local model by using its own local dataset $D_i$ in order to solve the optimization problem $\underset{w_i}{\mathrm{min}}\ L_i(D_i, w_i)$. (3) Each client updates its local model parameters to the central server. (4) The server computes the global model update by aggregating the parameters received from the local models such that. (5) Lastly, the server sends back the updated parameters to the local models. This iterative process is continued until convergence of the global model. 

Furthermore, there are two baseline approaches to train models in a FL set-up namely Federated Averaging (Fed-Avg) and Federated Stochastic Gradient Descent (Fed-SGD). Generally, Fed-SGD \cite{pmlr-v119-malinovskiy20a} averages the locally computed gradient at every step of the learning phase while Fed-Avg \cite{9488877} averages local model updates when all the clients have completed training their models. However, as mentioned before, regardless of the approach used, FL is prone to several privacy and security threats, which have been discussed as following.

\subsection{Differential Privacy}
\label{sect:diffpriv}
Due to the several drawbacks of data anonymization techniques such as loss of data utility, risks of re-identification, etc., Differential Privacy (DP) emerged as a formal framework that enables the quantification of the preservation of individual privacy within a statistical database during the release of useful aggregate information \cite{9594795}. Therefore, we formally define some related concepts in relation to DP as in the following:

\noindent \textbf{Definition 1}: A randomized algorithmic mechanism $M: X \longrightarrow R$ with domain $X$ and range $R$ satisfies ($\epsilon$, $\delta$)-differential privacy if for all measurable sets $S \subseteq R$ and if for any two adjacent inputs $D$, $D' \in X$, the following holds: $Pr[M(D) \in S] \leq exp(\epsilon) \times Pr[M(D') \in S] + \delta$. Here $Pr$ denotes probability \cite{9594795}.

\noindent \textbf{Definition 2}: The privacy loss $L$ of a randomized algorithmic mechanism $M: X \longrightarrow R$ for any result $v \in R$ and for any two data samples $D$, $D' \in X$ is expressed as: $L(v, D, D') = log \dfrac{Pr[M(D) = v]}{Pr[M(D') = v]}$ \cite{9594795}.

One of the most popular noise addition mechanisms for DP is the Gaussian Mechanism. A given noise distribution $n \sim N(0, \sigma^2)$ preserves ($\epsilon$,$\delta$)-DP where $N$ is a Gaussian distribution with 0 mean and variance $\sigma$, such that the noise scale is $\sigma \geq c\Delta s/\epsilon$ and the constant $c \geq \sqrt{2ln(1.25/\delta)}$ for $\epsilon \in (0,1)$ where $\Delta s$ is the sensitivity of the real-valued function. However, it is important to note that choosing the right amount of noise is a significant challenge that still lingers within research.

\subsection{Federated Load Forecasting with Fed-SGD (Benchmark)}

During Fed-SGD, a distributed stochastic gradient descent algorithm is applied within a federated environment to jointly train the global model. As shown in Algorithm \ref{FedSGDalgo}, during each communication round, each client $k$ computes the gradient $g_k$ by initially optimizing the loss of the local model using their local dataset $D_{k}$. The local gradients $g_k$ is then sent to the control centre whereby they are aggregated and the new gradient updates are pushed back to the local models. Eventually, the whole process is continued until convergence.


\begin{algorithm}
\textbf{Input}: learning rate $\eta$, each client $k$, local data $D_{k}$.

Control centre initializes and distributes unanimous model $m_0$ and encrypted parameter initialization $||\hat{m}_0||$ to all clients $N$.

\For{each communication round $T_{cl} = 1,2,..., t$}{

\For{each client $k \in N$}{

Compute gradient $g_k$ by training model on local dataset $D_k$.

Send to Control Centre.

\textbf{end}
}
Control Centre aggregates the local gradient updates as $g$.

Control centre pushes updated gradients back to the local models.

\textbf{end}
}
\caption{Short-term Load Forecasting with Fed-SGD.}
\label{FedSGDalgo}
\end{algorithm}


\section{Problem Definition \& Adversarial Models}
\label{sect:probdef}
Federated learning enables promising privacy-preserving data analytics for smart grids by pushing model training to devices, thus requiring no direct data sharing \cite{9084352}. Nonetheless, recent literature has revealed its failure to sufficiently guarantee privacy preservation due to update leakage \cite{bhowmick2019protection}, deep leakage\cite{geng2022general}, byzantine attacks \cite{247652}, etc. Throughout this paper, we aim to address byzantine threats in relation to federated learning for electrical load forecasting. Before we present our proposed defense strategy, in this section, we consider three types of byzantine threat models on federated load forecasting as in the following:
\begin{enumerate}[leftmargin=*]
    \item \textbf{Threat Model 1} \textit{(Local Data Poisoning)}: We consider the scenario where a subset of the total number clients $k$ to be malicious or are controlled by a malicious attacker. Malicious clients may be injected to the federated learning framework through the addition of adversarially-controlled smart metering devices. The goal of the adversary is to manipulate the learnt parameters such that the global model $M$ has high indiscriminate errors, thus implying that the attack objective is: $Attack(D_{k} \cup D_{k}', m_{t}^k) = max \displaystyle \sum_{i=1}1[f(x_{i}'; m_{t}^k ) = t_{i}'] $, where $m_{t}^k$ represents the updated model. Each malicious client is able to stealthily alter their local training sample, $D_k$ but is unable to access and manipulate the data of other participants or the model learning process. 
    
    Let $D_k = \{(x_i, t_i)|i = 1,...,n\}$ denote the pristine local training dataset with $n$ samples where $x_i$ is the time instance and $y_i$ is the corresponding electrical load consumed. Each malicious client $k$ modifies their dataset $D_k$ such that the trigger $v$ is inserted into $x_i$ whereby $x_{i}' = x_i + v, t_i$. The sign $+$ denotes the addition of the poison trigger $v$ to $x$ such that the poisoned dataset $D_{k}' = \{(x_{i}', t_{i}'|i = 1,...,n\}$. The poisoned dataset $D_{k}'$ is then used for model training. The adversary's goal is to ensure the degradation of forecasts of the auxiliary data by the global model.
    
    \item \textbf{Threat Model 2} \textit{(Model Leakage \& Poisoning)}: In this scenario, we assume that the adversary can arbitrarily manipulate the local models sent from the clients to the central aggregator for illicit purposes but cannot observe the training data of other honest clients. Similarly, during this type of threat, the ultimate adversarial goal is to manipulate the learnt global model such that it has a high error rate indiscriminately for testing examples. Such attacks directly negatively impact the usability of the model and will eventually lead to denial-of-service attacks. 
    
    
    \item \textbf{Threat Model 3} \textit{(Colluding attack)}: Lastly, we consider the cross-device scenario whereby multiple malicious clients are present during the federated training iteration. The adversaries intentionally collude with each other during a single iteration by sending the same update. i.e., each of the attackers send the same learnt update during some of the training iterations such that the goal of this threat model relies upon the manipulation of the learnt global model to induce high error rates. 
    
\end{enumerate}


\section{Proposed Method}
\label{propmethod}

Within this section, we propose a new FL framework to circumvent the aforementioned byzantine threats on FL for short-term load forecasting. The key idea lies in sharing just the sign of the gradients to preserve privacy. We present the our developed solution as in the following:

\subsection{System Model Overview}

\begin{figure}
    \centering
    \includegraphics[width=8cm]{DP-Fed.png}
    \caption{An illustration of proposed approach.}
    \label{fig:proposedapp}
\end{figure}

As previously discussed, the objective of this study is to design a robust and privacy-preserving FL framework for residential short-term load forecasting. As shown in Fig \ref{fig:proposedapp}, our proposed method  consists of three components as discussed.

\begin{enumerate}[leftmargin=*]
    \item \textit{Electrical Appliances}: Whenever someone within a household uses one of the electrical appliances, the load consumption is collected by the smart meter
    \item \textit{Smart Meter}: Each customer has a smart meter that is connected a Home Area Network. Each smart meter collects energy load consumption profiles. The data collected is locally stored on the HAN of the consumer such that local models can be trained using their own dataset.
    \item \textit{Control Centre}: The control centre is responsible for broadcasting a learning model and default model parameters, aggregation of parameters after training and finally broadcasting the updated model parameters.
\end{enumerate}

\subsection{Algorithm Design}
Within a conventional federated learning setting with $N$ clients, at round $t$, a selected client $k \in N$ performs local gradient descent iterations $T_{gd}$ using a common broadcasted local model $m_{t-1}$ on its local training sample $D_{k}$ such that a new updated model $m_{t}^k$ is obtained. Each client $k$ then sends its updated parameters $\Delta m_{t}^k = m_{t}^k - m_{t-1}^k$ to the central orchestrator which in turn aggregates model updates from all $N$ clients $\forall k \in N$ such that $m_{t} = m_{t-1} + \sum_{k \in N} \dfrac{|D_{k}|}{\sum_{j} |D_{j}|} \Delta m_{t}^k$. The model training continues until convergence and is subsequently terminated after a set number of rounds $T_{cl}$. 

\begin{algorithm}
\textbf{Input}: learning rate $\eta$, each client $k$ local data $D_{k}$.

Control centre distributes unanimous model $m_{0}$ and encrypted parameter initialization $||\hat{m_{0}}||$ to all clients $N$.

\For{each communication round, $T_{cl} =1,..., t$}{

\For{each client $k$}{

Compute the gradient $g_{k}= m_{t}^k - m_{t-1}^k$ by training on local dataset $D_k$.

Obtain sign vector $sign(\Delta m_{t}^k)$ from $g_k.$

Perturb $sign(\Delta m_{t}^k)$ with a random Gaussian noise $\zeta_{k}$ such that $\sum_{k \in N} sign(\Delta m_{t}^k) + \zeta_{k}$ satisfies differential privacy.

Encrypt $sign(\Delta m_{t}^k) + \zeta_{k}$ into $E_{k}[sign(\Delta m_{t}^k) + \zeta_{k}]$ and send to control centre.

\textbf{end}
}

Control Centre aggregates encrypted updates $\sum_{k} E_{N_{k}} (sign(\Delta m_{t}^k) + \zeta_{k})$.

Control Centre pushes $sign(g_N)$ to all clients, $N$. 

\textbf{end}
}
\caption{Proposed Framework}
\label{proposedalgo}
\end{algorithm}

However, in the context of smart grids, conventional federated learning settings pose several privacy risks as earlier discussed. Therefore, we propose a novel privacy-preserving federated learning framework for electrical load forecasting through model weight quantization as in \cite{jin2021stochasticsign}. Specifically, as shown in Algorithm \ref{proposedalgo} and Figure \ref{fig:proposedapp}, a selected client $k$ initially computes the gradient update $g_{k} = m_{t}^k - m_{t-1}^k$ from which it obtains the sign vector $sign(\Delta m_{t}^k) = sign(m_{t}^k - m_{t-1}^k)$ where $sign(\Delta m_{t}^k)$: $\mathbb{R}^n \longrightarrow {-1,1}^n$. A random Gaussian noise $\zeta_{k}$ is then added to perturb $sign(\Delta m_{t}^k)$ such that $\sum_{k \in N} sign(\Delta m_{t}^k) + \zeta_{k}$ satisfies differential privacy. Furthermore, to prevent an adversary from learning $sign(\Delta m_{t}^k) + \zeta_{k}$ accurately in circumstances where $N$ is large, each client $k$ updates the encrypted results $E_{k}[sign(\Delta m_{t}^k) + \zeta_{k}]$ to the central aggregator. The orchestrator in turn sums all the encrypted model updates from $N$ such that $\sum_{k} E_{N_{k}} (sign(\Delta m_{t}^k) + \zeta_{k})$. This aggregation follows the selection of the median of all $N$ clients  signs at every position of the update vector. The model training continues until convergence and is subsequently terminated after a set number of rounds $T_{cl}$. 


\subsection{Convergence Analysis}
In the following, we will present a formal analysis of the SIGNSGN approach through the use of refined assumptions derived from conventional SGD assumptions. 

\noindent\textbf{Assumption 1} \textit{(Lower Bound)}: Given an objective/loss function $f$, at any point $x$, $f(x) \geq f^(x^*)$, where $f^(x^*)$ represents the objective value and $x^*$ represents the global minima of f(x). This standard assumption is indeed necessary to ensure the convergence to a stationary point.

\noindent\textbf{Assumption 2} \textit{(Smoothness)}: Given an objective/loss function $f$, the gradient of $f$ (derivative of the function with respect to $x$) when evaluated on any coordinate $(x, y)$ can be represented as $g(x)$. Then, for $\forall x, y$ and for some non-negative constant $L_{i}$, we require that $|f(y) - [f(x) + g(x)^T (y-x)]| \leq \frac{1}{2} \sum_{i}L_{i}(y_{i} - x_{i})^2$. This assumption is an extension of the Lipschitz Continuity condition which is essential to guarantee that the loss $l$ of $f$ is smooth and convergence of gradient descent algorithms.

\noindent\textbf{Assumption 3} \textit{(Variance Bound)}: Upon receiving the query $ x \in \mathbb{R}^n$, the stochastic gradient oracle results in an independent, unbiased estimate $\hat{g}$ that has bounded variance per coordinate $\mathbb{E}[\hat{g}(x)] = g(x)$, $\mathbb{E}[(\hat{g}(x)_{i} - g(x)_{i})^2 \leq \sigma{i}^2 $ where $\sigma{i}^2$ is the uniform variance bound. The classical convergence analysis of SGD is carried out under the assumption that the norm of the stochastic gradient is uniformly bounded. While this might hold for some loss functions, bounded variance may be violated where $f$ is strongly convex as $x \longrightarrow \infty$. However, this assumption is necessary to grasp the fundamental properties of stochastic optimisation algorithms.

\noindent\textbf{Assumption 4} \textit{(Gradient Noise)}: At any given point $x$, each component of the stochastic gradient vector, $\hat{g}(x)$, must have a unimodal distribution that is also symmetric about the mean. This assumption ensures that the addition of extra noise for the purpose of differential privacy does not skew the distribution and decrease utility.

Under these assumptions, we have the following result:

\noindent \textbf{Theorem 1} \textit{(Non-convex convergence rate of SIGNSGD)}: Run algorithm 1 for $K$ iterations under Assumptions 1 to 3. Set the learning rate as $\delta_k = \dfrac{1}{\sqrt{||L||_1 K}}$ where $n_k = K$. Let $N$ be the cumulative number of stochastic gradient
calls up to step $K$, i.e. $N = O(K^2)$. Then we have $\mathbb{E}[\dfrac{1}{K} \displaystyle \sum_{k = 0}^{K-1}||g_k||_1 ]^2 \leq \dfrac{1}{\sqrt{N}}[\sqrt{||L||_1 } (f_0 - f_* \dfrac{1}{2}) + 2||\sigma||_1]$.

\section{Simulation \& Results}
\label{Results}
In this section, we provide the results of the experimental evaluations of our proposed approach. We first introduce the dataset used and the settings shared by all experiments. Next, the performance of the proposed approach is presented and compared throughout different scenarios. Lastly, we discuss the overall results.

\subsection{Experimental Setup}

This research was conducted using \textit{Solar Home Electricity Data} from Eastern Australia's largest electricity distributor, Ausgrid. The dataset composes of half-hourly electricity consumption data of 300 de-identified customers which is measured using gross meters during the period starting 1\textsuperscript{st} July 2012 to 30\textsuperscript{th} June 2013. We initially filter the data based on General Consumption (GC) category. It is then converted to the suitable time-series format. It is then split into test (30\%) and train (70\%) subsets. 

Every experiment carried out have the following general configurations. There is a set number of clients (10 clients) each holds a local subset of the data and there is a server which helps to coordinate the FL scenario. The model performance is evaluated using three metrics: \textit{Mean Squared Error (MSE)}, \textit{Root Mean Squared Error (RMSE)} and lastly, \textit{Mean Absolute Percentage Error (MAPE)}.

\subsection{Comparison with Baseline (No Attack)}
Throughout this section, we present the experimental results to compare the performance of the proposed approach against the conventional Fed-SGD approach. As shown in Fig. \ref{fig:trainloss}(a), it can be seen that the Fed-SGD reaches convergence after the 47\textsuperscript{th} communication round while the proposed approach converges after the 40\textsuperscript{th} communication round.
\begin{figure}[!h]
    \centering
    \subfloat[\centering Convergence of Federated LSTM-CNN model ]{{\includegraphics[width=4cm]{FedSGDvsProposed.png} }}%
    \qquad
    \subfloat[\centering MAPE (\%) per client ]{{\includegraphics[width=3.9cm]{ComparisonMAPEprop.png} }}%
    \caption{Comparison between Fed-SGD and proposed approach.}%
    \label{fig:trainloss}
\end{figure}
\begin{table}[!h]
\caption{Evaluation of Fed-SGD with several models \label{CompaGedSGD}}
\begin{tabular}{|l|l|l|l|l|l|}
\hline
\textbf{Metric}    & \textbf{RNN} & \textbf{GRU} & \textbf{LSTM} & \textbf{CNN} & \textbf{LSTM-CNN} \\ \hline
\textbf{MSE}       & 0.2657       & 0.1973       & 0.1634        & 0.2567       & 0.1583            \\ \hline
\textbf{RMSE}      & 0.5346       & 0.4042       & 0.3463        & 0.5243       & 0.3008            \\ \hline
\textbf{MAPE (\%)} & 16.4         & 10.9         & 11.0          & 12.8         & 9.7               \\ \hline
\end{tabular}
\end{table}

\begin{table}[!h]
\caption{Evaluation of proposed method with several models \label{CompaSignSGDModel}}
\begin{tabular}{|l|l|l|l|l|l|}
\hline
\multicolumn{1}{|l|}{\textbf{Metric}}    & \multicolumn{1}{l|}{\textbf{RNN}} & \multicolumn{1}{l|}{\textbf{GRU}} & \multicolumn{1}{l|}{\textbf{LSTM}} & \multicolumn{1}{l|}{\textbf{CNN}} & \multicolumn{1}{l|}{\textbf{LSTM-CNN}} \\ \hline
\multicolumn{1}{|l|}{\textbf{MSE}}       & \multicolumn{1}{l|}{0.2662}       & \multicolumn{1}{l|}{0.1864}       & \multicolumn{1}{l|}{0.1803}        & \multicolumn{1}{l|}{0.2456}       & \multicolumn{1}{l|}{0.1437}            \\ \hline
\multicolumn{1}{|l|}{\textbf{RMSE}}      & \multicolumn{1}{l|}{0.5432}       & \multicolumn{1}{l|}{0.4127}       & \multicolumn{1}{l|}{0.3890}        & \multicolumn{1}{l|}{0.5329}       & \multicolumn{1}{l|}{0.3243}            \\ \hline
\multicolumn{1}{|l|}{\textbf{MAPE (\%)}} & \multicolumn{1}{l|}{15.9}         & \multicolumn{1}{l|}{11.1}         & \multicolumn{1}{l|}{10.8}          & \multicolumn{1}{l|}{13.6}         & \multicolumn{1}{l|}{9.7}                           \\ \hline      
\end{tabular}
\end{table}

\begin{table*}[!ht]
\centering
\caption{Evaluation of proposed FL framework against Threat Model 1 \& 2 \label{EvaluationFLThreat}}
\begin{tabular}{cc|cc|cc|}
\cline{3-6}
\multicolumn{2}{l|}{\textbf{}}                                                                                          & \multicolumn{2}{c|}{\textbf{Fed-SGD}}                                  & \multicolumn{2}{c|}{\textbf{Proposed Solution}}                                             \\ \hline
\multicolumn{1}{|c|}{\textbf{\begin{tabular}[c]{@{}c@{}}\% of Compromised\\ Clients\end{tabular}}} & \textbf{Metric}    & \multicolumn{1}{c|}{\textbf{Threat Model 1}} & \textbf{Threat Model 2}  & \multicolumn{1}{l|}{\textbf{Threat Model 1}} & \multicolumn{1}{l|}{\textbf{Threat Model 2}} \\ \hline
\multicolumn{1}{|c|}{\multirow{3}{*}{\textbf{10}}}                                                 & \textbf{MSE}       & \multicolumn{1}{c|}{0.2910}                  & 0.3134                  & \multicolumn{1}{c|}{0.1621}                  & 0.1532                                       \\ \cline{2-6} 
\multicolumn{1}{|c|}{}                                                                             & \textbf{RMSE}      & \multicolumn{1}{c|}{0.4732}                  & 0.5490                  & \multicolumn{1}{c|}{0.3251}                  & 0.3029                                       \\ \cline{2-6} 
\multicolumn{1}{|c|}{}                                                                             & \textbf{MAPE (\%)} & \multicolumn{1}{c|}{18.2}                    & 20.1                    & \multicolumn{1}{c|}{10.1}                    & 9.9                                          \\ \hline
\multicolumn{1}{|c|}{\multirow{3}{*}{\textbf{20}}}                                                 & \textbf{MSE}       & \multicolumn{1}{c|}{0.4180}                  & 0.4519                  & \multicolumn{1}{c|}{0.1835}                  & 0.1642                                       \\ \cline{2-6} 
\multicolumn{1}{|c|}{}                                                                             & \textbf{RMSE}      & \multicolumn{1}{c|}{0.7893}                  & 0.9201                  & \multicolumn{1}{c|}{0.3502}                  & 0.3129                                       \\ \cline{2-6} 
\multicolumn{1}{|c|}{}                                                                             & \textbf{MAPE (\%)} & \multicolumn{1}{c|}{25.7}                    & 27.1                    & \multicolumn{1}{c|}{12.2}                    & 10.8                                         \\ \hline
\multicolumn{1}{|c|}{\multirow{3}{*}{\textbf{30}}}                                                 & \textbf{MSE}       & \multicolumn{1}{c|}{0.7319}                  & 0.8192                  & \multicolumn{1}{c|}{0.2678}                  & 0.2134                                       \\ \cline{2-6} 
\multicolumn{1}{|c|}{}                                                                             & \textbf{RMSE}      & \multicolumn{1}{c|}{1.2398}                  & 1.4576                  & \multicolumn{1}{c|}{0.4249}                  & 0.3965                                       \\ \cline{2-6} 
\multicolumn{1}{|c|}{}                                                                             & \textbf{MAPE (\%)} & \multicolumn{1}{c|}{38.9}                    & 42.2                    & \multicolumn{1}{c|}{17.3}                    & 14.1                                         \\ \hline
\end{tabular}
\end{table*}

\begin{table}[]
\caption{Evaluation of proposed FL framework against Threat Model 3\label{EvaluationFLThreat3}}
\begin{tabular}{|c|c|c|c|}
\hline
\multicolumn{1}{|l|}{\textbf{\% of Comp. Clients}} & \multicolumn{1}{l|}{{ \textbf{Metric}}} & \multicolumn{1}{l|}{{\textbf{Fed-SGD}}} & \multicolumn{1}{l|}{{\textbf{Proposed Solution}}} \\ \hline
                                                         & \textbf{MSE}                                                & 0.3103                                                       & 0.1732                                                                 \\ \cline{2-4} 
                                                         & \textbf{RMSE}                                               & 0.5321                                                       & 0.3324                                                                 \\ \cline{2-4} 
\multirow{-3}{*}{\textbf{20}}                            & \textbf{MAPE (\%)}                                          & 19.3                                                         & 11.2                                                                   \\ \hline
                                                         & \textbf{MSE}                                                & 0.5231                                                       & 0.2034                                                                 \\ \cline{2-4} 
                                                         & \textbf{RMSE}                                               & 0.8743                                                       & 0.3958                                                                 \\ \cline{2-4} 
\multirow{-3}{*}{\textbf{30}}                            & \textbf{MAPE (\%)}                                          & 34.0                                                         & 14.0                                                                   \\ \hline
                                                         & \textbf{MSE}                                                & 0.7793                                                       & 0.2901                                                                 \\ \cline{2-4} 
                                                         & \textbf{RMSE}                                               & 1.2343                                                       & 0.4302                                                                 \\ \cline{2-4} 
\multirow{-3}{*}{\textbf{40}}                            & \textbf{MAPE (\%)}                                          & 39.5                                                         & 16.4                                                                   \\ \hline
\end{tabular}
\end{table}

\begin{table}[]
\centering
\caption{Evaluation of proposed FL framework under different Privacy Budgets \label{threat3}}
\begin{tabular}{|c|c|c|c|}
\hline
\textbf{$\epsilon$-Budget}     & \textbf{Metric}    & \textbf{Fed-SGD} & \multicolumn{1}{l|}{\textbf{Proposed Solution}} \\ \hline
\multirow{3}{*}{\textbf{0.01}} & \textbf{MSE}       & 0.1583           & 0.1437                                          \\ \cline{2-4} 
                               & \textbf{RMSE}      & 0.3              & 0.3243                                          \\ \cline{2-4} 
                               & \textbf{MAPE (\%)} & 9.7              & 9.7                                             \\ \hline
\multirow{3}{*}{\textbf{0.1}}  & \textbf{MSE}       & 0.4320           & 0.1645                                          \\ \cline{2-4} 
                               & \textbf{RMSE}      & 0.8173           & 0.3192                                          \\ \cline{2-4} 
                               & \textbf{MAPE (\%)} & 26.4             & 10.5                                                      \\ \hline
\end{tabular}
\end{table}

\begin{figure*}[!h]
    \centering
    \subfloat[\centering Impact of Threat Model 1 ]{{\includegraphics[width=5cm]{Attack1FedSGD.png} }}%
    \subfloat[\centering Impact of Threat Model 2 ]{{\includegraphics[width=5cm]{ImpactT2FedSGD.png} }}%
    \subfloat[\centering Impact of Threat Model 3  ]{{\includegraphics[width=5cm]{ImpactofEPonFedSGD.png} }}%
    \caption{Impact of Attacks on Fed-SGD}%
    \label{fig:impactFedSGD}
\end{figure*}

\begin{figure*}[!h]
    \centering
    \subfloat[\centering Impact of Threat Model 1 ]{{\includegraphics[width=5cm]{Threat1Mitig.png} }}%
    \subfloat[\centering Impact of Threat Model 2 ]{{\includegraphics[width=5cm]{Threat2Mitig.png} }}%
    \subfloat[\centering Impact of Threat Model 3  ]{{\includegraphics[width=5cm]{Threat3Mitig.png} }}%
    \caption{Mitigating threat models using our proposed method}%
    \label{fig:impactSIGNSGD}
\end{figure*}

As our proposed solution converges faster that the traditional Fed-SGD one, we can conclude that the proposed approach provides a fast algorithmic convergence. Furthermore, we use the three aforementioned evaluation metrics to compare and contrast the performance of the proposed solution against Fed-SGD with several models as presented in Table \ref{CompaGedSGD} and Table \ref{CompaSignSGDModel}. The experimental results reveal that the the proposed framework reaches similar performance as compared to the Fed-SGD approach.  Similarly, in Fig. \ref{fig:trainloss}(b), the MAPE per active household within the FL set ups are contrasted which shows that our proposed approach reaches relatively similar performance as compared to the Fed-SGD. More specifically, after the comparison, we can deduce that our proposed framework reaches good generalization performance for short-term load forecasting within acceptable error ranges. Moreover, after comparing the proposed framework based on models as presented in Table \ref{CompaSignSGDModel}, it can be deduced that LSTM-CNN model shows the best overall forecasting performance with an average MAPE of 9.7\% in both the conventional Fed-SGD and the proposed FL framework. 

\subsection{Impact on attacks on proposed framework}

In this section, we evaluate the robustness of our proposed FL framework against the adversarial threat models as described in Section \ref{sect:probdef}. To discuss the impact of Byzantine Attacks on the standard Fed-SGD and our proposed approach, we further divide the results into the two following sections:

\subsubsection{Impact of attacks on Fed-SGD}

After evaluating the impact of the three byzantine threat models as in Section \ref{sect:probdef}, we present the results within this section. In Table \ref{EvaluationFLThreat}, we evaluated the performance of Fed-SGD under Threat Model 1 \& 2 respectively. For both threat models, there is a direct relationship between the percentage of compromised active FL clients and the mean error of the FedSGD FL model, that is, once the percentage of compromised clients increases, the mean error of the FL model decreases. Specifically, for threat models 1 \& 2, at 10\% of compromised clients, the MAPE of the FL model is 18.2\% and 20.1\% respectively, thereby following an upward trend such that at  30\% of compromised, the MAPE of the FL model reaches around 38.9\% and 42.2\% respectively. It is worth noting that once a third of the clients are compromised/malicious, there is almost around an average of 40\% difference between the actual value and the forecasted value. On the other hand, Table \ref{EvaluationFLThreat3} investigates the impact of the colluding attack on the Fed-SGD setup. Similarly, the the number of compromised clients is directly proportional to the mean error of the FL model. As the number of colluding adversaries increases, the mean error of the FL model also increases. Furthermore, based on Fig. \ref{fig:impactFedSGD}(a) and Fig. \ref{fig:impactFedSGD}(b), we can note that as the percentage of compromised clients increase, the FL model loss starts to diverge after a certain number of communication rounds due to threat models 1 \& 2. Similarly, based on Fig. \ref{fig:impactFedSGD}(c), as the number of colluding adversaries increases, the FL model loss starts to diverge after a certain number of communication rounds. 

\subsubsection{Impact of proposed FL framework on attacks}

Within the previous section, we discussed the impact of attacks on the standard Fed-SGD setup. However, throughout this one, we will discuss the impact of our proposed solution on mitigating the threat models presented in Section \ref{sect:probdef}. As presented in Table \ref{EvaluationFLThreat}, when our proposed solution is under attack by threat models 1 \& 2, at 10\% of compromised clients, the mean error of the FL model stayed relatively similar to the MAPE of the model prior to any attacks. Gradually, with increasing percentage of compromised clients, it can be seen that there is a slight increase in the MAPE. Specifically, from 20\% to 30\% of compromised clients, the MAPE is 5.1\% and 3.3\% for threat models 1 \& 2 respectively. However, the small increase in the mean error of the FL model is still within acceptable ranges. Similarly, based on Table \ref{EvaluationFLThreat3}, it is evident that there is a very slight increase (within acceptable error ranges) in the MAPE value as the number of compromised clients increases. Furthermore, based on Fig. \ref{fig:impactSIGNSGD}, we notice that the under all percentages of compromised clients, our proposed model is optimized such that it converges after a certain number of communication rounds/iterations. Therefore, we can eventually conclude that our proposed approach effectively mitigates byzantine attacks.

\subsection{Results Discussion}

With increasing concerns and regulation enforcement in regards to security and privacy within the smart grid paradigm, it is crucial to develop privacy-preserving and robust short term load forecasting solutions. FL, whilst still being in its infant stage, requires further improvements under different circumstances. Therefore, throughout this study, we investigate Byzantine attacks in relation to federated short term load forecasting. Furthermore, we propose and design a robust defense solution to mitigate those threats.

From Table \ref{CompaGedSGD} and \ref{CompaSignSGDModel} above, it can be seen that our proposed approach reaches comparable forecasting performance as FedSGD when there are no attacks. Similarly, when compared to several other time-series forecasting models, our proposed approach matches that of Fed-SGD. More specifically, we achieved the best overall performance of our proposed approach using the LSTM-CNN model with a MAPE of 9.7\% for both FL setups. Therefore, we selected LSTM-CNN as the principal model to evaluate our proposed approach under the three threat models as discussed in Section \ref{sect:probdef}.

Based on the experimental results presented in Tables \ref{EvaluationFLThreat} and \ref{EvaluationFLThreat3} as well as the Figs. \ref{fig:impactFedSGD} and \ref{fig:impactSIGNSGD}, under the conventional Fed-SGD approach, we notice an overall degradation in the performance of the model with increasing intensity of attacks. For instance, an increase in the percentage of compromised clients results an upward shift in the mean error of the model.  On the flip side, we notice that our proposed approach can withstand such attacks with minimal impact on the mean error of the FL model. This leads us to conclude that it is indeed a resilient and privacy-preserving FL set-up for residential short-term load forecasting.


\section{Conclusion}
\label{Conclusion}

The rapid adoption of FL within the smart grid ecosystem has spiked the interest of researchers to address its security and privacy issues. Byzantine attack mitigation plays a crucial role in securing and enhancing the robustness of FL for short-term load forecasting.  Therefore, throughout this manuscript, we propose a state-of-the-art FL-based approach that leverages the notions of gradient quantization and differential privacy to overcome this challenge. Furthermore, we empirically demonstrate that our proposed solution effectively mitigate popular byzantine threats and provides relatively similar performance as compared to standard FL setups.  Finally, the next steps in this research are to: (1) design and evaluate our proposed FL framework against stronger byzantine attacks, and, (2) take into consideration the existence of distributed energy resources to improve the grid model.  

\bibliographystyle{IEEEtran}
