\section{Proposed Methodology: WaveDIF}\label{sec:proposal}

This section presents WaveDIF (\underline{Wave}let sub-band based \underline{D}eepfake \underline{I}dentification in \underline{F}requency Domain), the proposed methodology for deepfake detection in this research. Since WaveDIF takes the wavelet sub-band energies as feature input to the model, the model operates strictly in the frequency domain. WaveDIF captures the subtle frequency artifacts that are introduced in videos (especially in their spectral domain) during artificial synthesis, and classifies the videos accordingly. 
% You must include your signed IEEE copyright release form when you submit your finished paper.
% We MUST have this form before your paper can be published in the proceedings.



\begin{algorithm}[htbp]
\caption{\label{alg1}WaveDIF}
\begin{algorithmic}[1]
\Require Labeled dataset $\{\mathbb{V}, \ell\}$, and test video $\mathcal{V} \notin \mathbb{V}$
\Ensure Predicted label, $l \in \{\texttt{ORIGINAL}, \texttt{DEEPFAKE}\}$
\Function{FeatureExtraction }{$v$}
\State $\{f_1, f_2, \dots, f_F\} \gets$ \textsc{ExtractFrames} $(v)$
\For{each frame $f_i$}
    \State $\mathcal{F}_i \gets$ \textsc{DiscreteFourierTransform} $(f_i)$ 
    \State $\mathcal{F}'_i \gets$ \textsc{GaussianLPF}$(\mathcal{F}_i)$
    \State $f'_i \gets$ \textsc{InverseDFT} $(\mathcal{F}'_i)$  
\EndFor

\State $\displaystyle{\mathcal{R} \gets \frac{1}{F} \sum_{i=1}^{F} f'_i}$ \Comment{Averaged (for all filtered frames)}

\State $\displaystyle{\left[\begin{matrix}\text{LL}&\text{LH}\\\text{HL}&\text{HH}\\\end{matrix}\right] \gets}$ \textsc{DiscreteWaveletTransform}$(\mathcal{R})$
\State Compute $\displaystyle{\mathcal{E}_\mathcal{S} = \sum_{j \in \mathcal{S}} \mathcal{S}_j^2}$, $\forall \mathcal{S} \in \{\text{LL}, \text{LH}, \text{HL}, \text{HH}\}$

\State $\displaystyle{\mathbf{F}_v \gets [\mathcal{E}_{\text{LL}}, \mathcal{E}_{\text{LH}}, \mathcal{E}_{\text{HL}}, \mathcal{E}_{\text{HH}}]}$ \Comment{Feature vector (for $v$)}
\EndFunction
\If {phase == \texttt{TRAINING}} \Comment{Training Phase}
    \For{each video $v$ in $\mathbb{V}$}
        \State $\mathbf{F}_v \gets$ \textsc{FeatureExtraction} $(v)$
        \State $\mathbf{F}_{\mathbb{V}} \gets \mathbf{F}_{\mathbb{V}} \oplus \mathbf{F}_v$ \Comment{Feature fusion}
    \EndFor
    \State Learn the model parameters (\textbf{Linear Regression})
    \[
    \mathcal{B}(\mathbf{F}_{\mathbb{V}}) = \theta_1 \cdot \mathcal{E}_{\text{LL}} + \theta_2 \cdot \mathcal{E}_{\text{LH}} + \theta_3 \cdot \mathcal{E}_{\text{HL}} + \theta_4 \cdot \mathcal{E}_{\text{HH}} + \beta
    \]
    \State Learn the threshold  \( T \) (\textbf{Logistic Regression}) 
\Else \Comment{Inference Phase}
    \State $\mathbf{F}_{\mathcal{V}} \gets$ \textsc{FeatureExtraction} $(\mathcal{V})$
    \State $\{\Theta^\text{T}, \beta\} \gets \mathcal{B}(\mathbf{F}_{\mathbb{V}})$
    \State \( f(\mathbf{F}_{\mathcal{V}}) \gets \Theta^\text{T} \mathbf{F}_{\mathcal{V}} + \beta \)
    \If{$f(\mathbf{F}_{\mathcal{V}}) \geq T$}
        \State $l\gets$ \texttt{ORIGINAL}
    \Else
        \State $l\gets$ \texttt{DEEPFAKE}
    \EndIf
    \State \Return $l$
\EndIf

\end{algorithmic}
\end{algorithm}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure*}
    \centering
    \includegraphics[width=\linewidth]{sec/model_architecture.jpg}
    \caption{\label{fig:architecture}\textbf{Architectural Overview of the WaveDIF Deepfake Detection Technique.} Given a video input $\mathcal{V} \in \mathbb{R}^{H \times W \times 3 \times F}$, each frame of size $(H \times W)$ undergoes a Discrete Fourier Transform (DFT) to filter out high-frequency noise artifacts. The DFTs of all frames are averaged to generate a final DFT representation of the input video $\mathcal{V}$. This representation is then decomposed into wavelet sub-bands (LL, LH, HL, HH) using a Haar filter. Further, the energy values $\mathcal{E}_{\text{LL}}, \mathcal{E}_{\text{LH}}, \mathcal{E}_{\text{HL}}$, and $ \mathcal{E}_{\text{HH}}$ are computed corresponding to each video. At the end of the \textit{feature extraction} process (iff phase == \texttt{TRAINING}), a linear decision boundary-based equation is modelled (using linear, and logistic regression). Models pertaining to the correct size of a boundary are \textit{classified} therefore.}
\end{figure*}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Given an input video (say $v \in \mathbb{R}^{H\times W \times 3 \times F}$), the objective is to classify whether the video is synthetically generated or original; where, $H\times W$ is the spatial resolution of the video, the factor three denotes three-channel RGB representation, and $F$ corresponds to the frame count of the video. 
The proposed WaveDIF model is a two-stage-pipeline – (i) \textit{feature extraction}, which captures the frequency domain abnormalities, and (ii) \textit{classification based on the extracted features}, to train a deterministic model which can classify between real and deepfake videos. Additionally, the feature extraction phase is further subdivided into two parts – (a) \textit{DFT-based frequency filtration}, which enhances the discriminative abnormalities (patterns) by suppressing noise and (b) \textit{DWT-based feature extraction}, which extracts localized frequency variations, based on wavelet sub-band energies. Note that prior to passing any input video to the feature extraction phase, it is converted to grayscale (single-channel), and to maintain uniformity across all videos, they are resized to a fixed preset resolution. 

\subsection{Feature Extraction}
In the first stage of the WaveDIF pipeline, initially every input video $v$ is decomposed into the constituent frames, $\{f_1, f_2, \dots, f_F\}$. Following the decomposition, each frame $ f_i$ is first transformed into their frequency domain using the 2D Discrete Fourier Transform (DFT), following Eqn.~ \eqref{2d_dft_eq}:
\begin{equation} \label{2d_dft_eq}
\mathcal{F}_i (u,v) = \sum_{x=0}^{H-1} \sum_{y=0}^{W-1} f_i(x,y) e^{-j 2\pi \left(\frac{ux}{H} + \frac{vy}{W}\right)}
 \end{equation}
where $(x,y)$, and $(u,v)$ are the coordinates in spatial and spectral domain respectively. 
As mentioned previously, deepfakes often introduce high frequency artifacts which are irrelevant for frequency-based classification \cite{frank2020leveraging}. To filter out these noisy artifacts, Gaussian Low-Pass Filtering (GLPF) 
\cite{xu2015forensic} has been used (as per Eqn.~\eqref{glpf}) to suppress 
the high-frequency components, while preserving discriminative patterns 
in lower frequencies, which was the reconstructed using Inverse DFT 
(as per Eqn.~\eqref{idft}): 
\begin{equation} \label{glpf}
\mathcal{F}'_i(u,v) = \mathcal{F}_i(u,v) \cdot e^{-\frac{(u^2 + v^2)}{2\sigma^2}}
 \end{equation}
\begin{equation} \label{idft}
f’_i(x,y) = \sum_{u=0}^{H-1} \sum_{v=0}^{W-1} \mathcal{F}'_i(u,v) e^{j 2\pi \left(\frac{ux}{H} + \frac{vy}{W}\right)}
 \end{equation}
where $\sigma$ is the cutoff-frequency. In this work, we set $\sigma = 45$.

To obtain a global spectral representation of the video ($v$), mean aggregated 
representation across all frames is computed as, $\displaystyle{\mathcal{R} = \frac{1}{F} \sum_{i=1}^{F} f'_i}$. Since, DFT captures global frequency
information, but lacks spatial localization, DWT using Haar wavelet 
filter was applied on the aggregated frame $\mathcal{R}$ 
\cite{sifuzzaman2009application}, which decomposed $\mathcal{R}$ into 
four frequency sub-bands:
\begin{enumerate}
\item LL (Low-Low), which captures low-frequency structures.
\item LH (Low-High), which captures horizontal high-frequency details.
\item LH (High-Low), which captures vertical high-frequency details.
\item HH (High-High), which captures diagonal high-frequency details.
\end{enumerate}
For the Haar wavelet transform, low-pass filters $\displaystyle{\left(\phi = \frac{1}{\sqrt{2}} \left[1, 1\right]\right)}$, and high pass filters $\displaystyle{\left(\psi = \frac{1}{\sqrt{2}} \left[1, -1\right]\right)}$ are applied separately along rows and columns of the matrix $\mathcal{R}$. The '
respective sub-bands are computed as per Eqn.~\eqref{sub_band_m_n}:
\begin{equation}\label{sub_band_m_n}
\begin{aligned}
\text{LL}(m, n) &= \frac{1}{4} \sum_{i=0}^{1} \sum_{j=0}^{1} \mathcal{R}(2m + i, 2n + j) \\
\text{LH}(m, n) &= \frac{1}{4} \sum_{i=0}^{1} \sum_{j=0}^{1} (-1)^j \mathcal{R}(2m + i, 2n + j) \\
\text{HL}(m, n) &= \frac{1}{4} \sum_{i=0}^{1} \sum_{j=0}^{1} (-1)^i \mathcal{R}(2m + i, 2n + j) \\
\text{HH}(m, n) &= \frac{1}{4} \sum_{i=0}^{1} \sum_{j=0}^{1} (-1)^{i+j} \mathcal{R}(2m + i, 2n + j)
\end{aligned}
\end{equation}
where, $(m, n)$ are the coordinates of transformed domain. 

Each sub-band captures some specific frequency response, notably the deepfake videos captures some unnatural energy distributions in these sub-bands, which serves as the basis of classification in this research. The energy corresponding to each of these bands were computed following
Eqn.~\eqref{sub_band_energies}:
\begin{equation}\label{sub_band_energies}
\mathcal{E}_\mathcal{S} = \sum_{m} \sum_{n} \mathcal{S}^2 (m, n), \quad \forall \mathcal{S} \in \{\text{LL}, \text{LH}, \text{HL}, \text{HH}\}
\end{equation}
or in simpler terms, $\displaystyle{\mathcal{E}_\mathcal{S} = \sum_{j \in \mathcal{S}} \mathcal{S}_j^2}$, $\forall \mathcal{S} \in \{\text{LL}, \text{LH}, \text{HL}, \text{HH}\}$
Thus, the feature vector corresponding to the input video $v$ is $\displaystyle{\mathbf{F}_v = [\mathcal{E}_{\text{LL}}, \mathcal{E}_{\text{LH}}, \mathcal{E}_{\text{HL}}, \mathcal{E}_{\text{HH}}]}$. 

The feature extraction stage of the WaveDIF pipeline is common to both training and inference phases. In the training phase, the labeled dataset $\{\mathbb{V}, \ell\}$ is fed for the feature extraction, and based on the features, a deterministic linear boundary equation is trained where $\mathbb{V}$ is a vector of videos, and $\ell$ is the vector of labels corresponding to every video in $\mathbb{V}$. In the inference phase, an unseen video $\mathcal{V}  \notin \mathbb{V}$ is taken as input, its features extracted, is passed through the learned boundary equation to predict a label $l \in \{\texttt{ORIGINAL}, \texttt{DEEPFAKE}\}$. 









\subsection{Classification based on the extracted features}
In the second stage of the WaveDIF pipeline, the learned features are used for training a deterministic model (if phase == \texttt{TRAINING}), and the trained deterministic model is used for giving a verdict for any input video (if phase == \texttt{INFERENCE}). During training, features corresponding to each video $v$ in the labeled dataset $\mathbb{V}$ are learned, and fused together to form the model feature vector as in Eqn.~\eqref{model_fv}:
\begin{equation}\label{model_fv}
\mathbf{F}_{\mathbb{V}} = \bigoplus_{v \in \mathbb{V}} \mathbf{F}_{v}, \quad \displaystyle{\mathbf{F}_v = [\mathcal{E}_{\text{LL}}, \mathcal{E}_{\text{LH}}, \mathcal{E}_{\text{HL}}, \mathcal{E}_{\text{HH}}]} 
\end{equation}
Next, the  model feature vector, and the associated labels  $(\mathbf{F}_{\mathbb{V}}, \ell)$ are used for training a uni-dimensional regression model, i.e., using $(\mathbf{F}_{\mathbb{V}}, \ell)$ to learn 
the model weights and biases as per Eqn.~\eqref{boundary_eq}:
\begin{equation}\label{boundary_eq}
    \mathcal{B}(\mathbf{F}_{\mathbb{V}}) = \theta_1 \cdot \mathcal{E}_{\text{LL}} + \theta_2 \cdot \mathcal{E}_{\text{LH}} + \theta_3 \cdot \mathcal{E}_{\text{HL}} + \theta_4 \cdot \mathcal{E}_{\text{HH}} + \beta
\end{equation}

Further, using the same set of features and labels, a 
logistic regression model was trained to obtain a threshold $(T)$. 
Thus, the trained model (inclusive of threshold) is as per 
Eqn.~\eqref{final_model}:
\begin{equation}\label{final_model}
    l = \begin{cases}
\texttt{ORIGINAL}, & \text{if } \Theta^\text{T} \mathbf{F}_{\mathcal{V}} + \beta \geq T \\
\texttt{DEEPFAKE}, & \text{otherwise}.
\end{cases}
\end{equation}
In the inference phase, for an unseen video $\mathcal{V \notin \mathbb{V}}$, 
the trained model (Eqn.~\eqref{final_model}) is used. 
Fig.~\ref{fig:architecture} gives a pictorial illustration of 
WaveDIF, and Algorithm~\ref{alg1} summarizes the workflow. 















% Please direct any questions to the production editor in charge of these proceedings at the IEEE Computer Society Press:
% \url{https://www.computer.org/about/contact}.