\section{Method} \label{sec:method}

\subsection{Preliminary Study on PAG on Text Classification}
In this preliminary experiment, we investigate the application of Perceptually Aligned Gradients~\cite{ganz2023perceptually} to sentence classification using hidden state representations from the DistilBERT language model~\cite{sanh2019distilbert}. While PAG has primarily been explored in the context of image classification, we adapt the methodology to the hidden state space of a transformer model to examine its effects on robustness and interpretability in NLP tasks. The core idea of PAG is to encourage gradients to align with semantically meaningful directions, and we hypothesize that this can lead to more robust and interpretable text representations as well.

To prove our point, we ran a proof-of-concept experiment using a classifier trained on top of the hidden state associated with the \texttt{[CLS]} token,
adopting the \texttt{distilbert-base-multilingual-cased}, on Amazon Review Multi dataset~\cite{keung2020multilingual}.
We considered 12 classes, each a combination of languages (English, German, Spanish, and French) and review ratings (1, 3, and 5 stars).


\minisection{PAG Application} To incorporate PAG, we extend the standard cross-entropy objective with a regularization term that enforces alignment between the gradient of the model output and a ``proxy'' target direction. The resulting loss for a classifier built on top of a frozen DistilBERT backbone is defined as:
%
\begin{equation}
\begin{split}
\Loss &= \Loss_{CE}(f_\theta(\bx), y) + \lambda\, \Loss_{PAG}(\bx), \\
\Loss_{PAG}(\bx) &= \frac{1}{C} \sum_{c = 1}^{C} \left(1 - 
\frac{
\nabla_{\bh} f_{\theta}(\bx)_c^\top\, g(\bx, c)
}{
\left\|\nabla_{\bh} f_{\theta}(\bx)_c\right\| \, \left\|g(\bx, c)\right\|
}
\right).
\end{split}
\label{eq:pag_classification_loss}
\end{equation}
%
Here, $\Loss_{PAG}$ penalizes misalignment via the cosine distance between the gradient of the classifier output with respect to the hidden representation and a proxy direction.

\paragraph{Notation.}
\begin{itemize}
    \item $\bx$ denotes the input sentence.
    \item $y \in \{1, \dots, C\}$ is the ground-truth class label.
    \item $f_{\theta}(\bx)$ is the classifier operating on the DistilBERT hidden representation $\bh$.
    \item $\Loss_{CE}$ denotes the standard cross-entropy loss.
    \item $\lambda \geq 0$ controls the strength of the PAG regularization.
    \item $C$ is the number of classes.
    \item $\nabla_{\bh} f_{\theta}(\bx)_c$ is the gradient of the logit corresponding to class $c$ with respect to $\bh$.
    \item $g(\bx, c)$ denotes the proxy target direction associated with class $c$.
\end{itemize}

\minisection{Proxy Ground-Truth Gradient}
We define the proxy target direction (PAG variant) as the difference between the hidden representation of the input sentence, $\bh_{\bx}$, and that of a randomly sampled sentence $\bu_{y}$ from the same class $y$:
%
\begin{equation}
g(\bx, y) = \bu_{y} - \bh_{\bx}.
\label{eq:pag}
\end{equation}
%
This construction encourages gradients to align with directions that connect samples within the same class in representation space, thereby promoting semantically consistent updates.

We also consider an alternative formulation, referred to as \textbf{Identity}, in which the proxy direction is defined directly in the input space:
%
\begin{equation}
g(\bx, y) = \bx.
\label{eq:pag-id}
\end{equation}
%
In this case, the model is encouraged to reconstruct the input from the induced gradients, effectively enforcing self-alignment.

As a reference, we include a baseline model trained with identical architecture and hyperparameters but without PAG regularization (i.e., $\lambda = 0$), thereby isolating the effect of $\Loss_{PAG}$.

\begin{table}
    \centering
    \caption{Robustness of classifier models with PAG variants under APGD, Square, and FGSM attacks. Higher percentages indicate stronger robustness.}
    \label{tab:pag_variants_multiclass}
    \resizebox{0.6\linewidth}{!}{
    \begin{tabular}{lccccc}
    \toprule
    attack $\rightarrow$ &  \multicolumn{2}{c}{\textbf{APGD}} & \textbf{Square} & \multicolumn{2}{c}{\textbf{FGSM}} \\
    &  \multicolumn{2}{c}{\cite{croce2020reliable}} & \cite{croce2020reliable} & \multicolumn{2}{c}{\cite{goodfellow2014explaining}} \\
    & \multicolumn{2}{c}{$\varepsilon$} &  & \multicolumn{2}{c}{$\alpha$} \\
    $g(\bx)$ $\downarrow$ & $1\text{e-}3$ & $0.5$ & &  $5\text{e-}3$ & $1\text{e-}2$ \\
    \midrule
    Baseline & 36.5\% & 31.2\% & 36.3\% &  27.3\% & 8.9\% \\
    Identity & 28.3\% & 25.0\% & 27.2\% &  25.7\% & 8.0\% \\
    \textbf{PAG} & \textbf{48.1\%} & \textbf{45.0\%} & 
    \textbf{49.3\%} & \textbf{43.5\%} & \textbf{25.7\%} \\
    \bottomrule
    \end{tabular}
    }
\end{table}

\minisection{Evaluation}
According to the results in Table~\ref{tab:pag_variants_multiclass}, the strongest model in robustness is the one trained with the full PAG loss with Equation~\ref{eq:pag}, which forces the model to make the gradients on the input point towards the direction of the predicted class. These models have been attacked by APGD, Square~\cite{croce2020reliable}, and FGSM~\cite{goodfellow2014explaining}.

