%\documentclass{uai2022} % for initial submission
\documentclass[accepted]{uai2022} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent



%%%cross-reference in main.file


\newcommand{\hq}[1]{\textcolor{black}{#1}}%pink %revise 
\newcommand{\lin}[1]{\textcolor{black}{#1}}%pink %revise

\newcommand{\HQ}[1]{\textcolor{black}{#1}}%pink %revise 

\usepackage{xr}
% \usepackage{xr-hyper}
\usepackage{multirow}
\usepackage{booktabs}
\usepackage[normalem]{ulem}
\usepackage[american]{babel}
\usepackage{algorithm}
\usepackage{comment}
\usepackage{algorithmicx}
\usepackage{bm}
\usepackage{enumitem}
\usepackage{url}
\usepackage{amsfonts,amssymb} 
\usepackage{natbib}
\usepackage{amsmath}
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
\usepackage{microtype}      % microtypography

% \usepackage[british]{babel}
%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\linespread{0.97}
\newcommand\numberthis{\addtocounter{equation}{1}\tag{\theequation}}


% In your preamble

% \makeatletter
% \newcommand*{\addFileDependency}[1]{% argument=file name and extension
%   \typeout{(#1)}
%   \@addtofilelist{#1}
%   \IfFileExists{#1}{}{\typeout{No file #1.}}
% }
% \makeatother


% \newcommand*{\myexternaldocument}[1]{%
%     \externaldocument{#1}%
%     \addFileDependency{#1.tex}%
%     \addFileDependency{#1.aux}%
% }

% \myexternaldocument{uai2022-template/yan_670-supp}

\title{Addressing Token Uniformity in Transformers via Singular Value Transformation}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:Hanqi.yan@warwick.ac.uk}{Hanqi Yan}{}}
\author[1]{\href{mailto:Lin.gui@warwick.ac.uk}{Lin Gui}{}}
\author[2]{\href{mailto:cswjli@comp.polyu.edu.hk}{Wenjie Li}{}}
\author[1,3]{\href{mailto:Yulan.He@warwick.ac.uk}{Yulan He}{}}
% Add affiliations after the authors
\affil[1]{%
    Department of Computer Science, University of Warwick, United Kingdom
}
\affil[2]{%
        Department of Computing, The Hong Kong Polytechnic University, China
}
  
\affil[3]{%
    The Alan Turing Institute, United Kingdom
}

\begin{document}
\maketitle
% \begin{abstract}
% Token uniformity is commonly observed in transformer-based models, in which different tokens share a large proportion of similar information after going through stacked multiple self-attention layers in a transformer. % to gather important information while smaller components become more negligible. 
% In this paper, we propose to use the distribution of singular values of outputs of each transformer layer to characterise the phenomenon of token uniformity and design a singular value transformation function to alleviate this problem. More concretely, we first show in a toy example that full rank of the transformer layer outputs does not adequately address the token uniformity problem. The skewness of the singular value distribution of layer-wise outputs plays an important role in defining the shape of the learned embedding space. % in transformer-based structures. 
% We then empirically illustrate that a less skewed singular value distribution can lead to more diverse token representations, alleviating the `token uniformity' problem. %semantic clusters, rather than a aggregated token cluster in original BERT. Next, 
% Base on our observations, we define several desirable properties of singular value distributions and propose a novel transformation function for updating the singular values. Our proposed singular value transformation function is applied to a range of transformer-based language models such as BERT, ALBERT, RoBERTa and DistilBERT, and improved performance is observed in a variety of NLP tasks including text classification, information extraction, question-answering and language modelling. 
% \end{abstract}

\begin{abstract}
Token uniformity is commonly observed in transformer-based models, in which different tokens share a large proportion of similar information after going through stacked multiple self-attention layers in a transformer. 
In this paper, we propose to use the distribution of singular values of outputs of each transformer layer to characterise the phenomenon of token uniformity and empirically illustrate that a less skewed singular value distribution can
alleviate the `token uniformity' problem. Base on our observations, we define several desirable properties of singular value distributions and propose a novel transformation function for updating the singular values. We show that apart from alleviating token uniformity, the transformation function should preserve the local neighbourhood structure in the original embedding space. 
Our proposed singular value transformation function is applied to a range of transformer-based language models such as BERT, ALBERT, RoBERTa and DistilBERT, and improved performance is observed in semantic textual similarity evaluation and a range of GLUE tasks. Our source code is available at \url{https://github.com/hanqi-qi/tokenUni.git}. %a variety of NLP tasks including text semantic similarity measurement, text classification, paraphrase detection, question-answering and language modelling.~
%is available at \url{https://github.com/hanqi-qi/tokenUni/}}  
\end{abstract}

\section{Introduction}
In Natural Language Processing (NLP), approaches built on the transformer architecture have achieved the state-of-the-art performance in many tasks~\citep{DBLP:conf/uai/VeitchSB20}. % such as machine translation, text classification and question answering. 
% The transformer model consists of repeated blocks of multi-layer perceptrons (MLPs), skip connections and self-attention layers. 
\hq{However, recent work identified an anisotropy problem in
language representations generated by transformer-based deep models~\citep{DBLP:conf/emnlp/Ethayarajh19,DBLP:conf/iclr/GaoHTQWL19,DBLP:conf/emnlp/LiZHWYL20}, i.e., the learned embeddings occupy a
narrow cone in the representation space. Such anisotropic shape is
very different from what would be expected in an expressive embedding space~\citep{DBLP:journals/tacl/AroraLLMR16,DBLP:conf/iclr/MuV18}}. This problem is called \textit{token uniformity} or \textit{information diffusion}, i.e., different tokens share a large proportion of similar information after going through stacked multiple self-attention layers in a transformer. \citet{pmlr-v119-goyal20a} showed that using different transformer-encoded tokens in an input sample as a classification unit can achieve almost the same result.
% (\sout{Many studies have been carried out to analyse the internal mechanism of the transformer in order to understand its limitations.)}

\hq{%\citet{DBLP:conf/iclr/GaoHTQWL19} and \citet{DBLP:conf/naacl/BisPL21} discovered that, with weight tying, i.e., by sharing the parameters of the embedding layerand the softmax layer, the learned word embeddings are positively correlated and spread in a narrow cone. 
Recently, \citet{DBLP:journals/corr/abs-2103-03404} found that pure self-attention networks, i.e., transformers without skip-connections and MLPs, have their outputs converging to a rank one matrix, and such rank deficiency can lead to token uniformity. They therefore concluded that skip-connection and MLP help alleviate the token uniformity problem. However, in our experiments, we still observe the token uniformity problem in the full transformer model with self-attention layers, skip-connections and MLPs, even when its output hidden state matrices are full-rank. %found that self-attention possesses an inductive bias towards ``token uniformity'': By leveraging the cascading effects of stacking self-attention modules, they finds that without skip connections or multi-layer perceptions (MLPs), the output of pure self-attention Networks (SANs) converges doubly exponentially (with depth) to a rank-1 matrix (a.k.a. degeneration). 
}

% \hq{To tackle the \textit{token uniformity} problem, one possible method is to map the output embeddings from transformers to an isotropic distribution by %sophisticated network or simple 
% post-processing~\citep{DBLP:conf/emnlp/LiZHWYL20,DBLP:journals/corr/abs-2103-15316}. Another possible solution is to add regularization to directly increase the size of the aperture of the cone that word embeddings fall into in order to mitigate the token uniformity problem %force the output embedding fall apart is another popular line of studies~
% \citep{DBLP:conf/iclr/GaoHTQWL19}. %,DBLP:conf/iclr/Wang0HHWG20,DBLP:conf/icml/0001I20}. 
% Contrasting learning can also alleviate the anisotropy problem both theoretically and empirically~\citep{DBLP:conf/iclr/CarlssonGGHS21,DBLP:conf/emnlp/GaoYC21}.} %Some studies focus on addressing the rank deficiency issued introduced by Softmax bottleneck~\citep{DBLP:conf/iclr/YangDSC18,DBLP:journals/corr/abs-2007-00992}.}
% \hq{ %Although ~\citet{DBLP:journals/corr/abs-2103-03404} report a serious token uniformity issue induced by rank deficiency, their observation is based on an assumption that the transformer is free from MLP(s) and Skip connections, which is not the case in practice. Moreover, we still observe the token uniformity problem in the full transformer (with skip-connections and MLPs), even when its output hidden state matrices are full-rank. We conduct our analysis in a more realistic setting, and arrive at different conclusions.
% }
\hq{In this paper, we instead investigate the token uniformity problem via exploring the distribution of singular values of the transformer-encoded hidden states of input samples. %(sentence-level). 
Our analysis reveals that the learned embedding space is a high-dimensional cone-like hypersphere which is bounded by the singular values. Also, skewed probability distribution of singular values %, though can be full-rank, but 
is indicative of token uniformity (See in \S\ref{sec:singular_theo}). Therefore, making the distribution less skewed towards small singular values can help alleviate the token uniformity issue. Unlike existing methods \citep{DBLP:conf/iclr/GaoHTQWL19,DBLP:conf/iclr/Wang0HHWG20} that implicitly or explicitly guide the spectra training of the output embedding matrix by adding a regularisation term to control the singular value decay, we %defines several desirable properties of a transformation function to be applied to the singular values of the layerwise outputs in a transformer and 
propose a novel approach to address the token uniformity via smoothing the singular value distribution (See in \S\ref{sec:desirable properties}). In order to verify the effectiveness of our proposed singular value transformation function in transformer-based structures, we apply it to four commonly used large-scale pretrained language models (PTLMs). In particular, the singular value distribution of the final layer output from a PTLM is modified using our proposed transformation function. Then, the transformed singular values are used to reconstruct the hidden states in the last layer of the PTLM, which are subsequently used for prediction in downstream tasks. Our extensive experiments on a variety of NLP tasks including semantic textual similarity evaluation and a range of General Language Understanding Evaluation (GLUE) tasks~\citep{DBLP:conf/iclr/WangSMHLB19} across thirteen datasets show that our proposed transformation function can effectively reduce the skewness of singular value distributions in qualitative analysis and achieve noticeable performance gains.}


% \sout{One potential limitation of the transformer is \textit{token uniformity} or \textit{information diffusion}, which has been observed in many existing studies, i.e., different tokens share a large proportion of similar information after going through stacked multiple self-attention layers in a transformer. \citet{pmlr-v119-goyal20a} showed that using different transformer-encoded tokens in an input sample as a classification unit can achieve almost the same result. \citet{DBLP:journals/corr/abs-2103-03404} found that pure self-attention networks, i.e., transformers without skip-connections and MLPs, have their outputs converging to a rank one matrix, and such rank deficiency can lead to token uniformity. They therefore concluded that skip-connection and MLP help alleviate the token uniformity problem. However, in our experiments, we still observe the token uniformity problem in the full transformer model with self-attention layers, skip-connections and MLPs, even when its output hidden state matrices are full-rank.} 

% \sout{In this paper, we investigate the token uniformity problem via exploring the distribution of singular values of the transformer-encoded hidden states of input samples. Our analysis indicates that the learned embedding space is a high-dimensional cone-like hypersphere which is bounded by the singular values. %, especially small singular values. 
% Also, skewed probability distribution of singular values is indicative of token uniformity. Making the distribution less skewed towards small singular values can help alleviate this issue.} 

% \sout{Furthermore, we define several desirable properties of a transformation function to be applied to the singular values of the layerwise outputs in a transformer and propose a novel approach to address the token uniformity via smoothing the singular value distribution. }

%  \sout{In order to verify the effectiveness of our proposed singular value transformation function in transformer-based structure, we apply it to four commonly used large-scale pretrained language models (PTLMs). In particular, the singular value distribution of the final layer output from a PTLM is modified using our proposed transformation function. Then, we use the transformed singular values to reconstruct the hidden states in the last layer of the PTLM, which are subsequently used for prediction in downstream tasks. Our extensive experiments on a variety of NLP tasks including semantic textual similarity evaluation and a range of GLUE tasks across thirteen datasets show that our proposed transformation function can effectively reduce the skewness of singular value distributions in qualitative analysis and achieve noticeable performance gains.}

Our contribution can be summarised as follows:
\setlist{nolistsep}
\begin{itemize}%[noitemsep]
 \item %By empirical study on different layers of BERT feature, 
 We have presented both geometric interpretation and empirical study of the token uniformity problem. Based on our observations, we have designed a set of desirable properties of singular value distributions and proposed a singular value transformation function to alleviate the token uniformity issue.
 \item We have proposed a range of methods to evaluate the transformed features in terms of uniformity and the preservation of the local neighbourhood structure. %The latter is firstly proposed in this paper in order to prevent a perfect uniform but less discriminative feature space.
\item  Our proposed transformation function has been applied to four widely-used PTLMs and evaluated on both unsupervised and supervised tasks. The results demonstrate the effectiveness of our proposed method on addressing the token uniformity problem while preserving the local neighbourhood structure in the original embedding space. 
%\item The visualization of transformed feature space show that a desirable feature space should make all the singular values contribute equally, instead, maintain the data manifolds learnt from PTLMs is also significant. 
\end{itemize}

\section{Related Work}
Transformer-based mask language models, such as BERT~\citep{devlin2018bert}, ALBERT~\citep{lan2019albert}, RoBERTa~\citep{liu2019roberta} and DistilBERT~\citep{sanh2019distilbert}, have achieved significant success in NLP. However, token uniformity, i.e., different tokens share similar representations, is commonly observed with the increasing network depth. Many studies \citep{DBLP:conf/icml/GaneaGBS19,DBLP:conf/nips/YangLSL19}
claimed that token uniformity is caused by rank collapse of the layer-wise outputs because the transformer architecture learns the token representation based on the normalised weighted sum of the context representations. %(i.e., \textit{self-attention}), and they turned to softmax alternatives.
% in which weights are normalised by a softmax function \citep{devlin2018bert}.

% The distribution of the softmax function usually has only few peaks in practice, it may cause rank collapse of the outputs from the self-attention layers.
% \citet{DBLP:conf/iclr/YangDSC18} proposed to maintain the rank of the output matrices after self-attention. 
% \citep{DBLP:journals/corr/abs-2007-00992} designed an expansion layer for softmax-based weighted sum by Hadamard Product. 
% \citet{DBLP:journals/corr/abs-2103-03404} pointed out that some existing network components, including using MLP %\citep{DBLP:conf/emnlp/WangC20a} 
% and skip-connections %\citep{DBLP:conf/cvpr/HeZRS16} 
% could solve the rank collapse issue in transformers.

Another line of work, which observed token uniformity in empirical studies, argued that the desirable word representations should be isotropic and focused on studying the distribution of the learned embeddings ~\citep{DBLP:conf/iclr/MuV18}. \citet{DBLP:conf/iclr/GaoHTQWL19} and ~\citet{DBLP:conf/naacl/BisPL21} defined the problem as \textit{`representation degeneration'} and gave a theoretical analysis, which asserts that this phenomenon is caused by frequencies of rare words.
% A similar finding was drawn in that an ideal word feature space should be isotropic so that all the singular values in different directions should be almost the same size.
\citet{DBLP:conf/iclr/Wang0HHWG20} proposed to add an exponential decay term in training objective so as to control the singular value distribution. All the aforementioned work focused on token-level features and tasks. More recent work argued that the sentence-level features can also be anisotropic due to the anisotropy in word features. 
Contrasting learning can also alleviate the anisotropy problem both theoretically and empirically~\citep{DBLP:conf/iclr/CarlssonGGHS21,DBLP:conf/emnlp/GaoYC21,DBLP:conf/uai/GuY21}.
%~\citep{DBLP:conf/emnlp/LiZHWYL20,DBLP:journals/corr/abs-2103-15316,DBLP:journals/corr/abs-2104-01767}. 
\citet{DBLP:conf/emnlp/LiZHWYL20} proposed BERT-flow to transform the representations learned by PTLMs into a standard Gaussian distribution in order to make the token/sentence representations isotropic. Other researchers turned to the whitening-based post-processing strategy to normalise sentence embeddings to derive almost perfectly isotropic representation distribution~\citep{DBLP:journals/corr/abs-2103-15316,DBLP:journals/corr/abs-2104-01767}. 
%It proposed a \textit{Bert-flow} model that learn to transform the vanila feature space into one satisfying a standard Gaussian distribution. These work always evaluate the proposed methods on unsupervised semantic similarity task, which does not rely on task-specific classifier.

%In this paper, 
We argue that on the one hand addressing rank collapse does not necessarily solve the token uniformity problem, as it is still observed even with the network components such as skip-connections which can guarantee the full rank feature space \citep{DBLP:journals/corr/abs-2103-03404}. On the other hand, while whitening methods can effectively solving the token uniformity problem, they failed to preserve the local neighbourhood structure of the original embedding space, which is important for downstream tasks. We propose a novel singular value transformation function which can alleviate the token uniformity and at the same time preserve the local neighbourhood structure.

% \hq{ADD difference between our model and important baselines}

% Therefore, we propose to  use small singular values to construct $\epsilon$-net to bound the embedding space, which is full rank but with some nearly vanished dimensions. 

\section{Singular Value Distribution of Transformer Block Outputs}
\label{sec:singular}

In a typical transformer block $\ell$, assuming the input token is  $x^{l-1}$, the information propagation process is given by:
% {\small
\begin{gather*}
    v^{l} = \text{Self-Attention}(x^{l-1}), \quad
    \Phi(v^{l}) = \varphi (W^{l}v^{l}+b^{l}), \\
    x^{l} = \text{LayerNorm}(\Phi(v^{l})+x^{l-1}) 
\numberthis
\label{eqn:self-attention}
\end{gather*}
% }

where $\varphi$ is an element-wise nonlinear function applied to a feed-forward layer, whose weight matrix, $W^{l}\in \mathbb{R}^{n_{l}\times{n_{l-1}}}$, transforms the feature dimension from $n_{l-1}$ to $n_{l}$, $\mbox{Self-Attention}(x^{l-1})$ returns the weighted value vector of all input representations where weights are derived by multiplying the query vector of the current input $x^{l-1}$ with the key vectors from other inputs. Between every two transformer blocks, there is a skip-connection and a layer normalisation. The former mechanism bypasses the transformer block $\ell$ and adds the input $x^{l-1}$ directly to the output $v^{l}$ of this block, while the latter normalises the input across the feature dimension.
%We consider the intermediate hidden state matrix $v^{l}$ (i.e., the output from the $\ell$th transformer block). %, whose squared singular value is the eigenvalue of its data covariance matrix $\chi = vv^{\intercal}$.

\noindent %In this section, we discuss how to describe the distribution of representation space for transformer based models, which is based on the linear projection based key/query/value weight matrix, and softmax based attention mechanism, to capture the feature for input text sequence recurrently. 
Taking BERT as an example, we assume that the input of the model is $X = x_1 \oplus x_2 \oplus ... \oplus x_m, X\in \mathbb{R}^{n \times m}$, where $x_i \in \mathbb{R}^{n}$, $m$ is the number of tokens in an input sentence (we do not distinguish special tokens such as $[\text{CLS}])$, and $\oplus$ is the concatenation operator. The output of a transformer block $\ell$ is denoted as $X^{l} \in \mathbb{R}^{n_{l}\times m}$, where $n_{l}$ is the dimension of output feature in layer $\ell$. Without loss of generality, we assume $n_{l}>m$ for all layers since the embedding size for tokens is larger than the maximum length of sentences in most BERT models. 

Existing work mainly focused on the discussion of the rank of features learned by a transformer-based language model. 
% \citet{DBLP:journals/corr/abs-2103-03404} presented the following bound:}
% \begin{equation}{
%     || \texttt{res}(X^l)||_{1,\infty} \leq (\frac{4 \beta H \lambda}{\sqrt{d_{qk}}})^{\frac{3\ell-1}{2}} ||\texttt{res}(X)||^{3\ell}_{1,\infty},
%     }
%     \label{eq:bound}
% \end{equation}
% \noindent \lin{where $\beta$ is the norm boundary of the key/query/value weight matrices, $H$ is the size of multi-heads in the multi-head attention, $d_{qk}$ is the dimension of the key/query weight matrix, $\text{res}(X) = X - \bm{1}x^\intercal$ is the residual between the representation matrix and a 1-rank matrix where $x= \texttt{argmax}_x||X-\bm{1}x^\intercal||$.} 
For example, it was stated in \citep{DBLP:journals/corr/abs-2103-03404} that with the growth of depth in a pure transformer model, the rank of the output representation matrix will converge to 1 exponentially. However, in practice, a position embedding matrix, %will keep adding a position embedding matrix, 
which is usually full rank, %aims to encode each position by an unique vector, 
is added to the output representation of each layer. In addition, skip connections are used. Therefore, rank collapse rarely occurs. %convergence cannot be guaranteed by only one-layer mapping in the transformer according to the bound in Equation \ref{eq:bound} when $\ell=1$. 
It can be observed from the empirical cumulative density function of singular values from different layers of BERT, derived from real-world data and shown in Figure \ref{fig:tokenuni_sta}, \HQ{that there is no zero singular value.} %whose value is zero}. 

In this section, we instead study the token uniformity problem by the singular value density of the representation matrix $X$ for a deep network with $\ell$ transformer blocks rather than the rank of $X$. Since the distribution of the singular values of $X^{l}$ determines the extent to which the input signals become distorted or stretched as they propagate through the network, 
we aim to understand the entire singular value spectrum of $X^{l}$ in transformers. % with randomly initialised weights and biases. 
In particular, we want to study the degree of skewness of the singular value distribution, because highly skewed distributions indicate strong anisotropy in the embedded feature space, the radical reason for token uniformity.
%%%%%%%%%%%%%%%%%%%%%%comment by hanqiyqn%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%5
%\subsection{Information Flow in a Transformer-based BERT Block}




\subsection{Singular Value Vanishing in Transformer}
\label{sec:singular_theo}
%Recently, many studies pointed out the rank collapse problem in the transformer architecture, that is, the learned feature matrix of the tokens in an input sequence is not full rank, $\mbox{rank}(v^{l})<n_{l}$. Hence, the collapsed dimension shrinks the embedding space into a lower dimensional subspace, which may limit the generality of the learned model. % since token embeddings only lay in the subspace with reduced dimensions. 
%The use of skip-connection and position embeddings can ensure the output embeddings of $v^{l}$ is full rank. However, full rank of $v^{l}$ cannot adequately address the token uniformity problem. %guarantee the complete searching of the embedding space by the transformer. 
In this subsection, we give a geometric interpretation of the problem of vanishing singular values in transformers.
%However, simply increasing the rank of $v^{l}$ cannot fully solve collapsed dimension issue, since:
%\begin{itemize}
 %   \item The training tricks, such as skip-connection and position embeddings can ensure the output embeddings of $v^{l}$ is full rank.
 %   \item Full rank of $v^{l}$ cannot guarantee the completely searching of embedding space by transformer. 
%\end{itemize}
%and we give a mathematical interpretation in this subsection.
We assume that $X^{l} \in \mathbb{R}^{n_{l}\times m}$ is a full rank matrix. %with a high probability 
%and the $i$-th column vector represents the $i$-th token embedding 
%, that is, $X^{l} = x_1^{l} \oplus x_2^{l} \oplus ... \oplus x_m^{l}$, where $\oplus$ is the concatenation operator and $x_i^{l} \in \mathbb{R}^{n_{l}}$. 
 We can perform SVD on $(X^{l})^\intercal = \bf U \Sigma V$, where $\bf{U}$ and $\bf{V}$ are orthogonal matrices and $\bm{\Sigma}$ is the diagonal singular value matrix. Without loss of generality, we sort the singular values in a descending order, 
%\begin{equation}
    %\sigma_1({\bf (v^{l})^T}) \geq \sigma_2({\bf (v^{l})^T}) \geq ... \geq \sigma_m({\bf (v^{l})^T}) \geq 0 \nonumber
     $\lambda_1 \geq \lambda_2 \geq ... \geq \lambda_m \geq 0$, 
%\end{equation}
where $\lambda_k,k\in\{1,\cdots,m\}$, is a diagonal element in $\bm{\Sigma}$. We can choose a positive value $C$ such that $\lambda_1\geq ... \geq \lambda_k \geq C  \geq \lambda_{k+1}\geq ... \geq 0$, which defines two subspaces, denoted as $\mathcal{S}_{[1,k]}^{l}$, and $\mathcal{S}_{[k+1,m]}^{l}$, respectively. For any token embedding, $x \in X^{l}$, we can find a point $x'$ in the subspace $\mathcal{S}_{[1,k]}^{l}$ such that their difference is no larger than $C$.  %separates the token embeddings into two subsets.
%Let $C$ be a positive real value in the range $[0, \sigma_1({\bf (v^{l})^T}) ]$, we see that the singular values and corresponding token embeddings can be separated into two subsets:
% \begin{equation}
%     \sigma_1({\bf (v^{l})^T}) \geq ... \geq \sigma_k({\bf (v^{l})^T}) \geq C  \geq \sigma_{k+1}({\bf (v^{l})^T}) \geq ... \geq 0 \nonumber
% \end{equation}
% \begin{gather}
%  v_{[1,k]}^{l} = v_1^{l} \oplus v_2^{l} \oplus ... \oplus v_k^{l} , \nonumber \\ 
%   v_{[k+1,m]}^{l} = v_{k+1}^{l} \oplus v_{k+2}^{l} \oplus ... \oplus v_m^{l}. \nonumber
% \end{gather}
%We denote the original embedding space spanned by $X^{l}$ as \HQ{a singular value space, denoted as} $\mathcal{X}^{l}$. \HQ{The two subspaces $X_{[1,k]}^{l}$ and $X_{[k+1,m]}^{l}$ separated by $C$ are therefore transformed to be two subspaces in the singular value space, denoted as $\mathcal{X}_{[1,k]}^{l}$, and $\mathcal{X}_{[k+1,m]}^{l}$,} respectively~\footnote{\HQ{$X_{i}$ is the $i$-th token embedding in the input sentence, $\mathcal{X}_{i}$ is the $i$-th largest singular value. $X_{i}$ is the linear combination of several $\mathcal{X}_{j}$ according to the SVD results.}}. 
That is, we are able to establish the following bound\footnote{The proof is shown in Appendix A.}:

\noindent \textbf{Theorem:} $ \forall x \in X^{l}$, $\exists x' \in \mathcal{S}_{[1,k]}^{l}$, where the subspace $\mathcal{S}_{[1,k]}^{l}$ is defined based on $\lambda_k \geq C  \geq \lambda_{k+1}$, then $\lVert x-x' \lVert_2 \leq C$.

According to this result, the embedding space is bounded by two components, the largest singular value in the subspace $\mathcal{S}_{[1,k]}^{l}$, and the upper bound $C$ of the small singular values, as the radius to span the $k$-dimensional $\mathcal{S}_{[1,k]}^{l}$ subspace into the $m$-dimensional space. % by an $m$-dimensional hypersphere. 
Furthermore, the weights of the $m$ components %is the linear combination 
in the $\ell$-th layer is constrained by self-attention: $\Sigma_{i=1}^m \alpha_i^l = 1$, which indicates that the embedding space is a spherical cone-like $k$-dimension hypershere. %, where the radius is given by the maximum singular value from $\mathcal{V}_{[k+1,m]}^{l}$, and the height is given by the singular values from $\mathcal{V}_{[1,k]}^{l}$. 
This phenomenon has been observed in many studies \citep{DBLP:conf/iclr/GaoHTQWL19,DBLP:conf/emnlp/0004GXMYS20,DBLP:conf/iclr/Wang0HHWG20,DBLP:conf/emnlp/LiZHWYL20}. 
% We illustrate a toy example in Figure \ref{fig:simulate}.

% \begin{figure}[h!]
%     \centering
%     \includegraphics[width=0.4\textwidth]{uai2022-template/3df.png}
%     % \includegraphics[width=0.40\textwidth,trim={100 200 100 150},clip]{uai2022-template/fig1_rep.pdf}
%     \caption{A 3D simulation of 500 randomly generated samples for the illustration of token uniformity caused by small singular values: \textbf{(a)} the original embeddings from self-attention on a Gaussian random matrix. \textbf{(b)} distortion of the embedding space of (a) after one layer of transformer with $C=0.3$ and $k=2$. In both embedding spaces, we can observe the spherical cone-like distribution of embeddings as asserted in the \textit{Theorem}, and the token uniformity after one layer mapping. The two dimensions corresponding to the two least singular values are close to vanish.}
%     \label{fig:simulate}
%     % \vspace{-10pt}
% \end{figure}

%there is a positive value $C$ and corresponding $k$ dimension subspace, where the embedding space is bounded by this subspace and  the singular values $\sigma_k  \geq \sigma_{k+1} $ 
We assume the Probability Density Function (PDF) of the distribution of singular values is $f_p(\lambda)$ , where $\lambda$ is the singular values in the learned embedding space, then we can obtain the value of $k$ based on the Cumulative Distribution Function (CDF) of singular values larger than $C$, and the size of input tokens $m$,
$k = \ulcorner \int_{C}^{\infty}m\cdot f_p(\lambda) d\lambda \urcorner$.
%\end{equation}

\noindent Therefore, the shape of the embedding space is now decided by two parameters: $C$, the radius of the hypercone, and $m-k$, the size of almost vanished dimensions when $C$ is small. \lin{However, due to the complex network operations in transformer blocks, it is difficult to derive the exact form of $f_p(\lambda)$. Hence, we resort to empirical study %\cite{DBLP:conf/icml/ChatterjiPB19} and our observation 
to understand the singular value distribution which appears to be an exponential long-tail distribution, and use the skewness to measure the risk of dimension vanishing in the rest of this paper.}

%Hence, to guarantee a full rank search space for token embeddings, an ideal PDF $f_p(\lambda)$ should have a small value when $\lambda$ is close to zero. %, as well as the derivative function $f'(\lambda)$. However, due to the complexity of BERT layer, it is difficult to give an estimation of $f(\lambda)$. Hence, we use an empirical study instead of estimation in next section. 
%Due to the complex network operations in transformer blocks (such as position embeddings, self-attention, MLP and skip-connection), it is difficult to derive the actual form of $f_p(\lambda)$. Hence, we resort to an empirical study to understand the singular value distributions in the next subsection.



\subsection{Empirical Study of Token Uniformity in BERT}
\label{sec:tokenuni}
Existing studies have observed the token uniformity issue in PTLMs and skewed singular value distributions of outputs from the intermediate network layers~\citep{pmlr-v119-goyal20a}. Few of them though has explored the impact of different shapes of singular value distributions on the downstream task performance. %, and existing measurement such as variance and skewness can only reflect the imbalanced distribution of singular values, \hq{far away from defining a desirable singular value distribution}.
Here, taking BERT as an example, we empirically illustrate that the singular value distribution of outputs from different intermediate transformer blocks \HQ{on the Corpus of Linguistic Acceptability (CoLA) dataset}.
% ~\footnote{Throughout the paper, singular value is for the token embeddings sequence $X^{n_{\ell}\times m}$ from different model output layers.}. 
% has an impact on token representations
The empirical CDF of singular values of the hidden states (i.e., intermediate token representations) from BERT in layer 2, 4, 6, 8, 10 and 12 is shown in Figure \ref{fig:cdf}. It clearly reveals that the CDF is steeper when closer to the origin, which indicates that the probability of singular values $\lambda$ less than a small $x$ is high ($F(x) = Pr(\lambda<x)$). 
% Hence, according to Figure \ref{fig:simulate}, even a small $C$ can cover a large number of nearly vanished dimensions.

With the increase of network depth, the CDF curve tends to be steeper, indicating that the shape of the spherical cone-like embedding space tends to be long and narrow, leading to token uniformity. 
% as previously shown in Figure \ref{fig:simulate}. 
In the last layer for the prediction (i.e., Layer 12), the representation is projected to an embedding space guided by supervised label information, therefore showing a lower degree of token uniformity. Nevertheless, simply leveraging the supervision from labels is not enough to address the problem of vanished dimensions during deep network training. %The curve of final layer is still above majority of layers.

\begin{figure}[ht]
\centering
    \includegraphics[trim=4 2 3 30,clip,width=0.35\textwidth]{cola_alllayers_noorigin.pdf}
    \caption{The empirical CDF of singular values from different layers of BERT (on the GLUE-CoLA dataset), $x$-axis: normalised singular values; $y$-axis: CDF of singular values. More flattened curve indicates a more balanced distribution of singular values. For a given $x$, the larger $F(x)$ can cover a higher percentage of singular values which are less than $x$. Hence, the top curve in this figure indicates more vanished dimensions in the embedding space.}
    \label{fig:cdf}
    % \vspace{-10pt}
\end{figure}

 \hq{We calculate the average cosine similarity between every token pair, and $[\text{CLS}]$ tokens from different BERT layers as a proxy measure of token uniformity. Figure~\ref{fig:tokenuni_sta} shows the skewness of singular values and token uniformity measurement increase as the network layer goes deeper. We also observe the gradual vanishing of smaller singular values as the median of the singular value decreases drastically (from 0.12 to 0.0397). Our results empirically show that the degree of skewness of singular value distributions is indicative of the degree of token uniformity.}
 
\begin{figure}
    \centering
    \includegraphics[width=0.35\textwidth,trim={80 256 583 130},clip]{uai2022-template/sv_differentlayers.pdf}
    \caption{\hq{Singular value distribution of the outputs from BERT layer 0, 7 and 12 (from top to bottom) in the GLUE-MRPC dataset. The second moment (skewness), token uniformity and $[\text{CLS}]$ uniformity values increase as BERT layer goes deeper, while the median of singular values decreases drastically, close to vanishing.} }
    \label{fig:tokenuni_sta}
% \vspace{-5pt}
\end{figure}

% \begin{figure}[h]
% \centering
%     \includegraphics[trim=8 8 3 10,clip,width=0.35\textwidth]{mrpc_statistics.pdf}
%     \caption{The maximum singular value, distribution variance and token uniformity change as the BERT layer goes deeper on the GLUE-MRPC dataset.}
%     \label{fig:tokenuni_sta}
% \vspace{-0.6cm}
% \end{figure}



\section{Transformation Function}

Having empirically illustrating the changes of the singular value distributions of the transformer layer outputs and the measures of token uniformity across the transformer blocks, we now provide insights of designing a desirable singular value transformation function.

\subsection{Motivation}

As in the geometric interpretation presented in Section \ref{sec:singular}, the highly skewed singular value distribution in the embedded feature space would mean that the axis of the ellipsoid is significantly longer than the corresponding axis of the sphere, leading to the token uniformity problem. A variety of techniques have been developed to alleviate this problem, %ill-conditioned data distributions, 
and the most popular ones are a series of normalisation methods that can be explained in a unified framework of constraining the contribution of every feature onto a sphere~\citep{DBLP:journals/corr/IoffeS15,wu2018group,ba2016layer}. A notable example is layer normalisation, an essential module in the Transformer architecture, which scales the variance of an individual activation across the layers to one. 
% Besides, spectral normalization~\citep{miyato2018spectral} normalizes the spectral norm (also the largest eigenvalue) of the weight matrix to maintain stability of the adversarial training. 
One common property of these normalisation methods is that they preserve the trace of the covariance matrix (i.e., the first moment of its singular value distribution), but they do not control higher moments of the distribution, such as the skewness. A consequence is that there may still be a large imbalance in singular values. %~\citep{pennington2017nonlinear}. %Meanwhile, they also ignore when the imbalanced distribution happens. That is, the higher moments caused by small singular values may be more harmful as we have shown in Figure \ref{}. Therefore, we would like to 
Here, we propose a transformation function that can adjust the skewness of singular value distributions %in higher moments, especially 
by modifying small singular values to avoid dimension vanishing.

%One common aspect of these normalisation methods is that they preserve the trace of the covariance matrix (i.e., the first moment of its eigenvalue distribution), but they do not control higher moments of the distribution. A consequence is that there may still be a large imbalance in singular values~\citep{pennington2017nonlinear}. Therefore, we would like to propose a transformation function that can adjust eigenvalue distribution in higher moments.

%-------------------delete？------------------
%\subsection{Notation}

%Given an input sample, assuming the hidden state matrix (i.e., the output) in the $l$th$-$layer is represented by $x \in\mathbb{R}^{n^{l}\times m}$, where $n_{\ell}$ is the feature dimension in layer $\ell$ and $m$ is the number of data samples. Eigendecomposition can be performed on its covariance matrix, $\chi = xx^\intercal$, to decompose it into $Q\Sigma Q^{-1}$, where $Q$ is a $ n^{l}\times n^{l}$ orthogonal matrix whose columns are the eigenvectors of $\chi$, and $\Sigma$ is a $n^{l} \times n^{l}$ diagonal matrix with non-negative real numbers, $\{\lambda\}_{i}^{n^{l}}$, on the diagonal, which are the corresponding eigenvalues. Our proposed transformation function $f$ is applied to the eigenvalues normalised in $[0,1]$ and can be regarded as an element-wise activation. The transformed eigenvalues are defined as $\hat{\Sigma} = \{\hat{\lambda}\}_{i}^{n^{l}}$. 

\subsection{Properties of Desirable Singular Value Transformation}
\label{sec:desirable properties}
We want to alleviate the token uniformity problem in PTLMs by adjusting the singular value distributions of outputs of transformer layers~(see Section~\ref{sec:tokenuni}). Since SVD is computationally expensive\footnote{Approximation methods exist which can speed up the computation of SVD. However, the time cost is still 1.5 times higher than that of the original transformer-based language models.} and a common practice is to fine-tune a PTLM on downstream tasks, rather than applying the transformation in each transformer block\footnote{We have also applied the transformation function to different layers of transformers, but have not observed significant improvements.}, we propose to only apply it 
%So what is a desirable transformation function in large pretrained Transformer blocks? Firstly, as we want to apply it to large pretrained model for various real applications, the complexity becomes an important factor. We have to do eigenvalue decomposition before applying the function, but this decomposition is time-consuming. Therefore, rather than apply this function to each transformer block, we only implement the transformation function 
in the last transformer block to modify the final output token distribution. 
% to a large extent.

%Some existing work explored singular value or eigenvalue distributions of various components in deep learning. \citep{pennington2017resurrecting} studied the the distribution of the Jacobian matrix and found that if the largest singular value  $\lambda_{max}$ and its variance $\sigma_{\chi}$ are much larger than 1, then the Jacobian is ill-conditioned and the learning is slow. Here, we focus on the singular value distribution of the output from a transformer block. 
On one hand, we do not want the singular value distribution to be too skewed towards very few large singular values. On the other hand, we do not want it to be too flat as %this would defeat the purpose. Therefore, we want to strive the balance of making the singular value distribution less skewed and 
we want to keep the relative difference between singular values learned from powerful pre-trained models. 
%However, these two properties can lead to undesirable learned feature space. Imagine a extreme situation when the variance is zero so all the singular value is the same, then model does not learn anything because it treats all the tokens equally. Therefore, in addition to make the transformed distribution more balanced, we also want to keep the discrepancy of the current eigenvalue learnt in powerful pretrained model. 
To this end, we propose the following three desirable properties of an singular value transformation function $f(\lambda)$:



%unexpected non-trivial situation.
% In practical, we use a mask~(details can be found in ) to keep the largest eigenvalue free from the transformation function.


\paragraph{1) $f'(\lambda) \geq 0$} 
The large PTLMs have achieved promising performance in a broad spectrum of NLP tasks, showing their capabilities in learning good feature representations. %indicating their abilities in feature representation learning. To 
In order to preserve the original data manifold that is mainly defined by the feature singular values, we would like to keep the original order of the singular values. As the input to $f$ is a monotonically increased singular value sequence, $f$ should be monotonically increasing: %, that is:
\begin{equation}
    f(\lambda_{i}) > f(\lambda_{j}), \quad  \text{if}\quad \lambda_{i} > \lambda_{j}, \quad i,j \in [0,n_{l}-1].
\nonumber
\end{equation}
%the smaller eigenvalue~$\lambda_{i}$ is still smaller after such transformation.

\paragraph{2) $f''(\lambda) \leq 0$}%, Singular value  increment should be monotonically decreasing}
To make the transformed singular value distribution more balanced while keeping the largest singular value unchanged,
intuitively, we should increase the smaller singular values. The increment $\Delta_{i}$ for each singular value is defined as $f(\lambda_{i})-\lambda_{i}$ and $\Delta_{0} = 0$ (i.e., the largest singular value is kept unchanged). To reduce that gap between larger and smaller singular values, % and the smaller one, 
we propose a simple solution that a smaller singular values should have an equal or larger increment than those larger ones while preserving the original order of the singular values. That is: %This property can be written as
\begin{equation}
        \Delta_{i} \geq \Delta_{j}, \quad \text{if}\quad \lambda_i< \lambda_j, \quad \mbox{where}\quad \Delta_{i}=f(\lambda_{i})-\lambda_{i}
\nonumber
\end{equation}
i.e., $\frac{f'(\lambda_{i})-f'(\lambda_{j})}{\lambda_{i}-\lambda_{j}}\leq 0$, the second-order derivative of $f$ should be monotonically decreasing.

\paragraph{3) $f(\lambda_{max})\approx \lambda_{max}$} %(Keep the largest singular value unchanged)}
%add papers explaining why the largest eigenvalue is so important. On one hand, 
To guarantee the bounded embedding space, we require the transformation keep the largest singular value unchanged as much as possible. %On the other hand, 
Existing studies have also shown that the largest singular value of the data covariance matrix is significant to model training~\citep{DBLP:conf/nips/PenningtonW17}. %We have also previously shown in Section~\ref{sec:tokenuni} that the largest singular value is much larger than all the other singular value. We thus speculate that any modification of the largest singular value can lead to potentially detrimental results.

\subsection{SoftDecay Function}

%It is obvious that there are many transformation function satisfying these three proprieties. We would like to evaluate whether they will lead to a better transform functions than naive smooth functions.

%We explore the following transformation functions which satisfy at least two or all of the aforementioned properties.

%\paragraph{Minimize the variance} The first choice that we use the average value of the current singular value sequence as every element in the transformed distribution.
%\begin{align*}
%    f(\lambda_{i}) = ({\sum_{i=0}^{n_{\ell}-1}\lambda_{i})/{n_{\ell}}}
%\numberthis
%\label{eq:aver_f}
%\end{align*}
%This candidate function definitely decrease the singular value distribution variance and keep the original order with $f'(\lambda) = 0$ and $\f''(\lambda) = 0$, but it does not keep the largest singular value. However, since the largest singular value $\lambda_{max} \geq f(\lambda_{max})$, the transformed embedding space is still bounded.

%\paragraph{Linear transformation}
%\begin{align*}
%    f(\lambda_{i}) = \lambda_{max}-\frac{\lambda_{max}-\lambda_{i}}{K}
%\label{eq:linear_f}
%\numberthis
%\end{align*}
%This function meets all the three proprieties, and can decrease the singular value distribution variance. %Without loss generality,  we set the 
%The parameter $K$ is set to $1.1$ in order to get a small increment at each singular value position.

%\paragraph{Non-linear and trainable transformation Layer}

%Furthermore, we would like 
We develop a non-linear and trainable function built on the soft-exponential function~\citep{godfrey2015continuum}. %, we will develop our transformation layer.
\begin{equation}
    f(x) = -\frac{\text{ln}(1-\alpha(x+\alpha))}{\alpha}
\label{eq:softdecay}
\end{equation}

Specially, when $\alpha<0$, these curves are monotonically increasing with smaller slop. This is consistent with our properties (1)(2). For the property (3), it can be proved that for any $\lambda$, there is $\lambda \geq f(\lambda)$ when $\alpha < 0$. Hence, we have $\lambda_{max} \geq f(\lambda_{max}) \geq \lambda_{2}$, where $\lambda_2$ is the second largest singular value. % as we defined previously. 
%That is, the transformed embedding space by $f(\lambda)$ is bounded without changing the order of singular values. 

Combing the desirable properties of singular value distribution and the non-linear transformation function, we describe our proposed transformation %the complete algorithm 
in Algorithm~\ref{numpy}:
 \begin{algorithm}[ht]
  \caption{\texttt{SoftDecay} tranformation}  
   {\bf Input:} 
  Original representations $\bm{X}\in \mathbb{R}^{n_l\times m}$, $m$ is the number of tokens, $n_l$ is the embedding dimension. 
  \begin{algorithmic}[1]  
    \State SVD decomposition $\bm{U}, \bm{\Sigma}, \bm{V}^{\intercal} = \text{SVD}(\bm{X})$
    \State Apply transformation $\hat{\bm{\Sigma}}=\text{SoftDecay}(\bm{\Sigma})$
    \State Rescaling factor $\mathcal{K} = \text{max}(\lambda)/\text{max}(\hat{\lambda})$
    \State Compute transformed singular value $\Tilde{\lambda} = \mathcal{K}\hat{\lambda}$
    \State Compute transformed representation $\Tilde{\bm{X}} = \bm{U}\Tilde{\bm{\Sigma}}\bm{V}^\intercal$ 
  \end{algorithmic}
  {\bf Output:} Transformed representation $\Tilde{\bm{X}}$
  \label{numpy}
\end{algorithm}

\subsection{Transformed Feature Evaluation}
\label{sec:metrics}
Existing research in text representation learning showed that the features should be roughly
isotropic (i.e., directionally uniform)~\citep{%DBLP:journals/tacl/AroraLLMR16,
DBLP:conf/iclr/MuV18,DBLP:conf/emnlp/0004GXMYS20,DBLP:conf/iclr/Wang0HHWG20,DBLP:conf/emnlp/LiZHWYL20,DBLP:journals/corr/abs-2103-15316} to prevent the feature space squeezing into a narrow
cone and preserve as much information of the data as possible. %Representation learning in contrastive learning encourages positive examples should be mapped to nearby features, while negative ones are far apart~\citep{DBLP:conf/icml/0001I20}, so that different data manifolds can maintain their unique structure information. Therefore, 
We argue that the evaluation of transformed features should consider both the uniformity and the preservation of local neighbourhood structure in the original embedding space. %structure invariance. 

\paragraph{Uniformity.} %Following the studies in alleviating the anisotropy in embedding spaces, 
We propose to measure the distribution uniformity in three different ways. First, we examine the features similarity~(\textbf{TokenUni}): % which is exactly the form of information diffusion~\citep{DBLP:conf/icml/GoyalCRCSV20}.
\begin{equation}
    % \text{TokenUni} = \frac{1}{N^{2}}\sum_{i=1}^{N}\sum_{j=1}^{N}\frac{\bm{x}_{i}\bm{x}_{j}}{\lVert\bm{x}_{i}\rVert\lVert\bm{x}_{j}\rVert}
    %\mbox{TokenUni}=\mathbb{E}_{x_{i},x_{j}\sim U(\mathcal{O})}[\text{cos}(f(x_{i}),f(x_{j}))]
    \mbox{TokenUni}(x_i,x_j)=\text{cos}(f(x_{i}),f(x_{j}))
\end{equation}
where 
$f(\cdot)$ transforms an input feature by the \texttt{SoftDecay}.
%to a continuous representation in layer $\ell$ of model $f$. %Different from dot products or Euclidean distance measurement that are optimised by distribution with zero mean, the Radial Basis Function (RBF) kernel has shown a great potential in evaluating representation uniformity ~\citep{DBLP:conf/icml/0001I20}. Therefore, 

Second, we use the Radial Basis Function (RBF) kernel, \textbf{RBF}$_{\mathbf{dis}}$, to measure feature similarity, %in a high-dimensional space, 
as it has been shown a great potential in evaluating representation uniformity~\citep{DBLP:conf/icml/0001I20}.
\begin{equation}
    %\mbox{RBF}_{dis} = \text{log}\mathbb{E}_{x_{i},x_{j}\sim U(\mathcal{O})}\big[\,e^{-t{\lVert f(x_{i})-f(x_{j})\rVert}_{2}^{2}}\big]\,
    \mbox{RBF}_{dis}(x_i,x_j) = \exp\big(-\frac{\lVert f(x_{i})-f(x_{j})\rVert^{2}}{t}\big),
\end{equation}
where $t$ is a constant. We use the logarithmic value of RBF$_{dis}$ in experiments.
%Here, the model $f$ is applied to a sequence of features $\{x_{i}\}_{i=1}^{n}$ to generate their feature representations  $[ f_{\ell}(x_{1}),\dots,f_{\ell}(x_{n})]$. Since 

Finally, as few predominant singular values will result in an anisotropic embedding space, we can check the difference of variances in different directions or singular values %, because  When applying  Where $S_{1},\dots,S_{m}$ are the $m$ singular value in a descending order. 
and use the \textbf{E}xplained \textbf{V}ariance (\textbf{EV$_{k}$})~\citep{DBLP:conf/aaai/ZhouL021}: %is defined as
\begin{equation}
    \mbox{EV}_{k}(f(\bm{X})) = \frac{\sum_{i=1}^{k}\lambda_{i}^{2}}{\sum_{j=1}^{m}\lambda_{j}^{2}},
\end{equation}
where $\lambda_{i}$ is the $i$-th singular value sorted in a descending order, $m$ is the number of all the singular values. In the extreme case when $EV_{1}$ approximates to 1, most of the variations concentrate on one direction, and the feature space squeezes to a narrow cone.

\paragraph{Preservation of Local Neighbourhood.}
%Considering BERT has achieved good performance in many NLP downstream tasks, we assume that the desirable data manifolds have been well exploited in the BERT representations. Therefore, 
Ideally, the transformed embeddings should preserve the local neighbourbood structure in the original embedding space. Inspired by the Locally Linear Embedding~\citep{roweis2000nonlinear}, %,saul2003think}, 
we propose the \textbf{L}ocal \textbf{S}tructure \textbf{D}iscrepancy Score (\textit{LSDS}) to measure the degree of preserving the original local neighbourhood. First, for a data point $x_i$ in the original embedding space, we choose its $k$-nearest-neighbours, %$\epsilon$-neighbourhood, i.e., its neighbouring data points which are within the distance of $\epsilon$. 
then define the weight connecting $x_i$ and its neighbour $x_j$ as the distance measured by the RBF kernel, $w_{ij}=\exp(-\lVert x_i - x_j\rVert^2/t)$. In the transformed space, the new feature $\tilde{x_i}=f(x_i)$ is supposed to be close to the linear combination of its original neighbours in the transformed space weighted by the distance computed in the original space:
\begin{equation}
   % w_{ij}=\exp\big(-\frac{\lVert x_i - x_j\rVert^2}{t}\big)\\
    \mbox{LSDS}(x_i)=\lVert f(x_i)-\sum_{j\in\mathcal{N}(x_i)} w_{ij}f(x_j)\rVert^2,
\end{equation}
\noindent where $\mathcal{N}(x_i)$ denotes the $k$-nearest-neighbours of $x_i$.

%In the original feature space $\mathcal{O}$, For a query point $\bm{q}$, the LLE uses its neighbor points to reconstruct it via weighted sum. In the transformed feature space~$\mathcal{O^{'}}$, we use the points with the same index of the $\bm{q}$ neighbor points and the corresponding weights to recover the $\bm{q^{'}}$. Then, the reconstruct loss for $\bm{q^{'}}$ is the LSDS.

% For example, the nearest data samples derived from BERT of a given data point should still be in its K-nearest set $\mathcal{N}_{k}$.
% \begin{equation}
%     \mathbb{E}_{x_{i}\sim U(\mathcal{O})} [\,|\mathcal{N}_{k}(\text{BERT}(x_{i}))\cap\mathcal{N}_{k}(f(x_{i}))|]\,
% \end{equation}


%%%%%%%Experiment%%%%%%%%%%%%%%
\section{Experiments}

We implement our proposed transformation functions on four transformer-based Pre-Trained Language Models (PTLMs), %BERT, ALBERT, RoBERTa and DistilBERT, 
BERT~\citep{devlin2018bert}, ALBERT~\citep{lan2019albert}, RoBERTa~\citep{liu2019roberta} and DistilBERT~\citep{sanh2019distilbert}, 
and evaluate on semantic textual similarity (STS) datasets and \hq{General Language Understanding Evaluation} (GLUE) tasks ~\citep{DBLP:conf/iclr/WangSMHLB19}, including unsupervised and supervised comparison. \footnote{Model training details and additional results can be found in the supplementary material.} 
%The former evaluation is based on the derived sentence representations, without relying on any supervision, \hq{i.e., \texttt{SoftDecay} is applied on top of the last layer of a PTLM without any fine-tuning on the STS datasets}. The latter uses downstream task supervision, \hq{i.e., a PTLM is modified by inserting \texttt{SoftDecay} on top of its last layer and then fine-tuned under the label supervision, along with the task-specific classification head.}


%\hq{As our evaluation is only concerned with the improvements brought by addressing token uniformity, so we use the existing methods designed to alleviate the issue as baselines, rather than the general state-of-the-art methods on specific tasks, e.g., ensemble methods can achieve better results on single sophisticated PTLM, but the improvement is irrelevant to our motivation.}

%, so the evaluation results are purely effected by the transformed features~\citep{DBLP:conf/naacl/ZhouS21}. Therefore, we use these two tasks to evaluate our transformation method with and without downstream task supervision. We use the implementation provided by huggingface\footnote{https://github.com/huggingface/transformers/}. The proposed \texttt{SoftDecay} method is applied to the last layer of these PTLMs. %, then we fine-tune these PTLMs. 


% \subsection{Models for Comparison}
% \hq{As our evaluation is only concerned with the improvements brought by addressing token uniformity, so we use the existing methods designed to alleviate the issue as baselines, rather than the general state-of-the-art methods on specific tasks, e.g., ensemble methods can achieve better results on single sophisticated PTLM, but the improvement is irrelevant to our motivation.}
%In addition to vanilla PTLMs, 

% \hq{As for unsupervised method for comparison in STS evaluation,} we compare our approach with recent methods on adjusting anisotropy in semantic similarity task, including \texttt{BERT-flow}~\citep{DBLP:conf/emnlp/LiZHWYL20}, \texttt{SBERT-WK}~\citep{DBLP:journals/taslp/WangK20}, \texttt{BERT-whitening} and \texttt{WhiteBERT}~\citep{DBLP:journals/corr/abs-2104-01767}. 
%\texttt{BERT-flow}~\citep{DBLP:conf/emnlp/LiZHWYL20} %argued that ideal token/sentence representations should be isotropic and %when the evaluation is based on cosine similarity. Its 
%proposed to transform the representations learned by PTLMs into a standard Gaussian distribution. % and the training process can be in supervised and unsupervised manner. Following the idea of adjusting the anisotropy, 
%Similar to BERT-flow, 
%and \texttt{SBERT-WK}~\citep{DBLP:journals/taslp/WangK20} %also 
%which used Natural Language Inference datasets to train the top transformation layer while keeping parameters in the PTLM fixed, and \texttt{BERT-whitening} \citep{DBLP:journals/corr/abs-2103-15316} and \texttt{WhiteBERT}~\citep{DBLP:journals/corr/abs-2104-01767} which dissect BERT-based word models through geometric analysis on the feature space. 
% We use these methods for comparison in STS evaluation.
% As for supervised GLUE tasks, \hq{in additional to the four base PTLMs on top of which our \texttt{SoftDecay} is applied on, we use another two recent model focus on improveing sentence-level embedding, Sentence-BERT (S-BERT for short) ~\citep{DBLP:journals/corr/abs-1908-10084} and BERT-CT~\citep{DBLP:conf/iclr/CarlssonGGHS21} as baselines.
% S-BERT adds a pooling operation to the output
% of BERT to derive a sentence embedding and fine-tuning with a siamese network structure on sentence-pairs; while BERT-CT
% improves the PTLMs by incorporating contrastive loss in the training objective to retain a semantically distinguishable sentence representation. Both of the two methods are targeting at making the sentence-level embedding more discriminative, i.e., addressing the token uniformity}

\begin{table*}[thb]
\centering
\resizebox{\textwidth}{!}{%
\begin{tabular}{lllllllll}
\toprule[1pt]
\textbf{Model} &STSB & STS-12 &STS-13&STS-14&STS-15&STS-16&SICK-R&Avg($\Delta$\%). \\
 \hline
  %\textit{Bert-base} &&&&&&&&\\
  \multicolumn{9}{c}{\textit{Results based on Bert-base-cased~}}\\ %Model results published in ~\citep{DBLP:journals/corr/abs-2104-01767}}} \\
  \texttt{BERT} &59.05&	57.72&	58.38&	61.97&	70.28&	69.63&	63.75&	62.97\\
  \texttt{SBERT-WK}~\citep{DBLP:journals/taslp/WangK20} & 16.07&	26.66&	14.74&	24.32&	28.84&	34.32&	41.54&	26.64\\
  \texttt{BERT-flow(NLI)}~\citep{DBLP:conf/emnlp/LiZHWYL20}&58.56&	59.54&	64.69&	64.66&	72.92&	71.84&	\underline{65.44}&	65.38\\
  \texttt{BERT-whitening(NLI)}~\citep{DBLP:journals/corr/abs-2103-15316}&	68.19&	61.69&	65.70&	66.02&	\underline{75.11}&	\underline{73.11}&	63.60&	67.63\\
  \texttt{BERT-whitening(NLI)-256}~\citep{DBLP:journals/corr/abs-2103-15316}&	67.51&	61.46&	66.71&	66.17&	74.82&	72.10&	64.90&	67.67 \\
  \texttt{WhiteBERT}~\citep{DBLP:journals/corr/abs-2104-01767}	&\underline{68.72}&	\underline{62.20}&	\underline{68.52}&	\underline{67.35}&	74.73&	72.42&	60.43&	\underline{67.77}($\uparrow$7.6)\\
%   \texttt{BERT-CT}&74.21&61.63&76.80&68.47&77.50&76.48&69.19& &\\
  \texttt{SoftDecay}&	\textbf{72.41}**&	\textbf{65.16}**&	\textbf{72.10}**&	\textbf{69.49}**&	\textbf{77.09}**&	\textbf{77.05}**&	\textbf{65.55}**&	\textbf{71.26}($\uparrow$12.0) \\
  \midrule
    \multicolumn{9}{c}{\textit{Results based on DistilBERT-base}} \\
\texttt{DistilBERT}&61.45&59.68&59.60&63.54&70.95&69.90&63.84&64.12\\
  \texttt{WhiteBERT}~\citep{DBLP:journals/corr/abs-2104-01767}	&
69.41   &  61.82 &  66.90 &  67.69 &  74.27 &  72.81 & 59.43& 67.48($\uparrow$5.2)\\
 \texttt{SoftDecay}&	
  \textbf{71.10}** &  \textbf{63.33}** &  \textbf{70.62}**  & \textbf{68.39}**  & \textbf{76.34}**&   \textbf{75.29}** &  \textbf{63.40}** &  \textbf{69.78}($\uparrow$8.8)\\
     \midrule
      \multicolumn{9}{c}{\textit{Results based on ALBERT-base}} \\
    \texttt{ALBERT}&46.18&51.02&43.94&50.79&60.83&55.35&54.99&51.87\\
  \texttt{WhiteBERT}~\citep{DBLP:journals/corr/abs-2104-01767}	&
61.76 &	58.33 &	62.89 &	59.92 &	68.84  &65.90 &	58.03&62.24($\uparrow$19.9)\\
\texttt{SoftDecay}&	
\textbf{63.30}**  &  \textbf{59.42}** &  \textbf{62.93}** &  \textbf{61.09}** &  \textbf{70.84}** &  \textbf{68.60}** &  \textbf{62.26}**& \textbf{64.06}($\uparrow$23.5)\\
\midrule
      \multicolumn{9}{c}{\textit{Results based on RoBERTa-base}} \\
\texttt{RoBERTa}&  57.54 &58.56&50.37&59.62&66.64&63.21&60.75&59.53\\
  \texttt{WhiteBERT}~\citep{DBLP:journals/corr/abs-2104-01767}	&
68.18 & 62.21 &  67.13 &  67.63 &  74.78 &  71.43  & 58.80&67.17($\uparrow$12.83)\\
 \texttt{SoftDecay}&
$\textbf{69.47}^{**}$&$\textbf{62.97}^{**}$ &$\textbf{67.65}^{**}$ &  $\textbf{68.09}^{**}$ &  $\textbf{75.33}^{**}$ & $\textbf{73.26}^{**}$ & $\textbf{62.87}^{**}$&  \textbf{68.50}($\uparrow$15.10) \\
\bottomrule[1pt]
\end{tabular}
}
\caption{Spearman’s rank results on STS tasks using sentence representation learning methods applied to different PTLMs. Results with ** are significant at $p < 0.001$,  * at $p < 0.05$ by comparing with the best baseline. \hq{The improvement $\Delta$\% is calculated by comparing with the base PTLM (first row in each PTLM group).}} 
\label{tab:sts_results}
\end{table*}

% \begin{table*}[ht]
% \centering
% \resizebox{0.72\textwidth}{!}{%
% \begin{tabular}{llllllll}
% \toprule[1pt]
%   \textbf{Model} &STSB & STS-12 &STS-13&STS-14&STS-15&STS-16&SICK-R \\
%   \hline
%         \multicolumn{8}{c}{\textit{Trained on wiki-text (unsupervised)}} \\
%     \texttt{SimCSE}~\citep{pmlr-v119-goyal20a}     & 	74.48 & 66.01	&81.48	&71.77	&77.55	&76.53	&69.36 \\
%     \texttt{SoftDecay} & 	\textbf{75.81} & 63.25&	78.67&	70.41	&\textbf{79.37}&	\textbf{77.69}&	\textbf{71.15}\\
% \hline
%       \multicolumn{8}{c}{\textit{Trained on MNLI and SNLI dataset (supervised)}} \\
%     \texttt{SimCSE}~\citep{pmlr-v119-goyal20a}     &  	82.26 & 77.37&	78.12&	77.81&	84.65&	81.10&	78.73\\
% \texttt{SoftDecay} & \textbf{83.51} & 75.31&	\textbf{81.70}&	\textbf{79.88}&	\textbf{86.33}&	\textbf{81.37}&	\textbf{79.04} \\
% \bottomrule[1pt]
%     \end{tabular}
% }
%     \caption{Comparison with contrastive learning method, \texttt{SimCSE}. Our methods demonstrate overall better results on supervised setting.}
%     \label{tab:compare_cl}
% \end{table*}



%%%%%%%%%%%%%newly add evaluation part%%%%%%%%%%
\subsection{Unsupervised Evaluation on STS}

\paragraph{Setup}
The STS task is a widely-used benchmark of %predicting the similarity of two sentences. There is only test dataset, so many  
evaluating sentence representation learning. %~\citep{DBLP:journals/corr/abs-2103-15316,DBLP:conf/emnlp/ReimersG19}. 
We conduct experiments on seven STS datasets,
namely, the SICK-R~\citep{DBLP:conf/lrec/MarelliMBBBZ14},
and the STS tasks~(\citeauthor{DBLP:conf/semeval/AgirreCDG12}~\citeyear{DBLP:conf/starsem/AgirreCDGG13,DBLP:conf/lrec/MarelliMBBBZ14,DBLP:conf/semeval/AgirreBCCDGGLMM15,DBLP:conf/semeval/AgirreBCDGMRW16}). We compare our approach with unsupervised methods on adjusting anisotropy in STS tasks, including \texttt{BERT-flow}~\citep{DBLP:conf/emnlp/LiZHWYL20}, \texttt{SBERT-WK}~\citep{DBLP:journals/taslp/WangK20}, \texttt{BERT-whitening}~\citep{DBLP:journals/corr/abs-2103-15316} and \texttt{WhiteBERT}~\citep{DBLP:journals/corr/abs-2104-01767}. 
\texttt{BERT-flow} argued that ideal token/sentence representations should be isotropic and proposed to transform the representations learned by PTLMs into a standard Gaussian distribution. Similar to BERT-flow, \texttt{SBERT-WK} also used Natural Language Inference datasets to train the top transformation layer while keeping parameters in the PTLM fixed. \texttt{BERT-whitening} and \texttt{WhiteBERT} dissect BERT-based word models through geometric analysis on the feature space. 
\hq{Our~\texttt{SoftDecay} is directly applied to the last layer of the original PTLMs to derive the transformed sentence representation without any fine-tuning~\footnote{Here, we empirically search for the best value of $\alpha$ in $[-0.2,-0.4,-0.6,-0.8,-1.0]$.}}.% and use the best $\alpha$ for each dataset.}}.
% As the combination of the first and the last layer representations in a language model has shown better performance, we use the average of these two representations as the default setting.

%We follow the same evaluation procedure in previous work~\citep{DBLP:journals/corr/abs-2103-15316}. %firstly obtain the sentence features for two sentences in a pair from feature distributions. Next, we compute 
%Given a sentence, we retrieve the most similar sentences based on their learned embeddings, and % pair, the cosine similarity of their embeddings is calculated, which is 
%then compare the results with the gold-standard to derive the Spearman’s rank correlation coefficient. %between the calculated cosine similarity and the gold cosine similarity is used as the final metric. %This evaluation doesn't rely on any classifier, so The evaluation results are purely determined by the learned embeddings. %~\citep{DBLP:conf/naacl/ZhouS21}.

%The comparison baselines include vanilla BERT features, trainable neural transformation methods~(\texttt{SBERT-WK}~\citep{DBLP:journals/taslp/WangK20} and \texttt{BERT-flow}~\citep{DBLP:conf/emnlp/LiZHWYL20})~\footnote{They use extra Natural Language Inference datasets to train the top transformation layers while keep parameters in the pretrained model fixed.} and most recent \texttt{Whitening} methods used by~\citeauthor{DBLP:journals/corr/abs-2103-15316} and~\citeauthor{DBLP:journals/corr/abs-2104-01767}. Our~\texttt{Soft-Decay} is directly applied on the original BERT features without any training.

\paragraph{Results}
It can be observed from Table~\ref{tab:sts_results} that:
(1) \hq{whitening-based methods (\texttt{BERT-Whitening} and \texttt{WhiteBERT}), which transform the derived representations to be perfectly isotropic, perform better than the other baselines %that decrease the anisotropy only in some extent. 
such as \texttt{BERT-flow}, which applies a flow-based approach to generate sentence-embedding from a Gaussian distribution.}
% \texttt{BERT-flow}. \texttt{SBERT-WK} gives even worse results compared to the original BERT. 
(2) Our proposed \texttt{SoftDecay} gives superior results across all seven datasets significantly, 23.5\% of improvement over the base PLTMs and 5\% over the best baseline on BERT-based methods. %The improvement can be explained from two aspects: it does NOT require the representation to be perfectly isotropic as Whitening-based method (Section~\ref{sec:feature_eval} will explain the improvement source in details) and without Gaussian assumption on the generated representations as \texttt{BERT-flow}. 
(3) When comparing the results from %The improvement tendencies among 
different PLTMs, %are similar for the baselines and our \texttt{SoftDecay}: 
we observe more significant improvements on the ALBERT-based models (23\%), and modest improvements on the DistilBERT-based models (8\%). %are the contrast. 
This is somewhat expected as the token uniformity issue is more likely to occur in deeper models. %, it can be inferred that the promotion from anisotropy issue is limited to the shadow models (
Therefore, less obvious improvements are found on DistilBERT with only 6 layers, compared to others with 12 layers. The cross-layer parameter sharing in ALBERT could potentially lead to more serious token uniformity, and thus benefits more from the mitigation strategies. % be relieved in a larger extend. To verify our hypothesis, 

To further understand how \texttt{SoftDecay} alleviates token uniformity, we show the CDF of singular values from DistilBERT and ALBERT before and after applying \texttt{SoftDecay} in Figure~\ref{fig:distilbert-albert}. We can observe that before applying \texttt{SoftDecay}, the outputs of ALBERT across various layers are very similar while the outputs of DistilBERT across different layers are more different. After applying \texttt{SoftDecay}, the singular value distribution of the last layer output (red curve) of ALBERT is less skewed compared to DistilBERT (the brown curve).
%outperforms all the listed sentence-level representation learning methods over the seven datasets by nontrivial improvements. The great improvements lie in Albert-base models. As for datasets, the STS-15 datasets see the largest performance increases based on BERT-based models.
\begin{figure}[h]
    \centering
    \includegraphics[width=0.48\textwidth,trim={245 230 240 140},clip]{uai2022-template/distilbert_albert_sts_svalue.pdf}
    \caption{\hq{Cumulative distribution function (CDF) of singular values from DistilBERT (left column) and ALBERT (right column), before (top) and after (bottom) applying \texttt{SoftDecay} on the STS dataset.  }}
    \label{fig:distilbert-albert}
    % \vspace{-10pt}
\end{figure}

% \paragraph{Effect of scaling factor $\alpha$}
% \hq{if has enough space, then add results here.}

\begin{figure*}[htbp]
\centering
\includegraphics[trim=85 385 399 120,clip,width=0.90\textwidth]{uai2022-template/sts_tsne_sidetable2.pdf}
    \caption{Data points are tSNE mapping results of sentence (pair) representations in STS-15, %\hq{for qualitative analysis}, 
    from left to right derived from the vanilla \texttt{BERT}, \texttt{BERT+whitening} and \texttt{BERT+SoftDecay}. %\citep{DBLP:conf/semeval/AgirreBCCDGGLMM15} 
     %These representations from left to right are derived from the vanilla \texttt{BERT}, \texttt{BERT+whitening} and \texttt{BERT+SoftDecay}.%~\footnote{We will only show the visualization results for BERT-based model, other models' results can be find in appendix.} 
    The two sentences in each pair are denoted by two different colors, e.g., black and red in \texttt{BERT}. %\hq{We can see that the green one is most fitted to the circle, then follows the blue and the red one, while the green does not preserve the original local structure learnt in BERT.} 
    The metrics measuring uniformity and local neighbourhood structure (see in \S{\ref{sec:metrics}}) are listed on the right. % \hq{for quantitative analysis}.
    We can see our method preserves the local neighbourhood structure better than \texttt{Whitening} with lower \textit{LSDS} and address token uniformity in \texttt{BERT} well with lower scores in the first three metrics.} %clearly see three clusters in \texttt{BERT} and \texttt{BERT+SoftDecay}. Comparing with the vanilla \texttt{\textcolor{red}{BERT}} and \texttt{\textcolor{teal}{Whitening}}, our method achieves better \textit{Uniformity} and less \textit{LSDS} respectively.}
    % \vspace{-4pt}
    \label{fig:sts_tsne}
\end{figure*}


\paragraph{Feature Evaluation}
\label{sec:feature_eval}
To gain insights into the \hq{characteristics of desirable features for the STS task}, we visualise the sentence representations in STS-15 via tSNE and present the results using our proposed metrics in Figure~\ref{fig:sts_tsne}. 
\texttt{BERT-Whitening} transforms vanilla features from BERT into perfectly isotropic distribution, which is evidenced in results of the uniformity measures that nearly all the features are orthogonal to each other as~\textit{TokenUni} is zero and they have the smallest \textit{RBF$_{dis}$}. It also has the lowest \textit{EV}$_{k}$ score of its top singular value. However, \texttt{BERT-Whitening} fails to preserve the local neighbourhood of BERT embeddings in its transformed space as shown by its larger Local Structure Discrepancy Score (\textit{LSDS}) compared to \texttt{SoftDecay}. %we can NOT recover original structural information, i.e., relative distance among data points in the green distribution. 
By contrast, \texttt{SoftDecay} not only significantly improves the uniformity compared to the vanilla BERT feature distribution, but also maintains a similar distribution shape. \hq{Our results show that transforming learned text representations into isotropic distributions does not %, or rather, eliminate token uniformity in larger extent won't 
necessarily lead to better performance. Our proposed \texttt{SoftDecay} is better in preserving the local neighbourhood structure in the transformed embedding space, leading to superior results compared to others.}\footnote{The full results of uniformity and structural evaluation of different methods over the seven STS datasets can be found in Appendix C.} 
%of our method in setting a trainable $\alpha$ to control the degree.}
% These results can further explain the large improvement of \texttt{SoftDecay} over \texttt{BERT-Whitening} on the STS-15 dataset.
\HQ{In Appendix C.3, we further discuss a comparison between \texttt{SoftDecay} and a representative contrastive learning method \texttt{SimCSE}~\citep{DBLP:conf/emnlp/GaoYC21}, which also aims to alleviate the anisotropy problem in language representations. }%. considers the uniform distribution as a desirable property of representation learning.}


\begin{table*}[tb]
    \centering
    \resizebox{0.9\textwidth}{!}{
    \begin{tabular}{l|cl|cl|cl}
\toprule[1pt]
\\
    [-1em]
  Dataset (\hq{size})  & BERT & +\texttt{SoftDecay}($\Delta$\%) & ALBERT & +SoftDecay($\Delta$\%) & DistilBERT & +SoftDecay($\Delta$\%)  \\
    [-1em]
    \\
    \hline CoLA(8.5k)&59.57&\textbf{59.84}*($\uparrow$0.45)& 46.47&\textbf{48.91}**($\uparrow$5.25)&50.60&\textbf{50.73}*($\uparrow$0.26) \\
    SST2(67k)&92.32&\textbf{93.12}**($\uparrow$0.87)&\textbf{90.02}&89.91*($\downarrow$0.12)&90.48&\textbf{91.40}**($\uparrow$1.00)\\
    \hline
MRPC-Acc(3.7k)&84.00&\textbf{85.20}**($\uparrow$1.43)&\textbf{85.54} &85.05($\downarrow$0.57)&\textbf{84.56}&84.31*($\downarrow$0.30)\\
    MRPC-F1(3.7k)&89.50&\textbf{89.65}($\uparrow$0.17)&\textbf{89.67}&89.28($\downarrow$0.43)&\textbf{89.16}&89.00($\downarrow$0.18)\\
% QQP-acc(364k) &91.06&\textbf{91.11}($\uparrow$0.05)& & & & \\
% QQP-f1(364k) &87.96&\textbf{88.07}($\uparrow$0.06)& & & \\
    \hline
QNLI(105k)&91.25&\textbf{91.98}**($\uparrow$0.80)&89.99&\textbf{90.24}*($\uparrow$0.28)&87.66&\textbf{88.81}**($\uparrow$1.31) \\
% MNLI-m(393k)& 84.39 & 84.36($\downarrow$0.03) & & &\\
% MNLI-mm(393k)& 84.70 & \textbf{84.82}($\uparrow$2.50) & & & \\
    RTE(2.5k)& 64.98&\textbf{68.23}**($\uparrow$5.00)&66.43&\textbf{68.23}**($\uparrow$2.71)&56.68&\textbf{59.21}**($\uparrow$4.46) \\
    % \hline
    % Avg.$\Delta$\%& -&$\uparrow$1.43&-&$\uparrow$1.18&-&$\uparrow$1.10\\
\bottomrule[1pt]
    \end{tabular}
    }
    \caption{Sentence-level classification results on five representative GLUE validation datasets. Matthews correlation is used to evaluate CoLA, Accuracy/F1 is used in other datasets. $\Delta\%$ represents the relative improvement over the baseline.} %After applying \texttt{SoftDecay}, the transformed sentences features are overall better than features derived from vanilla PTLMs.}
    \label{tab:glue_results}
\end{table*}


\begin{table*}[tb]
\centering
\resizebox{0.85\textwidth}{!}{
\begin{tabular}{llllllllll}
\toprule[1pt]
 & MNLI & MNLI(mm) & QQP & QNLI & SST2 & COLA & MRPC & RTE & Average($\Delta\%$) \\
 \hline
\texttt{S-BERT} & 83.9 & 83.1 & 71.3 & 90.5 & 90.9 & 47.0 & 85.3 & 61.6 & 76.7 \\
\texttt{BERT-CT} & 82.3 & 81.9 & 70.1 & 89.7 & 91.3 & 48.8 & 84.4 & 61.1 & 76.2 \\
\texttt{SoftDecay} & \textbf{84.6}** & \textbf{84.0}** & \textbf{71.6}* & \textbf{90.9}* & \textbf{93.3}** & \textbf{50.3}** & \textbf{86.2}** & \textbf{64.5}** & \bf{78.2} ($\uparrow$2.6\%)\\
\bottomrule[1pt]
\end{tabular}
}
\caption{GLUE test results returned by the GLUE leaderboard. The first two rows are reported in \texttt{BERT-CT} ~\citep{DBLP:conf/iclr/CarlssonGGHS21}. Our results outperform \texttt{BERT-CT} by 2.6\% on average.}
\label{tab:glue_test results}
% \vspace{-1pt}
\end{table*}

%%%%%%%commented by Hanqi June%
\begin{comment}

\subsection{Comparing with Contrastive Learning on STS}
\HQ{The objective of contrastive learning methods is to align semantically-related positive data pairs and make the whole representation space evenly distributed~\citep{DBLP:conf/icml/0001I20}. The latter also address the token uniformity issue. Therefore, we compare \texttt{Softdecay} with a representative contrastive learning method, SimCSE~\citep{DBLP:conf/emnlp/GaoYC21}, on STS settings. As SimCSE needs to train on datasets to fine-tune the model parameter, we conduct the experiments according to its original settings: (1) Unsupervised. Train the model on wiki-text dataset and use the data pairs after being applied with two different dropout masks on the same data sample as positive pairs, while the two different data samples as negative pair. (2) Supervised. Train the model on natural language inference datasets, MNLI and SNLI, and use annotated entailment and contradictory pairs as positive and negative pair, respectively. The results are shown in Table~\ref{tab:compare_cl}. The results are overall better, especially in the supervised setting.  The end goal of our approach (via increasing the weights of small singular values in the output embedding space) is similar to SimCSE (via random dropout masks) under the unsupervised setting, as both aim to learn an isotropic embedding distribution. However, in the supervised SimCSE, the contrastive loss is calculated on a subset of training pairs and their corresponding labels. As such, it is relatively difficult to achieve the universal isotropy, which is not the case in our approach.}
\end{comment}

\subsection{Supervised evaluation on GLUE datasets}
\paragraph{Setup}
We evaluate our method on five sentence-level classification datasets %\HQ{covering all the 3 tasks} 
in GLUE~\citep{DBLP:conf/iclr/WangSMHLB19}, including grammar acceptability assessment on the Corpus of Linguistic Acceptability (CoLA)~\citep{DBLP:journals/tacl/WarstadtSB19}, sentiment classification on the Stanford Sentiment Treebank (SST2)~\citep{DBLP:conf/emnlp/SocherPWCMNP13}, paraphrase detection on the Microsoft Research Paraphrase Corpus (MRPC)~\citep{DBLP:conf/acl-iwp/DolanB05}, natural language inference on the Question-Answering NLI (QNLI) data and the Recognizing Textual Entailment (RTE) data.\footnote{We exclude WNLI \HQ{as it has only 634 training samples and is often excluded in previous work}~\citep{devlin2018bert}. We also exclude STS-B as it is a benchmark in the STS task.}. 
%, including grammar acceptability assessment on the Corpus of Linguistic Acceptability (CoLA)~\citep{DBLP:journals/tacl/WarstadtSB19}, sentiment classification on the Stanford Sentiment Treebank (SST2)~\citep{DBLP:conf/emnlp/SocherPWCMNP13}, paraphrase detection on the Microsoft Research Paraphrase Corpus (MRPC)~\citep{DBLP:conf/acl-iwp/DolanB05}, natural language inference on the Question-Answering NLI (QNLI) data and the Recognizing Textual Entailment (RTE) data,all from the GLUE datasets~\citep{DBLP:conf/iclr/WangSMHLB19}. %We evaluate our method on GLUE tasks~\citep{DBLP:conf/iclr/WangSMHLB19}, including grammar acceptability assessment on  (CoLA)~\citep{DBLP:journals/tacl/WarstadtSB19}, sentiment classification (SST2)~\citep{DBLP:conf/emnlp/SocherPWCMNP13}, paragraphing %the Microsoft Research Paraphrase Corpus (MRPC)~\citep{DBLP:conf/acl-iwp/DolanB05},  natural language inference on the Question-Answering NLI (QNLI) data and Recognizing Textual Entailment (RTE). 

We apply our proposed \texttt{SoftDecay} on top of the last encoder layer in BERT, ALBERT and DistilBERT, and then fine-tune the PTLM weights, along with $\alpha$ on different tasks. 
In addition to the PTLMs, we include \HQ{two more baselines, i.e., sentence-level embedding learning models, Sentence-BERT (\texttt{S-BERT} for short) ~\citep{DBLP:conf/emnlp/ReimersG19} and \texttt{BERT-CT}~\citep{DBLP:conf/iclr/CarlssonGGHS21}
}.\footnote{We further compare \texttt{SoftDecay} with a method by adding regularisation during training in order to alleviate the anisotropy problem in language representations \citep{DBLP:conf/iclr/Wang0HHWG20} %, which  regularizes the output embedding matrix to an exponentially decayed singular value prior distribution~\citep{DBLP:conf/iclr/Wang0HHWG20}, the results on top of BERT on the GLUE datasets are shown 
in Appendix D.1. }
\begin{itemize}
% \item \HQ{\texttt{ExpDecay} is designed for an encoder-decoder architecture in language generation. %task, in their paper, machine translation task and they derive 
% The singular value distribution of the output embedding matrix is derived from the decoder. This approach is not directly applicable to our setup since we don't use the encoder-decoder architecture here. Nevertheless, we modify our training objective by adding the singular values $\{\lambda_{i}\}_{i=1}^{m}$ of output feature $X$: $ \lambda_{e}\sum_{k=1}^{K}(\sigma_{k}-c_{1}e^{-c_{2}k^{\lambda}})$. where $\lambda_{e}$ is a hyperparameter used to adjust the weight of the added term, $c_1, c_2, and \lambda$ are hyperparameters in the desirable exponential prior term. We empirically set $c_{1},c_{2} =1, \lambda=2, \lambda_{e}=1e-4$.} % (The primary loss does not decrease until we decrease $\lambda_{e}$ to 0.0001).}
    \item \texttt{S-BERT} adds a pooling operation to the output
of BERT to derive a sentence embedding and fine-tunes a siamese BERT network structure on sentence pairs.
\item \texttt{BERT-CT} improves the PTLMs by incorporating contrastive loss in the training objective to retain a semantically distinguishable sentence representation. 
\end{itemize}
% \HQ{\texttt{ExpDecay} addresses the token uniformity issue by adjusting the singular value distribution, which has the same goal as our approach. } %that is the most similar method to our motivation.} 
The two methods aim at making the sentence-level embeddings more discriminative, which in turn alleviate the token uniformity problem. 
% \HQ{\texttt{ExpDecay} is designed for language generation task, in their paper, machine translation task and they derive the singular value distribution of the output embedding matrix in the decoder. Although it is not directly applicable to our setup, we perform the experiments of adding the singular values $\{\lambda_{i}\}_{i=1}^{m}$ of output feature $X$ to the training objective as: $ \lambda_{e}\sum_{k=1}^{K}(\sigma_{k}-c_{1}e^{-c_{2}k^{\lambda}})$
% perform the experiments of adding the singular values 
% training objective as}

% The validation results \HQ{with implementing \texttt{ExpDecay} on top of BERT } are shown in Figure~\ref{tab:glue_results}. %, and 
Since GLUE did not release the test set, the test results can only be obtained by submitting the trained models to the GLUE leaderboard~\footnote{\url{https://gluebenchmark.com/leaderboard}}. We show the test results returned by the GLUE leaderboard in Table~\ref{tab:glue_test results}.

% Similar to \texttt{BERT-CT}~\citep{DBLP:conf/iclr/CarlssonGGHS21}, we also submit the results on GLUE validation dataset to GLUE leaderboard \footnote{\url{https://gluebenchmark.com/leaderboard}}. Therefore, the results of \texttt{BERT-CT} and \texttt{SoftDecay} are both from the GLUE leaderboard records for transparent and fair comparison.

\paragraph{Results}
%We summarise the effects of \texttt{SoftDecay} on different PTLMs and NLP tasks:
\hq{It can be observed from Table~\ref{tab:glue_results} that \texttt{SoftDecay} is more effective on BERT-based model, while gives less noticeable improvement on DistilBERT, similar to what we observed for the STS tasks since % gains the smallest improvements, which demonstrates the similar trends in STS tasks as 
DistillBERT has fewer layers.
For the vanilla PLTMs, BERT has the better results over all the single-sentence tasks (except for MRPC, sentence-pair paraphrase detection). 
All the three models achieve better results on inference task (QNLI and RTE), especially on the smaller dataset RTE. The CDF of singular value distributions on RTE before and after applying \texttt{SoftDecay} shown in Figure~\ref{fig:bert_cola_cdf_nocompare} further verifies the effectiveness of our proposed transformation function. We also observe that models trained on a larger training set tend to generate more similar representations\footnote{We investigate the impact of the training set size on model performance in Appendix D.2, Figure 3.}. 
On MRPC, using \texttt{SoftDecay} is effective on BERT, but gives slight performance drop on ALBERT and DistilBERT. One possible reason is the much smaller training set size. 
% \HQ{By comparing the results of \texttt{ExpDecay}, we don't see substantial improvement using the fixed exponential decay term. It can be explained by 1) the difficulty of balancing two losses by adding the exponential decay term; 2) the sensitivity of the hyper-parameter in the prior decay term. In our method, we only need to initialize  and its value can be automatically adjusted during training to fit the downstream tasks. Note that our method can be used under the unsupervised setting on the STS dataset (i.e., the sentence representations can be directly transformed using our method without any training).}
%is not suitable for MRPC where we see performance drop on ALBERT and DistilBERT. It is surprising because the \texttt{SoftDecay} performs much better than the base PLTMs on STS tasks that measure the semantically relatedness as well. The reason we think is from the evaluation metric, MRPC uses the accuracy/F1, while STS uses the spearman correlation to measure the sentence-pair similarity that is more vulnerable to the representation distribution/distance change, so the relative changes in STS is much larger than MRPC.
}
\hq{On the GLUE test results shown in Table~\ref{tab:glue_test results}, we observe that \texttt{SoftDecay} outperforms both \texttt{S-BERT} and \texttt{BERT-CT} across all tasks.}

\begin{figure}[h!]
    \centering
    \includegraphics[trim=19 420 539 69,clip,width=0.49\textwidth]{rte_cdf_side.pdf}
    \caption{CDF of singular value distributions on RTE before (left) and after (right) applying \texttt{SOftDecay} on BERT. It is clear that \texttt{SoftDecay} can produce a set of larger singular values as evidenced from the curves of $F(x)$.}
    % \caption{Cumulative distribution function (CDF) of singular value distributions on RTE. Different curves represent distributions derived from different model layers. The left results are from vanilla BERT, the right are from \texttt{SoftDecay}. The $x$-axis represents the normalised singular values sorted in an ascending order. The steeper the slope of the curve, the higher the relative frequency of the corresponding singular value. By comparing the red curves in the two charts, it is clear that \texttt{SoftDecay} can obtain a set of larger singular values when $F(x)$ is given.}
    \label{fig:bert_cola_cdf_nocompare}
% \vspace{-10pt}
\end{figure}

% \begin{figure}[h]
%     \centering
%     \includegraphics[width=0.48\textwidth,trim={100 350 380 120},clip]{uai2022-template/inference_bert_cdf.pdf}
%     \caption{\hq{The CDF of singular value in QNLI (left) and RTE (right) dataset derived from vanilla \texttt{BERT}}. For the same percentage 0.8, the larger dataset QNLI dataset has smaller $\Delta L_{i}$ among all the layers, refers to a more serious token uniformity issue.}
%     \label{fig:inference_bert_cdf}
%     \vspace{-10pt}
% \end{figure}
% 
\begin{comment}
Applying \texttt{SoftDecay} brings gains on both grammar acceptability assessment and natural language inference tasks. It also improves sentiment classification results using DistilBERT. % and performs similarly using ALBERT. 
But it does not help in paraphrase detection. When comparing results across tasks, 
%our proposed \texttt{SoftDecay} performs overall better on the listed datasets comparing to the three PTLMs, except for the results on MRPC dataset based ALBERT and DistilBERT. 
we observe more significant improvements of 5\% and 4.46\% in BERT and ALBERT, respectively, on the RTE dataset. % with relatively 5\% and 4.46\% increase, respectively. 
CoLA benefits the most from our proposed method based on ALBERT, with the relative improvement of 5.25\% being achieved. On the GLUE test results shown in Table~\ref{tab:glue_test results}, we observe that \texttt{SoftDecay} outperforms both \texttt{S-BERT} and \texttt{BERT-CT} across all tasks.\footnote{We exclude WNLI according to ~\citep{devlin2018bert} and STS-B as it is a benchmark in the STS task.}

\paragraph{Singular Value Distribution} 
We visualise the singular values distributions to get a better understanding of the transformed feature space. % and possible improvement reason. 
The left subfigure in Figure~\ref{fig:bert_cola_cdf_nocompare} shows the CDF of singular values in different layers derived from vanilla BERT on RTE, while the right one shows the CDF of singular values in various layers by applying \texttt{SoftDecay}. %We sort the singular values by a descending order and normalize them to one as the x-axis. The curve increase to $F(x)=1.0$ sooner have more predominant large singular values, are more easily suffer from anisotropy. 
By comparing the curves derived from layer 3 to layer 9 in the left subfigure, we learn that the larger singular values are more predominant as layers go deeper. That is, smaller singular values tend to diminish quickly.\footnote{A similar trend is observed in MRPC, QNLI and CoLA datasets, whose results are shown in Appendix D, Figure 4.}  %This situation implies that the model are learning more task-specific knowledge as layers go deeper, so the feature space tend to be less uniform. But the layer 12 curve fall behind layer 9.~We see the similar change tendency in the \texttt{Soft-Decay} (bottom row). By comparing the red curves in the two charts, we notice that the transformation method 
\texttt{SoftDecay} adjusts the anisotropy in the output layer as evidenced by the shifted CDF curve in the right subfigure, explaining its superior task performance. %We also show more details about the transformed singular value in Figure~\ref{fig:cola_hist}: it has larger smallest value and smaller variance, satisfying our desirable characteristics of singular values. 

\end{comment}

% \begin{figure}
%     \centering
%     \includegraphics[trim=5 20 5 25,clip,width=0.55\textwidth]{templates/latex/0713Layer12_originvssoft_expand_hist_normalizeTo1 (1).png}
%     \caption{Singular Value distribution on CoLA dataset derived from the last layer of BERT based model. The x-axis represents the original singular value, the y-axis is its percentage. It is clear that our \texttt{Soft-Decay} increases the smaller singular value and decrease the distribution variance.}
%     \label{fig:cola_hist}
% \vspace{-2pt}
% \end{figure}

% \begin{figure*}
%     \centering
%     \includegraphics[trim=95 305 105 15,clip,width=0.98\textwidth]{templates/latex/glue_singular_whole.pdf}
%     \caption{Cumulative distribution function (CDF) of singular value distributions. The upper ones are from vanilla \texttt{BERT}, bottom ones are from \texttt{BERT+SoftDecay}. From left to right, the evaluation datasets are SST-2, MRPC and QNLI. Different curves represent distributions derived from different model layers. The x-axis represents the normalised singular values sorted in a descending order. If a curve reaches $F(x)=1.0$ quicker means that the largest singular value is more predominant, i.e., the EV value is larger and the \textit{uniformity} is worse. %For better comparison, we add the last layer output from vanilla BERT as baseline in the bottom figures, shown in dash line. 
%     It is noticeable that our proposed \texttt{SoftDecay} adjusts the anisotropy through the starting point and its increase speed, its adjusting varies from different tasks: most noticeable in MRPC, smallest changes in QNLI.}
%     \label{fig:glue_singular}
% \end{figure*}

%%%%%or can choose separate two group figures%%%%
% \begin{figure*}[t]
%     \centering
%     \includegraphics[trim=70 245 90 215,clip,width=0.98\textwidth]{templates/latex/glue_origin_every3layers.pdf}
%     \caption{Cumulative distribution function (CDF) of singular value distributions. They are from the SST-2, MRPC and QNLI datasets evaluated on BERT.}
%     \label{fig:glue_origin}
% \end{figure*}

% \begin{figure*}[t]
%     \centering
%     \includegraphics[trim=80 394 80 30,clip,width=0.98\textwidth]{templates/latex/glue_softdecay_compare.pdf}
%     \caption{CDF of singular value distributions from the corresponding datasets in Figure~\ref{fig:glue_origin}, they are derived from \texttt{BERT+Soft-Decay}. For better comparison, we also show the baseline result in dash line.}
%     \label{fig:my_label}
% \end{figure*}


\section{Conclusion and future Work}
In this paper, we have empirically shown that %the feature space derived from transformer-based language models mainly relies on the small singular values of the outputs, and 
the degree of skewness of singular value distributions correlates with the degree of token uniformity. To address the token uniformity problem, we have proposed a singular value transformation function by alleviating the skewness of the singular values. We have also shown that a perfect isotropic feature space fails to capture the local neighborhood information and leads to inferior performance in downstream tasks. Our proposed transformation function has been evaluated on unsupervised and supervised tasks. Experimental results show that our methods can more effectively address token uniformity compared to existing approaches.

\HQ{Our paper explores the token uniformity issue in information propagation in the transformer encoder, where self-attention is used. It would be interesting to extend our approach to the encoder-decoder structure and explore its performance in language generation tasks. One promising future direction is to improve the generation diversity via addressing the token uniformity since it has been previously shown that anisotropy is related to the word occurrence frequencies~\citep{DBLP:conf/emnlp/0004GXMYS20,DBLP:conf/naacl/BisPL21}. As such, in the decoding phase, sampling words from more isotropic word embedding distributions could potentially lead to more diverse results.}
%while better preserving the local neighborhood information compared to existing methods.

\subsubsection*{Acknowledgements}
This work was funded by the the UK Engineering and Physical Sciences Research Council (grant no. EP/T017112/1, EP/V048597/1). YH is supported by a Turing AI Fellowship funded by the UK Research and Innovation (grant no. EP/V020579/1).

%\newpage

%\clearpage
\bibliography{yan_670}
%\bibliographystyle{acl_natbib}

% \newpage
% \clearpage

% \appendix
% \setcounter{table}{0}
% \renewcommand{\thetable}{A\arabic{table}}
% \setcounter{figure}{0}
% \renewcommand{\thefigure}{A\arabic{figure}}

% \section{Model Configurations and Training Details}

% \paragraph{Unsupervised Setting}

% In unsupervised settings on the STS task, we use the datasets processed by~\citep{DBLP:journals/corr/abs-2104-01767} and follow their evaluation pipeline by replacing their Whitening function with our \texttt{SoftDecay} function in their released  code\footnote{\url{https://github.com/Jun-jie-Huang/WhiteningBERT}}. We do not use any dataset to train the transformation function, instead, we choose a fixed $\alpha$ empirically ($\alpha$ is the hyper-parameter in Eq.(\ref{eq:softdecay})). As we did not see significant changes among different $\alpha$, we set $\alpha$ to $-0.6$ for all the datasets and PTLMs. For metrics calculation, we use $t=0.5$ in RBF$_{dis}$ and we choose the nearest 12 points to reconstruct the query point in \textit{LSDS}.

% \paragraph{Supervised Setting}

% We apply \texttt{SoftDecay} to the output of the last layer of a PTLM provide by huggingface, before layer normalisation. %We evaluate our method on sentence-level classification tasks, including grammar acceptability assessment on the Corpus of Linguistic Acceptability (CoLA)~\citep{DBLP:journals/tacl/WarstadtSB19}, sentiment classification on the Stanford Sentiment Treebank (SST2)~\citep{DBLP:conf/emnlp/SocherPWCMNP13}, paraphrase detection on the Microsoft Research Paraphrase Corpus (MRPC)~\citep{DBLP:conf/acl-iwp/DolanB05}, natural language inference on the Question-Answering NLI (QNLI) data and the Recognizing Textual Entailment (RTE) data, all from the GLUE datasets~\citep{DBLP:conf/iclr/WangSMHLB19}. 
% %In supervised settings on GLUE datasets, i.e.,  text-classification (SST2, MRPC, QNLI, CoLA, RTE)~\footnote{You can access these dataset through this link: \url{https://github.com/huggingface/transformers/tree/master/examples/pytorch/text-classification}}.
% We use the default parameters configured in BERT-base-uncased\footnote{%BERT configuration can be found here 
% \url{https://huggingface.co/docs/transformers/master/en/model_doc/bert#transformers.BertConfig}}, 
% ALBERT-base-v1\footnote{
% \url{https://huggingface.co/docs/transformers/master/en/model_doc/albert}}, 
% RoBERTa-base\footnote{
% \url{https://huggingface.co/docs/transformers/master/en/model_doc/roberta}} and DistilBERT-base-uncased\footnote{
% \url{https://huggingface.co/docs/transformers/master/en/model_doc/distilbert}} as the baselines. 
% For hyper-parameter setting, we search the initial alpha for different datasets from $[-0.2,-0.5,-0.8]$, and set different learning rates from $[2e-3, 2e-5]$ for the transformation layer and the pretrained models.\footnote{As SVD decomposition generates an error in the RoBERTa-base model, we exclude it in GLUE evaluation.} 

% % \begin{figure*}[t]
% %     \centering
% %     \includegraphics[trim=150 260 200 60,clip,width=0.90\textwidth]{templates/latex/appendix_bert_colarte.pdf}
% %     \caption{CDF for CoLA and RTE dataset. The upper results are from the vanilla BERT, the bottom are from \texttt{BERT+SoftDecay}. Large singular values are more predominant as layers go deeper except for the last layer. Comparing to the dash and solid red line, we notice that \texttt{SoftDecay} can greatly improve the anisotropy in BERT features.}
% %     \label{fig:bert_colarte}
% % \end{figure*}

% \begin{table*}[th]
% \centering
% \resizebox{0.78\textwidth}{!}{
% \begin{tabular}{ll|rr|rr|rr}
% \toprule[1pt]
%  &  & BERT & +SoftDecay & ALBERT & +SoftDecay & DistilBERT & +SoftDecay \\
%  \hline
%  \multirow{3}{*}{STS-B} & Evs & 0.6259 & 0.0252 & 0.6987 & 0.0326 &0.7301 & 0.0341 \\
%  & RBF$_{dis}$ & -1.4624 & -3.8534 & -1.1602 & -3.8016  & -1.0549 & -3.8052 \\
%  & TokenUni & 0.6195 & 0.0274 &0.6983 & 0.036  & 0.7282 & 0.037 \\
%  \hline
% \multirow{3}{*}{SICK} & Evs & 0.7383 & 0.0212 &  0.7711 & 0.0274  & 0.8135 & 0.0289 \\
%  &RBF$_{dis}$ & -1.0323 & -3.8671 & -0.8979 & -3.8268  & -0.7367 & -3.8241 \\
%  & TokenUni & 0.7361 & 0.023 & 0.7706 & 0.0295 &0.8130 & 0.0311 \\
%   \hline
% \multirow{3}{*}{STS-12} & Evs & 0.6219 & 0.0182 &0.7052 & 0.0247  & 0.7321 & 0.0245 \\
%  & RBF$_{dis}$ & -1.4785 & -3.8717 & -1.4785 & -1.1438 & -3.8308& -3.8381  \\
%  & TokenUni & 0.6193 & 0.0203 & 0.7058 & 0.0273& 0.7021 & 0.0329 \\
%   \hline
% \multirow{3}{*}{STS-13} & Evs & 0.5823 & 0.0221 &  0.6632 & 0.0287  & 0.7015 & 0.0302 \\
%  & RBF$_{dis}$ & -1.6189 & -3.8706 & -1.3032 & -3.8258  & -1.1594 & -3.8262 \\
%  & TokenUni & 0.5817 & 0.024 & 0.6637 & 0.031 & 0.7021 & 0.0329 \\
%   \hline
% \multirow{3}{*}{STS-14} & Evs & 0.5933 & 0.6729 & 0.0204  & 0.0151 & 0.712 & 0.0202 \\
%  & RBF$_{dis}$ & -1.593 & -3.9124 & -1.2712 &-3.8787  & -1.1288 & -3.8855 \\
%  & TokenUni & 0.5929 & 0.016 &0.6743 & 0.0217 & 0.7127 & 0.0215 \\
%   \hline
% \multirow{3}{*}{STS-15} & Evs & 0.6072 & 0.0183 &0.6827 & 0.0239& 0.7225 & 0.0248 \\
%  & RBF$_{dis}$ & -1.5177 & -3.8706 & -1.2178 & -3.8379  & -1.0772 & -3.8313 \\
%  & TokenUni & 0.6057 & 0.0216 & 0.6848 & 0.0273   & 0.7228 & 0.0291 \\
%   \hline
% \multirow{3}{*}{STS-16} & Evs & 0.6049 & 0.0267 &0.6824 & 0.0333& 0.7190 & 0.0363 \\
%  &RBF$_{dis}$ & -1.5262 & -3.8375 & -1.5262 & -1.2095 & -3.7952 & -3.7869 \\
%  & TokenUni & 0.6054 & 0.0286 & 0.6864 & 0.0360 & 0.7201 & 0.0390 \\
%  \bottomrule[1pt]
% \end{tabular}
% }
% \caption{Uniformity metrics (\textit{EVs}, \textit{TokenUni}, \textit{RBF$_\text{dis}$}) evaluates the isotropy in transformed feature space comparing to the vanilla PTLMs features. Smaller values means the features are better uniformly distributed. It can be seen that \texttt{SoftDecay} can greatly improve the uniformity. }
% \label{tab:sts_uniformity}
% \end{table*}

% \section{Feature Evaluation Results on STS Datasets}
% %To further examine the improvements source, 
% We show in Table~\ref{tab:sts_uniformity} and Figure~\ref{fig:structureloss_bert} both the uniformity and local neighborhood preservation evaluation results of different methods over the seven STS datasets. %Although the sentence features derived from \texttt{BERT} and \texttt{BERT+SoftDecay} look similar as an imperfect visualization result of TSNE algorithm, the noticeable improvements 
% The lower scores returned by \texttt{SoftDecay} in Table~\ref{tab:sts_uniformity} in comparison to the base PTLMs verify its capability of alleviating anisotropic feature space derived from \texttt{BERT}. In Figure~\ref{fig:structureloss_bert}), \texttt{SoftDecay} preserves the local neighbourhood structure better among all the datasets, which explains its performance superiority comparing with \texttt{Whitening} which ignores the original local manifold structure.


% % \begin{figure}[ht]
% %     \centering
% %     \includegraphics[trim=11 60 18 20,clip,width=0.48\textwidth]{templates/latex/structure-loss_bert_v2.pdf}
% %     \caption{Structure-Invariance for \texttt{Whitening} and \texttt{Soft-Decay} transformed Representations. Larger value structure-invariance means better preserving the original structure information learnt in pretrained model. }
% %     \label{fig:structureloss_bert}
% % \end{figure}

% \begin{figure}[thb]
%     \centering
%     \includegraphics[trim=11 30 18 20,clip,width=0.48\textwidth]{sst_lsds.pdf}
%     \caption{Local Structure Discrepancy Score (\textit{LSDS}) for \texttt{Whitening} and \texttt{SoftDecay} transformed Representations. Smaller scores are preferred as the original local neighborhood information learnt in the pretrained model is preserved better. }
%     \label{fig:structureloss_bert}
% \end{figure}

% \section{Visualisation of Features in STS Datasets}

% We show the representations of sentence pairs generated from \texttt{BERT}, with \texttt{Whitening} and with \texttt{SoftDecay} via tSNE for the rest five STS datasets in Figure~\ref{fig:appendix_bertsst}. In STSB, STS13 and STS16, the representation mapping results in \texttt{Whitening} are not unit Gaussian due to some \textit{abnormal} data point. %The Uniformity and LSDS metrics are shown in Figure~\ref{fig:structureloss_bert} and Table~\ref{tab:sts_uniformity}. 
% Our proposed method \texttt{SoftDecay} gives better uniformity score than vanilla BERT and  better \textit{LSDS} than \texttt{WhiteningBERT}, as have been shown in Figure~\ref{fig:structureloss_bert} and Table~\ref{tab:sts_uniformity}.

% %The sentence features derived from ALBERT and DistilBERT are similar in shape when mapping to the 2D plane, and it lacks quantity information, so we don't display the the TSNE results for these two models. Instead, we calculate the \textit{Uniformity} and \textit{structure-invariance} for the two models.

% \begin{figure*}[thb]
%     \centering
%     \includegraphics[trim= 75 90 75 25, clip,width=0.99\textwidth]{appendix_bert_sts.pdf}
%     \caption{The tSNE visualisation of representations of sentence pairs in datasets SICKR, STSB, STS12-16 (except STS15) in different columns. These representations from top to bottom are derived from vanilla \texttt{BERT}, \texttt{BERT+whitening} and \texttt{BERT+SoftDecay}. For each sentence pair, the two sentences are denotes by different colors, e.g., black and red in \texttt{BERT}. We can see clear clusters in \texttt{BERT} and \texttt{BERT+SoftDecay} for STS-B, STS-12 and STS-14 datasets. }
%     \label{fig:appendix_bertsst}
% \end{figure*}

% \section{Additional Results on GLUE datasets}
% \label{appe:glue_sv}
% Firstly, we highlight the different singular value distribution in QNLI and RTE, two datasets for language inference task \HQ{(See in Figure A4)}.

% \begin{figure*}[th]
%     \centering
%     \includegraphics[width=0.68\textwidth,trim={100 350 380 120},clip]{uai2022-template/inference_bert_cdf.pdf}
%     \caption{\hq{The CDF of singular value in QNLI (left) and RTE (right) dataset derived from vanilla \texttt{BERT}}. For the same percentage 0.8, the larger dataset QNLI dataset has smaller $\Delta L_{i}$ among all the layers, refers to a more serious token uniformity issue.}
%     \label{fig:inference_bert_cdf}
%     % \vspace{-10pt}
% \end{figure*}

% For BERT-based model, we show the CDF of singular values on all the evaluated datasets in Figure~\ref{fig:appendix_bert_glue}. We observe that by applying \texttt{SoftDecay} (bottom row of Figure~\ref{fig:appendix_bert_glue}), the CDF of singular values in the last layer becomes more flattened compared to that in vanilla BERT (top row of Figure~\ref{fig:appendix_bert_glue}). %the larger singular value distributions derived from the last layer suddenly become less predominant comparing to layer 9. 

% We also show the results for \texttt{ALBERT} (Figure~\ref{fig:albert_glue1} and Figure~\ref{fig:albert_glue2}) and \texttt{DistilBERT} (Figure~\ref{fig:distilbertglue1} and Figure~\ref{fig:distilbertglue2}). By comparing with the vanilla PTLMs (the top row of each figure), %curves in the first row of \texttt{ALBERT} are more gathering means that the singular values distribution derived from different ALBERT layers are more similar. By comparing the gap between red solid and dash line in the bottom row of each figure, 
% we notice that the application of \texttt{SoftDecay} has a larger impact on ALBERT compared to DistilBERT, especially on the CoLA dataset. For \texttt{DistilBERT}, its feature space becomes anisotropic gradually as layers go deeper. % without a sudden drop that are seen in BERT and ALBERT. 

% \begin{figure*}[htb]
%     \centering
%     \includegraphics[trim=30 305 135 25,clip,width=0.98\textwidth]{appendix_bert_gluecdf.pdf}
%     \caption{Cumulative distribution function (CDF) of singular value distributions. The upper ones are from vanilla \texttt{BERT}, bottom ones are from \texttt{BERT+SoftDecay}. From left to right, the evaluation datasets are SST-2, MRPC, QNLI and CoLA. Different curves represent distributions derived from different model layers. The x-axis represents the normalised singular values sorted in an ascending order. %If a curve reaches $F(x)=1.0$ quicker means that the largest singular value is more predominant, i.e., the EV value is larger and the \textit{uniformity} is worse. It is noticeable that our proposed 
%     \texttt{SoftDecay} adjusts the anisotropy of the feature space with the effect more %through the starting point and its increase speed, its adjusting varies from different tasks: most 
%     noticeable in MRPC and less obvious in QNLI.}
%     \label{fig:appendix_bert_glue}
% \end{figure*}

% \begin{figure*}[th]
%     \centering
%     \includegraphics[trim= 30 245 200 120,clip, width=0.98\textwidth]{appendix_albert_nocompare1.pdf}
%     \caption{CDF of SST-2, MRPC and QNLI datasets. The upper row results are from the vanilla \texttt{ALBERT}, the bottom ones are from \texttt{ALBERT+SoftDecay}.}
%     \label{fig:albert_glue1}
% \end{figure*}

% \begin{figure*}[b]
%     \centering
%     \includegraphics[trim= 40 280 350 30,clip, width=0.65\textwidth]{appendix_albert_gluenocompare2.pdf}
%     \caption{CDF of CoLA and RTE datasets. The upper row results are from the vanilla ALBERT, the bottom ones are from \texttt{ALBERT+SoftDecay}.}
%     \label{fig:albert_glue2}
% \end{figure*}


% \begin{figure*}[th]
%     \centering
%     \includegraphics[trim= 130 175 50 160, clip,width=0.98\textwidth]{appendix_distilbert_glue1.pdf}
%     \caption{CDF of SST-2, MRPC and QNLI datasets. The upper row results are from the vanilla DistilBERT, the bottom ones are from \texttt{DistilBERT+SoftDecay}.}
%     \label{fig:distilbertglue1}
% \end{figure*}

% \begin{figure*}[b]
%     \centering
%     \includegraphics[trim= 90 210 180 160, clip,width=0.88\textwidth]{appendix_distilbert_glue2.pdf}
%     \caption{CDF of CoLA and RTE datasets. The upper row results are from the vanilla DistilBERT, the bottom ones are from \texttt{DistilBERT+SoftDecay}.}
%     \label{fig:distilbertglue2}
% \end{figure*}
\end{document}
