% \documentclass{uai2022} % for initial submission
\documentclass[accepted]{uai2022} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

\usepackage{times}
\usepackage{soul}
\usepackage{url}
\usepackage[hidelinks]{hyperref}
\usepackage[utf8]{inputenc}
\usepackage[small]{caption}
\usepackage{graphicx}
\usepackage{amsmath}
\usepackage{amsthm}
\usepackage{booktabs}
\usepackage{algorithm}
\usepackage{algorithmic}
\urlstyle{same}
\usepackage{array}
\usepackage{multirow}
\usepackage{color}
\newcommand{\ysjred}[1]{\textcolor{red}{#1}}
\newcommand{\ysjgreen}[1]{\textcolor{green}{#1}}
\newcommand{\ysjblue}[1]{\textcolor{blue}{#1}}
\usepackage{CJKutf8}
\usepackage{graphics}
\newtheorem{example}{Example}
\newtheorem{theorem}{Theorem}
\usepackage{diagbox}

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{CoSPA: An Improved Masked Language Model with Copy Mechanism for Chinese Spelling Correction}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<1901210554@pku.edu.cn>?Subject=Your UAI 2022 paper}{Shoujian Yang}{}}
\author[1]{\href{mailto:<lianyu@ss.pku.edu.cn>?Subject=Your UAI 2022 paper}{Lian Yu}{}}
% \author[1,2]{Lian Yu}
% \author[3]{Further~Coauthor}
% \author[1]{Further~Coauthor}
% \author[3]{Further~Coauthor}
% \author[3,1]{Further~Coauthor}
% Add affiliations after the authors
\affil[1]{%
    School of Software \& Microelectronics \\
    Peking University \\
    China
    
}
% \affil[2]{
%     Another Affiliation\\
%     Address\\
%     …
% }
% \affil[3]{%
%     Another Affiliation\\
%     Address\\
%     …
%   }
  
  \begin{document}
\maketitle

\begin{abstract}
  Existing BERT-based models for Chinese spelling correction (CSC) have three issues. 1) Bert tends to rectify a correct low-frequency collocation into a high-frequency one and leads to over-correcting. 2) It fails to completely detect phonic or morphological errors by the current learned similarity knowledge between Chinese characters, and the recall rate still has room to improve. 3) Two-dimensional glyph information of Chinese characters is overlooked and some morphological misused characters may be difficult to detect.
  This paper proposes a hybrid approach, CoSPA, to address these issues. 1) This paper proposes an alterable copy mechanism to alleviate over-correcting by jointly learning to copy a correct character from input sentence, or generate a character from BERT. No method has used copy mechanism in BERT for CSC. 2) The attention mechanism is further applied on the phonic and shape representation of each character at the output layer. 3) Shape representation is enhanced by mining character glyph with ResNet, and fused with stroke representation via an adaptive gating unit.
The experimental results show that CoSPA outperforms the previous state-of-the-art methods on SIGHAN2015 datasets. 
\end{abstract}

\section{Introduction}\label{sec:intro}

\begin{CJK*}{UTF8}{gbsn}

Chinese spelling correction (CSC) is a challenging task in natural language processing (NLP) field, whose target is to detect and correct the misuse of characters or words in Chinese sentences. Existing BERT-based models for CSC have three issues:

\paragraph{Over-correcting:}
Recently, BERT-based non-autoregressive (NAT) language models \citep{liu-etal-2021-plome,dcn_csc,2021MLM-Phonetic,huang-etal-2021-phmospell} have achieved state-of-the-art performance in CSC task. Such NAT models carry out CSC tasks by predicting based on the whole vocabulary space that can easily lead to miscorrection as the number of vocabularies is large. In addition, they are prone to correct a correct low-frequency collocation into a high-frequency one, resulting in over-correcting. As presented in Table 1, PLOME \citep{liu-etal-2021-plome} corrects the character '动' (dong, move) to '话' (hua, talk), since the fixed collocation '电话' (phone) has a higher frequency than '电动' (electronic games). CSC is a task where the input and output have the same length with character to character alignment. In some cases, copying is the right action instead of correcting. Although the copy mechanism has been used in sequence-to-sequence framework for CSC, no method has used it in BERT for CSC. This paper proposes an alterable copy mechanism to alleviate over-correcting by jointly learning to copy a correct character from the input sentence, or generate a character from BERT.

\begin{table}[]
% \centering
\caption{Two examples of Chinese spelling correction.}
\scalebox{0.68}{
\setlength{\tabcolsep}{0.6mm}
\linespread{1.5}
\begin{tabular}{c|c}
\hline
\multicolumn{1}{l}{Over-correcting example} & \multicolumn{1}{l}{}                                                                                            \\
\hline
Input Sentence                                        & \begin{tabular}[c]{@{}c@{}}...上网打电\ysjred{动}过度不吃饭营养不足导致死亡。\end{tabular}                                               \\
PLOME                                        & \begin{tabular}[c]{@{}c@{}}...上网打电\ysjgreen{话}过度不吃饭营养不足导致死亡。\end{tabular}                                               \\
Correct Sentence                                     & \begin{tabular}[c]{@{}c@{}}...上网打电\ysjblue{动}过度不吃饭营养不足导致死亡。\end{tabular}                                               \\
Translation                                       & Excessive playing electronic games on the Internet, skipping \\
& meals, and nutritional deficiencies lead to death. \\
\hline
\multicolumn{1}{l}{Miscorrected example}   & \multicolumn{1}{l}{}                                                                                            \\
\hline
Input Sentence                                        & \begin{tabular}[c]{@{}c@{}}..., 而变成了一个吃\ysjred{香}难看的财迷, ...\end{tabular}                                                        \\
PLOME                                        & \begin{tabular}[c]{@{}c@{}}..., 而变成了一个吃\ysjgreen{香}难看的财迷, ...\end{tabular}                                                        \\
Correct Sentence                                      & \begin{tabular}[c]{@{}c@{}}..., 而变成了一个吃\ysjblue{相}难看的财迷, ...\end{tabular}                                                        \\
Translation                                       & ..., and become an ugly looking moneygrubber, ...                                                                       \\
\hline
\end{tabular}
}
\label{introduce1}
\end{table}

\paragraph{Unable to differentiate a near-phonic and a near-visual conversion:}
There are two main types of Chinese character errors: near tone error and near shape error. For errors caused by phonic, such as '香' (xiang, fragrant) and '相' (xiang, observe). However, PLOME is unable to detect correctly with its learned knowledge of the similarity between arbitrary characters, since it may ignore the different importance or relevance in the phonic and glyph aspects of each character to CSC task. As such cases have not been detected, the recall rate still has room to improve.

\paragraph{Overlooking the two-dimensional glyph:}
One of the distinctive characteristics of Chinese characters 
lies in characters both having stroke and two-dimensional graphic information, i.e., the same strokes may constitute different Chinese characters, such as '入' (ru, enter), '八' (ba, eight) and '人' (ren, man), visually similar characters may have different strokes, e.g., '陪' (pei, accompany) and '部' (bu, part). However, recent Chinese pre-training language models with misspelled knowledge overlook the two-dimensional glyph information of Chinese characters, which has visual similarity information for near-visual misused characters.

This paper proposes a model for CSC, called CoSPA, to tackle the above issues, and the contributions are summarized as follows:

\begin{itemize}
\item Propose an alterable copy mechanism to alleviate over-correcting by increasing the generation probability of the original character, which is automatically learned by model.
\item Introduce an attention mechanism to improve the recall rate, which provides an insight into which aspects of the erroneous input character are more relevant to the correct conversion of the output character.
\item Enrich shape embedding by integrating stroke embedding and glyph embedding via an adaptive gating unit.
% \item Achieve state-of-the-art performance on the SIGHAN2015 benchmark datasets using the proposed model.
\end{itemize}
\end{CJK*}

\section{Related work}
Early work on CSC followed the pipeline of error detection, candidate generation and selection \citep{dong-etal-2016-ace}.
The neural-based methods have made progress in CSC. Wu et al. took error correction as sequence labeling task with conditional random fields (CRF) \citep{2018CYUT}, and Wang et al. treated a CSC task as a sequence labeling problem and used a bidirectional LSTM to predict the correct characters \citepp{2018sequence-labeling}. This section focus on the models using BERT-based fine-tune, BERT-based pre-trained models with misspelled knowledge, and copy mechanisms.

\subsection{BERT-based fine-tune for CSC}
FASpell \citep{hong-etal-2019-faspell} first employed BERT as a denoising autoencoder (DAE) for CSC.
Recently, more researches fine-tuned BERT-based models using CSC training data. BERT\_CRS+GAD \citep{Guo2021GlobalAD} introduced a BERT with guided replacement strategies of Confusion set, consisting of a number of similar characters sets, to narrow the gap between BERT and misspelling correction, reformed self-attention mechanism to learn the global relationships of the potential correct input characters and the candidates of potentially erroneous characters, and obtained the rich global contextual information to alleviate the influence caused by error context.
Recent researches utilized the external knowledge of character similarity. PHMOSpell \citep{huang-etal-2021-phmospell} incorporated both phonological and morphological knowledge from two feature extractors into a pre-trained language model by an effective adaptive gating mechanism.
ReaLiSe \citep{xu-etal-2021-read} proposed to leverage multimodal information to tackle CSC task, which employed three encoders to learn informative representations from textual, acoustic and visual modalities, and used the selective fusion mechanism to integrate multimodal information.
However, these BERT-base models are independently pre-trained from CSC task, thus did not learn any task-specific knowledge during pre-training, which is proved to be most effective \citep{liu-etal-2021-plome}.

\subsection{BERT-based pre-trained models with misspelled knowledge}
Pre-trained masked language models (MLM) such as BERT \citep{devlin-etal-2019-bert} and ALBERT \citep{ALBERT} have set state-of-the-art performance on a broad range of NLP tasks. Different mask strategies enabled models to jointly learn semantics and task-specific knowledge from large scale training data during pre-training.

PLOME \citep{liu-etal-2021-plome} proposed a pre-trained masked language model with misspelled knowledge for CSC, and put forward the confusion set based masking strategy that randomly replaces 15\% of the characters of the input with other characters, where 75\% characters from similar character sets in the confusion set (60\% phonologically similar and 15\% visually similar), and enabled models to jointly learn semantics and misspelled knowledge. It introduced phonic and shape GRU networks to capture phonological and visual similarity features, and is also the first one to introduce pronunciation prediction as an auxiliary objective.

RoBERTa-Pretrain-DCN \citep{dcn_csc} pre-trained the model with the confusion set based masking strategy that randomly replaced 15\% of the characters of the input with other characters, where 15\% characters from the confusion set. MLM-Phonetic \citep{2021MLM-Phonetic} pre-trained a masked language model with phonetic features to improve the model’s ability to understand sentences with misspelling and model the similarity between characters and pinyin tokens. 
% These BERT-based error correction models can be regarded as a limited generation model. They generate a target character from the entire vocabulary space for each character in the input sequence. However, the vocabulary space is so large that the model is easy to generate characters different from input, which leads to a prominent problem of over-correcting.

These BERT-based error correction models can be regarded as a limited generation model. They generate a target character from the entire vocabulary space for each character in the input sequence. When the vocabulary space is very large, the probability of miscorrection becomes high. As most of the characters in input are correct in CSC, the high probability leads to over-correcting if no counter-measures are adopted. In addition, some morphological misused characters might be difficult to detect if overlooking the two-dimensional glyph information of Chinese characters. 
% overlooking the two-dimensional glyph information of Chinese characters might bring about near-visual misuses.

\subsection{Copy mechanism}
Copy mechanism is used in various language generation tasks, such as abstract generation \citep{2017Get}, machine translation \citep{gulcehre-etal-2016-pointing}, question and answer \citep{2018The}, and dialogue \citep{2019Multi-Domain-Transferable}. It introduces the decoder of seq2seq to improve the performance of the model through "copy and paste" words between inputs and outputs. Wang et al. used a sequence-to-sequence framework with copy mechanism to copy the correction results directly from a prepared confusion set for the erroneous words \citep{wang-etal-2019-confusionset}. To the best of authors' knowledge, this paper is the first to leverage a copy mechanism in BERT framework for CSC.

\graphicspath{{./}}
\begin{figure}
% \centering
    \includegraphics[width=8.7cm, height=10.5cm]{main_struct.png}
    \caption{The framework of CoSPA.}
    \label{left_struct_figure}
\end{figure}

\section{Model}\label{sec:math}
\begin{CJK*}{UTF8}{gbsn}
As shown in Figure \ref{left_struct_figure},  PLOME \citep{liu-etal-2021-plome} is used as the base model in this paper. Initially, the input embedding of each character is the sum of character embedding, position embedding, phonic embedding and shape embedding. The hidden states generated by the last layer of transformer encoder attach different importance to phonic or shape embedding by an attention mechanism, then generates the probability from the Softmax output. The final probability of character prediction is the weighted summation of copy probability and generated probability. The examples input contains two types of errors i.e, near-phonic '田' (tian, field) and near-shape '汽' (qi, steam). The phonic and glyph features are introduced to solve the conversion of near-phonic ('田' to '天') and near-shape errors ('汽' to '气'). Given an input sentence $X=\{x_1,x_2,...x_n\}$, the target is to generate a correct sentence $Y=\{y_1,y_2,...y_n\}$.
\end{CJK*}

\subsection{Shape Embedding}

This paper uses ResNet \citep{ResNet_2016_CVPR} to encode the character images to get the glyph representations, which has 5 ResNet blocks followed by a layer normalization operation. The glyph representation of $x_i$, $h_i^{g}$, are defined as follows:
\begin{align}
    h_i^{g}=LayerNorm(ResNet(I_i))
\end{align}%
where $I_i$ is the image of the \emph{i}-th character $x_i$ in the input sentence, and $LayerNorm$() takes a layer normalization.


\begin{CJK*}{UTF8}{gbsn}
 The character image of $x_i$ is read from preset font files as shown in Figure \ref{embedding_figure}. Microsoft elegant black in Simplified Chinese is selected, and the size of each character image is set to 32 × 32 pixel. To obtain the embeddings of the font in the two-dimensional graphic structures, each block in ResNet halves the width and height of the images, and doubles the number of channels. Thus, the final output is a vector with the length equal to the number of output channels, i.e., both height and width become 1. The number of output channels are set to the hidden size of stroke embedding for the follow-up fusion. The glyph representation of the input sentence is denoted as $H^g=\{h_1^g,h_2^g,...,h_n^g\}$

Finally, the shape embedding is obtained by an adaptive gating unit served as a gate to finely control the fusion of stroke embedding and glyph embedding. $a_i^{st}$ and $a_i^g$ are the gate values of stroke embedding and glyph embedding, $h_i^{sh}$ is the shape embedding of the \emph{i}-th character $x_i$, and are computed as follows:
\begin{align}
    a_i^{st}=\sigma(W^{s}⋅[h_i^{st},h_i^g]+b^{s})
\end{align}%
\begin{align}
    a_i^g=\sigma(W^g⋅[h_i^{st},h_i^g]+b^g)
\end{align}%
\begin{align}
    h_i^{sh}=a_i^{st}⋅h_i^{st}+a_i^g⋅h_i^g
\end{align}%
where $h_i^{st}$ is the stroke embedding of the \emph{i}-th character $x_i$, $W^s$, $W^g$, $b^s$, $b^g$ are learnable parameters, $\sigma$() is the sigmoid function, and [·] means the concatenation of vectors.
\end{CJK*}

% \graphicspath{{./}}
\includegraphics{{}}
\begin{figure}
\centering
    \includegraphics[width=8.5cm, height=3.3cm]{embedding.png}
    \caption{Adaptive fusion of stroke and glyph embeddings}
    \label{embedding_figure}
\end{figure}

\subsection{Attention Mechanism}    

After the transformer encoder, a 768-dimensional vector representation (donates as the last hidden states of BERT) is output for each position of the input sequence, which is used to perform attention operations on its phonic and shape embedding respectively. Following the use of the [CLS] token to represent the entire sentence \citep{devlin-etal-2019-bert}, in order to consider the semantics of the entire sentence, this paper uses the last hidden state of [CLS] token to perform attention operations at the same time. $F_i$ is the attention vector of the character $x_i$, and defined as follows:
\begin{align}
    F_i=\displaystyle\sum_{k\in\{p,s\}}a_{i,k}E_{i,k}
\end{align}%
where $a_{i,k}\in\mathbb{R}^{1\times2}$ is for \emph{i}-th character denoting the corresponding weight of feature $k$ (including phonic and shape), which is computed by
\begin{align}
    a_{i,k}=\frac{1}{2}\displaystyle\sum_{m\in\{h,[CLS]\}}a_{i,k}^m
\end{align}%
\begin{align}
    a_{i,k}^m=\frac{exp(E_{i,h}^TE_{i,k}/\beta)}{\displaystyle\sum_{k'\in\{p,s\}}exp(E_{i,h}^TE_{i,{k'}}/\beta)}\label{L1}
\end{align}%
% \begin{align}
%     a_{i,k}^{[CLS]}=\frac{exp(E_{i,[CLS]}^TE_{i,k}/\beta)}{\displaystyle\sum_{k'∈\{p,s\}}exp(E_{i,[CLS]}^TE_{i,{k'}}/\beta)} \label{L2}
% \end{align}%
where $a_{i,k}^m\in\mathbb{R}^{1×2}$ denotes the corresponding weights of $m$ (including the last hidden states token and the [CLS] token) to representation $k$ for \emph{i}-th character. $E_{i,h}∈\in\mathbb{R}^{N\times 768}$ (where N is the sequence length) is the hidden states for \emph{i}-th character, $E_{i,[CLS]}$ is the last hidden states for [CLS] token, and $\beta$ is a hyper-parameter determined by experment that controls the smoothness of attention weights. Finally, residual connection is added to $F_i$ and $E_{i,h}$ by linear combination:
\begin{align}
    E_i=F_i+E_{i,h}
\end{align}%

During the training process, the representation $E_i$ is fed into a fully-connected layer for the final classification. The generated conditional probability $P_{gen}$ of the character predicted for the $i_{th}$ character $x_i$ is defined as:
\begin{align}
    P_{gen}(y_j|X)=softmax(E_iW_c+b_c)
\end{align}%
where $W_c\in\mathbb{R}^{768×V}$, $b_c\in\mathbb{R}^{768×V}$ are learnable parameters for the fully-connected layer, $V$ is the size of the vocabulary and $y_j$ is the predicted j-th character in vocabulary. For more details about the probability of pronunciation prediction,  please refer to PLOME \citep{liu-etal-2021-plome}.

\subsection{Copy Mechanism}
In addition to the original generation probability $P_{gen}$ generated in BERT for each character, a copy probability $P_{copy}$ is added, which is the probability of the character directly output by the model. The final probability is obtained by adding the two probabilities:
\begin{align}
    p=p_{copy}*p_{input}+(1-p_{copy})*p_{gen}
\end{align}%
where $P_{input}$ is one-hot encoding. If the generated top one probability is close to the generation probability of the input character, that usually means model is uncertain to select generated character or copy input character. In this case, the input character is weighted during generation. This procedure is as follows:
\begin{align}
    p_{copy}=\frac{sigmoid(Relu(W_1⋅h_i)⋅W_2)}{e^{\tau(p_{top1}-p_{in})}}
\end{align}%
where $W_1$, $W_2$ are trainable parameters of the model, $p_{top1}$ means the generated top one probability of each character, $p_{in}$ means the generation probability of the original input in the vocabulary. $\tau$ is the temperature parameter determined by experment, where increasing $\tau$ makes the distribution flatter. When the difference of the generated top one probability and the generated original input probability is not signifcant, BERT has a low confidence to output top one character, thus tends to output the original chatacter.

\subsection{Learning}
The learning process is driven by minimizing negative log-likelihood of the character prediction $L_c$ and phonic prediction $L_d$:
\begin{align}
    L=\alpha*L_c+(1-\alpha)*L_p
\end{align}%
\begin{align}
    L_c=-\displaystyle\sum_{i=1}^nlogp_c(y_i=l_i|X)
\end{align}%
\begin{align}
    L_p=-\displaystyle\sum_{i=1}^nlogp_p(g_i=r_i|X)
\end{align}%
where $L$ denotes the overall objective, $l_i$ and $r_i$ are the true character and phonic for $x_i$, respectively, $g_i$ is the predicted \emph{j}-th phonic in vocabulary, and $\alpha$ is set to 0.7.

\section{Experiments}

\begin{table}[]
\caption{Statistics of datasets.}
\begin{tabular}{ccc}
\hline
Training Data       & \# erroneous sent / sent & Avg.length \\
\hline
SIGHAN13            & 340/700                  & 41.8       \\
SIGHAN14            & 3358/3437                & 49.3       \\
SIGHAN15            & 2273/2339                & 31.3       \\
(Wang et al., 2018) & 271009/271329            & 42.5       \\
\hline
Total               & 276980/277805            & 42.5       \\
\hline
\hline
Test Data           & \# erroneous sent / sent & Avg.length \\
\hline
SIGHAN15            & 541/1100                 & 30.6      \\
\hline
\end{tabular}
\label{table-data}
\end{table}

\begin{table*}[]
\caption{Comparisons among different models in terms of P-R-F (Precision, Recall and F1 score)}
\scalebox{1.03}{
\begin{tabular}{l|ccc|ccc|ccc|ccc} 
\hline
\multirow{3}{*}{\begin{tabular}[c]{@{}l@{}}\\Method\end{tabular}} & \multicolumn{6}{c|}{Character-level}                                         & \multicolumn{6}{c}{Sentence-level}                                            \\ 
\cline{2-13}
                                                                  & \multicolumn{3}{c|}{Detection-level} & \multicolumn{3}{c|}{Correction-level} & \multicolumn{3}{c|}{Detection-level} & \multicolumn{3}{c}{Correction-level} &                        
\cline{2-13}
                                                                  & P    & R    & F                      & P    & R    & F                       & P    & R    & F                      & P    & R    & F                                              \\ 
\hline
PN                                                                & 66.8 & 73.1 & 69.8                   & 71.5 & 59.5 & 69.9                    & -    & -    & -                      & -    & -    & -                                            \\
ReaLiSe                                                           & -    & -    & -                      & -    & -    & -                       & 77.3 & 81.3 & 79.3                   & 75.9 & 79.9 & 77.8                                         \\
PHMOSpell                                                         & -    & -    & -                      & -    & -    & -                       & \bf{90.1} & 72.7 & 80.5                   & \bf{89.6} & 69.2 & 78.1                                         \\
RoBERTa-Pretrain-DCN                                              & -    & -    & -                      & -    & -    & -                       & 77.1 & 80.9 & 79.0                   & 74.5 & 78.2 & 76.3                                         \\
MLM-phonetics                                                     & -    & -    & -                      & -    & -    & -                       & 77.5 & \bf{83.1} & 80.2                   & 74.9 & \bf{80.2} & 77.5                                         \\
BERT\_CRS+GAD                                                     & 88.6 & 87.8 & 88.2                   & 96.3 & 84.6 & 90.1                    & 75.6 & 80.4 & 77.9                   & 73.2 & 77.8 & 75.4                                      \\ 
\hline
PLOME                                                             & 94.5 & 87.4 & 90.8                   & 97.2 & 84.3 & 90.3                    & 77.4 & 81.5 & 79.4                   & 75.3 & 79.3 & 77.2                                      \\
CoSPA                                                             & \bf{95.9} & \bf{88.6} & \bf{92.1}                   & \bf{98.5} & \bf{85.3} & \bf{91.4}                    & 79.0 & 82.4 & \bf{80.7}                   & 76.7 & 80.0 & \bf{78.3}                                   \\
\hline
\end{tabular}}
\label{main-results}
\end{table*}

\subsection{Datasets and Implementation Details}
Table \ref{table-data} shows the  statistics of the datasets used in the experiments \citep{zhang2020spelling}.

\paragraph{Training Data} The training data is composed of 10K manually annotated samples from SIGHAN \citep{wu-etal-2013-chinese,yu-etal-2014-overview,tseng-etal-2015-introduction}, including 271K training samples automatically generated by OCR-based and ASR-based methods as in \citep{cheng2020spellgcn}.

\paragraph{Evaluation Data} The latest SIGHAN test dataset \citep{tseng-etal-2015-introduction} is used as in \citep{zhang2020spelling} to evaluate the proposed model, which contains 1100 testing sentences and half of these sentences included at least one spelling error.
% 542 testing sentences and each of them included at least one spelling error.

\paragraph{Evaluation Metrics} Precision, recall and F1 scores are used as the evaluation metrics. Besides character-level evaluation, sentence-level metrics are also adopt for errors detection and correction. These metrics are evaluated using the script from \citep{cheng2020spellgcn}.

\paragraph{Training Details} CoSPA is based on the repository of PLOME \citep{liu-etal-2021-plome} using Tensorflow1.14 framework. CoSPA is trained using AdamW optimizer for 10 epochs with learning rate 5e-5, a batch size of 32 and a maximum sentence length of 180, and the model is trained with learning rate warming up and linear decay.

\begin{table*}[]
% \centering
\caption{Ablation results where '- Glyph', '- Attention', and '- Copy' indicate the effects of removing the corresponding mechanisms}
\scalebox{1.18}{
\setlength{\tabcolsep}{2mm}
\linespread{1.5}
\begin{tabular}{l|cccccc|cccccc}
\hline
\multicolumn{1}{c|}{\multirow{3}{*}{Method}} & \multicolumn{6}{c|}{Character-level}                                            & \multicolumn{6}{c}{Sentence-level}                                              \\ \cline{2-13}
\multicolumn{1}{c|}{}                        & \multicolumn{3}{c|}{Detection-level}    & \multicolumn{3}{c|}{Correction-level} & \multicolumn{3}{c|}{Detection-level}    & \multicolumn{3}{c}{Correction-level}                       \\ \cline{2-13}
\multicolumn{1}{c|}{}                        & P    & R    & \multicolumn{1}{c|}{F}    & P           & R          & F          & P    & R    & \multicolumn{1}{c|}{F}    & P           & R          & F                                \\ \hline
CoSPA                                        & 95.9 & 88.6 & \multicolumn{1}{c|}{92.1} & 98.5        & 85.3       & 91.4       & 79.0 & 82.4 & \multicolumn{1}{c|}{80.7} & 76.7        & 80.0       & 78.3                         \\
- Glyph                                      & 95.7 & 88.4 & \multicolumn{1}{c|}{91.9} & 98.2        & 85.1       & 91.2       & 78.9 & 81.9 & \multicolumn{1}{c|}{80.4} & 76.4        & 79.7       & 78.0                         \\
- Attention                                  & 95.6 & 88.0 & \multicolumn{1}{c|}{91.7} & 98.0        & 84.8       & 90.9       & 78.6 & 82.0 & \multicolumn{1}{c|}{80.2} & 76.0        & 79.6       & 77.7                         \\
- Copy                                       & 95.2 & 87.9 & \multicolumn{1}{c|}{91.4} & 97.7        & 84.7       & 90.7       & 78.0 & 81.7 & \multicolumn{1}{c|}{79.8} & 75.8        & 79.5       & 77.5                        \\ \hline
\end{tabular}
}
\label{ablation-table}
\end{table*}


\subsection{Comparisons with other methods}


Table \ref{main-results} shows the evaluation scores compared with all previous methods at detection and correction levels on the SIGHAN2015 test datasets, and CoSPA achieves the new state-of-the-art performance.

Compared with the best baseline method {\bf PLOME}, for character-level and sentence-level, the improvements of CoSPA are 1.3\% on detection-level F1 and 1.1\% on correction-level F1 respectively.

Both {\bf ReaLiSe} and {\bf PHMOSpell} incorporated phonological and morphological knowledge into the semantic space for CSC and achieved a relatively good performance. They leverage multimodal information and selectively fuse just to match the characteristics of Chinese characters themselves. CoSPA is more focused on the phonic and glyph aspects of each character to distinguish a near-phonic or a near-visual conversion for the CSC task. CoSPA exceeds both of them illustrates the effectiveness of attention mechanism.

{\bf MLM-phonics}, with the help of additional Pinyin tokens, integrated phonetic features in word embedding, thus increasing the generalization of the model. However, rich shape information of Chinese characters was overlooked.
{\bf RoBERTa-Pretrain-DCN} focused on the incoherence problem and modeling the dependencies of the output tokens, since BERT is a non-autoregressive language model, which relies on the output independence assumption. However, they do not model the knowledge of the similarity between arbitrary characters.
{\bf BERT\_CRS+GAD} narrowed the gap between BERT and spelling error correction with confusion set guided replacement strategy during fine-tuning, but do not learn any task-specific knowledge during pre-training, thus it is sub-optimal.

Specifically, {\bf PN} also employed a Seq2Seq model with copy mechanism, which generated a new sentence considering the extra candidates from confusion set. Instead, CoSPA increases the copy probability of the original input when generating base BERT, and achieves 22.9\% F1 improvements at detection-level and 22\% F1 improvements at character-level by a large margin. This indicates that the copy mechanism used in BERT has a significant effect.


\subsection{Ablation Study}

As Table \ref{ablation-table} shows, when copy mechanism is removed, both the detection and correction F1 scores decrease about 0.8\% on character-level and sentence-level. This demonstrates that the copy mechanism makes decoding effective for the CSC task. No matter which component is removed, the performance of CoSPA drops, which fully demonstrates the effectiveness of each part in our model.

\paragraph{Effect of Character Image Resolution}
As Table \ref{attention-table} shows, the performance of the size 64×64 improved is limited, and the possible reason is that the shape representation usually can be well modeled by strokes.

\paragraph{Effect of Attention Mechanism}
% Table \ref{resuts}
\begin{CJK*}{UTF8}{gbsn}
This paper investigates how to better use the attention mechanism for CSC, which is compared against the sum (the summation of hidden states and phonic and shape embedding) and residual connection, with different values of the hyper-parameter $\beta$ in formula (\ref{L1}). The results presented in Table \ref{attention-table} show that the simple sum fails for CSC. It is suggested that both the phonic and glyph features of characters are enhanced without difference, which will have a negative impact on the model results, due to the different error types for CSC. The attention mechanism for char, phonics and shape embedding is feasible but is surpassed by the residual connection to hidden states and attention vector for phonics and shape embedding. This indicates that residual connection can provide useful contextual semantic information. Furthermore, hidden state has contained rich representation information. Using it as query can learn the relationship between phonic and glyph features, forcing model to focus on controlling the fusion of the two, which is helpful to identify the conversion type of these typos. A hyper-parameter $\beta$ is incorporated into the attention operation since the dot products may grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients. Based on the experiment results, the Hidden+Attention was chosen with a $\beta$ of 4.

In Table \ref{weight-table} of the weight results, the phonic and glyph representations solve the second issue and have high interpretability.
$a_p$ and $a_s$ represent the weights of phonic and shape, respectively. For the phonic and shape error, like '根'(gen, root) and '跟'(gen, and), the first example gives both a higher importance of phonic and shape. For the phonic error, like '师'(shi, teacher) and '书'(shu, book), the second example gives a higher importance of phonic. This shows that the attention mechanism helps model attach different importance to conversion of near-phonic and near-visual. Besides, the average values are calculated of erroneous characters on SIGHAN15, and the phonic and shape weights are 0.73 and 0.27, respectively, which means that the phonic aspect is more important than the shape aspect, which is consistent with the fact that the spelling errors caused by near-phonic are more frequent than that by near-shape \citep{liu-etal-2010-visually}.
\end{CJK*}

\begin{table}[]
\centering
\caption{Character-lever F1 with different strategies}
\begin{tabular}{lcc}
\hline
Strategy     & D-F  & C-F  \\
\hline
Sum                    & 86.4 & 86.5 \\
\hline
Attention(\beta=1)         & 91.4 & 90.9 \\
Attention(\beta=4)         & 91.6 & 91.0 \\
Attention(\beta=10)        & 91.2 & 90.7 \\
\hline
Hidden+Attention(\beta=1)  & 91.9 & 91.2 \\
Hidden+Attention(\beta=4)  & 92.1 & 91.4 \\
Hidden+Attention(\beta=10) & 91.8 & 91.1 \\
\hline
\hline
Copy(\tau=0)         & 91.5 & 91.0 \\
Copy(\tau=1.0)         & 91.9 & 91.2 \\
Copy(\tau=6.0)        & 92.1 & 91.4 \\
Copy(\tau=12.0)        & 91.5 & 90.9 \\
\hline
\hline
image size 32×32         & 92.1 & 91.4 \\
image size 64×64         & 92.1 & 91.5 \\
\hline
\end{tabular}
\label{attention-table}
\end{table}

\paragraph{Effect of Copy Mechanism}

Temperature $\tau$ is a hyper-parameter of neural networks used to control the values of the copy probability by scaling the difference between the generated top one probability of each character and the generated probability of the original input in the vocabulary. The experiments show that when $\tau=6.0$, the best performance is achieved. 
% On the other hand, when $\tau=0$, that is to say, the difference between them is not considered, the performance is unsatisfactory. The reason of the situation may be caused by ignoring that when the difference of the generated top one probability and the generated original input probability is not signifcant, i.e., there are multiple 'reasonable' candidate corrections, model should avoid over-correcting.

\begin{CJK*}{UTF8}{gbsn}
\begin{table}[]
\centering
\caption{Attention mechanism weights}
\scalebox{0.9}{
\setlength{\tabcolsep}{0.7mm}
\linespread{1.5}
\begin{tabular}{ccccccccccc}
\hline
\multicolumn{1}{c}{Eng.} & \multicolumn{5}{l}{Chinese and Japanese.} \\
\multicolumn{1}{c}{Input} & 华    & 语    & \ysjred{根}    & 日    & 文    \\
\multicolumn{1}{c}{Output}  & 华    & 语    & \ysjgreen{跟}    & 日    & 文    \\
$a_p$                         & 0.74 & 0.69 & 0.44 & 0.68 & 0.72  \\
$a_s$                         & 0.26 & 0.31 & \ysjblue{0.56} & 0.32 & 0.28  \\
\hline
\multicolumn{1}{c}{Eng.} & \multicolumn{10}{l}{And it's hard for teachers to teach.} \\
Input & 而    & 且    & 老    & 师    & 也    & 很    & 难    & 教    & \ysjred{师}     & 。    \\
Output  & 而    & 且    & 老    & 师    & 也    & 很    & 难    & 教    & \ysjgreen{书}     & 。    \\
$a_p$     & 0.72 & 0.69 & 0.70 & 0.75 & 0.68 & 0.70 & 0.72 & 0.66 & \ysjblue{0.86} & 0.99 \\
$a_s$     & 0.28 & 0.31 & 0.30 & 0.25 & 0.32 & 0.30 & 0.28 & 0.34 & 0.14                         & 0.01 \\
\hline
\end{tabular}
}
\label{weight-table}
\end{table}
\end{CJK*}


\begin{table}[]
\centering
\caption{Comparison of inference time (seconds)}
\begin{tabular}{cccc}
\hline
\diagbox{Method}{Sentences}       & 5000 & 10000 & 15000  \\
\hline
PLOME            & 71.97  & 142.24  & 212.45                       \\
CoSPA            & 74.60  & 147.73  & 220.80                    \\
\hline
\end{tabular}
\label{table-inference}
\end{table}

\paragraph{Comparison of Inference Time}
Table \ref{table-inference} shows the inference times of 5000, 10000, 15000 sentences using PLOME and CoSPA, and CoSPA requires slightly increased times, due to adding more modules, such as copy mechanism.
% It can be observed that CoSPA requires slightly increased inference time compared to PLOME, which is acceptable.
% and CoSPA requires slightly increased inference time compared to PLOME,

\subsection{Demonstration Examples}

\begin{CJK*}{UTF8}{gbsn}
% We conduct an error analysis on two types of incorrect cases of PLOME, namely, the false positive and the false negative case, which affect the precision and recall in CSC, respectively.

For the false positive case of PLOME, in the first example in Table \ref{case-study1}, the ground-truth '悔'(hui, regret) is more suitable according the word '错过' above, which means missing. However, PLOME over-corrects '悔' to '会'(hui, will). Compared with other locations, the difference between the generated top-one probability (0.839) and the generated probability (0.150) of the input character '悔' is not significant. Under these circumstances, the copy probability of the input character should be weighted during generation to output the original character preferentially. After that, CoSPA makes the input character rank as the first.

% it is observed that there are multiple 'reasonable' candidate corrections to replace the '近'(close) character, and the genarated top-one probability (0.33) of the '进'(forward) character is close to the generated probability (0.31) of the input character '近', but ground-truth '近' is more suitable because the context contains the meaning of closer. That means BERT-base model is uncertain to select which one. Under these circumstances, the input character should be weighted during generation to output the original character preferentially. After increasing the generated probability of the input character, CoSPA makes the input character rank as the first.

For the false negative case of PLOME, in the second example, PLOME corrects the erroneous character '素'(su, plain) to '赢'(ying, win). However, according to the word '赔偿' below, which means compensation, CoSPA can correctly consider the different correlation for phonetic error '素' in phonetic, shape and semantic aspects, and predict the correct result '诉'(su, suit).

% for the erroneous sentence '...素(su)取(qu)...', PLOME corrects '素(su)取' as '赢(ying)取' , which is only a reasonable match, not semantically reasonable. However, the output of CoSPA is the best correction considering the context semantics, and '诉'(su, suit) has the same prounciation with '素'(su, plain), which illustrates that the model is able to correctly learn the near-phonic conversion between erroneous character input and correct character output. Furthermore, '赢取(win for)' and '诉取(sue for)' are two related common Chinese words; it needs to incorporate more context in order to improve the recall performance. 
% More cases can be found in the Appendix A.1.

% \begin{table}[]
% % \centering
% \scalebox{0.7}{
% \setlength{\tabcolsep}{2mm}
% \linespread{1.5}
% \begin{tabular}{l|c}
% \hline
% input          & 希望您帮我\ysjred{素}取公平，得到他们适当的赔偿。                                                    \\
% \hline
% PLOME          & 希望您帮我\ysjgreen{索}取公平，得到他们适当的赔偿。                                                    \\
% CoSPA & 希望您帮我\ysjblue{诉}取公平，得到他们适当的赔偿。                                                    \\
% \hline
% translation    & I hope you can help me claim justice and get their proper compensation. \\
% \hline
% \hline
% input          & 总想让他与你更\ysjred{近}一步？吉列引力                                                    \\
% \hline
% PLOME          & 总想让他与你更\ysjgreen{进}一步？吉列引力                                                    \\
% CoSPA & 总想让他与你更\ysjblue{近}一步？吉列引力                                                    \\
% \hline
% translation    & Always want to make him one step closer to you? Gillette Gravity. \\
% \hline
% \end{tabular}
% }
% \caption{Two examples of the input and output of our CoSPA model. We highlight the \ysjred{input}/\ysjgreen{PLOME}/\ysjblue{CoSPA} characters in \ysjred{red}/\ysjgreen{green}/\ysjblue{blue} color.}
% \label{case-study1}
% \end{table}

\begin{center}
\begin{table}[]
% \centering
\caption{Two examples of inputs and outputs of CoSPA and PLOME}
\scalebox{0.7}{
\setlength{\tabcolsep}{0.5mm}
\linespread{2.0}
% \begin{tabular}{c|c|c}
% \hline
% Type        & Sentence                                                                & \begin{tabular}[c]{@{}c@{}}Probability distribution \\ of candidates\end{tabular}
\begin{tabular}{c c|c}
\hline
\multicolumn{2}{l|}{Over-correcting example}                                                                  & \begin{tabular}[c]{@{}c@{}}Probability distribution \\ of candidates\end{tabular}
\\
\hline
input       & 痛风患者错过后\ysjred{悔}哭                                                         & -                                      \\
PLOME       & 痛风患者错过后\ysjgreen{会}哭                                                         & \ysjgreen{会 0.839}, 悔 0.150, 回 0.000, ...            \\
CoSPA       & 痛风患者错过后\ysjblue{悔}哭                                                         & \ysjblue{悔 0.970}, 会 0.020, 回 0.000, ...            \\
Gold       & 痛风患者错过后\ysjred{悔}哭                                                         & -                                      \\
Trans. & Gout patients miss, regret crying        & -      \\
\hline
\hline
\multicolumn{2}{l}{Miscorrected example}    
\\
\hline
input       & 希望您帮我\ysjred{素}取公平，得到他们适当的赔偿。                                                    & -                                      \\
PLOME       & 希望您帮我\ysjgreen{赢}取公平，得到他们适当的赔偿。                                                    & -                                      \\
CoSPA       & 希望您帮我\ysjblue{诉}取公平，得到他们适当的赔偿。                                                    & -                                      \\
Gold       & 希望您帮我\ysjred{诉}取公平，得到他们适当的赔偿。                                                    & -                                      \\
Trans. & I hope you can help me claim justice and get \\
& their proper compensation. & -                                      \\
\hline
\end{tabular}
}
\label{case-study1}
\end{table}
\end{center}
\end{CJK*}

\section{Further Discussion}
Table \ref{main-results} gives the aggregated  comparison results of  CoSPA with PLOME and other approaches in terms of precision, recall and F1-score at character level and sentence level, and Table \ref{case-study1} gives two cases that CoSPA is superior to PLOME. Someone might want to know more cases of the performances of these approaches? 
This section conducts an error analysis on three categories: PLOME is wrong but CoSPA is right, PLOME is right but CoSPA is wrong, and both PLOME and CoSPA are wrong, which accounts for 45.9\%, 30.8\% and 23.3\%, respectively. For each category, two types of incorrect cases are analysed, namely, the false positive and the false negative cases, which affect the precision and recall in CSC, respectively. 10 more examples are randomly selected from each category of each type, which are conducted for 2 runs and the averaged results are as follows.

\begin{CJK*}{UTF8}{gbsn}

\paragraph{Q1: PLOME is wrong but CoSPA is right}

As Table \ref{Appendix1} shows, for the false positive cases, 20\% of them are due to the fact that BERT-base model easily rectifies a correct low-frequency collocation into a high-frequency one, whose standard deviation is 0.14. 
% To improve the precision, a possible approach to handle such cases is to encourage the selection of the original input token when the model is generated through the regular term. 
25\% of them are unable to distinguish 的(de, followed by noun)/地(di, followed by verb)/得(de, followed by adjective) or 他(ta, he)/她(ta, she)/它(ta, it), whose standard deviation (donated as Std Dev) is 0.07, such cases might be handled by Chinese grammar rules. 20\% of them are phonetic or shape errors, whose Std Dev is 0.14, a potential approach is to introduce glyph and pinyin features more effectively to break the limitation of artificial confusion sets. 35\% of them are other reasons, mainly due to the space of CSC task is very large, the erroneous characters in the real scene are likely to be written incorrectly between any two characters, and the mapping rules between them learned during the training are limited.


As Table \ref{Appendix1} shows, for the false negative cases, 25\% of them are phonetic or shape errors, whose Std Dev is 0.07. 30\% of them are continuous error, whose Std Dev is 0.14, one potential solution is to correct a sentence incrementally through multi-round inference until the model no longer corrects any words, 15\% of them are unable to distinguish 的/地/得 or 他/她/它, whose Std Dev is 0.07, and 30\% of them are other reasons.

\begin{table}[]
\centering
\caption{PLOME is wrong but CoSPA is right}
\scalebox{0.68}{
\setlength{\tabcolsep}{0.6mm}
\linespread{1.5}
\begin{tabular}{c|c}
\hline
\multicolumn{2}{l}{Type 1: False positive cases of PLOME, CoSPA is right.}                                                       \\
\hline
\hline
\multicolumn{2}{l}{20\% of Cases: BERT tends to rectify a correct low-frequency match into a high-frequency.}                                                       \\
\hline
Input                                        & \begin{tabular}[c]{@{}c@{}}这个宠物死掉的时候有心的养\ysjred{主}一定会难过，...\end{tabular}                                               \\
PLOME                                        & \begin{tabular}[c]{@{}c@{}}这个宠物死掉的时候有心的养\ysjgreen{生}一定会难过，...\end{tabular}                                               \\
CoSPA                                        & \begin{tabular}[c]{@{}c@{}}这个宠物死掉的时候有心的养\ysjblue{主}一定会难过，...\end{tabular}                                               \\
Gold                                     & \begin{tabular}[c]{@{}c@{}}这个宠物死掉的时候有心的养\ysjred{主}一定会难过，...\end{tabular}                                               \\
Trans                                       & Careful owners will be sad when this pet dies, ... \\
% \hline
% \multicolumn{2}{l}{25\% of Cases: Unable to distinguish 的/地/得 or 他/她/它.}                                                                        \\
% \hline
% Input                                        & \begin{tabular}[c]{@{}c@{}}... 您没有办法好好\ysjred{的}处理这件事的话 ...\end{tabular}                                                        \\
% PLOME                                        & \begin{tabular}[c]{@{}c@{}}... 您没有办法好好\ysjgreen{地}处理这件事的话 ...\end{tabular}                                                        \\
% CoSPA                                      & \begin{tabular}[c]{@{}c@{}}... 您没有办法好好\ysjblue{的}处理这件事的话 ...\end{tabular}                                                        \\
% Gold                                      & \begin{tabular}[c]{@{}c@{}}... 您没有办法好好\ysjred{的}处理这件事的话 ...\end{tabular}                                                        \\
% Trans                                       & ..., If you can't handle it properly, ...                                                                      \\
\hline
\multicolumn{2}{l}{20\% of Cases: Phonetic or shape errors.}                                                                        \\
\hline
Input                                        & \begin{tabular}[c]{@{}c@{}}大家也可怕你的工厂\ysjred{把}自然被坏, ...\end{tabular}                                                        \\
PLOME                                        & \begin{tabular}[c]{@{}c@{}}大家也可怕你的工厂\ysjgreen{吧}自然破坏, ...\end{tabular}                                                        \\
CoSPA                                      & \begin{tabular}[c]{@{}c@{}}大家也可怕你的工厂\ysjblue{把}自然破坏, ...\end{tabular}                                                        \\
Gold                                      & \begin{tabular}[c]{@{}c@{}}大家也可怕你的工厂\ysjred{把}自然破坏, ...\end{tabular}                                                        \\
Trans                                       & Everyone is also afraid that your factory destroys nature, ...                                                                      \\
\hline
\multicolumn{2}{l}{35\% of Cases: Other reasons.}                                                                        \\
\hline
Input                                        & \begin{tabular}[c]{@{}c@{}}..., 不应该都放在\ysjred{各}各孩子的身上。\end{tabular}                                                        \\
PLOME                                        & \begin{tabular}[c]{@{}c@{}}..., 不应该都放在\ysjgreen{个}个孩子的身上。\end{tabular}                                                        \\
CoSPA                                      & \begin{tabular}[c]{@{}c@{}}..., 不应该都放在\ysjblue{各}个孩子的身上。\end{tabular}                                                        \\
Gold                                      & \begin{tabular}[c]{@{}c@{}}..., 不应该都放在\ysjred{各}个孩子的身上。\end{tabular}                                                        \\
Trans                                       & ..., should not be placed on every child.                                                                      \\
\hline
\hline
\multicolumn{2}{l}{Type 2: False negative cases of PLOME, CoSPA is right.}                                                       \\
\hline
% \hline
% \multicolumn{2}{l}{25\% of Cases: Phonetic or shape errors.}                                                       \\
% \hline
% Input                                        & \begin{tabular}[c]{@{}c@{}}..., 那\ysjred{点}在中友百货附近。\end{tabular}                                               \\
% PLOME                                        & \begin{tabular}[c]{@{}c@{}}..., 那\ysjgreen{点}在中友百货附近。\end{tabular}                                               \\
% CoSPA                                        & \begin{tabular}[c]{@{}c@{}}..., 那\ysjblue{店}在中友百货附近。\end{tabular}                                               \\
% Gold                                     & \begin{tabular}[c]{@{}c@{}}..., 那\ysjred{店}在中友百货附近。\end{tabular}                                               \\
% Trans                                       & ..., that store is near Zhongyou Department Store.  \\
\hline
\multicolumn{2}{l}{30\% of Cases: Continuous error.}                                                                        \\
\hline
Input                                        & \begin{tabular}[c]{@{}c@{}}你的工厂机器声音是\ysjred{海曼}大声, ...\end{tabular}                                                        \\
PLOME                                        & \begin{tabular}[c]{@{}c@{}}你的工厂机器声音是\ysjgreen{海曼}大声, ...\end{tabular}                                                        \\
CoSPA                                      & \begin{tabular}[c]{@{}c@{}}你的工厂机器声音是\ysjblue{还蛮}大声, ...\end{tabular}                                                        \\
Gold                                      & \begin{tabular}[c]{@{}c@{}}你的工厂机器声音是\ysjred{还蛮}大声, ...\end{tabular}                                                        \\
Trans                                       & Your factory machines are pretty loud, ...                                                                       \\
\hline
\multicolumn{2}{l}{30\% of Cases: Other reasons.}                                                                        \\
\hline
Input                                        & \begin{tabular}[c]{@{}c@{}}..., 所以我就开发六学\ysjred{有}雪方面。\end{tabular}                                                        \\
PLOME                                        & \begin{tabular}[c]{@{}c@{}}..., 所以我就开发留学\ysjgreen{有}学方面。\end{tabular}                                                        \\
CoSPA                                      & \begin{tabular}[c]{@{}c@{}}..., 所以我就开发留学\ysjblue{游}学方面。\end{tabular}                                                        \\
Gold                                      & \begin{tabular}[c]{@{}c@{}}..., 所以我就开发留学\ysjred{游}学方面。\end{tabular}                                                        \\
Trans                                       & ..., so I developed the study abroad aspect.                                                                      \\
\hline
\end{tabular}
}
\label{Appendix1}
\end{table}

% As Table \ref{Appendix1} shows, for the false positive cases, 20\% of them are due to the fact that BERT-base model easily rectifies a correct low-frequency collocation into a high-frequency one, whose standard deviation is 0.14. To improve the precision, a possible approach to handle such cases is to encourage the selection of the original input token when the model is generated through the regular term. 25\% of them are unable to distinguish 的(de, followed by noun)/地(di, followed by verb)/得(de, followed by adjective) or 他(ta, he)/她(ta, she)/它(ta, it), whose standard deviation is 0.07, such cases might be handled by Chinese grammar rules. 20\% of them are phonetic or shape errors, whose standard deviation is 0.14, a potential approach is to introduce glyph and pinyin features more effectively to break the limitation of artificial confusion sets. 35\% of them are other reasons, mainly due to the space of CSC task is very large, the erroneous characters in the real scene are likely to be written incorrectly between any two characters, and the mapping rules between them learned during the training are limited.


% As Table \ref{Appendix1} shows, for the false negative cases, 25\% of them are phonetic or shape errors, whose standard deviation is 0.07. 30\% of them are continuous error, whose standard deviation is 0.14, one potential solution is to correct a sentence incrementally through multi-round inference until the model no longer corrects any words, 15\% of them are unable to distinguish 的/地/得 or 他/她/它, whose standard deviation is 0.07, and 30\% of them are other reasons.

\paragraph{Q2: PLOME is right but CoSPA is wrong}

As Table \ref{Appendix3} shows, for the false positive cases, 5\% of them are some fixed usages, such as idioms, phrases, and poems, whose Std Dev is 0.07. A possible approach to handle such cases is utilizing some external knowledge, such as building a collection of special Chinese usages. 20\% of them are phonetic or shape errors, whose Std Dev is 0.14. 25\% of them are unable to distinguish 的/地/得 or 他/她/它, whose Std Dev is 0.07, and 50\% of them are other reasons.

As Table \ref{Appendix3} shows, for the false negative cases, 15\% of them are phonetic or shape errors, whose Std Dev is 0.07. 10\% of them are due to lack of world knowledge, whose Std Dev is 0, it is still very challenging for the existing models to detect and correct such kind of errors, 20\% of them are continuous error, whose Std Dev is 0.14. 20\% of them are unable to distinguish 的/地/得 or 他/她/它, whose Std Dev is 0.14, and 35\% of them are other reasons. 

\begin{table}[]
\centering
\caption{PLOME is right but CoSPA is wrong}
\scalebox{0.68}{
\setlength{\tabcolsep}{0.6mm}
\linespread{1.5}
\begin{tabular}{c|c}
\hline
\multicolumn{2}{l}{Type 1: PLOME is right, false positive cases of CoSPA}                                                       \\
\hline
\hline
\multicolumn{2}{l}{5\% of Cases: Some fixed usages, such as idioms, phrases, and poems.}                                                       \\
\hline
Input                                        & \begin{tabular}[c]{@{}c@{}}..., 不\ysjred{经}一番寒澈骨，焉得梅花扑鼻香。\end{tabular}                                               \\
PLOME                                        & \begin{tabular}[c]{@{}c@{}}..., 不\ysjgreen{经}一番寒澈骨，焉得梅花扑鼻香。\end{tabular}                                               \\
CoSPA                                        & \begin{tabular}[c]{@{}c@{}}..., 不\ysjblue{禁}一番寒澈骨，焉得梅花扑鼻香。\end{tabular}                                               \\
Gold                                     & \begin{tabular}[c]{@{}c@{}}..., 不\ysjred{经}一番寒澈骨，焉得梅花扑鼻香。\end{tabular}                                               \\
Trans                                       & without a cold to the bone, how can you get the fragrance of plum blossoms.\\
% \hline
% \multicolumn{2}{l}{20\% of Cases: Phonetic or shape errors.}                                                                        \\
% \hline
% Input                                        & \begin{tabular}[c]{@{}c@{}}我看火车\ysjred{到}的地图，可是我不董，...\end{tabular}                                                        \\
% PLOME                                        & \begin{tabular}[c]{@{}c@{}}我看火车\ysjgreen{到}的地图，可是我不懂，...\end{tabular}                                                        \\
% CoSPA                                      & \begin{tabular}[c]{@{}c@{}}我看火车\ysjblue{道}的地图，可是我不懂，...\end{tabular}                                                        \\
% Gold                                      & \begin{tabular}[c]{@{}c@{}}我看火车\ysjred{到}的地图，可是我不懂，...\end{tabular}                                                        \\
% Trans                                       & I read the map of the train arriving, but I don't understand, ...                                                                       \\
\hline
\multicolumn{2}{l}{50\% of Cases: Other reasons.}                                                                        \\
\hline
Input                                        & \begin{tabular}[c]{@{}c@{}}等了半个小时的公车结果看到一辆\ysjred{２}９７的公车, ...\end{tabular}                                                        \\
PLOME                                        & \begin{tabular}[c]{@{}c@{}}等了半个小时的公车结果看到一辆\ysjgreen{２}９７的公车, ...\end{tabular}                                                        \\
CoSPA                                      & \begin{tabular}[c]{@{}c@{}}等了半个小时的公车结果看到一辆\ysjblue{１}９７的公车, ...\end{tabular}                                                        \\
Gold                                      & \begin{tabular}[c]{@{}c@{}}等了半个小时的公车结果看到一辆\ysjred{２}９７的公车, ...\end{tabular}                                                        \\
Trans                                       & After waiting for the bus for half an hour, I saw a 297 bus, ...                                                                       \\
\hline
\hline
\multicolumn{2}{l}{Type 2: PLOME is right, false negative cases of CoSPA}                                                       \\
\hline
% \hline
% \multicolumn{2}{l}{15\% of Cases: Phonetic or shape errors.}                                                       \\
% \hline
% Input                                        & \begin{tabular}[c]{@{}c@{}}吃了早\ysjred{菜}以后他去上课。\end{tabular}                                               \\
% PLOME                                        & \begin{tabular}[c]{@{}c@{}}吃了早\ysjgreen{餐}以后她去上课。\end{tabular}                                               \\
% CoSPA                                        & \begin{tabular}[c]{@{}c@{}}吃了早\ysjblue{菜}以后他去上课。\end{tabular}                                               \\
% Gold                                     & \begin{tabular}[c]{@{}c@{}}吃了早\ysjred{餐}以后他去上课。\end{tabular}                                               \\
% Trans                                       & After breakfast he went to class.  \\
\hline
\multicolumn{2}{l}{10\% of Cases: Lack of world knowledge.}                                                                        \\
\hline
Input                                        & \begin{tabular}[c]{@{}c@{}}..., 大家会想到生鱼片、天\ysjred{普}罗、寿司之类的东西。\end{tabular}                                                        \\
PLOME                                        & \begin{tabular}[c]{@{}c@{}}..., 大家会想到生鱼片、天\ysjgreen{妇}罗、寿司之类的东西。\end{tabular}                                                        \\
CoSPA                                      & \begin{tabular}[c]{@{}c@{}}..., 大家会想到生鱼片、天\ysjblue{普}罗、寿司之类的东西。\end{tabular}                                                        \\
Gold                                      & \begin{tabular}[c]{@{}c@{}}..., 大家会想到生鱼片、天\ysjred{妇}罗、寿司之类的东西。\end{tabular}                                                        \\
Trans                                       & ..., people will think of sashimi, tempura, sushi and so on..                                                                       \\
% \hline
% \multicolumn{2}{l}{20\% of Cases: Continuous error.}                                                                        \\
% \hline
% Input                                        & \begin{tabular}[c]{@{}c@{}}..., 在印尼有很多外国人来\ysjred{路性}。\end{tabular}                                                        \\
% PLOME                                        & \begin{tabular}[c]{@{}c@{}}..., 在印尼有很多外国人来\ysjgreen{旅行}。\end{tabular}                                                        \\
% CoSPA                                      & \begin{tabular}[c]{@{}c@{}}..., 在印尼有很多外国人来\ysjblue{录行}。\end{tabular}                                                        \\
% Gold                                      & \begin{tabular}[c]{@{}c@{}}..., 在印尼有很多外国人来\ysjred{旅行}。\end{tabular}                                                        \\
% Trans                                       & ..., there are many foreigners traveling in Indonesia.                                                                       \\
\hline
\multicolumn{2}{l}{35\% of Cases: Other reasons.}                                                                        \\
\hline
Input                                        & \begin{tabular}[c]{@{}c@{}}..., 可是那\ysjred{的}时候我有一个重要的考试。\end{tabular}                                                        \\
PLOME                                        & \begin{tabular}[c]{@{}c@{}}..., 可是那\ysjgreen{个}时候我有一个重要的考试。\end{tabular}                                                        \\
CoSPA                                      & \begin{tabular}[c]{@{}c@{}}..., 可是那\ysjblue{的}时候我有一个重要的考试。\end{tabular}                                                        \\
Gold                                      & \begin{tabular}[c]{@{}c@{}}..., 可是那\ysjred{个}时候我有一个重要的考试。\end{tabular}                                                        \\
Trans                                       & ..., but I had an important exam at that time.                                                                      \\
\hline
\end{tabular}
}
\label{Appendix3}
\end{table}

% As Table \ref{Appendix3} shows, for the false positive cases, 5\% of them are some fixed usages, such as idioms, phrases, and poems, whose standard deviation is 0.07. A possible approach to handle such cases is utilizing some external knowledge, such as building a collection of special Chinese usages. 20\% of them are phonetic or shape errors, whose standard deviation is 0.14. 25\% of them are unable to distinguish 的/地/得 or 他/她/它, whose standard deviation is 0.07, and 50\% of them are other reasons.


% As Table \ref{Appendix3} shows, for the false negative cases, 15\% of them are phonetic or shape errors, whose standard deviation is 0.07. 10\% of them are due to lack of world knowledge, whose standard deviation is 0, it is still very challenging for the existing models to detect and correct such kind of errors, 20\% of them are continuous error, whose standard deviation is 0.14. 20\% of them are unable to distinguish 的/地/得 or 他/她/它, whose standard deviation is 0.14, and 35\% of them are other reasons. 


\paragraph{Q3: Both PLOME and CoSPA are wrong}
As Table \ref{Appendix5} shows, 30\% of them are continuous error, whose Std Dev is 0.14. 25\% of them are labeled wrong, whose Std Dev is 0.07, a potential solution is to clean the label of data first, 10\% of them are phonetic or shape errors, whose Std Dev is 0. 15\% of them are unable to distinguish 的/地/得 or 他/她/它, whose Std Dev is 0.07. and 35\% of them are other reasons. There are still many challenge tasks to do in the future.

\section{Conclusion}
This paper proposes a hybrid model for CSC, CoSPA, where an alterable copy mechanism is designed to increase the generation probability of the original input to alleviate the over-correcting that BERT-based models widely get into trouble. In addition, to better incorporate phonic and shape of characters,  attention mechanism is introduced to provide insight into which features of the input character are more relevant to the correction conversion of the output character. Experimental results on SIGHAN2015 datsets show that CoSPA outperforms almost all the previous state-of-the art methods, demonstrating the effectiveness of the proposed method. Three questions about CoSPA and PLOME that may be concerned are further discussed. The future work will handle the multiple error problem by improving the robustness to noise.

\begin{table}[]
\centering
\caption{Cases of both PLOME and CoSPA are wrong.}
\scalebox{0.68}{
\setlength{\tabcolsep}{0.6mm}
\linespread{1.5}
\begin{tabular}{c|c}
\hline
\multicolumn{2}{l}{30\% of Cases: Continuous error.}                                                       \\
\hline
Input                                        & \begin{tabular}[c]{@{}c@{}}我的好朋友，你好！\ysjred{习惯}你的生活都很好。\end{tabular}                                               \\
PLOME                                        & \begin{tabular}[c]{@{}c@{}}我的好朋友，你好！\ysjgreen{希惯}你的生活都很好。\end{tabular}                                               \\
CoSPA                                        & \begin{tabular}[c]{@{}c@{}}我的好朋友，你好！\ysjblue{希惯}你的生活都很好。\end{tabular}                                               \\
Gold                                     & \begin{tabular}[c]{@{}c@{}}我的好朋友，你好！\ysjred{希望}你的生活都很好。\end{tabular}                                               \\
Trans                                       & Hello my good friend! Hope your life is good.  \\
\hline
\multicolumn{2}{l}{25\% of Cases: Labeled wrong.}                                                                        \\
\hline
Input                                        & \begin{tabular}[c]{@{}c@{}}..., 第一年上\ysjred{办}还是没事。...\end{tabular}                                                        \\
PLOME                                        & \begin{tabular}[c]{@{}c@{}}..., 第一年上\ysjgreen{班}还是没事。...\end{tabular}                                                        \\
CoSPA                                      & \begin{tabular}[c]{@{}c@{}}..., 第一年上\ysjblue{班}还是没事。...\end{tabular}                                                        \\
Gold                                      & \begin{tabular}[c]{@{}c@{}}..., 第一年上\ysjred{半}还是没事。...\end{tabular}                                                        \\
Trans                                       & ..., The first year of work is fine. ...                                                                       \\
% \hline
% \multicolumn{2}{l}{10\% of Cases: Phonetic or shape errors.}                                                                        \\
% \hline
% Input                                        & \begin{tabular}[c]{@{}c@{}}我每天六\ysjred{天}半起床。\end{tabular}                                                        \\
% PLOME                                        & \begin{tabular}[c]{@{}c@{}}我每天六\ysjgreen{天}半起床。\end{tabular}                                                        \\
% CoSPA                                      & \begin{tabular}[c]{@{}c@{}}我每天六\ysjblue{天}半起床。\end{tabular}                                                        \\
% Gold                                      & \begin{tabular}[c]{@{}c@{}}我每天六\ysjred{点}半起床。\end{tabular}                                                        \\
% Trans                                       & I get up at six thirty every day.                                                                       \\
\hline
\multicolumn{2}{l}{35\% of Cases: Other reasons.}                                                                        \\
\hline
Input                                        & \begin{tabular}[c]{@{}c@{}}我们希望\ysjred{李}工厂把这件事如以下处理：...\end{tabular}                                                        \\
PLOME                                        & \begin{tabular}[c]{@{}c@{}}我们希望\ysjgreen{李}工厂把这件事如以下处理：...\end{tabular}                                                        \\
CoSPA                                      & \begin{tabular}[c]{@{}c@{}}我们希望\ysjblue{李}工厂把这件事如以下处理：...\end{tabular}                                                        \\
Gold                                      & \begin{tabular}[c]{@{}c@{}}我们希望\ysjred{您}工厂把这件事如以下处理：...\end{tabular}                                                        \\
Trans                                       & We hope your factory will deal with this matter as follows: ...                                                                      \\
\hline
\end{tabular}
}
\label{Appendix5}
\end{table}
\end{CJK*}

% \subsubsection*{Acknowledgement}
% The research was supported by the National Natural Science Foundation of China (No. 61872011).

\begin{acknowledgements} % will be removed in pdf for initial submission,
                         % so you can already fill it to test with the
                         % ‘accepted’ class option
    The research was supported by the National Natural Science Foundation of China (No. 61872011).

    % \emph{All} acknowledgements go in this section.
\end{acknowledgements}


% \pagebreak
\bibliography{uai2022-template}

% \appendix
% % NOTE: necessary when ptmx or no mathfont class option is given
% \providecommand{\upGamma}{\Gamma}
% \providecommand{\uppi}{\pi}
% \section{Appendix}


\end{document}
