\section{Introduction}

In recent years, deep learning \cite{ref_proc1} model have shown significant improvement on natural language processing(NLP), computer vision and speech processing technologies. However, the model performance is limited by the human labeled data quality. The main reason is that the human labeled data has a certain number of noisy data. Previous work \cite{ref_proc2} has propose the simple idea to find the noisy data and correct the noisy data. In this paper, we first review the way we achieve more than 90 score in classification task, then we further illustrate our idea for sequence tagging, object detection, sequence generation, click-through rate (CTR) prediction.

\section{Background}

In previous work \cite{ref_proc2}, we illustrate our idea in these steps:

1. It is a text classification task. We have a human labeled dataset-v1.

2. We train a BERT-based \cite{ref_proc3} deep model upon dataset-v1 and get model-v1.

3. Using model-v1 to predict the classification label for dataset-v1. 

4. If the predicted labels of dataset-v1 do not equal to the human labels of dataset-v1, we think they are the noisy data.

5. We label the noisy data again by human, while given the labels of model and last label by human as reference. Then we get dataset-v2.

6. We loop this re-labeling noisy data steps and get the final dataset. Then we get model-v2. We can further loop this steps to get model-v3.

\section{Same Idea and More Applications}

\begin{figure}
\centering
\includegraphics[width=\textwidth]{fig1.png}
\caption{Re-label method applying to any machine learning application. } \label{fig1}
\end{figure}

\subsection{sequence tagging}
We take named entity recognition(NER) as example for the sequence tagging like tasks. In NER task, we extract several classes of key phrase from a sentence. Follow our idea, we view each class of NER task as a classification task. Then our steps are:

1. It is a NER task. We have a human labeled dataset-v1.

2. We train a BERT-based \cite{ref_proc3} deep model upon dataset-v1 and get model-v1.

3. Using model-v1 to predict the sequence labels of one class for dataset-v1. 

4. If the predicted labels of dataset-v1 do not equal to the human labels of dataset-v1, we think they are the noisy data.

5. We label the noisy data again by human, while given the labels of model and last label by human as reference. Then we get dataset-v2.

6. We loop this re-labeling noisy data steps for all the classes of NER and get the final dataset. Then we get model-v2.

\subsection{object detection}

Object detection is a computer vision technique that allows us to identify and locate objects in an image or video. Follow our idea, we view each kind of bounding box as a classification task. Then our steps are:

1. It is a object detection task. We have a human labeled dataset-v1.

2. We train a Swin Transformer\cite{ref_proc4} upon dataset-v1 and get model-v1.

3. Using model-v1 to predict the bounding boxes of one class for dataset-v1. 

4. If the predicted bounding box of dataset-v1 is far from the human labeled bounding box of dataset-v1, we think they are the noisy data.

5. We label the noisy data again by human, while given the bounding boxes of model and last label by human as reference. Then we get dataset-v2.

6. We loop this re-labeling noisy data steps for all the classes of object detection and get the final dataset. Then we get model-v2.

\subsection{sequence generation}

The key step of our idea is about how to judge the noisy data. For sequence generation, we can use BLEU score or other sequence similarity evaluation method. Then our steps are:

1. We take text generation task as example. We have a human labeled dataset-v1.

2. We train a Encoder-Decoder Transformer\cite{ref_proc5} upon dataset-v1 and get model-v1.

3. Using model-v1 to predict the generated sentences for dataset-v1. 

4. If the BLEU score of generated sentences of dataset-v1 is far from the human labeled generated sentences of dataset-v1, we think they are the noisy data.

5. We label the noisy data again by human, while given the generated sentences of model and last label by human as reference. Then we get dataset-v2.

6. We loop this re-labeling noisy data steps and get the final dataset. Then we get model-v2.

\subsection{click-through rate prediction}

For CTR task, we use the method of \cite{ref_proc6} that automatically set the label again for the noisy data. CTR task is a click-or-not prediction task, we choose a threshold between the predicted score and the 0/1 online label score to judge whether the data is the noisy data. In this way, we could improve the AUC in dev dataset but the online performance should test online. 



\section{Experimental Results}

We do the experiments of text classification and NER to verify our idea. The results is shown in Table 1 and Table 2. We also do a lot of other classification task and NER task of other dataset. The improvement is also significant and we do not list the detail results. 



\begin{table}
\caption{The experiment result of text classification. The model-v1 and model-v2 are corresponding to the previous section}\label{tab1}
\centering
\begin{tabular}{|l|l|l|}
\hline
model & Dev Dataset Accuracy  & Human Evaluation Accuracy \\
\hline
model-v1 & 83.3\% & 88.0\% \\ 
\hline
model-v2 & 91.7\% & 97.2\% \\
\hline
model-v3 & 93.8\% & 97.5\% \\
\hline
\end{tabular}
\end{table}


\begin{table}
\caption{The experiment result of NER. The model-v1 and model-v2 are corresponding to the previous section}\label{tab1}
\centering
\begin{tabular}{|l|l|l|}
\hline
model & Dev Dataset F1  & Human Evaluation Precision \\
\hline
model-v1 & 73.7\% & 86.0\% \\ 
\hline
model-v2 & 88.7\% & 97.0\% \\
\hline
\end{tabular}
\end{table}

\section{Analysis}

Why re-label method work? Because deep learning is statistic-based. Take classification as example. (In a broad sense, all the machine learning tasks can be viewed as classification.) If there are three \textbf{very similar} data (data-1/data-2/data-3) in total, which labels are class-A/class-A/class-B, Then the trained model will probably predict class-A for data-3. We assume that data-3 is wrong labeled to class-B by human, because more people label its similar data to class-A.

The improvement reason is also based on the better and better understanding for the specific task's labeling rule/knowledge of labeling human once by once. Human-in-the-loop here means that in each loop the labeling human leader should learn and summarize the corrected rule/knowledge based on the last loop.


\section{Related Work}

\subsection{Pseudo-label-based method}

The work\cite{ref7} proposes pseudo-label-based method to improve data quality without human re-label, which is different from our method. Our method is to improve the data quality for model of 97\% accuracy/precision/recall/BLEU/AUC by human re-label.

The work\cite{ref8} proposes label-guess-based method to improve data quality without human re-label, which is different from our method. Our method get the guess-label as reference for human re-label.

\subsection{ChatGPT}

The work\cite{ref9} of OpenAI also use model predictions for human as references to label. The work\cite{ref9} use the new human-labeled data from model predictions to train the reward model, which is named as reinforcement learning from human feedback(RLHF).

Our work do not have the reward model of RLHF, the differnce between RLHF and our re-label method is that RLHF focus on using the model predictions as training dataset of reward model, and our re-label method focus on correcting the origin training dataset of policy model. Also in RLHF, the new human-labeled data from model predictions do not conflict to our method, because the new data can be simply merged to the whole dataset of policy model, and then to re-label/re-correct by our method.

Also, the correction for policy by the reward model of RLHF is same to correct all the related data/labels that found by the patterns or substrings of the badcases. The detail pattern-based human-correct method is illustrated at \cite{ref10}.

\section{Conclusion}

We argue that the key point to improve the industry deep learning application performance is to correct the noisy data. We propose a simple method to achieve our idea and show the experimental results to verify our idea. 




