%\documentclass{uai2023} % for initial submission
\documentclass[accepted]{uai2023} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example


%%% the packages I add
\usepackage{float}
\usepackage[caption=false]{subfig}

%\title{Soldering Defect Detection for New Components in Printed Circuit Boards with Unknown Awareness}

\title{DeepGD3: Unknown-Aware Deep Generative/Discriminative Hybrid Defect Detector for PCB Soldering Inspection}
%Keywords: Unknown Awareness, Defect Detection, Generative/Discriminative Model


% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<machingwen@nycu.edu.tw>?Subject=DeepGD3}{Ching-Wen~Ma}{}}
\author[1]{Yanwei~Liu}
% Add affiliations after the authors
\affil[1]{%
	College of Artificial Intelligence\\
	National Yang Ming Chiao Tung University\\
	Tainan, Taiwan
}
  
\begin{document}
\maketitle

\begin{abstract}
We present a novel approach for detecting soldering defects in Printed Circuit Boards (PCBs) composed mainly of Surface Mount Technology (SMT) components, using advanced computer vision and deep learning techniques. The main challenge addressed is the detection of soldering defects in new components for which only samples of good soldering are available at the model training phase. To address this, we design a system composed of generative and discriminative models to leverage the knowledge gained from the soldering samples of old components to detect the soldering defects of new components. To meet industrial quality standards, we keep the leakage rate (i.e., miss detection rate) low by making the system "unknown-aware" with a low unknown rate. We evaluated the method on a real-world dataset from an electronics company. It significantly reduces the leakage rate from 1.827\% $\pm$ 3.063\% and 1.942\% $\pm$ 1.337\% to 0.063\% $\pm$ 0.075\% with an unknown rate of 3.706\% $\pm$ 2.270\% compared to the discriminative and generative approaches, respectively. 
%It  also performs remarkably better than the generative approach in the overkill rate when unseen components are encountered.
\end{abstract}

\section{Introduction}
\label{sec:intro}
\label{section: intro}

Deep learning has made significant advancements in academia and industries thanks to the abundance of data and the enhancement of computational power. Industries such as manufacturing, medicine, and transportation can cut costs using neural network predictions. Deep learning-based image classification techniques for identifying defects in printed circuit boards are becoming increasingly prevalent in the electronics manufacturing sector. This success is generally through implementing advanced machine vision imaging systems and acquiring abundant training examples that closely resemble the testing examples. However, this requirement can limit the use of deep learning in real-world situations where the testing examples may be new, novel, and dissimilar to the training examples.

Let us consider a scenario where the assembly line includes both old and new components. We train the deep learning model on available examples, including the good and defective soldering samples of old components and only good soldering samples of new components. It is then applied directly to detect defective soldering in new components.
This approach is limited as it needs to consider that the defective soldering of new components may be dissimilar to the training samples, which can negatively impact the model's performance. Hence, we should consider advanced techniques such as transfer learning, domain adaptation, and meta-learning to improve performance and adapt the model to detect defective soldering in new components.
Additionally, to meet the manufacturing standard in real-world industrial applications, it is reasonable to make the model unknown-aware and balance the accuracy and unknown rate. The unknown cases can then be further examined at the next station of the assembly line.

\begin{figure*}[ht]
	\centering
	\includegraphics[width=\textwidth]{figures/deepGD3arch.pdf}
	\caption{DeepGD3: The unknown-aware \textbf{deep} \textbf{G}enerative/\textbf{D}iscriminative hybrid \textbf{D}efect \textbf{D}etector. By using the prediction converter $\Lambda(\cdot)$, two heterogeneous predictoions $\hat{y}^1_{def}$ and $P(z|y_{com})$ are transformed into two homogeneous predictions $\hat{y}^1_{def}$ and $\hat{y}^2_{def}$, allowing for easy combination to produce the final prediction $\hat{y}_{def}$.}%, where
	\label{fig:DeepGDC}
\end{figure*}

We aim to achieve knowledge transfer and unknown awareness in these situations simultaneously. In \citep{raina2003classification, fujino2005hybrid, bosch2008scene, ouyang2011indoor, kuleshov2017deep, roth2018hybrid, grcic2022densehybrid, loh2022long, cao2022deep}, two kinds of models, the discriminative and generative models, were combined. \citet{raina2003classification} mainly addresses text categorization tasks. It describes a hybrid model in which a high-dimensional subset of the parameters is trained to maximize the generative likelihood, and another subset of parameters is discriminatively trained to maximize the conditional likelihood. Instead, we seek to use deep neural networks to combine discriminative and generative models for our goals. The combined model exchanges knowledge between these two distinct models, forming a shared embedding $z$. The knowledge exchange process shapes the embedding $z$ in a manner that enables the model to effectively detect new and defective samples that were not encountered during the training phase. Additionally, the inclusion of a generative model enables accurate uncertainty estimation, allowing the model to be aware of and handle unknown cases effectively.
%The knowledge exchange shapes the embedding $z$ in a way to make the model be able to detect new and defective samples, which are not seen in the training phase. 

The proposed deep neural network architecture consists of two branches that share a common feature extractor, as illustrated in Figure~\ref{fig:DeepGDC}. The upper branch serves as the discriminative defect detector, determining whether the input sample $x$ is good or defective, denoted as $\hat{y}^1_{def}$. The lower branch, on the other hand, acts as the generative defect detector, producing the likelihood $P(z|y_{com})$ indicating the probability of the input sample belonging to a specific component type. These two predictions, $\hat{y}^1_{def}$ and $P(z|y_{com})$, are considered heterogeneous predictions.

To ensure homogeneous predictions, we transform the likelihood $P(z|y_{com})$ into the second defectiveness prediction, denoted as $\hat{y}^2_{def}$. Consequently, we obtain two predictions, $\hat{y}^1_{def}$ and $\hat{y}^2_{def}$, both in a homogeneous format. These two homogeneous predictions are then merged using a prediction combiner, resulting in the final prediction $\hat{y}_{def}$. This final prediction can be categorized as "good," "bad," or "unknown." By combining the predictions from both branches, the final prediction becomes more reliable and robust compared to relying solely on one branch.

The task we addressed here can be seen as a sub-task of zero-shot learning \citep{xian2018zero}. It is similar to compositional zero-shot learning (CZSL) \citep{mancini2022learning} but not the same. The final prediction of our task is the soldering status only, not the composition of soldering status and component types. This setting comes from the fact that we know the component types in advance in real-world applications. An algorithm targeting this setting should perform better than those targeting the setting of CZSL. 
Our method considers these facts and considerations, converting and combining two predictions into one prediction, resulting in superior performance.
Unknown awareness also makes our setting more practical and different from CZSL.

According to experiments, the proposed method solves the task mentioned above much better than using only the discriminative or generative models.
We summarize our contributions as follows:
\begin{itemize}
	\item Introduction of a new task for the electronic assembly line, which involves not only detecting soldering defects in old components but also in new components that visually differ from the old ones, while maintaining a low leakage rate.
	
	\item Addressing the challenge of zero-shot learning, where samples of defective soldering for new components are not available during the training phase.
	
	\item Proposal of a hybrid model that incorporates both discriminative and generative models for detecting soldering defects in both old and new components. This ensures the low leakage rate requirement through knowledge exchange and consideration of unknown-awareness.
	
	\item Proposal of the prediction converter $\Lambda(\cdot)$, which transforms two heterogeneous predictions $\hat{y}^1_{def}$ and $P(z|y_{com})$ into two homogeneous predictions $\hat{y}^1_{def}$ and $\hat{y}^2_{def}$. This enables easy combination to produce the final prediction $\hat{y}_{def}$.
	
	\item Experimental results on a real-world dataset demonstrating the superiority of the proposed method compared to baseline methods that use only discriminative or generative models.
\end{itemize}
We organize this paper as follows; In Section~\ref{section: related work}, we discuss related work in existing PCB soldering defect detection methods, unknown awareness in defect detection, compositional zero-shot learning, hybrid generative/discriminative models, and deep metric learning. In section~\ref{section: Methodology}, we introduce the proposed model architecture. In section~\ref{section: experiment and results}, we describe the dataset, evaluation metrics, experiment setup, and experimental results. Finally, in section~\ref{section:conclusions}, we summarize our work and discuss future works.

\section{Related Work}
\label{section: related work}
\textbf{Model Input and Output: }
In the field of PCB soldering defect detection using deep learning models, there are two common types of inputs to the model: 1) an image of a PCB board that contains multiple electronic components, and 2) a soldering image of a single electronic component. Our focus in this work is on the latter input type. As shown in Fig.~\ref{fig:Examples}, examples of the input images to the model are provided. The output of the model in this work is either "Good," "Bad," or "Unknown." While some related works also classify the different types of defects, we concentrate on reducing the leakage and overkill rates without additional efforts in categorizing the defect types. In this section, we provide a review of related work.

\begin{figure}[tb]
	\centering
	\includegraphics[height=!,width=1\linewidth, keepaspectratio=true]{figures/Examples.pdf}
	\caption{Examples of the input images. 'Good' refers to good soldering. 'Bad' refers to defective soldering. 'Missing,' 'Shift,' 'Stand,' 'Broken,' and 'Short' refers to the defective soldering types.}
	\label{fig:Examples}
\end{figure}

\subsection{Existing PCB Soldering Defect Detection Methods}
\citet{wu2022pcbnet} proposed a lightweight CNN model called PCBNet capable of locating and classifying the type and defect of an electronic component with low computation complexity while maintaining high accuracy. \citet{liao2022solder} proposed ConvNeXt-YOLOX model for solder joint defect detection with high accuracy and speed. \citet{bhattacharya2022end} combined the merits of both transformer \citep{vaswani2017attention} and convolutional networks. All of them focused on balancing speed and accuracy. None of them address the issues encountered in new components. 

\citet{ulger2021anomaly} propose a beta-Variational Autoencoders (beta-VAE) architecture for anomaly detection in unrestricted domains with no special lighting and without the existence of error-free reference boards. Instead, we consider where error reference examples of old components are available. 

\citet {dai2020soldering} used a generic deep learning method for both defect localization and classification tasks. For the classification part, an active learning method reduces the labeling workload when an extensive labeled training database is not easily available. On the other hand, our work depends on balancing "knowledge exchange" and  "unknown awareness" to achieve the goals -- low leakage and overkill rates for new components.   

\subsection{Unknown Awareness in Defect Detection}
Predictions with low confidence should be considered as unknown. \citet{cheon2019convolutional} applied unknown detection to wafer defect detection tasks in the semiconductor industry. It uses a modified version of K-nearest neighbors (KNN) to determine whether the input belongs to a specific type of defect. When the model cannot determine which type an input belongs to with sufficient confidence, the model claims it to be unknown. The modified KNN, however, is a non-parametric model for which the model should keep the training data in memory.

\citet{habibpour2021uncertainty} applied transfer learning methods and uncertainty quantification (UQ) techniques to the casting defect detection task. They believe an uncertainty-aware automatic defect detection solution will reinforce casting production's quality assurance. However, they did not discuss when to say unknown.

\citet{zhou2021semi} used a variational autoencoder (VAE) and a Gaussian mixture model (GMM) for the fabric defect detection task. They utilized VAE for feature extraction and image reconstruction and GMM for density estimation. They fitted the GMM with normal data only, which means that the GMM can learn the probability distribution of normal data. Therefore, abnormal samples tend to have a lower probability density than normal samples. A threshold can then be determined to distinguish normal and abnormal samples. We also use GMM density estimation in this work. However, we do not set thresholds for normal and abnormal samples. Instead, we set thresholds for defective, non-defective, and unknown samples. Our approach acknowledges the fact that abnormal samples are not necessarily defective samples.

\subsection{Compositional Zero-Shot Learning}
A task similar to ours is compositional zero-shot learning (CZSL), which involves the recognition of the unseen composition of objects (components) and states (defectiveness). In particular, CZSL aims to recognize compositions composed of a set of states and objects. (e.g., red apple, where red is the state and apple is the object). Instead, we focus on recognizing the state of an object, where the object is known in advance or less critical and not interested. 

Some CZSL methods~\citep{misra2017red, purushwalkam2019task, li2020symmetry, li2022siamese} train two classifiers for state and object, respectively. It is similar to our proposed hybrid classifier, while in our task, the goal is to classify the defectiveness of both old and new components with unknown awareness. The task we address here differs from CZSL in the following aspects.

\begin{enumerate} 
	\item {\it We focus on predicting states only} since predicting objects is generally not critical. With this goal in mind, we can convert a component prediction of the generative model to a defectiveness prediction and do other things, e.g., unknown awareness. 
	\item {\it We allow the model not to make any prediction;} when it does make one, the accuracy must approach 100\%, thus achieving trustable predictions for real-world applications.
	\item {\it The states in our task are only 'good,' 'bad,' and 'unknown.'} It enables us to effectively share/exchange knowledge between the discriminative and the generative models. 
\end{enumerate} 

\subsection{Hybrid Generative and Discriminative Models}
Recently, reliable machine learning models have attracted the attention of researchers. A line of research addresses this goal by combining the generative and discriminative models. \citet{grcic2022densehybrid}, \citet{loh2022long}, and \citet{cao2022deep} applied this idea for anomaly detection, uncertainty capturing, and out-of-distribution detection, respectively. Their successes come from the combination of the strength of these two models. The discriminative models often attain higher predictive accuracy, while the generative ones can deliver reliable predictions. 

As in Figure~\ref{fig:DeepGDC}, we use GMM models for density (or likelihood) estimation in the generative model. As in Figure~\ref{fig:HybridExpert_training}, we use deep metric learning, Section~\ref{subsection: Deep Metric Learning}, to make the embedding $z$ suitable for GMM modeling.

\subsection{Deep Metric Learning}
\label{subsection: Deep Metric Learning}
Deep metric learning is often applied to face recognition, person re-identification, and fine-grained image recognition. It enables the model to pull samples of the same class in the embedded space closer and push samples of different classes apart. Its loss functions involve two types: Proxy-based and pair-based.

Proxy-based loss leverages the concept of prototypes so that samples belonging to the same class aggregate in their respective proxy. On the contrary, samples of different classes form separate and independent proxies due to their low similarity, as in ~\citep{movshovitz2017no}~\citep{qian2019softtriple}.

Pair-based loss calculates the distances of the paired samples in each mini-batch. 
%When the paired samples belong to the same class (positive sample), the distances between them should be smaller. On the other hand, it should be larger when belonging to a different class (negative sample)~\citep{hadsell2006dimensionality, schroff2015facenet}. 
The paired samples require more sampling at the training stage ~\citep{hadsell2006dimensionality, schroff2015facenet}. That needs more computation resources than the proxy-based method. 

Multi-similarity Loss~\citep{wang2019multi} handles the sampling problem by using hard sample mining, which relaxes the sampling problem in the pair-based loss. Furthermore, it penalizes the loss differently by comparing the relationship of anchor, positive and negative, and leads to a performance boost. 

We use the multi-similarity loss to train the embedding $z$ of our hybrid architecture, making it suitable for GMM modeling. 
% A detailed description of the training and inference method is described in Section~\ref{section: Methodology}.

\section{Methodology}
\label{section: Methodology}

\begin{figure*}[ht]
	\centering
	\includegraphics[width=0.95\textwidth]{figures/HybridExpert_training.pdf}
	\caption{Training procedure of deepGD3 (Hybrid Expert). We train the defect classifier $\psi$, the component classifier $\phi$, $f_{\theta_1}$ and $f_{\theta_2}$ in stage 1. We then fit the Gaussian mixture models $\gamma$ in stage 2.}
	\label{fig:HybridExpert_training}
\end{figure*}

We train the hybrid defect detector of Figure~\ref{fig:DeepGDC} by procedures depicted in Figure~\ref{fig:HybridExpert_training}. In stage 1, we alternatively train the upper branch for detecting defectiveness and the lower branch for learning cluster embedding $z_{com}$. In stage 2, we use GMM models to fit $z_{com}$ to realize probabilistic component-type predictions $P(z|y_{com})$. We then convert $P(z|y_{com})$ to the second defectiveness predictions $\hat{y}^2_{def}$. The conversion is optimized with thresholding parameters $\tau0, \tau1,$ and  $\tau2$ using Bayesian optimization. Finally, we combine these two defectiveness predictions to make the unknown-aware final predictions $\hat{y}_{def}$.

\subsection{Model Architecture}

We elabrate on the blocks in Figure~\ref{fig:DeepGDC} and Figure~\ref{fig:HybridExpert_training} as follows:
\begin{enumerate}
	\item \textbf{Pre-trained Feature Extractor ${f_{\theta1}(\cdot)}$}. We use the backbone of MobileNetV3 Large pre-trained on Imagenet and remove the original MobileNetV3 Large classification head. Then we adopt the backbone as the feature extractor. ${f_{\theta1}(\cdot)}$ maps $x$ to a vector $x'$, $x'$ = ${f_{\theta1}(x)}$ $\in$ $\mathcal{R}^{D_{\theta1}}$. $D_{\theta1} = 960.$
	
	\item \textbf{Shared Encoder $f_{\theta2}(\cdot)$}. $f_{\theta2}(\cdot)$ is composed of a fully connected layer. The input dimension of the layer is 960, and the output dimension is 512. $f_{\theta2}(\cdot)$ maps the extracted features $x'$ to embedding vector $z$ for both the discriminative and the generative models. $z = f_{\theta2}(x')$ $\in \mathcal{R}^{D_{\theta2}},$ $D_{\theta2} = 512$. 
	
	\item \textbf{Discriminative Model $\psi(\cdot)$}. $\psi(\cdot)$ is a single fully connected layer $FC_1$ that predicts the defectiveness of a sample. $\psi(\cdot)$ maps $z$ to defectiveness prediction $\hat{y}_{def}^{1} = \psi(z)$ $\in \mathcal{R}^{D_{\psi}},$ $D_{\psi} = 2.$  
	
	\item \textbf{Generative Model $\gamma(\cdot)$}.
	\label{gmm_descirbe}
	$\gamma(\cdot)$ is a Gaussian mixture model (GMM) that predicts whether the embedding features $z$ belong to any known component types. The output of the Gaussian mixture model processes is the likelihood of a sample being a specific component type, $P(z|y_{com}) \in \{GMM_1(z), GMM_2(z), ... GMM_T(z)\}$. It will be converted to the second defectiveness prediction $\hat{y}_{def}^2$ later. There are 23 component types in our dataset, so $T = 23.$
	
	\item \textbf{Prediction Converter $\Lambda(\cdot)$}.
	$\Lambda(\cdot)$ converts the likelihood $P(z|y_{com})$ to the second defectiveness prediction  $\hat{y}_{def}^{2} \in \mathcal{R}^{D_{\psi}}$, $D_{\psi} = 3.
	$ If $P(z|y_{com})$ is larger than a threshold $h^1_{\cdot,\cdot}$, it is classified as a good sample. If it is smaller than another threshold $h^2_{\cdot,\cdot}$, it is an unknown sample. If it falls between these two thresholds, this sample is a bad sample. 
	There is also a parameter $\tau_0$ for adjusting $P(z|y_{com})$. 
	%We use Bayesian optimization to optimize these two thresholds and the parameter. 
	A detailed discussion can be found in~\ref{PredictionConverter}.
	
	\item \textbf{Prediction Combiner $G(\cdot, \cdot)$}.
	If $\hat{y}^{1}_{def}$ and $\hat{y}^{2}_{def}$ do not match or $\hat{y}^{2}_{def}$ is unknown, we consider the corresponding sample to be unknown. Otherwise, we consider that $\hat{y}^{1}_{def}$, which is equal to  $\hat{y}^{2}_{def}$, is the final prediction $\hat{y}_{def}$.
	
	\item \textbf{Projection Head $\phi(\cdot)$}. $\phi(\cdot)$, in Figure~\ref{fig:HybridExpert_training}, is a fully connected layer $FC_2$ that uses the multi-similarity loss~\citep{wang2019multi} to pull features of good samples with the same component type closer and push features of good samples with different component types away. During stage 1 of the training process, $\phi(\cdot)$ maps $z$ to $z' = \phi(z) $ $\in$ $\mathcal{R}^{D_{\phi}}$. $D_{\phi} = 512.$ At the end of the training, we discard  $\phi(\cdot)$ as \citet{khosla2020supervised, simclr, simclrv2} did in constrastive learning settings.
	
\end{enumerate}

\subsection{Prediction Converter $\Lambda(\cdot)$}
\label{PredictionConverter}

There may exist many methods for converting $P(z|y_{com})$ to $\hat{y}_{def}^{2}$. We introduce a method we found efficient to do that and is stable. The input of our prediction converter $\Lambda$ are the 23 GMM models $GMM(\mu,\Sigma)$, the embedding $z$, an adjusting parameter $\tau_{0}$, and two thresholding paramters $\tau_{1}$ and $\tau_{2}$. These three parameters apply to all GMM models and are optimized by Bayesian optimization. 

The parameter $\tau_{0}$ adjusts the covariance matrices of all Gaussians of all GMM models by Equation~\ref{tau0_prob}.
\begin{equation}
	\label{tau0_prob}
	P_{j,n}(z_{i}) = GMM_{j}(z_{i};\mu_{j,n},(\tau_{0})^{2}\cdot\Sigma_{j,n}),
\end{equation}
where $j\in \{1, 2, \ldots, 23\}$ is the number of the component types, $n$ is the number of Gaussians in each GMM model, and $i$ is the sample index. By adjusting the these covariance matrices, the Bayesian optimization find the thresholding parameters $\tau_{1}$ and $\tau_{2}$ relatively quickly and resulting better prediction performance.

The two thresholds for all Gaussians of all GMM models are $h^{1}_{j,n}$ and $h^{2}_{j,n}$ shown in Equation~\ref{tau1_prob}.
\begin{equation}
	\label{tau1_prob}
	\begin{aligned}
	h^{1}_{j,n} & := P^{1}_{j}(\mu_{j,n}) & = GMM_{j}(\mu_{j,n};\mu_{j,n},(\tau_{1})^{2}\cdot\Sigma_{j,n}) \\
	h^{2}_{j,n} & := P^{2}_{j,n}(\mu_{j,n}) & = GMM_{j}(\mu_{j,n};\mu_{j,n},(\tau_{2})^{2}\cdot\Sigma_{j,n})
	\end{aligned}
\end{equation}
where $\mu_{j,n}$ indcates the center of a Gaussian. We can therefore adjust all thresholds by adjusting $\tau_1$, and $\tau_2$. 

With the adjusted GMM models $P_{j,n}(z_{i})$ and these thresholds, the predictoin converters are defined as Eq.~\ref{predict_rule}.
% defined as Eq.~\ref{predict_rule1}~\ref{predict_rule2}~\ref{predict_rule3}.
\begin{equation}
	\label{predict_rule}
	\begin{aligned}
	\hat{y}^2_{j,n} & :=
	\begin{cases}
		\text{good}, & P_{j,n}(z_{i}) \geq h^{1}_{j,n} \\
		\text{bad}, & h^{1}_{j,n} > P_{j,n}(z_{i}) \geq h^{2}_{j,n} \\
		\text{unknown}, &  h^{2}_{j,n} > P_{j,n}(x_{i})
	\end{cases} \\
%	\end{aligned}
%\end{equation}
%\begin{equation}
%	\label{predict_rule2}
%	\begin{aligned}
	\hat{y}^2_{def,j} & := 
	\begin{cases}
		\ \text{good}, & \text{if one of $\hat{y}^2_{j,n}=$ good},\\
		\ \text{bad}, & \text{else if one of $\hat{y}^2_{j,n}=$ bad}, \\
		\ \text{unknown}, & \text{else.}
	\end{cases} \\
%	\end{aligned}
%\end{equation}
%\begin{equation}
%	\label{predict_rule3}
%	\begin{aligned}
	\hat{y}^{2}_{def} & :=
	\begin{cases}
		\ \text{good}, & \text{if one of $\hat{y}^2_{def,j}=$ good},\\
		\ \text{bad}, & \text{else if one of $\hat{y}^2_{def,j}=$ bad}, \\
		\ \text{unknown}, & \text{else.}
	\end{cases}
	\end{aligned}
\end{equation}

Visual explanations of the effect of $\tau_0$, $\tau_1$ and $\tau_2$ are available in the supplementary material. Bayesian optimization then optimizes $\tau_0$, $\tau_1$ and $\tau_2$ to maximize harmonic score $H$ in Equation ~\ref{eq: harmonicScore}.

\subsection{Training Procedure}

Figure~\ref{fig:HybridExpert_training} shows the training procedure of the hybrid generative/discriminative defect detector in Figure~\ref{fig:DeepGDC}. We use class-balanced sampling in each mini-batch to deal with the data imbalance issue. The early stop technique is also applied to prevent overfitting.

Our solution trains the upper branch with the defect classifier $\psi$ and the lower branch with the project head $\phi$ in stage 1. We use the defect type label $y_{def}$ to train the upper branch and the component type label $y_{com}$ to train the lower branch. Cross-entropy loss $\ell_{def}$ is first computed to update $\psi$, $f_{\theta2}$, and $f_{\theta1}$. Then, the lower branch is trained with multi-similarity loss $\ell_{com}$ to update $\phi$, $f_{\theta2}$, and $f_{\theta1}$, enabling knowledge exchange between both branches.

We evaluate the trained models $\psi$, $\phi$, $f_{\theta2}$, and $f_{\theta1}$ in the validation set after each epoch. The final model is selected based on the lowest model selection loss, defined as $\ell_{\omega} = \ell_{com} + \ell_{def}$. After training, Gaussian mixture models are fitted using $z_{com}$ and $y_{com}$ as shown in stage 2 of Figure~\ref{fig:HybridExpert_training}.

A more detailed description of the training procedure, presented as an algorithm, can be found in the supplementary material.

\subsection{Determine the Thresholds by Bayesian Optimization}
\label{subsection: BO}
We perform Bayesian optimization using the bayes\_opt~\citep{Bayesian} package on the training and validation sets, with the expected improvement as the acquisition function. The bounds for $\tau_{0}$, $\tau_{1}$, $\tau_{2}$ are given as follows
\begin{equation}
	\begin{split}
	\{\tau_{0} | 0 \leq \tau_{0} \leq 1.0\} \\
	\{\tau_{1} | 0 \leq \tau_{1} \leq 1.0\} \\ 
	\{\tau_{1} | 0 \leq \tau_{1} \leq 1.0\} \\
	\end{split}
\end{equation}
To improve convergence, we shrink the domain around the current optimum using the domain reduction technique. The steps for random exploration are 15 and 25 for Bayesian optimization. We use a harmonic score $H$ to balance the overkill, leakage, and unknown rates, and choose the best combination using Equation~\ref{eq: harmonicScore}. The overkill rate is defined as the ratio of good samples mistakenly classified as defective samples to the total number of test samples. The leakage rate and unknown rate are similarly defined. A higher harmonic score indicates better overall performance. Users have the flexibility to adjust $H$ according to their specific requirements.
\begin{equation}
	\label{eq: harmonicScore}
	\begin{aligned}
		H = \frac{1}{3 \times exp(Overkill\ rate)} 
		+ \frac{1}{3 \times exp(leakage\ rate)} \\
		+ \frac{1}{3 \times exp(Unknown\ rate)}
	\end{aligned}
\end{equation}

Finally, the optimal values of $\tau_{0}$, $\tau_{1}$, and $\tau_{2}$, obtained through Bayesian optimization, are applied to the prediction converter in~\ref{PredictionConverter}, making the proposed hybrid defect detector in Figure~\ref{fig:DeepGDC} fully operational. The complete inference algorithm is included in the supplementary material.

\section{Experiments and Results}
\label{section: experiment and results}

The proposed method was tested on a dataset from an electronics manufacturing company. The results of the experiment are presented in this section.

\subsection{Experimental Configuration}
A subset of 388,702 images was selected from the original dataset, taking into account both data imbalance and simulation speed. There were 23 different component types, and the soldering defect types included missing, shift, stand, broken, and short (as shown in Figure~\ref{fig:Examples}). These defect types were consolidated into a single "bad" type. Thus, each sample was annotated with two labels: component type and defect type. The characteristics of the resulting dataset are summarized in Table~\ref{table:dataset_in_use}.
\begin{table}[ht]
	\centering
	\caption{Summary for the dataset in use.}
	\label{table:dataset_in_use}
	\begin{tabular}{cc}
		\toprule[1.1pt]
		Number of images     & 388,702                                       \\
		\midrule
		Component types     & 23 types   \\
		\midrule
		Defect types     & good, bad           \\
		\midrule
		Image labels     & 1 component type, 1 defect type     \\
		\bottomrule[1.1pt]
	\end{tabular}
\end{table} 

\textbf{Selection of new and old components:}
We divided the images into two groups: old components and new components. In the training and validation stage, old components have two labels: component type and defect type, represented as $(x, y_{com}, y_{def})$. New components only have one label, component type, represented as $(x, y_{com})$. During the testing stage, both old and new components are used, and the goal is to detect their defectiveness, even though defective new components were not seen in the training and validation stage.

\textbf{Comparison of different approaches/experts:}
We compared three approaches: Expert 1 uses a discriminative model and disables the lower branch of Figure.\ref{fig:DeepGDC} in all training/validation/test stages; Expert 2 uses a generative model and a prediction converter and disables the upper branch of Figure.\ref{fig:DeepGDC} in all training/validation/test stages; and the Hybrid Expert uses both branches.

\begin{figure*}[!tb]
	\begin{center}
		\subfloat[Test data on top of train data with Expert 1. \\ (Discriminative model). \\ Good test samples and bad test samples are mixed.]{\includegraphics[width=0.85\columnwidth]{figures/EXP1_GMM_TSNE_1212_12-05.pdf}}
		\subfloat[Test data on top of train data with Expert 2. \\ (Generative model). \\ Good test samples and bad test samples are mixed.]{\includegraphics[width=0.85\columnwidth]{figures/seed_1212_EXP2_2_12-05.pdf}}
	\end{center} 
	\begin{center}
		\subfloat[Test data on top of train data with Hybrid Expert. (Hybrid model). Good test samples and bad test samples are separable.]{\includegraphics[width=0.75\textwidth]{figures/seed_1212_HBE_2_12-05.pdf}}
	\end{center}
	\caption{2D visualization of training and test set features. Left: (a) Expert 1, Right: (b) Expert 2, Bottom: (c) Hybrid Expert. In the foreground, red circles indicate good samples in the test set, and green crosses indicate the bad samples of the test set. In the background, each color dot represents a component cluster from the training set.}
	\label{fig:t-SNE_test_set}
\end{figure*}

We made 2-D visualizations using t-SNE~\citep{van2008visualizing} of the feature embedding $z$ for all three experts. We overlaid the test samples on top of the ${ \text{Good}, \text{Missing}, \text{Stand}}$ training samples to see if good and bad samples are separable under GMM modeling. If a sample is inside the GMM models of ${ \text{good} }$ samples, it is considered a good sample. If a sample is inside the GMM models of ${ \text{Missing}, \text{Stand} }$ samples, it is considered a bad sample. If a sample is on the boundary of the GMM models of ${ \text{good} }$ samples, it may be a bad or unknown sample. If a sample is far from any GMM models, it is considered an unknown sample. Figures~\ref{fig:t-SNE_test_set} (a) and (b) show that, for both Expert 1 and Expert 2, good and bad samples are mixed. However, the Hybrid Expert successfully pushes bad samples to the boundary of the GMM models.

\subsection{Quantitative Results}

\textbf{Evaluation Metrics:}
Our experimentation evaluations utilize overkill, leakage, and unknown rates as our measurement standards due to their direct relevance to the assembly line needs. These rates are expressed as ratios to the overall number of test samples, as defined in Section~\ref{subsection: BO}.
% The overkill rate represents the ratio of false negative test samples to the overall test samples. The leakage rate displays the ratio of false positive test samples to the overall test samples. The unknown rate indicates the ratio of test samples predicted as unknown to the overall test samples.

\textbf{Experiment Results}
Our results are presented for 1) the entire test samples, including old and new components, 2) the test samples of old components, and 3) the test samples of new components.

Table~\ref{table:overall_performance} displays the average overkill and leakage rates for all test samples. The leakage rate of Expert 1 is not up to par. Expert 2 also shows subpar results. Allowing Expert 2 to classify samples as unknown does not improve its performance. Figure~\ref{fig:t-SNE_test_set} (b) shows why this is the case because the bad samples are not on the boundary of GMM models. Nevertheless, the Hybrid Expert performs the best with a low unknown rate of $3.7\%$.

Table~\ref{table:overall_performance_OLD} showcases the average overkill and leakage rates for the old component test samples. Expert 1 performs as expected with a favorable leakage rate. Expert 2, however, falls short in comparison. The Hybrid Expert's leakage rate is slightly better than Expert 1.

Table~\ref{table:overall_performance_NEW} presents the average overkill and leakage rates for the new component test samples. Expert 1 shows a disappointing leakage rate. Expert 2 also fails to meet expectations with a subpar overkill rate. On the other hand, the Hybrid Expert demonstrates the best performance overall.

\begin{table}[ht]
	\centering
	\caption{Comparison of Expert 1, Expert 2, and Hybrid Expert for all test samples.}
	\label{table:overall_performance}
	\scalebox{0.8}{\begin{tabular}{lccc}
			\toprule[1.1pt]
			Method               & Overkill (\%)    & Leakage (\%) & Unknown (\%)\\
			\midrule[1.1pt]
			Expert 1 & 0.015 $\pm$ 0.008    & 1.827 $\pm$ 3.063 & -\\
			\midrule
			Expert 2 & 1.954 $\pm$ 0.724    & 1.942 $\pm$ 1.337 & 0.0 $\pm$ 0.0\\
			\midrule
			\textbf{Hybrid Expert} & 0.108 $\pm$ 0.033  & \textbf{0.063 $\pm$ 0.075} & 3.7 $\pm$ 2.3\\
			\bottomrule[1.1pt]
	\end{tabular}}
\end{table}

\begin{table}[ht]
	\centering
	\caption{Comparison of Expert 1, Expert 2, Hybrid Expert for old component test samples.}
	\label{table:overall_performance_OLD}
	\scalebox{0.8}{\begin{tabular}{lccc}
			\toprule[1.1pt]
			Method               & Overkill (\%)    & Leakage (\%) & Unknown (\%)\\
			\midrule[1.1pt]
			Expert 1 & 0.017 $\pm$ 0.007    & 0.021 $\pm$ 0.011 & -\\
			\midrule
			Expert 2 & 1.282 $\pm$ 0.192    & 2.257 $\pm$ 1.495 & 0.0 $\pm$ 0.0\\
			\midrule
			Hybrid Expert & 0.129 $\pm$ 0.110    & \textbf{0.019 $\pm$ 0.013} & 3.5 $\pm$ 3.0\\
			\bottomrule[1.1pt]
		\end{tabular}
	}
\end{table}

\begin{table}[ht]
	\centering
	\caption{Comparison of Expert 1, Expert 2, Hybrid Expert for new component test samples.}
	\label{table:overall_performance_NEW}
	\scalebox{0.8}{\begin{tabular}{lccc}
			\toprule[1.1pt]
			Method               & Overkill (\%)    & Leakage (\%) & Unknown (\%)\\
			\midrule[1.1pt]
			Expert 1 & 0.010 $\pm$ 0.010    & 3.380 $\pm$ 5.540 & -\\
			\midrule
			Expert 2 & 3.739 $\pm$ 2.459    & 0.989 $\pm$ 1.713 & 0.0 $\pm$ 0.0\\
			\midrule
			Hybrid Expert & 0.126 $\pm$ 0.062    & \textbf{0.090 $\pm$ 0.156} & 3.3 $\pm$ 1.6\\
			\bottomrule[1.1pt]
		\end{tabular}
	}
\end{table}

\subsection{Ablation Study}
In our studies, we also conducted ablation experiments and found that when 'good' new component samples were not included in the training set, Expert 2 suffered a decline in performance. In contrast, the Hybrid Expert still maintained its overkill and leakage rates, although with a somewhat higher unknown rate. We also examined the effect of the prediction combiner, and these results are available in the supplementary material.

\section{Conclusion}
\label{section:conclusions}
By leveraging the strengths of a discriminative model (Expert 1) and a generative model (Expert 2), the proposed hybrid defect detector (Hybrid Expert) effectively address the issue of performance degradation when the test samples come from new components for which no defective sample is available during the model training phase.

The hybrid architecture enables the shared encoder network to form a better feature embedding $z$. The discriminative model makes the first defect prediction $\hat{y}^1_{def}$. The generative model makes probabilistic component prediction $P(z|y_{com})$ by Gaussian mixture models, which determines whether a sample belongs to any known component. The prediction converter converts the probabilistic component prediction to the second defect prediction $\hat{y}^2_{def}$. Finally, a prediction combiner combines the first and second defect predictions to make the final defection prediction $\hat{y}_{def}$. Additionally, the proposed architecture offers the option to output "unknown".

Compared to Expert 1 and Expert 2, the Hybrid Expert reduces the average leakage rate from 1.827\% $\pm$ 3.063\% and 1.942\% $\pm$ 1.337\% to 0.063\% $\pm$ 0.075\% with an unknown rate of 3.706\% $\pm$ 2.270\%. Our method strikes a balance between overkill, leakage, and unknown rate. The proposed method significantly improves the performance of the new component defect detection task.

The success of our hybrid expert is attributed to three key factors. Firstly, it leverages the knowledge gained from the detection of defects in old components to improve the detection of defects in new components. Secondly, it utilizes a prediction converter to maximize the utilization of the acquired knowledge. Finally, it has the capability to indicate "unknown" when the model's confidence in its predictions is low. These factors contribute to the effectiveness of our hybrid generative/discriminative defect detector and provide a new avenue for further research.

The proposed approach has significant practical value for detecting soldering defects in new components, which is crucial for ensuring the quality and reliability of Printed Circuit Board assemblies. Its potential for application in other scenarios motivates us to continue exploring its capabilities.

\textbf{Code and data availabiltiy}

Please refer to https://github.com/machingwen/DeepGD3, where an alternative fruit dataset serves as a reliable benchmark for evaluating the proposed model's generalization and robustness.
% along with an alternative fruit dataset for evaluation. The alternative fruit dataset has been thoughtfully chosen to share similar characteristics with the PCB dataset, including multiple object types and similar defects among them. This dataset can serve as a reliable benchmark for evaluating the proposed model's generalization and robustness. Access to the PCB dataset used in the paper must be granted by the PCB manufacturer, and we can provide their contact information upon request.


\begin{acknowledgements} % will be removed in pdf for initial submission,
						 % (without ‘accepted’ option in \documentclass)
                         % so you can already fill it to test with the
                         % ‘accepted’ class option
Thanks to H\&J global chair for supporting this project.
Thanks to Phison Electronics Corporation, Taiwan, for supporting this project and providing the dataset.
Thanks to Acer aiForge for providing computatonal power.
Thanks to the anonymous reviewers for helping improve the readability of this paper.
\end{acknowledgements}

%\newpage
% References
\bibliography{reference}
\end{document}
