\begin{abstract}  
Despite the success of current segmentation models powered by the transformer, the camouflaged instance segmentation (CIS) task remains a challenge due to the target and the background are similar. To overcome this problem, we propose a novel architecture called the local-feature-aware transformer ($\alpha$-Former), inspired by how humans find the camouflaged instance in a given photograph. We use traditional computer vision descriptors to simulate how humans find the unnatural boundary in a given photograph. Then, the information extracted by traditional descriptors can be employed as prior knowledge to enhance the neural network's performance. Moreover, due to the non-learnable characteristics of traditional descriptors, we designed a learnable binary filter to simulate the traditional descriptors. In order to aggregate the information from the backbone and binary filter, we introduce an adapter to merge local features into the transformer framework. Additionally, we introduce an edge-aware feature fusion module to improve boundary results in the segmentation model. Using the proposed transformer-based encoder-decoder architecture, our $\alpha$-Former surpasses state-of-the-art performance on the COD10K and NC4K datasets. 
\end{abstract}


%%%%%%%%% BODY TEXT
\section{Introduction}
\label{sec:intro}

\noindent Camouflaged instance segmentation (CIS) is beneficial for applications in computer vision, like medical image segmentation, agriculture, etc (~\cite{fan2020camouflaged}).
However, this task is challenging compared to traditional object detection and segmentation since camouflaged objects can effectively blend in with the background, making it difficult for models to detect and annotate them accurately.

Recently, transformer reached outstanding performances in different applications like detection (~\cite{carion2020end}), classification (~\cite{chen2021crossvit}), segmentation (~\cite{strudel2021segmenter}), etc. However, transformer models usually need a large-scale dataset for training. 
Thanks to the large-scale datasets and benchmarks for camouflaged object detection including , NC4K (~\cite{lv2021simultaneously}), COD10K (~\cite{fan2020camouflaged}), CAMO (~\cite{le2019anabranch}), CAMO++~ (\cite{le2021camouflaged}), the researchers can implement the transformer on CIS.
As a result, the transformer have achieved state-of-the-art performance in this field (~\cite{pei2022osformer}). 

\begin{figure}[t]
    \centering
    \includegraphics[width=0.9\linewidth]{Figs/Fig_1_cropped.pdf}

    \caption{The $\alpha$-Former was motivated by the need to improve the performance of the camouflaged instance segmentation model. The model generates a local feature that provides precise boundary information about the target object. The input image is displayed in the top left, the prediction result without the local feature is shown in the top right, the generated local feature is displayed in the bottom left, and the prediction result with the local feature is shown in the bottom right. Incorporating the local feature into the model results in a more accurate segmentation of the target object.}
    \label{fig:1}
\end{figure}

Despite their effectiveness, current transformer models have limitations in dealing with CIS. As shown in Fig~\ref{fig:1}, these models tend to predict multiple objects for a single target when the edge is unclear. This is because the models primarily focus on finding the target object and ignore the importance of accurately identifying the boundary of the target object. To improve CIS performance, models need to understand the object's location better and enhance the features around the instance's boundary.


Inspired by how humans detect hidden objects within a photograph, the approach does not involve a direct search for the concealed instance due to the object's seamless integration into the surroundings, making it challenging to pinpoint directly. Instead, humans rely on comparing the local features with adjacent pixels. When humans recognize an unnatural boundary, it raises confidence in the presence of a concealed object (\cite{troscianko2009camouflage}). However, the question arises of how to impart prior knowledge to a neural network regarding identifying unnatural boundaries. This is where traditional descriptors come into play. The fundamental concept behind these descriptors is to establish a means of comparing a pixel to its neighboring pixels. Illustrated in Fig.\ref{fig:1}, the lower left image demonstrates the outcome of applying a traditional descriptor to a camouflaged instance, revealing the ability of such descriptors to identify unnatural boundaries in the given image. Subsequently, this information can be employed as prior knowledge to enhance the neural network's performance.

To improve the boundary features of our model, we have integrated traditional descriptors like LBP (~\cite{ojala1994performance}) into the transformer framework. LBP is especially sensitive to edges, which is advantageous in the context of CIS because of the high similarity between foreground and the background. As depicted in Figure~\ref{fig:1}, LBP can accurately demarcate the boundary of the target object, even when the texture and color of the target object have high similarity to that of the background. This makes it possible for the model to achieve superior results, as shown in Figure~\ref{fig:1}. By combining LBP with the transformer, we have developed an effective framework for identifying target objects and creating precise boundaries. We call this framework the local-feature-aware transformer, or $\alpha$-Former (pronounced "alpha-former"). Inspired by LBP, we have created a learnable module known as the binary filter (BF), which can compare pixel values within a field and generate a local feature. The binary filter consists of a learnable module and a fixed-weight convolution layer called BCNN which can extract features similar to the LBP.


The fixed convolution layer is able to generate local features by comparing different pairs of pixels, while the learnable module can collect and consolidate this comparison information. To effectively integrate the features extracted by the binary filter, we have developed a learnable module known as the feature aggregation adapter (FAA). The FAA can provide the local features to the backbone of the model without interfering with its performance, even if there are differences in the input distribution. Moreover, our FAA module is highly parameter-efficient and easy to train. Additionally, we have designed an edge-aware module that can accurately predict boundaries for CIS. This module includes a multi-level convolution layer that offers a wide receptive field, as well as a fixed-weight convolution layer that extracts local features. To utilize the ground truth edge as supervision, we employ a $1\times 1$ convolution layer to generate edge predictions. These edge predictions are then incorporated into the final prediction head to improve overall performance of the model.


Our model combines the binary filter (BF), feature aggregation adapter (FAA), and edge-aware fusion module to achieve superior performance on two popular datasets, NC4K and COD10K. Specifically, our architecture outperforms the current state-of-the-art by approximately 2 average precision (AP) points. Additionally, we do comprehensive ablation studies to demonstrate the effectiveness of the proposed BF, FAA, and edge-aware fusion module. Also, we provide lots of qualitative results in our experiments. 

To summarize, our contributions are:
\begin{itemize}

\item Inspired by how human find the camouflaged instance in a photograph, we use traditional descriptors to simulate the process of how human find the unnatural boundary. Moreover, due to the non-learnable characteristics of traditional descriptor, we proposed a learnable module to extract similar features as the traditional descriptor. 

\item We proposed $\alpha$-Former, which firstly provides local binary information to the camouflage instance segmentation model. Also, we provide edge supervision to our model to improve the final mask boundary. 

\item We achieve state-of-art camouflaged instance segmentation results on two different benchmarks. Experiments and ablation study prove the efficiency of our proposed modules and architecture. 
\end{itemize}

\begin{figure}[t]
    \centering
    \includegraphics[width=1.0\linewidth]{Figs/Fig_2_cropped.pdf}
    \caption{Examples of our BCNN layer. The left is a sample of $3\times 3$ BCNN layer, the center is the input image, and the right is the output of the BCNN layer. Our BCNN is a fixed-weight binary convolution layer that can provide the comparison information of neighboring pixel pairs. Our results show that BCNN can provide a precise boundary for a given image. }
    \label{fig:2}
\end{figure}

\section{Relate Work}
\label{sec:relate_work}
\noindent\textbf{Camouflaged Object Detection.}
Camouflaged object detection aims to find the object in the image that hidden in the background and is more difficult than traditional object detection. Earlier works mainly focus on some level features like color (\cite{huerta2007improving}), texture (\cite{song2010new}). As deep learning advances, an increasing number of studies are employing neural networks to address the issue. These methods (including \cite {zhu2021inferring, mei2021camouflaged}) mostly employ a CNN backbone for high-level feature extraction and aim to detect and predict the camouflaged objects.. (\cite{zhai2021mutual}) proposed MGL that firstly use a mutual graph to detect and predict the final results. (\cite{yang2021uncertainty}) proposed UGTR which tried to mimic the human process, adding an uncertain prediction for camouflaged object detection. (\cite{pei2022osformer}) proposed OSFormer that uses a one-stage architecture and transformer to get the final results. (\cite{mei2021camouflaged}) introduced PFNet that firstly adds a focus and positioning module to mimic the human detection process, which tries to find the target object.  \\

\textbf{Integrating traditional descriptors to Help CNN.}
There is a long history of using traditional descriptors to help improve the performance of CNN. Earlier works use different descriptors to help CNN. For example, some works (\cite{karanwal2021neighborhood, karanwal2021od}) use LBP (\cite{ojala1994performance}) to help improve the face performance recognition. People also use HOG (\cite{dalal2005histograms}) to help them improve the performance of human detection (\cite{surasak2018histogram}) and action recognition (\cite{patel2020histogram}). Recently, researchers tried to combine SIFT (\cite{lowe1999object}) and convolution networks (including \cite{gupta2019improved, hossein2021image,kovavc2022finger}) to extract better features and implement the features in different applications. Considering so many works integrating traditional descriptor with deep learning architecture and get performance improvement and the lack of effort to apply the traditional descriptor to camouflaged object detection, we try to use a descriptor inspired by traditional descriptors to enhance the effectiveness of camouflaged object detection.\\
\textbf{Binary Filter.}
The traditional descriptor inspires the idea of using a binary filter for convolution. Many works already use their binary filter to get good performance in many datasets. For example, BinaryConnect (\cite{courbariaux2015binaryconnect}) tried to design a neural network that only has binary weights in propagation. In this article, they approximate the real value in neural networks with binary values. Based on BinaryConnect, (\cite{courbariaux2016binarized}) proposed BinaryNet where both the activations and weights are constrained to $-1$ or $+1$. LBCNN (\cite{juefei2017local}) uses a fixed-weight binary convolution to replace the original convolution and get good performance in the classification tasks. These works show the feasibility of using binary filters to extract features and train neural networks.\\
\textbf{Adapter.}
The adapter is firstly proposed in NLP tasks (\cite{houlsby2019parameter}), which targets to transfer the pre-trained NLP model to different downstream tasks while not introducing lots of parameters in the new models. Because of its efficiency, more and more researchers have recently tried to add an adapter to computer vision tasks (including \cite{long2019multi, sung2022vl}). Also, it is very efficient to use adapter in domain transfer, and lots of works (including \cite{ansell2021mad, ke2021achieving}) that concentrate on this. The input domain has changed after using the binary filter in our work. Hence, we use an adapter to help the pre-trained backbone to extract the features. 

\begin{figure}[htbp]
    \centering
    \includegraphics[width=1.0\linewidth]{Figs/architecture_crop.pdf}
    \caption{$\alpha$-Former comprises a feature extractor, an encoder-decoder, an edge-aware fusion module, and a prediction head. $\alpha$-Former use a single RGB image as input, and $\alpha$-Former output the camouflaged object mask in the input image.}
    \label{fig:3}
\end{figure}

\begin{figure*}[htbp]
    \centering
    \includegraphics[width=0.95\linewidth]{Figs/feature_extractor_cropped.pdf}
    \caption{Our feature extractor contains a preprocessing module, several binary filters, a feature aggregation adapter, and a pre-trained backbone. The binary filter can extract local features of the input image. After getting local features, we concatenate the original image and local features and use our feature aggregation adapter to transfer the new image domain to the input image domain. After the feature aggregation adapter, we use a pre-trained CNN to do high-level and low-level features extraction. }
    \label{fig:4}
\end{figure*}



\section{Binary Filter}
\label{sec:3}
\subsection{Why use binary filter}

\noindent We have observed that traditional camouflage segmentation models struggle to accurately determine the boundary of objects in ambiguous cases. For example, when presented with an image of a pipefish, as shown in Fig. \ref{fig:1}, a standard model may predict multiple objects instead of correctly identifying the single target object. Inspired by how human find camouflaged instance, traditional descriptors come into our minds. We can use traditional descriptors to provide prior knowledge to neural network to enhance its ability. However, the traditional descriptors like LBP is not learnable, meaning that it is difficult to adapt to new input data. 

To address this issue, we sought to design an architecture that can detect local binary features similar to those captured by traditional descriptors but is also learnable. The LBP descriptor compares the center pixel value with the surrounding pixel values, so we were inspired to create a binary filter using a fixed binary weight convolution (BCNN) to simulate this process.


\subsection{Architecture of binary filter}

\noindent We describe the architecture of the proposed binary filter, which allows for comparison operations that are difficult to perform with traditional convolution layers in this section. As illustrated in Fig. \ref{fig:2}, we can simulate the comparison operation by designing a convolution kernel where the center value is -1, the left value is 1, and all other values are 0. After applying this convolution operation, we compare the output with 0. If the output is greater than 0, we know that the value of left pixel is greater than the value of the center pixel; otherwise, the value of the left pixel is less than the value of the center pixel. Our designed binary convolution layer with fixed binary weight convolution (BCNN) can extract the precise boundary for the target object, as demonstrated in Fig. \ref{fig:2}. To increase the robustness of BCNN, we use multiple binary convolution kernels for each BCNN layer, and for each kernel, we randomly select a value from ${-1,0,1}$. However, the BCNN is not trainable, and to make the binary filter trainable, we add a $1\times 1$ convolution layer after each BCNN layer to gather information, which is trainable. This trainable $1\times 1$ convolution layer is very light and easy to train compared to the traditional CNN architecture. 

\section{Methods}
\noindent\textbf{Architecture}
Our proposed $\alpha$-Former has five crucial modules. (1) A feature extractor with a binary filter to extract similar features as the LBP (~\cite{ojala1994performance}), an adapter to transfer the input domain, and a backbone to extract object features. (2) A transformer encoder that uses global and local features to generate object embedding. (3) An edge-aware feature fusion module to generate precise boundaries. (4) A transformer decoder to extract the information from the embedding (5) Mask predict head to predict final instance mask. The whole architecture is shown in Fig.\ref{fig:3}


\begin{figure*}[htb]
    \centering
    \includegraphics[width=0.95\linewidth]{Figs/encoder_decoder_cropped.pdf}
    \caption{Our encoder contains a position encoding module, a self-attention module, and a CNN architecture. The encoder's input is the extracted third to fifth layer's backbone features. After getting the input feature, we integrate a position embedding to the features and use a self-attention module to get its local features. After getting the local feature, we use an add \& norm operation followed by a CNN architecture to get the final output of the encoder. Then we restore and grid the output of the encoder to a location-aware query and input the query to the decoder. In the decoder, we use a cross-attention module to extract information. We use the same CNN architecture as the encoder after the cross-attention. }
    \label{fig:5}
\end{figure*}
\subsection{Feature Extractor}


\noindent Our feature extractor consists of three parts: a learnable local binary filter (BF), a feature aggregation adapter (FAA), and a pre-trained CNN backbone. These components are shown in Fig.\ref{fig:4}.

\subsubsection{Binary Filter (BF)}
The purpose of the binary filter is to get local features. Here, provided an input image $I\in \mathbb{R}^{H\times W\times 3}$, we firstly use a convolution layer to preprocess the image. After the preprocessing, we use a pre-defined binary filter to extract local binary features. The detail of the binary filter is already discussed in Sec.\ref{sec:3}. We use multiple binary filters in every experiment to extract the local binary information. After the BF module, we can get a feature $F\in \mathbb{R}^{H\times W\times C}$ where channel number of the final $1\times 1$ convolution is C. Then we concatenate the feature $F$ and the original image $I$ . 
\subsubsection{Feature Aggregation Adapter (FAA)} 
After the BF module, the channel numbers of concatenate images are different from the backbone training images, which makes it not practical to use the pre-trained backbone directly. To use the pre-trained backbone, we need a method to transfer the concatenated image domain to the domain that is the same as the original images. Here, we introduce a feature aggregation adapter to align the new image domain with the original image domain. The architecture of the adapter is a $1\times 1$ convolution and a skip connection which can be seen in Fig.\ref{fig:4}. The adapter output a image with the shape $H\times W\times 3$ which is the same as the original images. The purpose of adding a skip connection is that, at the beginning of the training, it is challenging to initialize the parameter of the $1\times 1$ convolution to guarantee the domain of the output is the same as the domain of the original image. In order not to influence the performance of the backbone at the beginning of the training, we can set very tiny initial values of the $1\times 1$ convolution layer. Furthermore, for the skip connection, we can directly add the first three channels, the original images, to the output. This operation can ensure the input of the backbone is almost the same as the original image at the beginning of the training. Throughout the training phase, the model can gradually learn to use the local binary features. 

\subsubsection{CNN BACKBONE}
We use a pre-trained backbone in our experiments. In order to provide high-level features and low-level features to the prediction module, We utilize multi-scale features derived from the backbone. We will use the last four layers' features in most of our experiments. We will use $F_2-F_5$ to represent different layer features in the following part. Because the backbone's input contains more local features than the original image, the extracted features of the backbone contain extra information compared to directly inputting the original images to the backbone. 



\subsection{Encoder-Decoder}
\noindent To speed up the training process and reduce the computation cost we combine the transformer and CNN in our encoder, which can be seen in Fig.\ref{fig:5}. We input multi-scale features $F_3-F_5$ to our encoder to generate more informative features. Inspired by DETR (\cite{carion2020end}), which adds a position embedding to the input feature, We firstly calculate the position embedding of the input features and incorporate it into the original features $F3-F5$ and get updated features $F3^{(1)}-F5^{(1)}$. Then we input the features to a self-attention module, which can capture the local information and get $F3^{(2)}-F5^{(2)}$. After the self-attention module, we use a CNN module to increase the training process. We add the features $F3^{(1)}-F5^{(1)}$ and $F3^{(2)}-F5^{(2)}$, then we pass the result of the self-attention module to a layer normalization, then we pass the result to a $3\times 3$ convolution layer. After the convolution, A group normalization and a GELU activation are used. Following the GELU activation, we add a $3\times 3$ convolution layer. After the convolution layer, we restore the outputs to multi-scale features $T_3-T_5$. Then we flatten the $T_3-T_5$ to a sequence and input them to the decoder. \\
% The process of the encoder can be written as 
% \begin{equation}
% \begin{aligned}
%     F_i^{(2)}&= {\rm LN}((F_i+P_i)+{\rm Att}(F_i+P_i))\\
%     T_i &= {\rm Conv^3}({\rm GELU}({\rm GN}({\rm Conv^3}(F_i^{(2)}))))
% \end{aligned}
% % \nonumber
% \end{equation}
% where $F_i$ is the input feature, $P_i$ is the position embedding, Att is the self-attention module, LN is layer normalization, $\rm Conv^3$ is $3\times 3$ convolution, GELU is GELU activation, GN is group normalization. 
% \subsection{Decoder}
\noindent The decoder is the same as the encoder. We also combine the transformer and the convolution. For the input sequence, we follow the same operation as the encoder, which first calculates the location embedding of the input features. After that, we grid the input sequence to the shape of $S\times S\times D$, then flatten them to query shapes $L\times D$ and produce a location-aware query that will provide the location information for every token. After getting the location-aware query, we input the encoder feature and location-aware query to a cross-attention layer. We use the encoder feature as the key and value, and use the location-aware query as the query in the cross attention layer. After the cross-attention layer, we use the same normalization layer and convolution structure as the encoder to produce the decoder embedding. 


\subsection{Edge-aware feature fusion module (EAF)}

\noindent To improve the performance of boundary prediction, we added a module called edge-aware feature fusion. This module uses the ground truth edge as a guide to combine two types of features: high-level features extracted from the backbone network (called $F_2$) and low-level features extracted from the encoder (called $T_3$ to $T_5$).

The edge-aware feature fusion module processes the low-level features $T_5$ to $T_3$ by first extracting information with a convolution layer, then a binary convolutional layer is utilized to capture local binary features, which are then fed into a $1\times 1$ convolutional layer to predict edges (called $E_5$).

Next, We up-sample the binary features to ensure they are the size is the same as $T_4$ and concatenate them, generating a new input feature ($I_4$). We repeat this process until we reach $F_2$.

Employing the edge-aware feature fusion module enables the model better recognize the boundaries of objects, leading to more precise segmentation masks and avoiding the issue of predicting one object as multiple objects. The formula for the edge-aware fusion model is given, and the output of the final block $O_2$ is forwarded to the mask prediction head.

We also output the result of the final block $O_2$ to the mask prediction head. 
\subsection{Mask prediction head}
We follow the same structure as OSFormer (\cite{pei2022osformer}). For more details, please see supplementary materials. 

\begin{figure}[t]
    \centering
    \includegraphics[width=0.8\linewidth]{Figs/edge_fusion_cropped.pdf}
    \caption{Our edge-aware feature fusion module uses a pyramid structure. The main component of our edge-aware feature fusion module is an edge prediction block. Given the input feature, we use a multi-size convolution following a BCNN layer to extract its feature. Then we up-sample the result to ensure that the size is the same as the next input feature size. We employ a $1\times 1$ convolution layer to predict the edge and use the ground truth edge as supervision. }
    \label{fig:6}
    %\vspace{-15mm}
\end{figure}
\subsection{Loss function}
\noindent Our loss function is composed of three parts, edge loss, location loss, and mask loss. We use dice loss for the edge loss and location loss; for mask loss, we use focal loss. Hence, our final loss function can be written as $$L = \lambda_{edge}L_{edge} + \lambda_{location}L_{location} + \lambda_{mask}L_{mask}$$. 
In our experiments, $\lambda_{edge}$ and $\lambda_{location}$ is set to 1 while $\lambda_{mask}$ is set to 3 to balance different loss. 
\section{Experiments}
\subsection{Experimental setup}
\noindent \textbf{datasets}
We use two benchmark datasets: NC4K (\cite{lv2021simultaneously}) and COD10K (\cite{fan2020camouflaged}) in our experiments. The COD10K datasets include 3040 training images with instance-level annotations and 2026 for testing. The NC4K datasets contain 4121 images with instance-level labels. We train our model using the COD10K training set and test our model on COD10K testing set and NC4K dataset. In order to provide more training samples for the model, we resize the input images to multiple sizes. We ensure that the shorter side measures between 480 and 800 pixels, while the longer side of the input image is under 1333 pixels after resizing. \\
\textbf{evaluation metrics}
We use COCO-style evaluation metrics in our experiments, including $AP, AP_{50}$ and $AP_{75}$, but our experiments have slight differences. The original COCO evaluation metrics use mAP, which will calculate the mean AP for every category. However, our camouflaged datasets are class-agnostic. Hence, we only need to calculate the AP for the whole dataset while ignoring the category. \\
\textbf{implement details}
Pytorch is used to implement our $\alpha$-Former and we trained it on a single V100-sxm2. To build our model, ResNet-50 (\cite{he2016deep}) is used as the backbone, which had been trained with the ImageNet (\cite{deng2009imagenet}) dataset. During our experiments, we trained our model for 90K iterations, utilizing a batch size of 2. The optimizer we used was SGD, the initial learning rate is $2.5e-4$, and the learning rate was reduced by a factor of $0.1$ when the number of iterations reached 60K and 80K. The weight decay parameter is $1e-4$.
\subsection{Comparison with the State-of-the-arts}
\noindent We conduct experiment to compare our model with current State-of-the-arts models. Because there are not many camouflaged instance segmentation models, we also use several generic instance segmentation models and limit these models to train and test on the camouflaged datasets. To have fair comparisons, pre-trained ResNet-50 was used as the backbone for all models. The results are shown in Table.\ref{table-1}

\begin{table}

  \scriptsize
  \caption{Quantitative results of the $\alpha$-Former, the best results are highlighted in \textbf{bold}.}
  \centering
\resizebox{0.47\textwidth}{!}{
\begin{tabular}{c | c c c | c c c}
    \toprule
      \multirow{2}*{method} & \multicolumn{3}{c|}{COD10K} & \multicolumn{3}{c}{NC4K} \\
     ~ & AP & AP50 & AP75 & AP& AP50 & AP75\\
    \midrule
   Mask-RCNN (\cite{he2017mask}) & 25.0 & 55.5 & 20.4 & 27.7 & 58.6 & 22.7\\
   MS-RCNN (\cite{huang2019mask}) & 30.1 & 57.5 & 25.7 & 36.1 & 68.9 & 33.5\\
   Cascade RCNN (\cite{cai2019cascade}) & 25.3 & 56.1 & 21.3 & 29.5 & 60.8 & 24.8\\
   HTC (\cite{chen2019hybrid}) & 28.1 & 56.3 & 25.1 & 29.8 & 59.0 & 26.6\\
   Mask Transfiner (\cite{ke2022mask}) & 28.7 & 56.3 & 26.4 & 29.4 & 56.7 & 27.2\\
   YOLACT (\cite{bolya2019yolact})  & 24.3 & 53.3 & 19.7 & 32.1 & 65.3 & 27.9\\
   CondInst (\cite{tian2020conditional}) & 30.6 & 63.6 & 26.1 & 33.4 & 67.4 & 29.4\\
   QueryInst (\cite{fang2021instances}) & 28.5 & 60.1 & 23.1 & 33.0 & 66.7 & 29.4\\
   SOTR (\cite{guo2021sotr}) & 27.9 & 58.7 & 24.1 & 29.3 & 61.0 & 25.6\\
   SOLOv2 (\cite{wang2020solov2}) & 32.5 & 63.2 & 29.9 & 34.4 & 65.9 & 31.9\\
   OSFormer (\cite{pei2022osformer}) & 41.0 & 71.1 & 40.8 & 42.5 & 72.5 & 42.3\\
   $\alpha$-Former(Ours) & \textbf{42.5} & \textbf{72.8} & \textbf{41.8} & \textbf{42.9} & \textbf{72.9} & \textbf{43.3}\\
  \bottomrule
\end{tabular}
}
\label{table-1}
\vspace{-10pt}
\end{table}


\subsection{Ablation Study}
% Due to the space limit, we select several important ablation studies here. More ablation studies can be seen in the supplementary materials. 
\subsubsection{Comparison with the traditional descriptor}
As shown in Table.\ref{table-2}, the performance of our binary filter and the traditional descriptor is compared. Here, Baseline means no descriptors are added. Because SIFT cannot generate a feature map with the same size as the input images, in order to use the same architecture and have a fair comparison, we mainly focus on the LBP (\cite{ojala1994performance}), HOG  (\cite{dalal2005histograms}), circle-LBP (\cite{ojala2002multiresolution}) descriptor in our experiments. Except for the local feature extractor, our experiments' other settings are the same. We can see that some of the traditional descriptors can outperform the model that does not include any local feature extractor. However, our learnable binary filter can perform better than the traditional descriptor. This experiment demonstrates our binary filter's efficiency and ability to provide powerful local features to enhance the model's performance. 

\begin{table}

  \scriptsize
  \caption{Comparison with the traditional descriptor, the best results are highlighted in \textbf{bold}.}
  % \vspace{10pt}
  \centering
\resizebox{0.47\textwidth}{!}{
\begin{tabular}{c | c c c | c c c}
    \toprule
      \multirow{2}*{method} & \multicolumn{3}{c|}{COD10K} & \multicolumn{3}{c}{NC4K} \\
     ~ & AP & AP50 & AP75 & AP& AP50 & AP75\\
    \midrule
   Baseline & 40.244 & 69.875 & 39.422 & 41.718 & 71.640 & 41.179\\
   HOG & 40.934 & 70.887 & 40.285 & 42.765 & 71.988 & \textbf{44.226}\\
   LBP & 40.410 & 70.323 & 40.184 & 41.794 & 71.313 & 42.484\\
   Circle-LBP & 40.424 & 69.622 & 40.764 & 41.921 & 71.661 & 42.133\\
   Binary filter & \textbf{42.453} & \textbf{72.735} & \textbf{41.758} & \textbf{42.936} & \textbf{72.905} & 43.278\\
  \bottomrule
\end{tabular}
}
\label{table-2}
\end{table}

\subsubsection{Adapter}
In this section, we show the improvement of adding the feature aggregation adapter to our feature extractor. The target for our adapter is to provide the extra local feature to our encoder. If we directly delete the adapter, the input domain will be different, and the pre-trained backbone cannot deal with the input with the local feature. However, to provide a fair comparison, we still need to provide the local feature to the encoder-decoder and the edge-aware fusion module. Hence, we concatenate our local features to the ResNet extracted features and change the input channel numbers of the encoder and edge-aware fusion module. In this way, we can still provide the local features to the encoder and edge-aware fusion module and provide a fair comparison. Also, we try a different setting that modified the first layer of the pre-trained backbone and randomly initialized (RI) this layer to demonstrate the efficiency of our adapter. To better show the effectiveness of our adapter, We also test the adapter on the traditional descriptor. The results are shown in Table.\ref{table-3}. Noticed that our adapter is helpful for the binary filter and can improve the performance of the traditional descriptor. 
\begin{table}
\setlength\tabcolsep{3pt}
%   \small
%   \footnotesize
  \scriptsize
  \caption{Ablations for the existence of feature aggregation adapter. }
  % \vspace{10pt}
  \centering
\resizebox{0.47\textwidth}{!}{
\begin{tabular}{c | c c c | c c c}
    \toprule
      \multirow{2}*{method} & \multicolumn{3}{c|}{COD10K} & \multicolumn{3}{c}{NC4K} \\
     ~ & AP & AP50 & AP75 & AP& AP50 & AP75\\
    \midrule
    
%   Baseline & 40.244 & 69.875 & 39.422 & 41.718 & 71.640 & 41.179\\
%   \midrule
   HOG + RI & 36.785 & 63.585 & 37.906 & 35.474 & 64.150 & 37.246\\
   HOG w/o adapter  & 40.801 & 70.435 & 41.407 & 42.682 & 72.647 & 43.154\\
   HOG w/ adapter & 40.934 & 70.887 & 40.285 & 42.765 & 71.988 & 44.226\\
   \midrule
   LBP + RI & 33.562 & 61.623 & 34.732 & 35.631 & 64.463 & 35.462\\
   LBP w/o adapter & 39.530 & 69.419 & 39.331 & 42.288 & 71.077 & 42.162\\
   LBP w/ adapter & 40.410 & 70.323 & 40.184 & 41.794 & 71.313 & 42.484\\
   \midrule
   Circle-LBP + RI & 35.246 & 66.352 & 36.462 & 36.853 & 67.432 & 35.241 \\
   Circle-LBP w/o adapter & 40.270 & 70.550 & 40.257 & 42.668 & 73.669 & 42.172\\
   Circle-LBP w/ adapter & 40.424 & 69.622 & 40.764 & 41.921 & 71.661 & 42.133\\
   \midrule
   Binary filter + RI & 36.415 & 64.151 & 35.414 & 33.541 & 67.252 & 34.532\\
   Binary filter w/o adapter & 41.427 & 71.247 & 40.984 & 42.610 & 71.517 & 42.985\\
   Binary filter w/ adapter & 42.453 & 72.735 & 41.758 & 42.936 & 72.905 & 43.278\\
  \bottomrule
\end{tabular}
}
\label{table-3}
\end{table}

\subsubsection{Edge-aware feature fusion module}
We provide the ablation study of our edge-aware fusion module in this section. Our edge-aware fusion module can provide precise boundary prediction information to the final prediction heads. We show the results using different descriptors, including traditional descriptors and our binary filter which is similar to the adapter. The results are shown in Table.\ref{table-4}. Noticed that our proposed edge-aware feature fusion module can improve the performance for about 4 AP higher than the model do not have an edge-aware feature fusion module. It shows the efficiency of our edge-aware feature fusion module and proves that edge prediction is crucial in camouflaged instance segmentation. The qualitative results of our edge-aware feature fusion module can be seen in Fig.\ref{fig:vis}, which shows that our edge-aware feature fusion module can deal with different situations and precisely predict the edge of the target object. 

\begin{figure*}[htbp]
    \centering
    \includegraphics[width=0.9\linewidth]{Figs/vis_cropped.pdf}
    \caption{The results of our $\alpha$-Former's qualitative evaluation demonstrate its ability to extract precise boundaries and its strong performance in a range of challenging scenarios. These findings suggest that our proposed approach can effectively address the complexities of real-world image segmentation tasks.}
    \label{fig:vis}
\end{figure*}

\begin{table}
% \setlength\tabcolsep{3pt}
%   \small
%   \footnotesize
  \scriptsize
  \caption{Ablations for the existence of edge-aware feature fusion module.}
  % \vspace{10pt}
  \centering
\resizebox{0.47\textwidth}{!}{
\begin{tabular}{c | c c c | c c c}
    \toprule
      \multirow{2}*{method} & \multicolumn{3}{c|}{COD10K} & \multicolumn{3}{c}{NC4K} \\
     ~ & AP & AP50 & AP75 & AP& AP50 & AP75\\
    \midrule
   HOG w/o EAF & 37.658 & 66.584 & 35.984 & 39.252 & 67.971 & 38.756\\
   HOG w/ EAF & 40.934 & 70.887 & 40.285 & 42.765 & 71.988 & 44.226\\
   \midrule
   LBP w/o EAF & 36.128 & 67.197 & 36.725 & 36.375 & 68.258 & 37.813\\
   LBP w/ EAF & 40.410 & 70.323 & 40.184 & 41.794 & 71.313 & 42.484\\
   \midrule
   Circle-LBP w/o EAF & 35.254 & 64.741 & 36.194 & 36.581 & 66.943 & 36.135\\
   Circle-LBP w/ EAF & 40.424 & 69.622 & 40.764 & 41.921 & 71.661 & 42.133\\
   \midrule
   Binary filter w/o EAF & 38.019 & 69.765 & 36.813 & 37.083 & 68.672 & 38.731\\
   Binary filter w/ EAF & 42.453 & 72.735 & 41.758 & 42.936 & 72.905 & 43.278\\
  \bottomrule
\end{tabular}
}

\label{table-4}
\end{table}

\subsubsection{influence of different kernel size in BCNN}
we investigate the impact of various kernel sizes on our binary filter in this section. Different kernel sizes will have different receptive fields, and a larger receptive field will provide more pixels in one convolution operation. In our binary filter, it will affect the final local binary feature of the binary filter. Our results are shown in Table.\ref{table-5}. It shows that a smaller kernel size can have better performance. The reason that small kernel sizes have better performance may be that camouflaged objects have similar pixel values as the background. The larger kernel may increase the influence of the background and result in final performance drops. 


\begin{table}
% \setlength\tabcolsep{3pt}
%   \small
%   \footnotesize
  \scriptsize
  \caption{performance of $\alpha$-Former with different kernel size in the binary filter, the best results are highlighted in \textbf{bold}.}
  % \vspace{10pt}
  \centering
\resizebox{0.47\textwidth}{!}{
\begin{tabular}{c | c c c | c c c}
    \toprule
      \multirow{2}*{method} & \multicolumn{3}{c|}{COD10K} & \multicolumn{3}{c}{NC4K} \\
     ~ & AP & AP50 & AP75 & AP& AP50 & AP75\\
    \midrule
   $3\times 3$ & \textbf{42.453} & \textbf{72.735} & \textbf{41.758} & \textbf{42.936} & \textbf{72.905} & \textbf{43.278}\\
   $5\times 5$ & 41.308 & 70.624 & 41.707 & 42.567 & 72.075 & 43.198\\
   $7\times 7$ & 40.476 & 70.047 & 40.790 & 42.136 & 71.895 & 42.698\\
   $9\times 9$ & 40.691 & 70.116 & 40.810 & 41.164 & 71.043 & 42.580\\
  \bottomrule
\end{tabular}
}

\label{table-5}
\end{table}
\subsection{Visualizations}

\noindent We presents the qualitative results of the $\alpha$-Former in this section, including the edge prediction achieved by our edge-aware fusion module. The results show the efficiency of our method, as our module can predict precise boundaries, as shown in the second row's first column, where it accurately identifies the feet of a challenging target object. Additionally, our $\alpha$-Former can successfully handle different backgrounds, such as branches, land, or aquatic plants, and precisely segment different target objects, including birds, fishes, and terrestrial animals. Moreover, our model can generate accurate edges even when the target object is partially occluded by the background, as seen in the last row's first column. This suggests that our approach can extract semantic information from the backbone's features and still recognize the object as the same entity, even if it is not continuous. Overall, these results demonstrate the robustness and effectiveness of our $\alpha$-Former in challenging scenarios.

\section{Conclusion}
\noindent In conclusion, we contribute a novel local feature-aware transformer framework called $\alpha$-Former targeting on camouflaged instance segmentation. Observing the camouflaged objects' characteristics, inspired by humans, we introduce traditional descriptors to current camouflaged instance segmentation methods and use traditional descriptor to simulate the process that human find unnatural boundary of camouflaged instance. Moreover, we design a learnable novel binary filter to extract the camouflaged image's local features. To provide the local features to the encoder, we design a feature aggregation adapter to fuse the pre-trained backbone and the local features input. Furthermore, we create an edge-aware feature fusion module to improve the boundary prediction of camouflaged objects, combining multi-level features and employing the ground truth edge as supervision. We also provide the quantitative results and qualitative results of our $\alpha$-Former to show our robustness to different backgrounds. We believe the $\alpha$-Former is a new state-of-the-art for camouflaged instance segmentation, and it can be transferred to applications like medical diagnosis, photo-realistic blending, etc. 
