Keywords: Unsupervised Learning, Pig Counting, Bilinear Model, CLIP.
Abstract: Pig counting is an important task in pig marketing and farming. Most of the existing methods use supervised object detection for counting. However, supervised pig counting relies heavily on expensive data labeling, especially in the dense counting scenario. To address this issue, we propose BU-CLIP, which adopts the Contrastive Language Image Pre-training model (CLIP) for unsupervised pig counting. We tailor the image encoder of CLIP and the loss function to make it more suitable for pig counting. Our method replaces CLIP's image encoder with a bilinear model combining ConvNeXt and ResNet50 backbones, while the pooling operation is performed via a multi-head attention module. We reconstruct the loss function by using a multi-modal full-ranking loss, which captures the intrinsic correspondence between text and image. The proposed model is tested on a dense pig counting dataset, and extensive experiments demonstrate that our method outperforms unsupervised state-of-the-art counting methods and achieves almost the same results as supervised state-of-the-art counting methods.
Submission Number: 3
Loading