Abstract: This paper proposes a near duplicate product image detection system for large scale datasets based on binary hashing. The product images are represented using global features generated by deep neural networks. An automatic evaluation method is used to choose the best feature description. A compact feature description is learnt using discriminative metric learning. A method called subspace learning for hashing is used to index images. A distributed system is designed to process large scale product images, which utilize five strategies including removing the logo area of the product images, accelerating Hamming distance computation by using SSE2, filtering results using color information, dividing the dataset into buckets, distributing the computing using Spark clusters. The experimental results show that the system can detect near duplicate product images in large scale datasets rapidly and accurately.
0 Replies
Loading