Abstract: With the rise of "Metaverse" and "Web 3.0", Non-Fungible Tokens (NFTs) have emerged as a kind of pivotal digital asset, garnering significant attention. By the end of March 2024, more than 1.7 billion NFTs have been minted across various blockchain platforms. To effectively locate a desired NFT token, conducting searches within the huge amount NFTs is essential. The challenge in NFT retrieval is heightened due to the high degree of similarity among different NFTs regarding regional and semantic aspects. In this paper, we introduce a dataset named “NFT Top1000 Visual-Text Dataset”(NFT1000), containing 7.56 million image-text pairs, and being collected from 1000 most famous PFP NFT collections by sales volume on the Ethereum blockchain. Based on this dataset, building upon the foundation of the CLIP series of pre-trained models, we propose a dynamic masking fine-grained contrastive learning fine-tuning approach, which enables us to fine-tune a more performant model using only 13% of the total training data (0.79 million v.s. 6.1 million), resulting in a 7.2% improvement in the top-1 accuracy rate. We also propose a robust metric Comprehensive Variance Index (CVI) to assess the similarity and retrieval difficulty of visual-text pairs data. Please try our retrieval demo at https://876p9s4054.vicp.fun/
Primary Subject Area: [Engagement] Multimedia Search and Recommendation
Secondary Subject Area: [Systems] Data Systems Management and Indexing
Relevance To Conference: Our main contributions are:(1) We construct the first NFT-related visual-text dataset in the field of computer vision. (2) We introduce a task of large-scale, high-similarity image-text retrieval. (3) We design an efficient training method for NFT data, using less data but training better models. (4) We propose the Comprehensive Variance Index, a universal metric designed to measure the similarity between images and texts.
Submission Number: 859
Loading