CrypticBio: A Large Multimodal Dataset for Visually Confusing Species

Published: 18 Sept 2025, Last Modified: 30 Oct 2025NeurIPS 2025 Datasets and Benchmarks Track posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: AI for biodiversity, multimodal embedding space, vision-language models, fine-grained image recognition
TL;DR: We present CrypticBio, the largest publicly available multimodal dataset of visually confusing species, specifically curated to support the development of AI models for biodiversity identification using images, language and spatiotemporal data.
Abstract: We present CrypticBio, the largest publicly available multimodal dataset of visually confusing species, specifically curated to support the development of AI models in the context of biodiversity applications. Visually confusing or cryptic species are groups of two or more taxa that are nearly indistinguishable based on visual characteristics alone. While much existing work addresses taxonomic identification in a broad sense, datasets that directly address the morphological confusion of cryptic species are small, manually curated, and target only a single taxon. Thus, the challenge of identifying such subtle differences in a wide range of taxa remains unaddressed. Curated from real-world trends in species misidentification among community annotators of iNaturalist, CrypticBio contains 52K unique cryptic groups spanning 67K species represented in 166 million images. Records in the dataset include research-grade image annotations—scientific, multicultural, and multilingual species terminology, hierarchical taxonomy, spatiotemporal context, and associated cryptic groups. To facilitate easy subset curation from CrypticBio, we provide an open-source pipeline, CrypticBio-Curate. The multimodal design of the dataset provides complementary cues such as spatiotemporal context that support the identification of cryptic species. To highlight the importance of the dataset, we benchmark a suite of state-of-the-art foundation models across CrypticBio subsets of common, unseen, endangered, and invasive species, and demonstrate the substantial impact of spatiotemporal context on vision-language zero-shot learning for cryptic species. By introducing CrypticBio, we aim to catalyze progress toward real-world-ready fine-grained species classification models for biodiversity monitoring capable of handling the nuanced challenges of species ambiguity. The data and the code are publicly available in the project website https://georgianagmanolache.github.io/crypticbio.
Croissant File: json
Dataset URL: https://huggingface.co/datasets/gmanolache/CrypticBio
Code URL: https://github.com/georgianagmanolache/crypticbio
Primary Area: AL/ML Datasets & Benchmarks for life sciences (e.g. climate, health, life sciences, physics, social sciences)
Flagged For Ethics Review: true
Submission Number: 1396
Loading