Abstract: The plant kingdom exhibits remarkable diversity that must be maintained for global ecosystem sustainability. However, plant life is currently disproportionately disappearing at a rapid rate, putting many essential functions---such as ecosystem production, resistance, and resilience---at risk. Plant specimen identification---the first step of plant biodiversity research---is heavily bottlenecked by a shortage of qualified experts. The botanical community has imaged large volumes of annotated physical herbarium specimens, which present a huge potential for building artificial intelligence systems that can assist researchers. In this paper, we present a novel large--scale, fine--grained dataset, NAFlora-1M, which consists of 1,050,182 hebarium images covering 15,501 North American vascular plant species (90\% of the known species). Addressing gaps from previous research efforts, NAFlora-1M is the first–ever dataset to closely replicate the real--world task of herbarium specimen identification, as the dataset is intended to cover as many of the taxa in North America as possible. We highlight some key characteristics of NAFlora-1M from a machine learning dataset perspective: high--quality labels rigorously peer--reviewed by experts; hierarchical class structure; long–tailed and imbalanced class distribution; high image resolution; and extensive image quality control for consistent scale and color. In addition, we present several baseline models, along with benchmarking results from a Kaggle competition: A total of 134 teams benchmarked the dataset in a total of 1,663 submissions; the leading team achieved an 87.66% macro-F$_{1}$ score with a 1–billion–parameter ensemble model---leaving substantial room for future improvement in both performance and efficiency. We believe that NAFlora-1M is an excellent starting point to encourage the development of botanical AI applications, thereby facilitating enhanced monitoring of plant diversity and conservation efforts. The dataset and training scripts are available at https://github.com/dpl10/NAFlora-1M .
Keywords: Biodiversity, Plant Diversity, Plant Specimen Collection, Plant Specimen Images, Digitization, Herbarium, Fine-grained Image Classification, High-resolution Images, Long-tail Distribution, Class Imbalance, Class Hierarchy, Hierarchical Label, Annotation Quality, Image Quality Control, Kaggle Competition
Code: https://github.com/dpl10/NAFlora-1M
Assigned Action Editor: ~Sergio_Escalera1
Submission Number: 36
Loading