Keywords: visual language model, prompt engineering, zero-shot, detection
TL;DR: Visual Langauge Models have difficulty with marine images in a zero-shot setting.
Abstract: Visual Language Models have exhibited impressive performance on new tasks in a zero-shot setting. Language queries enable these large models to classify or detect objects even when presented with a novel concept in a shifted domain. We explore the limits of this capability by presenting Grounding DINO with images and concepts from field images of marine and terrestrial animals. By manipulating the language prompts, we found that the embedding space does not necessarily encode scientific taxonomic organism names, but still yields potentially useful localizations due to a strong sense of general objectness. Grounding DINO struggled with objects in a challenging underwater setting, but improved when fed expressive prompts that explicitly described morphology. These experiments suggest that large models still have room to grow in domain use-cases and illuminate avenues for strengthening their understanding of shape to further improve zero-shot performance. The code to reproduce these experiments is available at: https://github.com/bioinspirlab/deepsea-foundation-2023.
Submission Number: 35
Loading