Keywords: event-based vision, CLIP, few-shot learning
TL;DR: We adapt pre-trained CLIP model to perform zero-shot and few-shot event-based object recognition.
Abstract: Recent advances in zero-shot and few-shot classification heavily rely on the success of pre-trained vision-language models (VLMs) such as CLIP.
Due to a shortage of large-scale datasets, training such models for event camera data remains infeasible.
Thus, adapting existing models across modalities is an important research challenge.
In this work, we introduce EventCLIP, a novel approach that utilizes CLIP for zero-shot and few-shot event-based object recognition.
We first generalize CLIP's image encoder to event data by converting raw events to 2D grid-based representations.
To further enhance performance, we propose a feature adapter to aggregate temporal information over event frames and refine text embeddings to better align with the visual inputs.
We evaluate EventCLIP on N-Caltech, N-Cars, and N-ImageNet datasets, achieving state-of-the-art few-shot performance.
When fine-tuned on the entire dataset, our method outperforms all existing event classifiers.
Moreover, we explore practical applications of EventCLIP including robust event classification and label-free event recognition, where our approach surpasses previous baselines designed specifically for these tasks.
Primary Area: transfer learning, meta learning, and lifelong learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 1572
Loading