# LOCAL-DETR: Localized Open-vocabulary Contrastive Alignment via Semantic Caching

![A visual comparison between LOCAL-DETR (bottom row) and the baseline OC-OVD (top row), where objects from novel classes are highlighted with bounding boxes in colors different from those used for base class objects. Multi-scale features and VLM outputs are first processed through DHSAP and then pass the Parametric Decoupling Transformer, which effectively enhances the understanding of feature semantics and enables the extended recognition of novel classes.](Figure_4.png)


This repository presents the official implementation of **LOCAL-DETR**, a framework predicated upon a purely vision-based methodology for open-vocabulary object detection. The proposed methodology leverages a Dynamically Hierarchical Semantic Prototype Repository (DHSAP) in conjunction with a Dual-stream Decoupled Training Paradigm, thereby facilitating the attainment of efficient and robust open-set object perception.

The source code pertinent to this project is accessible via the following Uniform Resource Locator: `https://github.com/justin-herry/KO-DEVA.git`

## Introduction

Open-vocabulary object detection (OVOD) constitutes an endeavor to effectuate the identification of novel object categories extending beyond a predefined, closed-world categorical scope. Prevailing methodologies frequently exhibit deficiencies in efficiency and generalization capabilities, primarily attributable to an extensive reliance upon multimodal fusion. Inspired by the principles of hierarchical visual perception as elucidated within cognitive science, LOCAL-DETR introduces a vision-centric framework that autonomously extracts semantic anchors from the multi-scale attention maps generated by DETR, employing a self-supervised learning paradigm.

The principal contributions of this work encompass:

* **Vision-Centric Paradigm**: The pioneering establishment of a purely visual OVOD framework, which, through the utilization of DETR's hierarchical attention mechanism, autonomously expands the visual semantic space, thereby enabling direct feature mapping for open-set detection.

* **Dynamic Semantic Anchoring Theory**: The formulation of an attention-weighted, self-supervised prototype generation methodology, which dynamically constructs hierarchical visual concept dictionaries, thereby obviating dependence upon pre-established categorical definitions.

* **Parametrically Decoupled Progressive Learning**: The implementation of a dual-stream gradient isolation approach, which achieves decoupled optimization, consequently preserving base-class detection accuracy and conferring robust zero-shot generalization capabilities through the application of curriculum learning.

Experimental validation substantiates that LOCAL-DETR effectively reconciles the objectives of accuracy and generalization, furnishing an efficacious approach for open-set object perception.

## Installation

### Prerequisites

The requisite software components for the deployment of this framework include:

* Python version 3.10

* CUDA version 12.1

* PyTorch (compatibility with CUDA 12.1 is mandatory)

### Environment Setup

The procedural steps for establishing the operational environment are delineated hereunder:

1.  Repository cloning:

    ```bash
    git clone https://github.com/justin-herry/KO-DEVA.git
    cd KO-DEVA
    ```

2.  Conda environment creation (recommended practice):

    ```bash
    conda create -n localdetr python=3.10
    conda activate localdetr
    ```

3.  PyTorch installation (consult the official PyTorch documentation for precise command syntax):

    ```bash
    pip install torch torchvision torchaudio --index-url [https://download.pytorch.org/whl/cu121](https://download.pytorch.org/whl/cu121)
    ```

4.  Installation of supplementary package requirements:

    ```bash
    pip install -r requirements.txt
    ```


