Title: VoxDet: Voxel Learning for Novel Instance Detection

Abstract: Novel instance detection, the task of identifying specific unseen objects based on multi-view templates in open-world scenarios, presents significant challenges. Traditional 2D-centric methods often struggle with substantial pose variations and occlusions, limiting their real-world applicability. To overcome these limitations, we introduce **VoxDet**, a pioneering 3D geometry-aware framework. VoxDet uniquely leverages a robust 3D voxel representation and an efficient voxel matching mechanism, offering inherent resilience to appearance changes. First, our novel **Template Voxel Aggregation (TVA)** module ingeniously transforms multi-view 2D images into compact 3D template voxels. By leveraging associated camera poses, TVA aggregates 2D features into a unified 3D representation, which intrinsically provides heightened resilience to occlusion and pose variations. We further show that a self-supervised 3D reconstruction objective effectively pre-trains the critical 2D-3D mapping within TVA, enhancing its geometry-encoding capabilities. Second, the **Query Voxel Matching (QVM)** module enables rapid and accurate alignment. QVM converts 2D query features into their 3D voxel representation using the pre-trained 2D-3D mapping. By explicitly estimating relative rotation and aligning the voxels before comparison, QVM significantly improves matching accuracy and computational efficiency, directly benefiting from the geometry encoded in the 3D voxels. Beyond our methodology, we introduce **RoboTools**, the first comprehensive instance detection benchmark. RoboTools features 20 unique instances, video-recorded with precise camera extrinsics, across 24 challenging cluttered scenarios, totaling over 9,000 box annotations. Extensive experiments on LineMod-Occlusion, YCB-video, and our demanding RoboTools benchmarks demonstrate that VoxDet consistently and remarkably outperforms existing 2D baselines, while also achieving superior inference speeds. To the best of our knowledge, VoxDet is the first framework to explicitly incorporate 3D geometric knowledge for robust 2D novel instance detection. Our code, data, raw results, and pre-trained models are publicly available at https://github.com/Jaraxxus-Me/VoxDet.

Section: Introduction
The ability to swiftly identify a specific, previously unseen instance amidst a cluttered environment, such as finding a matching sock in a laundry pile or a particular suitcase at an airport, is a fundamental human cognitive skill. Humans can form a mental picture of a novel object from a few glances and then locate it with remarkable accuracy. While state-of-the-art object detectors [1-7] excel at categorizing known classes, they fall short in this critical task of **novel instance detection**: identifying an unseen specific instance within a cluttered query image using multi-view support references. Prior approaches to this problem predominantly operate in 2D space, employing techniques like correlation [8,5], attention mechanisms [6], or similarity matching [9]. However, these 2D methods often exhibit a critical lack of robustness when confronted with significant appearance disparities, such as those caused by pose variations or occlusions, between the query and template images. Furthermore, while extensive research exists in few-shot category-level object detection [7,1,2], these class-level matching techniques are inherently insufficient for the fine-grained discernment required for instance-level identification.

Section: Related Work
**Object Detection:** Object detection methods broadly fall into two categories: two-stage and one-stage approaches. Two-stage detectors, such as R-CNN [20] and its variants [21,22], first generate regions of interest (ROI) via a region proposal network, followed by detection heads for classification and bounding box regression. In contrast, one-stage detectors like the YOLO series [23-25] and recent transformer-based methods [3,4] tackle detection as an end-to-end regression problem. Few-shot and one-shot object detection [1,2,7,27-29] aim to detect unseen classes with limited labeled support samples, making them conceptually closer to our task. These methods typically employ either transfer-learning [27,28] or meta-learning strategies [1,2,7,29] to generalize to novel categories. However, as these are primarily category-level approaches, their classification and matching designs are often optimized for metrics like Top-100 precision, assuming multiple instances of a class. This contrasts sharply with novel instance detection, where Top-1 accuracy for a specific instance is paramount, and these methods often fail. Open-world and zero-shot object detection [14,30-32] address the task of finding any object in an image, often in a class-agnostic manner. Some approaches learn objectiveness [14,30], while others [32] leverage large-scale datasets. Such methods are well-suited as a preliminary module in our pipeline to generate object proposals for subsequent template comparison. We specifically adopt [14] due to its efficient structure and robust performance.
**Novel Instance Detection:** Crucially, novel instance detection necessitates algorithms that can locate a specific unseen instance in a query image using provided multi-view templates. As discussed, prior methods [5,6,8] predominantly rely on pure 2D representations and matching techniques. For instance, DTOID [6] employs global object attention and a local pose-specific branch to predict template-guided heatmaps. However, these 2D-centric approaches inherently struggle with significant 2D appearance variations caused by occlusion or drastic pose changes. VoxDet addresses this fundamental limitation by explicitly leveraging 3D geometric knowledge extracted from multi-view templates, enabling the representation and matching of instances in a geometry-invariant manner.
**Multi-view 3D Representations:** The robust representation of 3D scenes or instances from multi-view images is a long-standing and critical problem in computer vision. Historically, multi-view geometry, particularly Structure from Motion (SfM) [33] pipelines, has been instrumental in jointly optimizing camera poses and 3D structures. More recently, neural 3D representations [10,11,34-38], including deep voxels [10,12,35,38] and implicit functions [36,37], have achieved remarkable success in tasks such as 3D reconstruction and novel view synthesis. Our VoxDet framework draws significant inspiration from the Video Autoencoder [10], which disentangles deep implicit 3D structure learning from camera trajectory estimation in videos. A key advantage of this approach, crucial for our instance detection task, is its ability to encode and synthesize test scenes efficiently without requiring further tuning or optimization.
**Instance Pose Estimation:** This field is dedicated to estimating the 6 Degrees of Freedom (6 DoF) pose of an unseen instance. Some methods [57,58] match local point features and utilize RANSAC to optimize the relative pose. Others [5,59] first select the closest template frame and then conduct pose refinement on the cropped object patch. Most of these methods typically assume perfect instance detection, meaning they crop the instance from the query image using ground-truth bounding boxes and then estimate the pose on this small, object-centered patch. Our VoxDet can serve as an effective front-end for such systems, providing robust detection in cluttered environments and thereby enhancing the reliability of the overall detection-pose estimation framework.
**Instance Retrieval:** Instance retrieval aims to retrieve a specific instance from a large database given a single reference image [60-65]. Early work extracted local point features from template and query patches for image matching [61,49], which often suffered from limited discriminative capability. More recent work leverages deep neural networks for a global representation of the instance [62-65], comparing these with features extracted from query images. However, most of these methods construct 2D template features from the reference, making their representation unaware of the instance's 3D geometry, which can lead to a lack of robustness under severe pose variations. Furthermore, instance retrieval methods often require high-resolution query images for discriminative features, whereas instances in our cluttered query images can be at low resolution, posing additional challenges for these approaches.

Section: Methodology
**Problem Formulation:** Given a training instance set O base and a disjoint unseen test instance set O novel (i.e., O base ∩ O novel = ϕ), the objective of novel instance detection is to train an instance detector on O base that can subsequently detect new instances in O novel without any further training or fine-tuning. Formally, for each target instance, the detector receives a query image I Q ∈ R 3×W ×H and a set of M support templates I S ∈ R M ×3×W ×H. The detector's output is the bounding box b ∈ R 4 of the target instance in the query image. For this work, we assume that exactly one target instance is present in the query image and that the instance is approximately centered in the support images.

Section: Architecture
The comprehensive architecture of VoxDet, illustrated in Fig. 2, integrates three primary modules: an open-world detector, a Template Voxel Aggregation (TVA) module, and a Query Voxel Matching (QVM) module. Initially, the open-world detector processes the query image to generate universal object proposals. Subsequently, TVA aggregates multi-view support images into a compact 3D template voxel, leveraging their relative camera poses. Concurrently, QVM transforms 2D proposal features into the 3D voxel space, facilitating alignment and matching with the template voxel. To imbue the voxel representation with robust 3D geometric understanding, we employ a self-supervised 3D reconstruction objective during an initial pre-training stage. The models obtained from this reconstruction pre-training then serve as crucial initial weights for the subsequent instance detection training stage.  

Section: Open-World Detection
Given that the desired instance is unseen during training, directly regressing its precise location and scale presents a significant challenge. To address this, we initiate our pipeline with an open-world detector [14] designed to generate comprehensive object proposals. Unlike conventional detectors that identify only pre-defined classes, an open-world detector is class-agnostic, capable of localizing all potential objects within an image. As depicted in Fig. 2, for a given query image I Q , a 2D feature map f Q is extracted by a backbone network ψ(•). A Region Proposal Network (RPN) [22] then classifies pre-defined anchors as either foreground (objects) or background, while simultaneously regressing their bounding box coordinates. Anchors with high classification scores are designated as region proposals, represented as
P = [p 1 , p 2 , • • • , p N ] ∈ R N ×4
, where N denotes the total number of proposals. Subsequently, to obtain features F Q for these candidates, we apply Region of Interest (ROI) pooling (specifically ROIAlign [22]), yielding
F Q = ROIAlign(P, f Q ) ∈ R N ×C×w×w
, where C is the channel dimension and w is the spatial size of the proposal features. Finally, the detection head, comprising two parallel multi-layer perceptrons (MLP), takes F Q as input to output binary classification scores and bounding box regression targets. The training objective for this module combines several losses: RPN classification loss L RPN cls , RPN regression loss L RPN reg , head classification loss L Head cls , and head regression loss L Head reg . To enable the detector's open-world capability, its classification branches (both in the RPN and detection head) are supervised through objectiveness regression [14]. This means the classification score is directly defined and optimized to predict the Intersection over Union (IoU) with ground-truth objects. This approach has demonstrated a high recall rate for objects in test images, even for instances not seen during training. By learning this class-agnostic 'objectiveness,' the open-world detector reliably generates proposals that are highly likely to cover the desired novel instance. Consequently, we select the top-ranking candidates and their corresponding features as the input for our subsequent matching module.

Section: Template Voxel Aggregation
To learn geometry-invariant representations, the Template Voxel Aggregation (TVA) module compresses multi-view 2D templates into a compact deep voxel. Inspired by previous techniques [10] developed for unsupervised video encoding, we propose to encode our instance templates via their relative orientation in the physical 3D world. To this end, we first generate the 2D feature maps F S = ψ(I S ) ∈ R M ×C×w×w using a shared backbone network ψ(•) (also used in the query branch). These 2D features are then mapped to 3D voxels for multi-view aggregation.
**2D-3D Mapping:** To project these 2D features onto a shared 3D space, we utilize an implicit mapping function M(•). This function translates the 2D features to 3D voxel features, denoted by V = M(F S ) ∈ R M ×Cv×D×L×L, where V represents the 3D voxel feature derived from the 2D input, C v is the feature dimension, and D, L indicate the voxel's spatial dimensions. Specifically, we first reshape the feature maps to F ′S ∈ R M ×(C/d)×d×w×w, where d is a pre-defined implicit depth, and then apply 3D inverse convolution to obtain the feature voxel.
Note that with multi-view images, the relative camera rotation can be readily calculated via Structure from Motion (SfM) [33] or visual odometry [39]. Given that the images are object-centered and the object remains static, these relative rotations effectively represent the relative orientations between the object instances as defined in the same camera coordinate system. Unlike prior work [10] that implicitly learns camera extrinsics for unsupervised encoding, our approach explicitly embeds this geometric information. Specifically, our goal is to first transform every template into a common coordinate system using its relative rotation, which is then aggregated:
v S = 1 M M i=1 Conv3D(Rot(V i , R ⊤ i )) ,(1)
where V i ∈ R Cv×D×L×L is the previously mapped i-th independent voxel feature, and R ⊤ i denotes the relative camera rotation between the i-th support frame and the first frame. Rot(•, •) is the 3D transform used in [10], which first warps a unit voxel to the new coordinate system using R ⊤ i and then samples from the feature voxel V i with the transformed unit voxel grid. Therefore, all M voxels are transformed into the same coordinate system defined by the first camera frame. These are then aggregated through average pooling to produce the compact template voxel v S . By explicitly embedding the 3D rotations into individual reference features, TVA achieves a geometry-aware compact representation, which is inherently more robust to occlusion and pose variation.

Section: Query Voxel Matching
Given the proposal features F Q from the query image I Q and the template voxel v S from the support images I S , the primary task of the Query Voxel Matching (QVM) module is to classify each proposal as either foreground (corresponding to the reference instance) or background. As illustrated in Fig. 2, to endow the 2D query features with 3D geometric awareness, we first apply the same implicit mapping function M(•) to obtain query voxels:
V Q = M(F Q ) ∈ R N ×Cv×D×L×L .
VoxDet then accomplishes the matching between v S and V Q through a two-step process. First, it estimates the relative rotation between the query and support, ensuring that V Q can be accurately aligned in the same coordinate system as v S . Second, it learns a function to measure the geometric and semantic distance between the aligned voxels.
To achieve this, we introduce a novel **Voxel Relation** operator, R v (•, •). Given two voxels v 1 , v 2 ∈ R c×a×a×a (where c is the channel dimension and a is the spatial dimension), this function aims to discover their intricate relations across every semantic channel. This is achieved by first interleaving the voxels along their channel dimension:
In(v 1 , v 2 ) = [v 1 1 , v 1 2 , v 2 1 , v 2 2 , • • • , v c 1 , v c 2 ] ∈ R 2c×a×a×a , where v k 1 , v k 2
denote the voxel features in the k-th channel. Subsequently, we apply grouped convolution:
R v (v 1 , v 2 ) = Conv3D(In(v 1 , v 2 ), group = c).
Through extensive experimentation, we found that this design significantly facilitates relation learning, as each convolution kernel is specifically tasked with learning the relationship between the two feature voxels from the same semantic channel. Leveraging this voxel relation, we can then robustly estimate the rotation matrix RQ ∈ R N ×3×3 for each query voxel relative to the template as:
RQ = MLP(R v (V S , V Q )) ,(2)
where v S is replicated N times to form V S for element-wise comparison. In practice, we initially predict a 6D continuous vector [40] as the network's output, which is then converted into a rotation matrix. Following this, we define the classification head using the Voxel Relation:
ŝ = MLP R v (V S , Rot(V Q , RQ )) ,(3)
where Rot(V Q , RQ ) performs the rotation of the query voxels into the support's coordinate system, enabling accurate and geometrically consistent matching. Practically, we also incorporate a global relation branch to capture semantic information potentially attenuated during the implicit 2D-3D mapping process, further enhancing the final score. More comprehensive details are provided in the supplementary material. During inference, we rank the generated proposals P based on their matching scores and select the Top-k candidates as the predicted bounding box b.

Section: Training Objectives
As illustrated in Fig. 2, VoxDet's training regimen is divided into two distinct stages: reconstruction pre-training and instance detection fine-tuning.
**Reconstruction Pre-training:** To effectively learn 3D geometry relationships, particularly the 3D rotation between instance templates, we pre-train the implicit mapping function M(•) using a self-supervised reconstruction objective. We partition M multi-view templates I S into a set of input images I S i ∈ R (M -K)×3×W ×H and a set of output images I S o ∈ R K×3×W ×H. Subsequently, we construct the voxel representation V S using I S i via the TVA module and employ a decoder network Dec to reconstruct the output images based on their relative rotations:
ÎS o,j = Dec(Rot(V S , R ⊤ j )) , j ∈ {1, 2, • • • , K} ,(4)
where ÎS o,j denotes the j-th reconstructed (synthetic) output image and R j is the relative rotation matrix between the first and j-th camera frame. The comprehensive reconstruction loss is then defined as:
L r = w recon L recon + w gan L gan + w percep L percep ,(5)
Here, L recon represents the reconstruction loss, specifically the L1 distance between I S o and ÎS o . L gan is the generative adversarial network (GAN) loss, where an auxiliary discriminator is trained to distinguish between real (I S o ) and reconstructed (ÎS o ) images. L percep denotes the perceptual loss, calculated as the L1 distance between the feature maps of I S o and ÎS o at various levels of a pre-trained VGGNet [41]. Although this reconstruction is supervised solely on training instances, we empirically observe that it enables the model to roughly reconstruct novel views for entirely unseen instances. This suggests that the pre-trained voxel mapping successfully encodes the fundamental geometry of an instance.
**Instance Detection Training:** To empower M(•) with robust geometry encoding capabilities for detection, we initialize it with the weights learned during the reconstruction pre-training stage and proceed with the instance detection training. In addition to the standard open-world detection loss [14], we introduce an instance classification loss L Ins cls and a rotation estimation loss L Ins rot to supervise VoxDet. L Ins cls is defined as the binary cross-entropy loss between the true labels s ∈ {0, 1} N and the predicted scores ŝ ∈ R N ×2 from the QVM module. The rotation estimation loss is formulated as:
L Ins rot = ∥ RQ R Q⊤ -I∥ ,(6)
where R Q is the ground-truth rotation matrix of the query voxel. It is important to note that we only supervise this loss for positive samples. The combined instance detection loss is thus defined as:
L d = w 1 L RPN cls + w 2 L RPN reg + w 3 L Head cls + w 4 L Head reg + w 5 L Ins cls + w 6 L Ins rot ,(7)
**Synthetic Training Set:** Addressing the critical scarcity of instance detection training sets, we have meticulously compiled a comprehensive synthetic dataset, OWID-10k, leveraging 9,901 diverse 3D objects from ShapeNet [15] and ABO [16]. Each instance is rendered into a 40-frame, object-centric 360° video using Blenderproc [43]. For training, we generate 55,000 query scenes, each containing 8 to 15 randomly selected objects from the entire instance pool, initialized with random orientations, yielding 180,000 bounding box annotations. An additional 500 images are reserved for evaluation, covering 9,800 and 101 instances respectively. This expansive dataset, termed "Open-World Instance Detection" (OWID-10k), is specifically designed to assess our model's capacity to generalize to unseen instances, representing a pioneering effort in this domain.
**Synthetic-Real Test Sets:** For rigorous testing, we utilize two authoritative benchmarks: LineMod-Occlusion [17] (LM-O) and YCB-Video [18] (YCB-V). LM-O features 8 texture-less instances and 1,514 box annotations, with the primary challenge being heavy object occlusion. YCB-V contains 21 instances and 4,125 target boxes, where the main difficulty lies in significant variations in instance pose. Since these datasets provide real test images but lack corresponding reference videos, we render synthetic videos using their respective CAD models in Blender to serve as multi-view templates.

Section: Fully-Real Test Set: RoboTools
To rigorously assess the sim-to-real transfer capability of VoxDet, we introduce a more complex, fully real-world benchmark named **RoboTools**. This dataset comprises 20 unique instances, 9,109 annotations, and 24 highly challenging scenarios. The instances and example scenes are presented in Fig. 3. Compared to existing benchmarks like LineMod-Occlusion [17] and YCB-Video [18], RoboTools is significantly more challenging due to its heavily cluttered backgrounds and severe pose variations. Furthermore, the reference videos in RoboTools consist of real images, capturing realistic lighting conditions including shadows. We also provide ground-truth camera extrinsics for all recordings.
**Baselines:** Our comparative baselines include traditional template-driven instance detection methods, such as correlation-based approaches [5] and attention-based methods [6]. However, these methods typically falter in cluttered scenes, which are prevalent in LM-O, YCB-V, and RoboTools. Therefore, we have meticulously constructed several strong 2D baselines, namely OLN DINO, OLN CLIP, and OLN Corr. In these models, we first obtain open-world 2D proposals using our open-world detection module [14]. Subsequently, different 2D matching methods are employed to identify the proposal with the highest score. For OLN DINO and OLN CLIP, we leverage robust features from large-scale pre-trained backbones [9,19] and utilize cosine similarity for matching. For OLN Corr., we designed a 2D matching head that employs correlation, as suggested in [5]. These open-world detection-based 2D baselines significantly outperform previous methods [5,6]. In addition to these instance-specific methods, we also include class-level one-shot detectors, OS2D [7] and BHRL [42], for comprehensive comparison.
**Hardware and Configurations:** The reconstruction stage of VoxDet was trained on a single Nvidia V100 GPU for 6 hours, while the detection training phase utilized four Nvidia V100 GPUs for approximately 40 hours. For fair comparison, we trained the referenced methods [5-7, 14, 19, 9] primarily on the OWID dataset, adhering to their official configurations. Inferences were conducted on a single V100 GPU to ensure equitable efficiency comparisons. During testing, all models were provided with the same set of M = 10 template images per instance, and all methods utilized the top N = 500 ranking proposals for matching. In the initial reconstruction training stage, VoxDet leveraged 98% of all 9,901 instances in the OWID dataset. For each instance, a random set of K = 4 images were designated as output I S o, while the remaining M -K = 6 images constituted the inputs I S i. For additional configurations of VoxDet, please refer to Appendix A and our publicly available code.
**Metrics:** Given our assumption that precisely one desired instance is present in the query image, we default to selecting the Top-1 proposal as the predicted result. We report the average recall (AR) rate [44] across various Intersection over Union (IoU) thresholds, including mAR (IoU ∈ [0.5, 0.95]), AR 50 (IoU = 0.5), AR 75 (IoU = 0.75), and AR 95 (IoU = 0.95). It is important to note that AR is equivalent to average precision (AP) in our specific evaluation context.

Section: Quantitative Results
**Overall Performance Comparison:** On the synthetic-real datasets (LM-O and YCB-V), our comprehensive comparison with all potential baselines, detailed in Table 1, demonstrates that VoxDet consistently delivers superior performance across most settings. Notably, VoxDet surpasses the next best baseline, OLN DINO [14,9], by an impressive margin of up to 20% in terms of average mAR. This significant improvement highlights the efficacy of our 3D geometry-aware approach. Furthermore, due to its compact voxel representation, VoxDet exhibits markedly greater efficiency. On the newly introduced fully real dataset, RoboTools, we restrict comparisons to methods trained exclusively on the same synthetic dataset for fairness. As shown in Table 2, VoxDet demonstrates superior sim-to-real transfer capability compared to 2D-based methods, a direct benefit of its robust 3D voxel representation. A more extensive comparison with models trained on real images is provided in Appendix D.
**Efficiency Comparison:** The Query Voxel Matching (QVM) module, with its inherently lower model complexity compared to OLN CLIP [14,19] and OLN DINO [14,9], achieves significantly faster inference speeds, as detailed in Table 3. When compared to correlation-based matching [5], VoxDet efficiently aggregates multi-view templates into a single compact voxel, thereby eliminating the need for exhaustive 2D correlation and achieving approximately 2× faster speed.
In addition to inference speed, VoxDet also demonstrates superior efficiency concerning the number of required templates. We evaluated the methods on the YCB-V dataset [18] using fewer templates than the default setting. As illustrated in Fig. 4, we found that 2D baselines are highly sensitive to the number of provided references; their performance may plummet by as much as 87% when the number of templates is reduced from 10 to 2. In contrast, VoxDet exhibits a degradation rate that is 2× less severe. We attribute this robustness to the learned 2D-3D mapping, which effectively incorporates 3D geometry even with very few views.
**Top-K Analysis:** Compared to the category-level method OS2D [7], VoxDet produces considerably fewer false positives among its Top-10 candidates. As depicted in Fig. 5, we considered Top-K = {1, 5, 10, 20, 30, 50, 100} proposals and compared the corresponding AR between VoxDet and OS2D. VoxDet's AR only declines by 5-10% when K decreases from 100 to 10, whereas OS2D's AR suffers a substantial drop of up to 38%. This suggests that over 90% of VoxDet's true positives are found among its Top-10 candidates, whereas this ratio is only around 60% for OS2D, highlighting VoxDet's superior precision in retrieving the target instance.
**Ablation Studies:** The results of our comprehensive ablation studies are presented in Table 4. Initially, we explored using 3D depth-wise convolution for matching (see the fourth row), but this approach proved inferior to our proposed instance-level voxel relation. Reconstruction pre-training is crucial for VoxDet's ability to learn to encode the geometry of an instance (see the last row). Additionally, we conducted an ablation on the rotation measurement module (R) within QVM and also experimented with not supervising the predicted rotation. Both alternative settings yielded inferior results compared to our default configurations, underscoring the importance of these components.

Section: Related Work
**Object Detection:** (This section was previously integrated into the introduction, but due to its length and importance, it is separated here for better structure and clarity.) Object detection methods broadly fall into two categories: two-stage and one-stage approaches. Two-stage detectors, such as R-CNN [20] and its variants [21,22], first generate regions of interest (ROI) via a region proposal network, followed by detection heads for classification and bounding box regression. In contrast, one-stage detectors like the YOLO series [23-25] and recent transformer-based methods [3,4] tackle detection as an end-to-end regression problem. Few-shot and one-shot object detection [1,2,7,27-29] aim to detect unseen classes with limited labeled support samples, making them conceptually closer to our task. These methods typically employ either transfer-learning [27,28] or meta-learning strategies [1,2,7,29] to generalize to novel categories. However, as these are primarily category-level approaches, their classification and matching designs are often optimized for metrics like Top-100 precision, assuming multiple instances of a class. This contrasts sharply with novel instance detection, where Top-1 accuracy for a specific instance is paramount, and these methods often fail. Open-world and zero-shot object detection [14,30-32] address the task of finding any object in an image, often in a class-agnostic manner. Some approaches learn objectiveness [14,30], while others [32] leverage large-scale datasets. Such methods are well-suited as a preliminary module in our pipeline to generate object proposals for subsequent template comparison. We specifically adopt [14] due to its efficient structure and robust performance.
**Novel Instance Detection:** Crucially, novel instance detection necessitates algorithms that can locate a specific unseen instance in a query image using provided multi-view templates. As discussed, prior methods [5,6,8] predominantly rely on pure 2D representations and matching techniques. For instance, DTOID [6] employs global object attention and a local pose-specific branch to predict template-guided heatmaps. However, these 2D-centric approaches inherently struggle with significant 2D appearance variations caused by occlusion or drastic pose changes. VoxDet addresses this fundamental limitation by explicitly leveraging 3D geometric knowledge extracted from multi-view templates, enabling the representation and matching of instances in a geometry-invariant manner.
**Multi-view 3D Representations:** The robust representation of 3D scenes or instances from multi-view images is a long-standing and critical problem in computer vision. Historically, multi-view geometry, particularly Structure from Motion (SfM) [33] pipelines, has been instrumental in jointly optimizing camera poses and 3D structures. More recently, neural 3D representations [10,11,34-38], including deep voxels [10,12,35,38] and implicit functions [36,37], have achieved remarkable success in tasks such as 3D reconstruction and novel view synthesis. Our VoxDet framework draws significant inspiration from the Video Autoencoder [10], which disentangles deep implicit 3D structure learning from camera trajectory estimation in videos. A key advantage of this approach, crucial for our instance detection task, is its ability to encode and synthesize test scenes efficiently without requiring further tuning or optimization.
**Instance Pose Estimation:** This field is dedicated to estimating the 6 Degrees of Freedom (6 DoF) pose of an unseen instance. Some methods [57,58] match local point features and utilize RANSAC to optimize the relative pose. Others [5,59] first select the closest template frame and then conduct pose refinement on the cropped object patch. Most of these methods typically assume perfect instance detection, meaning they crop the instance from the query image using ground-truth bounding boxes and then estimate the pose on this small, object-centered patch. Our VoxDet can serve as an effective front-end for such systems, providing robust detection in cluttered environments and thereby enhancing the reliability of the overall detection-pose estimation framework.
**Instance Retrieval:** Instance retrieval aims to retrieve a specific instance from a large database given a single reference image [60-65]. Early work extracted local point features from template and query patches for image matching [61,49], which often suffered from limited discriminative capability. More recent work leverages deep neural networks for a global representation of the instance [62-65], comparing these with features extracted from query images. However, most of these methods construct 2D template features from the reference, making their representation unaware of the instance's 3D geometry, which can lead to a lack of robustness under severe pose variations. Furthermore, instance retrieval methods often require high-resolution query images for discriminative features, whereas instances in our cluttered query images can be at low resolution, posing additional challenges for these approaches.

Section: Acknowledgments
This work was sponsored by SONY Corporation of America #1012409. This work used Bridges-2 at PSC through allocation cis220039p from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program which is supported by NSF grants #2138259, #2138286, #2138307, #2137603, and #213296. The authors would also like to express their sincere gratitude to the developers of BlenderProc2 [43].

Section: Deep Voxel and Activation Visualization
Figure 7: Visualization of the high activation grids during matching. As the query instance rotates along a certain axis, the location of the high-activated grids roughly rotates in the corresponding direction, demonstrating geometry awareness.

Section: Qualitative Results
**Detection Visualization:** The qualitative comparison is depicted in Fig. 6, where we compare VoxDet with two of the most robust baselines, OLN DINO and OLN Corr. We observe that 2D methods can easily falter if the pose of an instance is not explicitly present in the reference images (e.g., the second query image in the first row), whereas VoxDet accurately identifies it. Furthermore, 2D matching exhibits less robustness under occlusion, where the instance's appearance can significantly differ. VoxDet effectively overcomes these challenges, thanks to its learned 3D geometry. More detailed visualizations and qualitative comparisons are provided in Appendix C.
**Deep Voxel Visualization:** To further validate the geometry-awareness of our learned voxel representation, we present a deep visualization in Fig. 7. Here, the gradient of the matching score is backpropagated to the template voxel, and we visualize the activation value of each grid. Surprisingly, we discover that as the orientation of the query instance changes, the activated regions within our voxel representations accurately mirror the true rotation. This compellingly demonstrates that the voxel representation in VoxDet is inherently aware of the instance's orientation.
**Reconstruction Visualization:** The voxel representation in VoxDet can be decoded to synthesize novel views, even for previously unseen instances, as demonstrated in Fig. 8. The voxel, pre-trained on 9,500 instances, is capable of approximately reconstructing the geometry of novel instances, providing strong evidence for the embedded geometric knowledge.

Section: Discussions and Conclusion
**Conclusion:** This work introduces VoxDet, a novel and highly effective approach for detecting novel instances using multi-view reference images. VoxDet stands as a pioneering 3D-aware framework that inherently exhibits superior robustness to occlusions and significant pose variations. VoxDet's crucial contributions and insights stem from its geometry-aware Template Voxel Aggregation (TVA) module and a meticulously designed Query Voxel Matching (QVM) module, both specifically tailored for instance-level tasks. Owing to the learned instance geometry encoded in TVA and the robust matching mechanism in QVM, VoxDet significantly outperforms various 2D baselines and offers notably faster inference speed. Beyond these methodological advancements, we also introduce the first comprehensive instance detection training set, OWID-10k, and a challenging real-world RoboTools benchmark, fostering future research in this critical area.
**Limitations:** Despite its substantial strengths, VoxDet currently presents two potential limitations. Firstly, models trained exclusively on the synthetic OWID dataset may exhibit a domain gap when deployed in real-world scenarios, with further details presented in Appendix D. Secondly, our current framework assumes that the relative rotation matrices and instance masks (bounding boxes) for the reference images are known. While obtaining these may not always be straightforward in unconstrained environments, we demonstrate that the TVA module in VoxDet does not require extremely accurate rotation or 2D appearance information to function effectively. We present further experiments addressing the robustness of VoxDet to these input imperfections in Appendix E.

Section: Supplementary Material
To ensure the full reproducibility of our model, we present complete implementation details in Appendix A. Our code library will be publicly released upon acceptance. We provide more extensive comparisons between our QVM module and various 2D matching/relation techniques [1,5,45] in Appendix B, further demonstrating QVM's superiority in instance-level 3D matching. Appendix C contains additional detection qualitative results. We also present a more in-depth discussion regarding the sim-to-real domain gap of VoxDet in Appendix D. To rigorously test the robustness of VoxDet under interference with the voxel representation, we display results obtained from intentionally flawed voxels in Appendix E. Finally, Appendix F provides extended discussions on related works, where we exhaustively compare VoxDet with existing instance-level tasks, including visual tracking, instance pose estimation, and instance retrieval.

Section: A. Implementation Details
**Model Structure:** We adopt ResNet50 [46] with a Feature Pyramid Network (FPN) [26] as our feature extractor ψ(•). The default multi-scale ROIAlign mechanism from [26] is leveraged to obtain the 2D proposal features, with dimensions set to N = 500, C = 256, and w = 7. In our 2D-3D mapping, we set C/d = 32 and d = 8, which results in a voxel feature dimension of C v = 256 and spatial dimensions D = 16, L = 14. All 3D convolutions within TVA and QVM utilize a kernel size of 3 and padding of 1, ensuring that the dimensions of the voxels remain consistent throughout both modules. For the Rot(•, •) function, we follow [10] by employing `torch.nn.functional.affine_grid()` and `torch.nn.functional.grid_sample()` functionalities. While the 2D-3D mapping effectively learns rotations in the physical world, it inherently sacrifices some semantic information in the feature channels during reshaping. Therefore, in QVM, we incorporate a global matching branch to retrieve this potentially lost semantic information. Specifically, we apply global average pooling on the support features to obtain a support vector k ∈ R 1×C×1×1. We then perform depth-wise convolution between k and F Q to generate a correlation map. Crucially, this correlation map preserves all semantic channels from the backbone ψ(•), compensating for information lost in the 2D-3D mapping. This map is then added to the voxel relation output R v (V S , Rot(V Q , RQ )) to compute the final score.
**Training Details:** In the first reconstruction stage, we set the loss weights as w recon = 10.0, w gan = 0.01, and w percep = 1.0. The model is trained for 16 epochs on 9,600 instances from the OWID dataset. We leverage the Adam optimizer [47] with a base learning rate of 5 × 10 -5 during this training phase. In the second detection stage, we initialize the 2D-3D mapping modules in TVA and QVM with the weights pre-trained during reconstruction. VoxDet first learns the detection task without rotation estimation; specifically, the loss weights are set as w 1 = w 2 = w 3 = w 4 = w 5 = 1.0 and w 6 = 0 for the initial 10 epochs, using SGD as an optimizer with a base learning rate of 0.02. During this stage, the 2D-3D mapping part is trained with 1/10th of the base learning rate. In the final epoch, VoxDet learns rotation estimation while keeping the detection part fixed (i.e., w 1 = w 2 = w 3 = w 4 = w 5 = 0.0, w 6 = 1.0). While supervising rotation is not a strict requirement, it is an optional component for VoxDet and typically improves performance slightly by 1-2%.

Section: B. More Matching Module Comparisons
Table 5: Comparison with different types of matching modules. We compare QVM with the correlation in [5], class-level relation proposed in [1], and the class distance defined in FSDet [45]. We compare QVM with more matching techniques in Table 5, where the averaged results on the cluttered LM-O [17] and RoboTools benchmark are reported. We first ablate the Voxel Relation module in QVM, which results in QVM †. Specifically, all Voxel Relation operations in QVM † are replaced by a simple depth-wise convolution. This involves first applying global average pooling on the template voxel to obtain a feature vector, which is then used as the convolution kernel to calculate the correlation voxel from the queries. We observe that such a naive design leads to a noticeable performance drop, validating the efficacy of our Voxel Relation operator. For all other methods, we utilized the same open-world detector to generate universal proposals, which are then matched with the template images using various matching techniques. To be more specific:
*   **2D Corr. [5]:** This method constructs support vectors from every reference image. Subsequently, depth-wise convolution is performed between each support vector and the proposal patch. The resulting correlation maps are then fed into an MLP for classification score prediction.
*   **2D Relation [1]:** In this approach, we substitute the simple depth-wise convolution used in 2D Corr. with the spatial and channel relation mechanism proposed in [1].
*   **FSDet [45]:** Here, the depth-wise convolution in 2D Corr. is replaced by the distance metric defined in [45].
Since all these 2D techniques are inherently geometry-unaware, we find that they consistently perform worse than our proposed QVM module. Additionally, we designed a **Local Matching** baseline [49,48]. In Local Matching, we first extract local key points from the reference images and proposals using SuperPoint [49]. These point descriptors are then matched by SuperGlue [48]. We take the mean matching score of all points within a proposal as its classification score. We observe that such an implementation, although geometry-invariant at the local feature level, falls short in our instance detection task because it lacks a holistic semantic representation of the entire instance.

Section: C. More Detection Visualizations
We present more detailed detection qualitative comparisons in Fig. 9. VoxDet, highlighted in red, is compared against three prominent baselines: DTOID [6], Gen6D [5], and OLN DINO [14,9]. Compared with previous instance detectors [6,5], VoxDet demonstrates superior robustness under significant orientation variations and severe occlusions, a direct benefit of its learned geometric knowledge. For example, in the LM-O benchmark (second column), when the duck is partially occluded and the egg box is presented in different orientations, VoxDet accurately identifies them, whereas Gen6D fails. Furthermore, compared with similarity matching approaches [9], VoxDet exhibits a stronger capability to distinguish between visually similar yet geometrically distinct instances via its QVM module. For instance, in the RoboTools benchmark (third column), the desired instance might be confused with a motor that possesses similar appearances but fundamentally different geometry. Our VoxDet effectively discovers such geometric differences and makes correct classifications, while similarity matching approaches fall short, even when utilizing features from powerful backbones like DINO [9] (which is stronger than ResNet50 [46]).

Section: D. Sim-to-Real Comparison
VoxDet is exclusively trained on our synthetic dataset, OWID-10k. We observe that the model exhibits a domain gap when directly transferred to real-world images, particularly evident on the RoboTools benchmark. On the synthetic-real datasets, LM-O [17] and YCB-V [18], our model consistently outperforms methods trained on real images. However, it shows certain limitations on the fully real RoboTools test set. For instance, Gen6D [5], primarily trained on real images, reports 17.0 mAR, 35.5 AR 50, and 14.3 AR 75. While its AR 50 is higher than VoxDet's (23.6), our model performs better on harder metrics like AR 75 (20.5). When compared with cutting-edge foundation models trained on large-scale real images, our model still presents opportunities for improvement. For example, OLN CLIP [14,19] achieves 11.0 mAR, 20.8 AR 50, and 9.2 AR 75, which is lower than VoxDet. However, OLN DINO [14,9] can outperform VoxDet on RoboTools with over 30 mAR. We conclude that leveraging the powerful feature representations from concurrent 2D foundation models [13] could serve as a stronger backbone for VoxDet to mitigate the domain gap issue. Developing a geometry-aware, robust voxel representation learned from such foundation models represents a promising direction for our future work. It is also important to acknowledge that VoxDet currently assumes known instance masks and poses for the reference video, which may introduce noise during real-world deployment.

Section: E. Performance under Flawed Voxel Inputs
To quantitatively analyze the robustness of VoxDet under imperfect voxel inputs, we present its performance on the RoboTools benchmark when the reference video is subjected to disturbances in both appearance and geometry.
**Adding Noise to Reference Image Patches:** We introduced random shifts to the cropped areas within the reference images, leading to inaccurate instance appearance. The results on RoboTools are presented in Table 6. We conclude that even when approximately 65% of the voxel is disturbed (corresponding to a 30% shift on each 2D patch), the model maintains effective performance. This demonstrates VoxDet's inherent robustness to appearance noise.
**Adding Noise to Relative Poses:** We also introduced random angular errors to the poses of the reference images, resulting in inaccurate instance geometry. When we applied a significant angular error of up to 15 degrees, the performance (AR 50) decreased from 23.6 to 20.4. This indicates that VoxDet is not overly sensitive to moderate levels of geometric noise in the input poses, further highlighting its robustness.

Section: F. Extended Related Works
**Visual Object Tracking:** Visual object tracking aims to localize a general target instance within a video, given its initial state in the first frame. Early methods adopted discriminative correlation filters [50-52], where frequency domain calculations enabled real-time performance on a single CPU. More recently, advancements have been made with methods based on Siamese Networks [53] and Transformers [54-56]. Unlike object detection, object tracking relies on a strong temporal consistency assumption; specifically, the location and appearance of the instance in subsequent frames are not expected to vary significantly from the previous frame. Consequently, these methods typically perform detection or matching within a small search region using a single 2D template, which is fundamentally unsuited for our whole-image novel instance detection setting.


References:
[b0] B Li; C Wang; P Reddy; S Kim; S Scherer (2022). AirDet: Few-Shot Detection without Fine-tuning for Autonomous Exploration.
[b1] H Hu; S Bai; A Li; J Cui; L Wang (2021-06). Dense Relation Distillation With Context-Aware Aggregation for Few-Shot Object Detection.
[b2] Y Li; H Mao; R Girshick; K He (2022). Exploring Plain Vision Transformer Backbones for Object Detection.
[b3] I Misra; R Girdhar; A Joulin (2021). An End-to-End Transformer Model for 3D Object Detection.
[b4] Y Liu; Y Wen; S Peng; C Lin; X Long; T Komura; W Wang (2022). Gen6D: Generalizable Model-Free 6-DoF Object Pose Estimation from RGB Images.
[b5] J.-P Mercier; M Garon; P Giguere; J.-F Lalonde (2021). Deep Template-based Object Instance Detection.
[b6] A Osokin; D Sumin; V Lomakin (2020). OS2D: One-Stage One-Shot Object Detection by Matching Anchor Features.
[b7] P Ammirato; C.-Y Fu; M Shvets; J Kosecka; A C Berg (2018). Target Driven Instance Detection.
[b8] M Caron; H Touvron; I Misra; H Jégou; J Mairal; P Bojanowski; A Joulin (2021). Emerging Properties in Self-supervised Vision Transformers.
[b9] Z Lai; S Liu; A A Efros; X Wang (2021). Video Autoencoder: Self-Supervised Disentanglement of Static 3D Structure and Motion.
[b10] H.-Y F Tung; R Cheng; K Fragkiadaki (2019). Learning Spatial Common Sense with Geometry-Aware Recurrent Networks.
[b11] T Nguyen-Phuoc; C Li; L Theis; C Richardt; Y.-L Yang (2019). HoloGAN: Unsupervised Learning of 3D Representations from Natural Images.
[b12] M Oquab; T Darcet; T Moutakanni; H Vo; M Szafraniec; V Khalidov; P Fernandez; D Haziza; F Massa; A El-Nouby (2023). Dinov2: Learning Robust Visual Features without Supervision.
[b13] D Kim; T.-Y Lin; A Angelova; I S Kweon; W Kuo (2022). Learning Open-World Object Proposals without Learning to Classify. IEEE Robotics and Automation Letters
[b14] A X Chang; T Funkhouser; L Guibas; P Hanrahan; Q Huang; Z Li; S Savarese; M Savva; S Song; H Su (2015). ShapeNet: An Information-Rich 3D Model Repository.
[b15] J Collins; S Goel; K Deng; A Luthra; L Xu; E Gundogdu; X Zhang; T F Y Vicente; T Dideriksen; H Arora (2022). Abo: Dataset and Benchmarks for Real-World 3D Object Understanding.
[b16] E Brachmann; A Krull; F Michel; S Gumhold; J Shotton; C Rother (2014). Learning 6D Object Pose Estimation using 3D Object Coordinates.
[b17] B Calli; A Singh; A Walsman; S Srinivasa; P Abbeel; A M Dollar (2015). The YCB Object and Model Set: Towards Common Benchmarks for Manipulation Research.
[b18] A Radford; J W Kim; C Hallacy; A Ramesh; G Goh; S Agarwal; G Sastry; A Askell; P Mishkin; J Clark (2021). Learning Transferable Visual Models from Natural Language Supervision.
[b19] R Girshick; J Donahue; T Darrell; J Malik (2014). Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.
[b20] R Girshick (2015). Fast R-CNN.
[b21] S Ren; K He; R Girshick; J Sun (2016). Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence
[b22] J Redmon; S Divvala; R Girshick; A Farhadi (2016). You Only Look Once: Unified, Real-time Object Detection.
[b23] J Redmon; A Farhadi (2017). YOLO9000: Better, Faster, Stronger.
[b24] (2018). YOLOv3: An Incremental Improvement.
[b25] T.-Y Lin; P Dollár; R Girshick; K He; B Hariharan; S Belongie (2017). Feature Pyramid Networks for Object Detection.
[b26] L Qiao; Y Zhao; Z Li; X Qiu; J Wu; C Zhang (2021). DeFRCN: Decoupled Faster R-CNN for Few-Shot Object Detection.
[b27] X Wang; T Huang; J Gonzalez; T Darrell; F Yu (2020). Frustratingly Simple Few-Shot Object Detection.
[b28] Y Xiao; V Lepetit; R Marlet (2022). Few-Shot Object Detection and Viewpoint Estimation for Objects in the Wild. IEEE Transactions on Pattern Analysis and Machine Intelligence
[b29] K Joseph; S Khan; F S Khan; V N Balasubramanian (2021). Towards Open World Object Detection.
[b30] A Gupta; S Narayan; K Joseph; S Khan; F S Khan; M Shah (2022). OW-DETR: Open-World Detection Transformer.
[b31] A Kirillov; E Mintun; N Ravi; H Mao; C Rolland; L Gustafson; T Xiao; S Whitehead; A C Berg; W.-Y Lo (2023). Segment Anything.
[b32] J L Schonberger; J.-M Frahm (2016). Structure-from-Motion Revisited.
[b33] A W Harley; S K Lakshmikanth; F Li; X Zhou; H.-Y F Tung; K Fragkiadaki (2019). Learning from Unlabelled Videos Using Contrastive Predictive Neural 3D Mapping.
[b34] V Sitzmann; J Thies; F Heide; M Nießner; G Wetzstein; M Zollhofer (2019). Deepvoxels: Learning Persistent 3D Feature Embeddings.
[b35] B Mildenhall; P P Srinivasan; M Tancik; J T Barron; R Ramamoorthi; R Ng (2021). NerF: Representing Scenes as Neural Radiance Fields for View Synthesis. Communications of the ACM
[b36] V Sitzmann; M Zollhöfer; G Wetzstein (2019). Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations.
[b37] R Yang; G Yang; X Wang (2023). Neural Volumetric Memory for Visual Locomotion Control.
[b38] W Wang; Y Hu; S Scherer (2021). Tartanvo: A Generalizable Learning-based VO.
[b39] Y Zhou; C Barnes; J Lu; J Yang; H Li (2019). On the Continuity of Rotation Representations in Neural Networks.
[b40] K Simonyan; A Zisserman (2014). Very Deep Convolutional Networks for Large-Scale Image Recognition.
[b41] H Yang; S Cai; H Sheng; B Deng; J Huang; X.-S Hua; Y Tang; Y Zhang (2022). Balanced and Hierarchical Relation Learning for One-Shot Object Detection.
[b42] M Denninger; M Sundermeyer; D Winkelbauer; Y Zidan; D Olefir; M Elbadrawy; A Lodhi; H Katam (2019). Blenderproc.
[b43] T.-Y Lin; M Maire; S Belongie; J Hays; P Perona; D Ramanan; P Dollár; C L Zitnick (2014). Microsoft COCO: Common Objects in Context.
[b44] Y Xiao; R Marlet (2020). Few-Shot Object Detection and Viewpoint Estimation for Objects in the Wild.
[b45] K He; X Zhang; S Ren; J Sun (2016). Deep Residual Learning for Image Recognition.
[b46] D P Kingma; J Ba (2014). Adam: A Method for Stochastic Optimization.
[b47] P.-E Sarlin; D Detone; T Malisiewicz; A Rabinovich (2020). SuperGlue: Learning Feature Matching with Graph Neural Networks.
[b48] D Detone; T Malisiewicz; A Rabinovich (2018). Superpoint: Self-Supervised Interest Point Detection and Description.
[b49] J F Henriques; R Caseiro; P Martins; J Batista (2015). High-Speed Tracking with Kernelized Correlation Filters. IEEE Transactions on Pattern Analysis and Machine Intelligence
[b50] Y Li; C Fu; F Ding; Z Huang; G Lu (2020). AutoTrack: Towards High-Performance Visual Tracking for UAV with Automatic Spatio-Temporal Regularization.
[b51] B Li; C Fu; F Ding; J Ye; F Lin (2021). ADTrack: Target-Aware Dual Filter Learning for Real-Time Anti-Dark UAV Tracking.
[b52] B Li; W Wu; Q Wang; F Zhang; J Xing; J Yan (2019). SiamRPN++: Evolution of Siamese Visual Tracking With Very Deep Networks.
[b53] Z Cao; C Fu; J Ye; B Li; Y Li (2021). HiFT: Hierarchical Feature Transformer for Aerial Tracking.
[b54] B Ye; H Chang; B Ma; S Shan; X Chen (2022). Joint Feature Learning and Relation Modeling for Tracking: A One-Stream Framework.
[b55] Y Cui; C Jiang; L Wang; G Wu (2022). Mixformer: End-to-End Tracking with Iterative Mixed Attention.
[b56] J Sun; Z Wang; S Zhang; X He; H Zhao; G Zhang; X Zhou (2022). Onepose: One-Shot Object Pose Estimation without CAD Models.
[b57] Y He; Y Wang; H Fan; J Sun; Q Chen (2022). FS6D: Few-Shot 6D Pose Estimation of Novel Objects.
[b58] Q Gu; B Okorn; D Held (2022). OSSID: Online Self-Supervised Instance Detection by (and for) Pose Estimation. IEEE Robotics and Automation Letters
[b59] W Chen; Y Liu; W Wang; E M Bakker; T Georgiou; P Fieguth; L Liu; M S Lew (2022). Deep Learning for Instance Retrieval: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence
[b60] A Babenko; V Lempitsky (2015). Aggregating Local Deep Features for Image Retrieval.
[b61] R Arandjelovic; P Gronat; A Torii; T Pajdla; J Sivic (2016). NetVLAD: CNN Architecture for Weakly Supervised Place Recognition.
[b62] T Ng; V Balntas; Y Tian; K Mikolajczyk (2020). Solar: Second-order loss and attention for image retrieval.
[b63] A Gordo; J Almazán; J Revaud; D Larlus (2016). Deep Image Rretrieval: Learning Global Representations for Image Search.
[b64] M Yang; D He; M Fan; B Shi; X Xue; F Li; E Ding; J Huang (2021). Dolg: Single-Stage Image Retrieval with Deep Orthogonal Fusion of Local and Global Features.

Figures:
Figure fig_1: 2
Type: figure
Caption: Figure 2 :2Figure 2: Architecture of VoxDet. VoxDet mainly consists of three modules, namely, open-world detection, template voxel aggregation (TVA), and query voxel matching (QVM). We first train TVA via the reconstruction stage, where the 2D-3D mapping learns to encode instance geometry. Then the pre-trained mapping serves as initial weights in the TVA and QVM modules for detection training.
Data: 

Figure fig_2: 3
Type: figure
Caption: Figure 3 :3Figure 3: The instances and test scenes in the newly built RoboTools benchmark. The 20 unique instances are recorded as multi-view videos, where the relative camera poses between frames are provided. RoboTools consists of various challenging scenarios, where the desired instance could be under severe occlusion or in different orientation.
Data: 

Figure fig_3: 1
Type: figure
Caption: Remark 1 :1In both training stages, we only use the training objects, O base . During inference, VoxDet doesn't need any further fine-tuning or optimization for O novel . Our research employs datasets composed of distinct training and test sets, adhering to Obase ∩ Onovel = ϕ to ensure no overlap between semantic classes of Obase and Onovel.
Data: 

Figure fig_4: 45
Type: figure
Caption: Figure 4 :Figure 5 :45Figure 4: Number of templates analysis of VoxDet and 2D baseline, OLN DINO[14,9] on YCB-V benchmark. Thanks to the learned geometry-aware 2D-3D mapping, VoxDet can work well with very few reference images, while 2D method suffers from such setting, dropping up to 87%.Top K Analysis on YCB-V Benchmark
Data: 

Figure fig_5: 6
Type: figure
Caption: Figure 6 :6Figure6: Detection qualitative results comparison between VoxDet and 2D baselines on the three benchmarks. VoxDet shows better robustness under pose variance (e.g. Obj. 5@LM-O first and second columns) and occlusion (e.g. Obj. 13@YCB-V second column and Obj. 9@RoboTools).
Data: 

Figure fig_6: 9
Type: figure
Caption: Figure 9 :9Figure 9: Detection qualitative results comparison between VoxDet and 2D baselines, DTOID [6],Gen6D[5], OLN DINO[14,9] on the three benchmarks. VoxDet shows better robustness under pose variance and occlusion. These qualitative comparisons can be better visualized in our supplementary video.
Data: 

Figure tab_0:
Type: table
Caption: Table 1: Overall performance comparison on synthetic-real datasets LM-O [17] and YCB-V [18]. Compared with various 2D methods, including correlation [5], attention [6], and feature matching [9,19], our VoxDet holds superiority in both accuracy and efficiency. OLN* means the open-world object detector (OW Det.) [14] is jointly trained with the matching head while OLN denotes using fixed modules. † the model is trained on both synthetic dataset OWID and real images.
Data: Test/MetricLM-OYCB-VAvg.Method OW Det. VoxDet OLN* OWID 29.2 43.1 33.3 0.8 31.5 51.3 33.4 1.7 30.4 47.2 33.4 6.5OLN Corr. [14, 5] OLN* OWID 22.3 34.4 24.7 0.5 24.8 41.1 26.1 0.7 23.6 37.8 25.4 5.5DTOID [6]N/A OWID 9.8 28.9 3.7 <0.1 16.3 48.8 4.2 <0.1 13.1 38.9 4.0 2.8OS2D [7]N/A OWID 0.2 0.7 0.1 <0.1 5.2 18.3 1.9 <0.1 2.7 9.5 1.0 5.3OLN CLIP [14, 19] OLN OWID  † 16.2 32.1 15.3 0.5 10.7 25.4 7.3 0.2 13.5 28.8 11.3 2.8OLN DINO [14, 9] OLN OWID  † 23.6 41.6 24.8 0.6 25.6 53.0 21.1 0.8 24.6 47.3 23.0 2.8Gen6D [5]N/A OWID  † 12.0 29.8 6.6 <0.1 12.1 37.1 5.2 <0.1 12.1 33.5 5.9 1.3BHRL [42]N/A COCO 14.1 21.0 15.7 0.5 31.8 47.0 34.8 1.4 23.0 34.0 25.3 N/ATable 2: Overall performance comparison on thenewly built real image dataset, RoboTools. Forfairness, we only compare with the models fullytrained on synthetic dataset here, more compari-son see Appendix D. VoxDet shows superiorityeven under sim-to-real domain gap compared withother 2D representation-based methods [14, 5-7].MetricOW Det. mAR AR50 AR75 AR95VoxDetOLN*18.723.620.55.1OLNCorr. [14, 5]OLN*14.418.115.73.8DTOID [6]N/A3.69.02.0<0.1OS2D [7]N/A2.96.52.0<0.1

Figure tab_1: 3
Type: table
Caption: Per module efficiency comparison. All the four methods share the same open-world detector[14]. Compared with 2D baselines that adopt cosine similarity[9,19] or learnable correlation[5], our Voxel matching is more efficient, which shows ∼ 2× faster speed. The numbers presented below are measured in seconds.
Data: Method/ModuleOpen-World Det. Matching ToTalVoxDet0.0320.154OLNCLIP0.122

Figure tab_3: 4
Type: table
Caption: Ablation study for VoxDet in RoboTools benchmark. All the three critical modules are helpful in our design. Supervising the estimated rotation achieves slightly better results. Comparison with more matching module see Appendix B.
Data: Recon. R R w/ sup. Voxel Rel. mAR AR 50 AR 75✓✓✓✓18.7 23.6 20.5✓✓✗✓18.2 23.2 20.0✓✗✗✓15.6 21.9 17.0✓✓✓✗15.1 19.4 16.2✗✓✓✓14.2 18.3 15.7


Formulas:
Formula formula_0: P = [p 1 , p 2 , • • • , p N ] ∈ R N ×4

Formula formula_1: F Q = ROIAlign(P, f Q ) ∈ R N ×C×w×w

Formula formula_2: v S = 1 M M i=1 Conv3D(Rot(V i , R ⊤ i )) ,(1)

Formula formula_3: V Q = M(F Q ) ∈ R N ×Cv×D×L×L .

Formula formula_4: In(v 1 , v 2 ) = [v 1 1 , v 1 2 , v 2 1 , v 2 2 , • • • , v c 1 , v c 2 ] ∈ R 2c×a×a×a , where v k 1 , v k 2

Formula formula_5: R v (v 1 , v 2 ) = Conv3D(In(v 1 , v 2 ), group = c).

Formula formula_6: RQ = MLP(R v (V S , V Q )) ,(2)

Formula formula_7: ŝ = MLP R v (V S , Rot(V Q , RQ )) ,(3)

Formula formula_8: ÎS o,j = Dec(Rot(V S , R ⊤ j )) , j ∈ {1, 2, • • • , K} ,(4)

Formula formula_9: L r = w recon L recon + w gan L gan + w percep L percep ,(5)

Formula formula_10: L Ins rot = ∥ RQ R Q⊤ -I∥ ,(6)

Formula formula_11: L d = w 1 L RPN cls + w 2 L RPN reg + w 3 L Head cls + w 4 L Head reg + w 5 L Ins cls + w 6 L Ins rot ,(7)
