['3c3', '< Abstract: Detecting unseen instances based on multi-view templates is a challenging problem due to its open-world nature. Traditional methodologies, which primarily rely on 2D representations and matching techniques, are often inadequate in handling pose variations and occlusions. To solve this, we introduce VoxDet, a pioneer 3D geometry-aware framework that fully utilizes the strong 3D voxel representation and reliable voxel matching mechanism. VoxDet first ingeniously proposes template voxel aggregation (TVA) module, effectively transforming multi-view 2D images into 3D voxel features. By leveraging associated camera poses, these features are aggregated into a compact 3D template voxel. In novel instance detection, this voxel representation demonstrates heightened resilience to occlusion and pose variations. We also discover that a 3D reconstruction objective helps to pre-train the 2D-3D mapping in TVA. Second, to quickly align with the template voxel, VoxDet incorporates a Query Voxel Matching (QVM) module. The 2D queries are first converted into their voxel representation with the learned 2D-3D mapping. We find that since the 3D voxel representations encode the geometry, we can first estimate the relative rotation and then compare the aligned voxels, leading to improved accuracy and efficiency. In addition to method, we also introduce the first instance detection benchmark, RoboTools, where 20 unique instances are video-recorded with camera extrinsic. RoboTools also provides 24 challenging cluttered scenarios with more than 9k box annotations. Exhaustive experiments are conducted on the demanding LineMod-Occlusion, YCB-video, and RoboTools benchmarks, where VoxDet outperforms various 2D baselines remarkably with faster speed. To the best of our knowledge, VoxDet is the first to incorporate implicit 3D knowledge for 2D novel instance detection tasks. Our code, data, raw results, and pre-trained models are public at https://github.com/Jaraxxus-Me/VoxDet.', '---', '> Abstract: Novel instance detection, the task of identifying specific unseen objects based on multi-view templates in open-world scenarios, presents significant challenges. Traditional 2D-centric methods often struggle with substantial pose variations and occlusions, limiting their real-world applicability. To overcome these limitations, we introduce **VoxDet**, a pioneering 3D geometry-aware framework. VoxDet uniquely leverages a robust 3D voxel representation and an efficient voxel matching mechanism, offering inherent resilience to appearance changes. First, our novel **Template Voxel Aggregation (TVA)** module ingeniously transforms multi-view 2D images into compact 3D template voxels. By leveraging associated camera poses, TVA aggregates 2D features into a unified 3D representation, which intrinsically provides heightened resilience to occlusion and pose variations. We further show that a self-supervised 3D reconstruction objective effectively pre-trains the critical 2D-3D mapping within TVA, enhancing its geometry-encoding capabilities. Second, the **Query Voxel Matching (QVM)** module enables rapid and accurate alignment. QVM converts 2D query features into their 3D voxel representation using the pre-trained 2D-3D mapping. By explicitly estimating relative rotation and aligning the voxels before comparison, QVM significantly improves matching accuracy and computational efficiency, directly benefiting from the geometry encoded in the 3D voxels. Beyond our methodology, we introduce **RoboTools**, the first comprehensive instance detection benchmark. RoboTools features 20 unique instances, video-recorded with precise camera extrinsics, across 24 challenging cluttered scenarios, totaling over 9,000 box annotations. Extensive experiments on LineMod-Occlusion, YCB-video, and our demanding RoboTools benchmarks demonstrate that VoxDet consistently and remarkably outperforms existing 2D baselines, while also achieving superior inference speeds. To the best of our knowledge, VoxDet is the first framework to explicitly incorporate 3D geometric knowledge for robust 2D novel instance detection. Our code, data, raw results, and pre-trained models are publicly available at https://github.com/Jaraxxus-Me/VoxDet.', '6c6', '< Consider the common scenarios of locating the second sock of a pair in a pile of laundry or identifying luggage amid hundreds of similar suitcases at an airport. These activities illustrate the remarkable capability of human cognition to swiftly and accurately identify a specific instance among other similar objects. Humans can rapidly create a mental picture of a novel instance with a few glances even if they see such an instance for the first time or have never seen instances of the same type. Searching for instances using mental pictures is a fundamental ability for humans, however, even the latest object detectors [1][2][3][4][5][6][7] still cannot achieve this task. We formulate the above tasks as novel instance detection, that is identification of an unseen instance in a cluttered query image, utilizing its multi-view support references. Previous attempts mainly work in 2D space, such as correlation [8,5], attention mechanisms [6], or similarity matching [9], thereby localizing and categorizing the desired instance, as depicted in Fig. 1 gray part. However, these techniques struggle to maintain their robustness when faced with significant disparities between the query and templates. In comparison to novel instance detection, there is a vast amount of work centered around few-shot category-level object detection [7,1,2]. Yet, these class-level matching techniques prove insufficient when it comes to discerning specific instance-level features. one-stage approaches. For the former one, RCNN [20] and its variants [21,22] serves as foundations, where the regions of interest (ROI) are first obtained by the region proposal network. Then the detection heads classify the labels of each ROI and regress the box coordinates. On the other hand, the YOLO series [23][24][25] and recent transformer-based methods [4,3] are developing promisingly as the latter stream, where the detection task is tackled as an end-to-end regression problem. Few-shot/One-shot object detection [1,27,28,2,29,7] can work for unseen classes with only a few labeled support samples, which are closer to our task. One stream focuses on transfer-learning techniques [28,27], where the fine-tuning stage is carefully designed to make the model quickly generalize to unseen classes. While the other resorts to meta-learning strategies [1,7,2,29], where various kinds of relations between supports and queries are discovered and leveraged. Since the above methods are category-level, they assume more than one desired instances exist in an image, so the classification/matching designs are usually tailored for Top-100 precision, which is not a very strict metric. However, they can easily fail in our problem, where the Top-1 accuracy is more important. Open-world/Zero-shot object detection [30][31][32]14] finds any objects on an image, which is class-agnostic and universal. Some of them learn objectiveness [30,14] and others [32] rely on large-scale high-quality training sets. These methods can serve as the first module in our pipeline, which generates object proposals for comparison with the templates. Among them, we adopt [14] with its simple structure and promising performance. Instance detection requires the algorithm to find an unseen instance in the test image with some corresponding templates. Previous methods [6,5,8] usually utilize pure 2D representations and 2D matching/relation techniques. For example, DTOID [6] proposed global object attention and a local pose-specific branch to predict the template-guided heatmap for detection. However, they easily fall short when the 2D appearance variates due to occlusion or pose variation. Differently, VoxDet leverages the explicit 3D knowledge in the multi-view templates to represent and match instances, which is geometry-invariant. Multi-view 3D representations Representing 3D scenes/instances from multi-view images is a long-standing problem in computer vision. Traditional methods resort to multi-view geometry, where structure from motion (SfM) [33] pipeline has enabled joint optimization of the camera pose and 3D structure. Modern methods usually adopts neural 3D representations [34, 11, 35-37, 12, 10], including deep voxels [35,12,10,38] and implicit functions [36,37], which have yielded great success in 3D reconstruction or novel view synthesis. Our framework is mainly inspired by Video Autoencoder [10], which encodes a video by separately learning the deep implicit 3D structure and the camera trajectory. One biggest advantage of [10] is that the learned Autoencoder can encode and synthesize test scenes without further tuning or optimization, which greatly satisfies the efficiency requirement of our instance detection task.', '---', '> The ability to swiftly identify a specific, previously unseen instance amidst a cluttered environment, such as finding a matching sock in a laundry pile or a particular suitcase at an airport, is a fundamental human cognitive skill. Humans can form a mental picture of a novel object from a few glances and then locate it with remarkable accuracy. While state-of-the-art object detectors [1-7] excel at categorizing known classes, they fall short in this critical task of **novel instance detection**: identifying an unseen specific instance within a cluttered query image using multi-view support references. Prior approaches to this problem predominantly operate in 2D space, employing techniques like correlation [8,5], attention mechanisms [6], or similarity matching [9]. However, these 2D methods often exhibit a critical lack of robustness when confronted with significant appearance disparities, such as those caused by pose variations or occlusions, between the query and template images. Furthermore, while extensive research exists in few-shot category-level object detection [7,1,2], these class-level matching techniques are inherently insufficient for the fine-grained discernment required for instance-level identification.', '7a8,14', '> Section: Related Work', '> **Object Detection:** Object detection methods broadly fall into two categories: two-stage and one-stage approaches. Two-stage detectors, such as R-CNN [20] and its variants [21,22], first generate regions of interest (ROI) via a region proposal network, followed by detection heads for classification and bounding box regression. In contrast, one-stage detectors like the YOLO series [23-25] and recent transformer-based methods [3,4] tackle detection as an end-to-end regression problem. Few-shot and one-shot object detection [1,2,7,27-29] aim to detect unseen classes with limited labeled support samples, making them conceptually closer to our task. These methods typically employ either transfer-learning [27,28] or meta-learning strategies [1,2,7,29] to generalize to novel categories. However, as these are primarily category-level approaches, their classification and matching designs are often optimized for metrics like Top-100 precision, assuming multiple instances of a class. This contrasts sharply with novel instance detection, where Top-1 accuracy for a specific instance is paramount, and these methods often fail. Open-world and zero-shot object detection [14,30-32] address the task of finding any object in an image, often in a class-agnostic manner. Some approaches learn objectiveness [14,30], while others [32] leverage large-scale datasets. Such methods are well-suited as a preliminary module in our pipeline to generate object proposals for subsequent template comparison. We specifically adopt [14] due to its efficient structure and robust performance.', '> **Novel Instance Detection:** Crucially, novel instance detection necessitates algorithms that can locate a specific unseen instance in a query image using provided multi-view templates. As discussed, prior methods [5,6,8] predominantly rely on pure 2D representations and matching techniques. For instance, DTOID [6] employs global object attention and a local pose-specific branch to predict template-guided heatmaps. However, these 2D-centric approaches inherently struggle with significant 2D appearance variations caused by occlusion or drastic pose changes. VoxDet addresses this fundamental limitation by explicitly leveraging 3D geometric knowledge extracted from multi-view templates, enabling the representation and matching of instances in a geometry-invariant manner.', '> **Multi-view 3D Representations:** The robust representation of 3D scenes or instances from multi-view images is a long-standing and critical problem in computer vision. Historically, multi-view geometry, particularly Structure from Motion (SfM) [33] pipelines, has been instrumental in jointly optimizing camera poses and 3D structures. More recently, neural 3D representations [10,11,34-38], including deep voxels [10,12,35,38] and implicit functions [36,37], have achieved remarkable success in tasks such as 3D reconstruction and novel view synthesis. Our VoxDet framework draws significant inspiration from the Video Autoencoder [10], which disentangles deep implicit 3D structure learning from camera trajectory estimation in videos. A key advantage of this approach, crucial for our instance detection task, is its ability to encode and synthesize test scenes efficiently without requiring further tuning or optimization.', '> **Instance Pose Estimation:** This field is dedicated to estimating the 6 Degrees of Freedom (6 DoF) pose of an unseen instance. Some methods [57,58] match local point features and utilize RANSAC to optimize the relative pose. Others [5,59] first select the closest template frame and then conduct pose refinement on the cropped object patch. Most of these methods typically assume perfect instance detection, meaning they crop the instance from the query image using ground-truth bounding boxes and then estimate the pose on this small, object-centered patch. Our VoxDet can serve as an effective front-end for such systems, providing robust detection in cluttered environments and thereby enhancing the reliability of the overall detection-pose estimation framework.', "> **Instance Retrieval:** Instance retrieval aims to retrieve a specific instance from a large database given a single reference image [60-65]. Early work extracted local point features from template and query patches for image matching [61,49], which often suffered from limited discriminative capability. More recent work leverages deep neural networks for a global representation of the instance [62-65], comparing these with features extracted from query images. However, most of these methods construct 2D template features from the reference, making their representation unaware of the instance's 3D geometry, which can lead to a lack of robustness under severe pose variations. Furthermore, instance retrieval methods often require high-resolution query images for discriminative features, whereas instances in our cluttered query images can be at low resolution, posing additional challenges for these approaches.", '> ', '8a16', "> **Problem Formulation:** Given a training instance set O base and a disjoint unseen test instance set O novel (i.e., O base ∩ O novel = ϕ), the objective of novel instance detection is to train an instance detector on O base that can subsequently detect new instances in O novel without any further training or fine-tuning. Formally, for each target instance, the detector receives a query image I Q ∈ R 3×W ×H and a set of M support templates I S ∈ R M ×3×W ×H. The detector's output is the bounding box b ∈ R 4 of the target instance in the query image. For this work, we assume that exactly one target instance is present in the query image and that the instance is approximately centered in the support images.", '10,13d17', '< ', '< Section: Problem Formulation', '< Given a training instance set O base and an unseen test instance set O novel , where O base ∩ O novel = ϕ, the task of novel instance detection (open-world detection) is to find an instance detector trained on O base and then detect new instances in O novel with no further training or finetuning. Specifically, for each instance, the input to the detector is a query image I Q ∈ R 3×W ×H and a group of M support templates I S ∈ R M ×3×W ×H of the target instance. The detector is expected to output the bounding box b ∈ R 4 of an instance on the query image. We assume there exists exactly one such instance in the query image and the instance is located near the center of the support images.', '< ', '15c19', '< The architecture of VoxDet is shown in Fig. 2, which consists of an open-world detector, a template voxel aggregation (TVA) module, and a query voxel matching (QVM) module. Given the query image, the open-world detector aims to generate universal proposals covering all possible objects. TVA aggregates multi-view supports into a compact template voxel via the relative camera pose between frames. QVM lifts 2D proposal features onto 3D voxel space, which is then aligned and matched with the template voxel. In order to empower the voxel representation with 3D geometry, we first resort to a reconstruction objective in the first stage. The pre-trained models serve as the initial weights for the second instance detection training stage.  ', '---', '> The comprehensive architecture of VoxDet, illustrated in Fig. 2, integrates three primary modules: an open-world detector, a Template Voxel Aggregation (TVA) module, and a Query Voxel Matching (QVM) module. Initially, the open-world detector processes the query image to generate universal object proposals. Subsequently, TVA aggregates multi-view support images into a compact 3D template voxel, leveraging their relative camera poses. Concurrently, QVM transforms 2D proposal features into the 3D voxel space, facilitating alignment and matching with the template voxel. To imbue the voxel representation with robust 3D geometric understanding, we employ a self-supervised 3D reconstruction objective during an initial pre-training stage. The models obtained from this reconstruction pre-training then serve as crucial initial weights for the subsequent instance detection training stage.  ', '18c22', '< Since the desired instance is unseen during training, directly regressing its location and scale is non-trivial. To solve this, we first use an open-world detector [14] to generate the most possible candidates. Different from standard detection that only finds out pre-defined classes, an open-world detector locates all possible objects in an image, which is class-agnostic. As shown in Fig. 2, given a query image I Q , a 2D feature map f Q is extracted by a backbone network ψ(•). To classify each pre-defined anchor as foreground (objects) or background, the region proposal network (RPN) [22] is adopted. Concurrently, the boundaries of each anchor are also roughly regressed. The resulting anchors with high classification scores are termed region proposals', '---', '> Given that the desired instance is unseen during training, directly regressing its precise location and scale presents a significant challenge. To address this, we initiate our pipeline with an open-world detector [14] designed to generate comprehensive object proposals. Unlike conventional detectors that identify only pre-defined classes, an open-world detector is class-agnostic, capable of localizing all potential objects within an image. As depicted in Fig. 2, for a given query image I Q , a 2D feature map f Q is extracted by a backbone network ψ(•). A Region Proposal Network (RPN) [22] then classifies pre-defined anchors as either foreground (objects) or background, while simultaneously regressing their bounding box coordinates. Anchors with high classification scores are designated as region proposals, represented as', '20c24', '< , where N is the number of proposals. Next, to obtain the features F Q for these candidates, we use region of interest pooling (ROIAlign) [22],', '---', '> , where N denotes the total number of proposals. Subsequently, to obtain features F Q for these candidates, we apply Region of Interest (ROI) pooling (specifically ROIAlign [22]), yielding', '22c26', '< , where C denotes channel dimensions and w is the spatial size of proposal features. Finally, we obtain the final classification result and bounding box by two parallel multi-layer perceptrons (MLP), known as the detection head, which takes the proposal features F Q as input, and outputs the binary classification scores and the box regression targets. The training loss is comprised of RPN classification loss L RPN cls , RPN regression loss L RPN reg , head classification loss L Head cls , and head regression loss L Head reg . To make the detector work for open-world objects, the classification branches (in RPN and head) are guided by objectiveness regression [14]. Specifically, the classification score is defined (supervised) by Intersection over Union (IoU), which showed a high recall rate over the objects in test images, even those unseen during training. Since they have learned the class-agnostic "objectiveness", we assume the open-world proposals probably cover the desired novel instance. Therefore, we take the top-ranking candidates and their features as the input of the subsequent matching module.', '---', "> , where C is the channel dimension and w is the spatial size of the proposal features. Finally, the detection head, comprising two parallel multi-layer perceptrons (MLP), takes F Q as input to output binary classification scores and bounding box regression targets. The training objective for this module combines several losses: RPN classification loss L RPN cls , RPN regression loss L RPN reg , head classification loss L Head cls , and head regression loss L Head reg . To enable the detector's open-world capability, its classification branches (both in the RPN and detection head) are supervised through objectiveness regression [14]. This means the classification score is directly defined and optimized to predict the Intersection over Union (IoU) with ground-truth objects. This approach has demonstrated a high recall rate for objects in test images, even for instances not seen during training. By learning this class-agnostic 'objectiveness,' the open-world detector reliably generates proposals that are highly likely to cover the desired novel instance. Consequently, we select the top-ranking candidates and their corresponding features as the input for our subsequent matching module.", '25c29,31', '< To learn geometry-invariant representations, the Template Voxel Aggregation (TVA) module compresses multi-view 2D templates into a compact deep voxel. Inspired by previous technique [10] developed for unsupervised video encoding, we propose to encode our instance templates via their relative orientation in the physical 3D world. To this end, we first generate the 2D feature maps F S = ψ(I S ) ∈ R M ×C×w×w using a shared backbone network ψ(•) used in the query branch and then map the 2D features to 3D voxels for multi-view aggregation. 2D-3D mapping: To map these 2D features onto a shared 3D space for subsequent orientation-based aggregation, we utilize an implicit mapping function M(•). This function translates the 2D features to 3D voxel features, denoted by V = M(F S ) ∈ R M ×Cv×D×L×L , where V is the 3D voxel feature from the 2D feature, C v is the feature dimension, and D, L indicate voxel spatial size. Specifically, we first reshape the feature maps to F ′S ∈ R M ×(C/d)×d×w×w , where d is the pre-defined implicit depth, then we apply 3D inverse convolution to obtain the feature voxel. Note that with multi-view images, we can calculate the relative camera rotation easily via Structure from Motion (SfM) [33] or visual odometry [39]. Given that the images are object-centered and the object stays static in the scene, these relative rotations in fact represent the relative rotations between the object orientations defined in the same camera coordination system. Different from previous work [10] that implicitly learns the camera extrinsic for unsupervised encoding, we aim to explicitly embed such geometric information. Specifically, our goal is to first transform every template into the same coordinate system using their relative rotation, which is then aggregated:', '---', '> To learn geometry-invariant representations, the Template Voxel Aggregation (TVA) module compresses multi-view 2D templates into a compact deep voxel. Inspired by previous techniques [10] developed for unsupervised video encoding, we propose to encode our instance templates via their relative orientation in the physical 3D world. To this end, we first generate the 2D feature maps F S = ψ(I S ) ∈ R M ×C×w×w using a shared backbone network ψ(•) (also used in the query branch). These 2D features are then mapped to 3D voxels for multi-view aggregation.', "> **2D-3D Mapping:** To project these 2D features onto a shared 3D space, we utilize an implicit mapping function M(•). This function translates the 2D features to 3D voxel features, denoted by V = M(F S ) ∈ R M ×Cv×D×L×L, where V represents the 3D voxel feature derived from the 2D input, C v is the feature dimension, and D, L indicate the voxel's spatial dimensions. Specifically, we first reshape the feature maps to F ′S ∈ R M ×(C/d)×d×w×w, where d is a pre-defined implicit depth, and then apply 3D inverse convolution to obtain the feature voxel.", '> Note that with multi-view images, the relative camera rotation can be readily calculated via Structure from Motion (SfM) [33] or visual odometry [39]. Given that the images are object-centered and the object remains static, these relative rotations effectively represent the relative orientations between the object instances as defined in the same camera coordinate system. Unlike prior work [10] that implicitly learns camera extrinsics for unsupervised encoding, our approach explicitly embeds this geometric information. Specifically, our goal is to first transform every template into a common coordinate system using its relative rotation, which is then aggregated:', '27c33', '< where V i ∈ R Cv×D×L×L is the previously mapped i-th independent voxel feature, R ⊤ i denotes the relative camera rotation between the i-th support frame and the first frame. Rot(•, •) is the 3D transform used in [10], which first wraps a unit voxel to the new coordination system using R ⊤ i and then samples from the feature voxel V i with the transformed unit voxel grid. Therefore, all the M voxels are transformed into the same coordinate system defined in the first camera frame. These are then aggregated through average pooling to produce the compact template voxel v S . By explicitly embedding the 3D rotations into individual reference features, TVA achieves a geometryaware compact representation, which is more robust to occlusion and pose variation.', '---', '> where V i ∈ R Cv×D×L×L is the previously mapped i-th independent voxel feature, and R ⊤ i denotes the relative camera rotation between the i-th support frame and the first frame. Rot(•, •) is the 3D transform used in [10], which first warps a unit voxel to the new coordinate system using R ⊤ i and then samples from the feature voxel V i with the transformed unit voxel grid. Therefore, all M voxels are transformed into the same coordinate system defined by the first camera frame. These are then aggregated through average pooling to produce the compact template voxel v S . By explicitly embedding the 3D rotations into individual reference features, TVA achieves a geometry-aware compact representation, which is inherently more robust to occlusion and pose variation.', '30c36', '< Given the proposal features F Q from query image I Q and the template voxel C S from supports I S , the task of the query voxel matching (QVM) module is to classify each proposal as foreground (the reference instance) or background. As shown in Fig. 2, in order to empower the 2D features with 3D geometry, we first use the same mapping to get query voxels,', '---', '> Given the proposal features F Q from the query image I Q and the template voxel v S from the support images I S , the primary task of the Query Voxel Matching (QVM) module is to classify each proposal as either foreground (corresponding to the reference instance) or background. As illustrated in Fig. 2, to endow the 2D query features with 3D geometric awareness, we first apply the same implicit mapping function M(•) to obtain query voxels:', '32,33c38,39', '< VoxDet next accomplishes matching v S and V Q through two steps. First, we need to estimate the relative rotation between query and support, so that V Q can be aligned in the same coordinate system as v S . Second, we need to learn a function that measures the distance between the aligned two voxels.', '< To achieve this, we define a voxel relation operator R v (•, •): Voxel Relation Given two voxels v 1 , v 2 ∈ R c×a×a×a , where c is the channel and a is the spatial dimension, this function seeks to discover their relations in every semantic channel. To achieve this, we first interleave the voxels along channels as', '---', '> VoxDet then accomplishes the matching between v S and V Q through a two-step process. First, it estimates the relative rotation between the query and support, ensuring that V Q can be accurately aligned in the same coordinate system as v S . Second, it learns a function to measure the geometric and semantic distance between the aligned voxels.', '> To achieve this, we introduce a novel **Voxel Relation** operator, R v (•, •). Given two voxels v 1 , v 2 ∈ R c×a×a×a (where c is the channel dimension and a is the spatial dimension), this function aims to discover their intricate relations across every semantic channel. This is achieved by first interleaving the voxels along their channel dimension:', '35c41', '< is the voxel feature in the k-th channel. Then, we apply grouped convolution as', '---', '> denote the voxel features in the k-th channel. Subsequently, we apply grouped convolution:', '37c43', '< In the experiments, we found that such a design makes relation learning easier since each convolution kernel is forced to learn the two feature voxels from the same channel. With this voxel relation, we can then roughly estimate the rotation matrix RQ ∈ R N ×3×3 of each query voxel relative to the template as:', '---', '> Through extensive experimentation, we found that this design significantly facilitates relation learning, as each convolution kernel is specifically tasked with learning the relationship between the two feature voxels from the same semantic channel. Leveraging this voxel relation, we can then robustly estimate the rotation matrix RQ ∈ R N ×3×3 for each query voxel relative to the template as:', '39c45', '< where v S is copied N times to get V S . In practice, we first predict 6D continuous vector [40] as the network outputs and then convert the vector to a rotation matrix. Next, we can define the classification haed with the Voxel Relation as:', '---', "> where v S is replicated N times to form V S for element-wise comparison. In practice, we initially predict a 6D continuous vector [40] as the network's output, which is then converted into a rotation matrix. Following this, we define the classification head using the Voxel Relation:", '41c47', '< where Rot(V Q , RQ ) rotates the queries to the support coordination system to allow for reasonable matching. In practice, we additionally introduced a global relation branch for the final score, so that the lost semantic information in implicit mapping can be retrieved. More details are available in the supplementary material. During inference, we rank the proposals P according to their matching score and take the Top-k candidates as the predicted box b.', '---', "> where Rot(V Q , RQ ) performs the rotation of the query voxels into the support's coordinate system, enabling accurate and geometrically consistent matching. Practically, we also incorporate a global relation branch to capture semantic information potentially attenuated during the implicit 2D-3D mapping process, further enhancing the final score. More comprehensive details are provided in the supplementary material. During inference, we rank the generated proposals P based on their matching scores and select the Top-k candidates as the predicted bounding box b.", '44,45c50,51', '< As illustrated in Fig. 2, VoxDet contains two training stages: reconstruction and instance detection.', '< Reconstruction To learn the 3D geometry relationships, specifically 3D rotation between instance templates, we pre-train the implicit mapping function M(•) using a reconstruction objective. We divide M multi-view templates I S into input images I S i ∈ R (M -K)×3×W ×H and outputs I S o ∈ R K×3×W ×H . Next, we construct the voxel representation V S using I S i via the TVA module and adopt a decoder network Dec to reconstruct the output images through the relative rotations:', '---', "> As illustrated in Fig. 2, VoxDet's training regimen is divided into two distinct stages: reconstruction pre-training and instance detection fine-tuning.", '> **Reconstruction Pre-training:** To effectively learn 3D geometry relationships, particularly the 3D rotation between instance templates, we pre-train the implicit mapping function M(•) using a self-supervised reconstruction objective. We partition M multi-view templates I S into a set of input images I S i ∈ R (M -K)×3×W ×H and a set of output images I S o ∈ R K×3×W ×H. Subsequently, we construct the voxel representation V S using I S i via the TVA module and employ a decoder network Dec to reconstruct the output images based on their relative rotations:', '47c53', '< where ÎS o,j denotes the j-th reconstructed (fake) output images and R j is the relative rotation matrix between the 1-st to j-th camera frame. We finally define the reconstruction loss as:', '---', '> where ÎS o,j denotes the j-th reconstructed (synthetic) output image and R j is the relative rotation matrix between the first and j-th camera frame. The comprehensive reconstruction loss is then defined as:', '49c55,56', '< where L recon denotes the reconstruction loss, i.e., the L1 distance between I S o and ÎS o . L gan is the generative adversarial network (GAN) loss, where we additionally train a discriminator to classify I S o and ÎS o . L percep means the perceptual loss, which is the L1 distance between the feature maps of I S o and ÎS o in each level of VGGNet [41]. Even though the reconstruction is only supervised on training instances, we observe that it can roughly reconstruct novel views for unseen instances. We thus reason that the pre-trained voxel mapping can roughly encode the geometry of an instance. Detection base training : In order to empower M(•) with geometry encoding capability, we initialize it with the reconstruction pre-trained weights and conduct the instance detection training stage. In addition to the open-world detection loss [14], we introduce the instance classification loss L Ins cls and rotation estimation loss L Ins rot to supervise our VoxDet. We define L Ins cls as the binary cross entropy loss between the true labels s ∈ {0, 1} N and the predicted scores ŝ ∈ R N ×2 from the QVM module. The rotation estimation loss is defined as:', '---', '> Here, L recon represents the reconstruction loss, specifically the L1 distance between I S o and ÎS o . L gan is the generative adversarial network (GAN) loss, where an auxiliary discriminator is trained to distinguish between real (I S o ) and reconstructed (ÎS o ) images. L percep denotes the perceptual loss, calculated as the L1 distance between the feature maps of I S o and ÎS o at various levels of a pre-trained VGGNet [41]. Although this reconstruction is supervised solely on training instances, we empirically observe that it enables the model to roughly reconstruct novel views for entirely unseen instances. This suggests that the pre-trained voxel mapping successfully encodes the fundamental geometry of an instance.', '> **Instance Detection Training:** To empower M(•) with robust geometry encoding capabilities for detection, we initialize it with the weights learned during the reconstruction pre-training stage and proceed with the instance detection training. In addition to the standard open-world detection loss [14], we introduce an instance classification loss L Ins cls and a rotation estimation loss L Ins rot to supervise VoxDet. L Ins cls is defined as the binary cross-entropy loss between the true labels s ∈ {0, 1} N and the predicted scores ŝ ∈ R N ×2 from the QVM module. The rotation estimation loss is formulated as:', '51c58', '< Table 1: Overall performance comparison on synthectic-real datasets LM-O [17] and YCB-V [18]. Compared with various 2D methods, including correlation [5], attention [6], and feature matching [9,19], our VoxDet holds superiority in both accuracy and efficiency. OLN* means the open-world object detector (OW Det.) [14] is jointly trained with the matching head while OLN denotes using fixed modules. † the model is trained on both synthetic dataset OWID and real images.   [14,19] 0.248 0.370 OLNDINO [14,9] 0.235 0.357 OLNCorr. [14,5] 0.060 0.182 where R Q is the ground-truth rotation matrix of the query voxel. Note that here we only supervise the positive samples. Together, our instance detection loss is defined as: Synthetic Training set: In response to the scarcity of instance detection traing sets, we\'ve compiled a comprehensive synthetic dataset using 9,901 objects from ShapeNet [15] and ABO [16]. Each instance is rendered into a 40-frame, object-centric 360 o video via Blenderproc [43]. We then generate a query scene using 8 to 15 randomly selected objects from the entire instance pool, each initialized with a random orientation. This process yielded 55,000 scenes with 180,000 boxes for training and an additional 500 images for evaluation, amounting to 9,800 and 101 instances respectively. We\'ve termed this expansive training set "open-world instance detection" (OWID-10k), signifying our model\'s capacity to handle unseen instances. To our knowledge, this is the first of its kind. Synthetic-Real Test set: We utilize two authoritative benchmarks for testing. LineMod-Occlusion [17] (LM-O) features 8 texture-less instances and 1,514 box annotations, with the primary difficulty being heavy object occlusion. The YCB-Video [18] (YCB-V) contains 21 instances and 4,125 target boxes, where the main challenge lies in the variance in instance pose. These datasets provide real test images while lacks the reference videos, we thus render synthetic videos using the CAD models in Blender.', '---', '> where R Q is the ground-truth rotation matrix of the query voxel. It is important to note that we only supervise this loss for positive samples. The combined instance detection loss is thus defined as:', '52a60,61', '> **Synthetic Training Set:** Addressing the critical scarcity of instance detection training sets, we have meticulously compiled a comprehensive synthetic dataset, OWID-10k, leveraging 9,901 diverse 3D objects from ShapeNet [15] and ABO [16]. Each instance is rendered into a 40-frame, object-centric 360° video using Blenderproc [43]. For training, we generate 55,000 query scenes, each containing 8 to 15 randomly selected objects from the entire instance pool, initialized with random orientations, yielding 180,000 bounding box annotations. An additional 500 images are reserved for evaluation, covering 9,800 and 101 instances respectively. This expansive dataset, termed "Open-World Instance Detection" (OWID-10k), is specifically designed to assess our model\'s capacity to generalize to unseen instances, representing a pioneering effort in this domain.', '> **Synthetic-Real Test Sets:** For rigorous testing, we utilize two authoritative benchmarks: LineMod-Occlusion [17] (LM-O) and YCB-Video [18] (YCB-V). LM-O features 8 texture-less instances and 1,514 box annotations, with the primary challenge being heavy object occlusion. YCB-V contains 21 instances and 4,125 target boxes, where the main difficulty lies in significant variations in instance pose. Since these datasets provide real test images but lack corresponding reference videos, we render synthetic videos using their respective CAD models in Blender to serve as multi-view templates.', '54,55c63,67', '< Section: Fully-Real Test set:', "< To test the sim-to-real transfer capability of VoxDet, we introduced a more complex fully real-world benchmark, RoboTools, consisting of 20 instances, 9,109 annotations,   [7]. By virtue of the instance-level matching method, QVM, VoxDet can better classify the proposals, so that 90% of the true positives lie in Top-10, while for OS2D, this ratio is only 60%. and 24 challenging scenarios. The instances and scenes are presented in Fig. 3. Compared with existing benchmarks [17,18], RoboTools is much more challenging with more cluttered backgrounds and more severe pose variation. Besides, the reference videos of RoboTools are also real-images, including real lighting conditions like shadows. We also provide the ground-truth camera extrinsic. Baselines: Our baselines comprise template-driven instance detection methods, such as correlation [5] and attention-based approaches [6]. However, these methods falter in cluttered scenes, like those in LM-O, YCB-V, and RoboTools. Therefore, we've self-constructed several 2D baselines, namely, OLN DINO , OLN CLIP , and OLN Corr. In these models, we initially obtain open-world 2D proposals via our open-world detection module [14]. We then employ different 2D matching methods to identify the proposal with the highest score. In OLN DINO and OLN CLIP , we leverage robust features from pre-trained backbones [9,19] and use cosine similarity for matching. 1 For OLN Corr. , we designed a 2D matching head using correlation as suggested in [5]. These open-world detection based 2D baselines significantly outperform previous methods [5,6]. In addition to these instance-specific methods, we also include a class-level one-shot detector, OS2D [7] and BHRL [42] for comparison. Hardware and configurations: The reconstruction stage of VoxDet was trained on a single Nvidia V100 GPU over a period of 6 hours, while the detection training phase utilized four Nvidia V100 GPUs for a span of ∼40 hours. For the sake of fairness, we trained the methods referenced [5-7, 14, 19, 9] mainly on the OWID dataset, adhering to their official configuration. Inferences were conducted on a single V100 GPU to ensure fair efficiency comparison. During testing, we supplied each model with the same set of M = 10 template images per instance, and all methods employed the top N = 500 ranking proposals for matching. In the initial reconstruction training stage, VoxDet used 98% of all 9,901 instances in the OWID dataset. For each instance, a random set of K = 4 images were designated as output I S o , while the remaining M -K = 6 images constituted the inputs I S i . For additional configurations of VoxDet, please refer to Appendix A and our code. Metrics: Given our assumption that only one desired instance is present in the query image, we default to selecting the Top-1 proposal as the predicted result. We report the average recall (AR) rate [44] across different IoU, such as mAR (IoU ∈ 0.5 ∼ 0.95), AR 50 (IoU 0.5), AR 75 (IoU 0.75), and AR 95 (IoU 0.95). Note that the AR is equivalent to the average precision (AP) in our case.", '---', '> Section: Fully-Real Test Set: RoboTools', '> To rigorously assess the sim-to-real transfer capability of VoxDet, we introduce a more complex, fully real-world benchmark named **RoboTools**. This dataset comprises 20 unique instances, 9,109 annotations, and 24 highly challenging scenarios. The instances and example scenes are presented in Fig. 3. Compared to existing benchmarks like LineMod-Occlusion [17] and YCB-Video [18], RoboTools is significantly more challenging due to its heavily cluttered backgrounds and severe pose variations. Furthermore, the reference videos in RoboTools consist of real images, capturing realistic lighting conditions including shadows. We also provide ground-truth camera extrinsics for all recordings.', '> **Baselines:** Our comparative baselines include traditional template-driven instance detection methods, such as correlation-based approaches [5] and attention-based methods [6]. However, these methods typically falter in cluttered scenes, which are prevalent in LM-O, YCB-V, and RoboTools. Therefore, we have meticulously constructed several strong 2D baselines, namely OLN DINO, OLN CLIP, and OLN Corr. In these models, we first obtain open-world 2D proposals using our open-world detection module [14]. Subsequently, different 2D matching methods are employed to identify the proposal with the highest score. For OLN DINO and OLN CLIP, we leverage robust features from large-scale pre-trained backbones [9,19] and utilize cosine similarity for matching. For OLN Corr., we designed a 2D matching head that employs correlation, as suggested in [5]. These open-world detection-based 2D baselines significantly outperform previous methods [5,6]. In addition to these instance-specific methods, we also include class-level one-shot detectors, OS2D [7] and BHRL [42], for comprehensive comparison.', '> **Hardware and Configurations:** The reconstruction stage of VoxDet was trained on a single Nvidia V100 GPU for 6 hours, while the detection training phase utilized four Nvidia V100 GPUs for approximately 40 hours. For fair comparison, we trained the referenced methods [5-7, 14, 19, 9] primarily on the OWID dataset, adhering to their official configurations. Inferences were conducted on a single V100 GPU to ensure equitable efficiency comparisons. During testing, all models were provided with the same set of M = 10 template images per instance, and all methods utilized the top N = 500 ranking proposals for matching. In the initial reconstruction training stage, VoxDet leveraged 98% of all 9,901 instances in the OWID dataset. For each instance, a random set of K = 4 images were designated as output I S o, while the remaining M -K = 6 images constituted the inputs I S i. For additional configurations of VoxDet, please refer to Appendix A and our publicly available code.', '> **Metrics:** Given our assumption that precisely one desired instance is present in the query image, we default to selecting the Top-1 proposal as the predicted result. We report the average recall (AR) rate [44] across various Intersection over Union (IoU) thresholds, including mAR (IoU ∈ [0.5, 0.95]), AR 50 (IoU = 0.5), AR 75 (IoU = 0.75), and AR 95 (IoU = 0.95). It is important to note that AR is equivalent to average precision (AP) in our specific evaluation context.', '58,64c70,74', '< Overall Performance Comparison: On the synthetic real datasets, we comprehensively compare with all the potential baselines, the results are detailed in Table 1, demonstrating that VoxDet consistently delivers superior performance across most settings. Notably, VoxDet surpasses the VoxDet OLN !"#$ OLN %&\'\'.', '< VoxDet OLN !"#$ OLN %&\'\'.', '< VoxDet OLN !"#$ OLN %&\'\'.', '< Obj 5@LM-O Obj 13@YCB-V Obj. 9@RoboTools Support Images and AR Query Images 13@YCB-V second column and Obj. 9@RoboTools).', '< next best baseline, OLN DINO , by an impressive margin of up to 20% in terms of average mAR. Furthermore, due to its compact voxel representation, VoxDet is observed to be markedly more efficient. On the newly built fully real dataset, RoboTools, we only compare methods trained on the same synthetic dataset for fairness. As shown in Table 2, VoxDet demonstrates better sim2real transfering capability compared with the 2D methods due to its 3D voxel representation. We present the results comparison with the real-image trained models in Appendix D. Efficiency Comparison: As QVM has a lower model complexity than OLN CLIP and OLN DINO , it achieves faster inference speeds, as detailed in Table 3. Compared to correlation-based matching [5],', '< VoxDet leverages the aggregation of multi-view templates into a single compact voxel, thereby eliminating the need for exhaustive 2D correlation and achieving 2× faster speed.', "< In addition to inference speed, VoxDet also demonstrates greater efficiency regarding the number of templates. We tested the methods on the YCB-V dataset [18] using fewer templates than the default. As illustrated in Fig. 4, we found that the 2D baseline is highly sensitive to the number of provided references, which may plummet by 87% when the number of templates is reduced from 10 to 2. However, such a degradation rate for VoxDet is 2× less. We attribute this capability to the learned 2D-3D mapping, which can effectively incorporate 3D geometry with very few views. Top-K Analysis: Compared to the category-level method [7], VoxDet produces considerably fewer false positives among its Top-10 candidates. As depicted in Fig. 5, we considered Top-K = 1, 5, 10, 20, 30, 50, 100 proposals and compared the corresponding AR between VoxDet and OS2D [7]. VoxDet's AR only declines by 5 ∼ 10% when K decreases from 100 to 10, whereas OS2D's AR suffers a drop of up to 38%. This suggests that over 90% of VoxDet's true positives are found among its Top-10 candidates, whereas this ratio is only around 60% for OS2D. Ablation Studies: The results of our ablation studies are presented in Table 4. Initially, we attempted to utilize the 3D depth-wise convolution for matching (see the fourth row). However, this proved to be inferior to our proposed instance-level voxel relation. Reconstruction pre-training is crucial for VoxDet's ability to learn to encode the geometry of an instance (see the last row). Additionally, we conducted an ablation on the rotation measurement module (R) in the QVM, and also tried not supervising the predicted rotation. Both are inferior to our default settings.", '---', '> **Overall Performance Comparison:** On the synthetic-real datasets (LM-O and YCB-V), our comprehensive comparison with all potential baselines, detailed in Table 1, demonstrates that VoxDet consistently delivers superior performance across most settings. Notably, VoxDet surpasses the next best baseline, OLN DINO [14,9], by an impressive margin of up to 20% in terms of average mAR. This significant improvement highlights the efficacy of our 3D geometry-aware approach. Furthermore, due to its compact voxel representation, VoxDet exhibits markedly greater efficiency. On the newly introduced fully real dataset, RoboTools, we restrict comparisons to methods trained exclusively on the same synthetic dataset for fairness. As shown in Table 2, VoxDet demonstrates superior sim-to-real transfer capability compared to 2D-based methods, a direct benefit of its robust 3D voxel representation. A more extensive comparison with models trained on real images is provided in Appendix D.', '> **Efficiency Comparison:** The Query Voxel Matching (QVM) module, with its inherently lower model complexity compared to OLN CLIP [14,19] and OLN DINO [14,9], achieves significantly faster inference speeds, as detailed in Table 3. When compared to correlation-based matching [5], VoxDet efficiently aggregates multi-view templates into a single compact voxel, thereby eliminating the need for exhaustive 2D correlation and achieving approximately 2× faster speed.', '> In addition to inference speed, VoxDet also demonstrates superior efficiency concerning the number of required templates. We evaluated the methods on the YCB-V dataset [18] using fewer templates than the default setting. As illustrated in Fig. 4, we found that 2D baselines are highly sensitive to the number of provided references; their performance may plummet by as much as 87% when the number of templates is reduced from 10 to 2. In contrast, VoxDet exhibits a degradation rate that is 2× less severe. We attribute this robustness to the learned 2D-3D mapping, which effectively incorporates 3D geometry even with very few views.', "> **Top-K Analysis:** Compared to the category-level method OS2D [7], VoxDet produces considerably fewer false positives among its Top-10 candidates. As depicted in Fig. 5, we considered Top-K = {1, 5, 10, 20, 30, 50, 100} proposals and compared the corresponding AR between VoxDet and OS2D. VoxDet's AR only declines by 5-10% when K decreases from 100 to 10, whereas OS2D's AR suffers a substantial drop of up to 38%. This suggests that over 90% of VoxDet's true positives are found among its Top-10 candidates, whereas this ratio is only around 60% for OS2D, highlighting VoxDet's superior precision in retrieving the target instance.", "> **Ablation Studies:** The results of our comprehensive ablation studies are presented in Table 4. Initially, we explored using 3D depth-wise convolution for matching (see the fourth row), but this approach proved inferior to our proposed instance-level voxel relation. Reconstruction pre-training is crucial for VoxDet's ability to learn to encode the geometry of an instance (see the last row). Additionally, we conducted an ablation on the rotation measurement module (R) within QVM and also experimented with not supervising the predicted rotation. Both alternative settings yielded inferior results compared to our default configurations, underscoring the importance of these components.", '66c76,81', '< Section: VoxDet vs DTOID', '---', '> Section: Related Work', '> **Object Detection:** (This section was previously integrated into the introduction, but due to its length and importance, it is separated here for better structure and clarity.) Object detection methods broadly fall into two categories: two-stage and one-stage approaches. Two-stage detectors, such as R-CNN [20] and its variants [21,22], first generate regions of interest (ROI) via a region proposal network, followed by detection heads for classification and bounding box regression. In contrast, one-stage detectors like the YOLO series [23-25] and recent transformer-based methods [3,4] tackle detection as an end-to-end regression problem. Few-shot and one-shot object detection [1,2,7,27-29] aim to detect unseen classes with limited labeled support samples, making them conceptually closer to our task. These methods typically employ either transfer-learning [27,28] or meta-learning strategies [1,2,7,29] to generalize to novel categories. However, as these are primarily category-level approaches, their classification and matching designs are often optimized for metrics like Top-100 precision, assuming multiple instances of a class. This contrasts sharply with novel instance detection, where Top-1 accuracy for a specific instance is paramount, and these methods often fail. Open-world and zero-shot object detection [14,30-32] address the task of finding any object in an image, often in a class-agnostic manner. Some approaches learn objectiveness [14,30], while others [32] leverage large-scale datasets. Such methods are well-suited as a preliminary module in our pipeline to generate object proposals for subsequent template comparison. We specifically adopt [14] due to its efficient structure and robust performance.', '> **Novel Instance Detection:** Crucially, novel instance detection necessitates algorithms that can locate a specific unseen instance in a query image using provided multi-view templates. As discussed, prior methods [5,6,8] predominantly rely on pure 2D representations and matching techniques. For instance, DTOID [6] employs global object attention and a local pose-specific branch to predict template-guided heatmaps. However, these 2D-centric approaches inherently struggle with significant 2D appearance variations caused by occlusion or drastic pose changes. VoxDet addresses this fundamental limitation by explicitly leveraging 3D geometric knowledge extracted from multi-view templates, enabling the representation and matching of instances in a geometry-invariant manner.', '> **Multi-view 3D Representations:** The robust representation of 3D scenes or instances from multi-view images is a long-standing and critical problem in computer vision. Historically, multi-view geometry, particularly Structure from Motion (SfM) [33] pipelines, has been instrumental in jointly optimizing camera poses and 3D structures. More recently, neural 3D representations [10,11,34-38], including deep voxels [10,12,35,38] and implicit functions [36,37], have achieved remarkable success in tasks such as 3D reconstruction and novel view synthesis. Our VoxDet framework draws significant inspiration from the Video Autoencoder [10], which disentangles deep implicit 3D structure learning from camera trajectory estimation in videos. A key advantage of this approach, crucial for our instance detection task, is its ability to encode and synthesize test scenes efficiently without requiring further tuning or optimization.', '> **Instance Pose Estimation:** This field is dedicated to estimating the 6 Degrees of Freedom (6 DoF) pose of an unseen instance. Some methods [57,58] match local point features and utilize RANSAC to optimize the relative pose. Others [5,59] first select the closest template frame and then conduct pose refinement on the cropped object patch. Most of these methods typically assume perfect instance detection, meaning they crop the instance from the query image using ground-truth bounding boxes and then estimate the pose on this small, object-centered patch. Our VoxDet can serve as an effective front-end for such systems, providing robust detection in cluttered environments and thereby enhancing the reliability of the overall detection-pose estimation framework.', "> **Instance Retrieval:** Instance retrieval aims to retrieve a specific instance from a large database given a single reference image [60-65]. Early work extracted local point features from template and query patches for image matching [61,49], which often suffered from limited discriminative capability. More recent work leverages deep neural networks for a global representation of the instance [62-65], comparing these with features extracted from query images. However, most of these methods construct 2D template features from the reference, making their representation unaware of the instance's 3D geometry, which can lead to a lack of robustness under severe pose variations. Furthermore, instance retrieval methods often require high-resolution query images for discriminative features, whereas instances in our cluttered query images can be at low resolution, posing additional challenges for these approaches.", '67a83,84', '> Section: Acknowledgments', '> This work was sponsored by SONY Corporation of America #1012409. This work used Bridges-2 at PSC through allocation cis220039p from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program which is supported by NSF grants #2138259, #2138286, #2138307, #2137603, and #213296. The authors would also like to express their sincere gratitude to the developers of BlenderProc2 [43].', '69,74c86,87', '< Section: VoxDet vs Gen6D', '< VoxDet vs OLN !"#$ LM-O YCB-V Gen6D [5], OLN DINO [14,9] on the three benchmarks. VoxDet shows better robustness under pose variance and occlusion. These qualitative comparisons can be better visualized in our supplementary video.', '< RoboTools', '< correlation maps are sent to an MLP for classification score. In 2D Relation [1], we substitute the simple depth-wise convolution in 2D Corr. with the spatial and channel relation proposed in [1]. In FSDet [45], the depth-wise convolution in 2D Corr is replaced by the distance defined in [45]. Since they are geometry-unaware, we find all the 2D techniques worse than our QVM module. Additionally, we designed a Local Matching baseline [49,48]. In Local Matching, we first extract local key points from the reference images and proposals using SuperPoint [49]. Then the points descriptors are matched by SuperGlue [48]. We take the mean matching score of all the points in the proposal as their classification score. We find such an implementation, though geometry-invariant, falls short in our task since it lacks semantic representation of the whole instance.', '< known template poses. Most of these methods usually assume the instance detection is perfect, i.e., they crop the instance out of the query image with the ground truth box and estimate the pose on the small object-centered patch. Our VoxDet can serve as their front-end, which is robust to cluttered environments, thus making the detection-pose estimation framework more reliable.', '< Instance Retrieval hopes to retrieve a specific instance from a large database with a single reference image [60][61][62][63][64][65]. Some early work extracts local point features from template and query patch for image matching [61,49], which may suffer from poor discriminative capability. More recent work resorts to the deep neural network for a global representation of the instance [62][63][64][65], which is compared with the features from query images. However, most of them construct 2D template features from the reference, so that their representation is unaware of the 3D geometry of the instance, which may not be robust under severe pose variation. Besides, instance retrieval methods usually require high-resolution query images for the discriminative features, while the instance in our cluttered query image could be in low-resolution, which sets additional barriers to these approaches.', '---', '> Section: Deep Voxel and Activation Visualization', '> Figure 7: Visualization of the high activation grids during matching. As the query instance rotates along a certain axis, the location of the high-activated grids roughly rotates in the corresponding direction, demonstrating geometry awareness.', '76,81d88', '< Section: Acknowledgement', '< This work was sponsored by SONY Corporation of America #1012409. This work used Bridges-2 at PSC through allocation cis220039p from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program which is supported by NSF grants #2138259, #2138286, #2138307, #2137603, and #213296. The authors would also like to express the sincere gratitute on the developers of BlenderProc2 [43].', '< ', '< Section: Support Query and Voxel Activation', '< Support Query and Voxel Activation Figure 7: Visualization of the high activation grids during matching. As query instance rotates along a certain axis, the location of the high-activated grids roughly rotates in the corresponding direction.', '< ', '83c90,92', '< Figure 8: Reconstruct results of VoxDet on unseen instances. The voxel representation in VoxDet can be decoded with a relative rotation and synthesize novel views, which demonstrate the geometry embedded in our learned voxels.', '---', "> **Detection Visualization:** The qualitative comparison is depicted in Fig. 6, where we compare VoxDet with two of the most robust baselines, OLN DINO and OLN Corr. We observe that 2D methods can easily falter if the pose of an instance is not explicitly present in the reference images (e.g., the second query image in the first row), whereas VoxDet accurately identifies it. Furthermore, 2D matching exhibits less robustness under occlusion, where the instance's appearance can significantly differ. VoxDet effectively overcomes these challenges, thanks to its learned 3D geometry. More detailed visualizations and qualitative comparisons are provided in Appendix C.", "> **Deep Voxel Visualization:** To further validate the geometry-awareness of our learned voxel representation, we present a deep visualization in Fig. 7. Here, the gradient of the matching score is backpropagated to the template voxel, and we visualize the activation value of each grid. Surprisingly, we discover that as the orientation of the query instance changes, the activated regions within our voxel representations accurately mirror the true rotation. This compellingly demonstrates that the voxel representation in VoxDet is inherently aware of the instance's orientation.", '> **Reconstruction Visualization:** The voxel representation in VoxDet can be decoded to synthesize novel views, even for previously unseen instances, as demonstrated in Fig. 8. The voxel, pre-trained on 9,500 instances, is capable of approximately reconstructing the geometry of novel instances, providing strong evidence for the embedded geometric knowledge.', '85,86c94,96', '< Section: Detection Visualization', "< The qualitative comparison is depicted in Fig. 6, where we compare VoxDet with the two most robust baselines, OLNDINO and OLNCorr.. We notice that 2D methods can easily falter if the pose of an instance is not seen in the reference, e.g., 2-nd query image in the 1-st row, while VoxDet still accurately identifies it. Furthermore, 2D matching exhibits less robustness under occlusion, where the instance's appearance could significantly differ. VoxDet can effectively overcome these challenges thanks to its learned 3D geometry. More visualizations and qualitative comparisons see Appendix C. Deep Voxels Visualization To better validate the geometry-awareness of our learned voxel representation, we present the deep visualization in Fig. 7. The gradient of the matching score is backpropagated to the template voxel and we visualze the activation value of each grid. Surprisingly, we discover that as the orientation of the query instance changes, the activated regions within our voxel representations accurately mirror the true rotation. This demonstrates that the voxel representation in VoxDet is aware of the orientation of the instance. Reconstruction Visualization The voxel representation in VoxDet can be decoded to synthesize novel views, even for unseen instances, which is demonstrated in Fig. 8. The voxel, pre-trained on 9500 instances, is capable of approximately reconstructing the geometry of unseen instances.", '---', '> Section: Discussions and Conclusion', "> **Conclusion:** This work introduces VoxDet, a novel and highly effective approach for detecting novel instances using multi-view reference images. VoxDet stands as a pioneering 3D-aware framework that inherently exhibits superior robustness to occlusions and significant pose variations. VoxDet's crucial contributions and insights stem from its geometry-aware Template Voxel Aggregation (TVA) module and a meticulously designed Query Voxel Matching (QVM) module, both specifically tailored for instance-level tasks. Owing to the learned instance geometry encoded in TVA and the robust matching mechanism in QVM, VoxDet significantly outperforms various 2D baselines and offers notably faster inference speed. Beyond these methodological advancements, we also introduce the first comprehensive instance detection training set, OWID-10k, and a challenging real-world RoboTools benchmark, fostering future research in this critical area.", '> **Limitations:** Despite its substantial strengths, VoxDet currently presents two potential limitations. Firstly, models trained exclusively on the synthetic OWID dataset may exhibit a domain gap when deployed in real-world scenarios, with further details presented in Appendix D. Secondly, our current framework assumes that the relative rotation matrices and instance masks (bounding boxes) for the reference images are known. While obtaining these may not always be straightforward in unconstrained environments, we demonstrate that the TVA module in VoxDet does not require extremely accurate rotation or 2D appearance information to function effectively. We present further experiments addressing the robustness of VoxDet to these input imperfections in Appendix E.', '88,89c98,99', '< Section: Discussions', "< Conclusion: This work introduces VoxDet, a novel approach to detect novel instances using multiview reference images. VoxDet is a pioneering 3D-aware framework that exhibits robustness to occlusions and pose variations. VoxDet's crucial contribution and insight stem from its geometryaware Template Voxel Aggregation (TVA) module and an exhaustive Query Voxel Matching (QVM) specifically tailored for instances. Owing to the learned instance geometry in TVA and the meticulously designed matching in QVM, VoxDet significantly outperforms various 2D baselines and offers faster inference speed. Beyond methodological contributions, we also introduce the first instance detection training set, OWID, and a challenging RoboTools benchmark for future research. Limitations: Despite its strengths, VoxDet has two potential limitations. Firstly, the model trained on the synthetic OWID dataset may exhibit a domain gap when applied to real-world scenarios, we present details in Appendix D. Secondly, we assume that the relative rotation matrixes and instance masks (box) for the reference images are known, which may not be straightforward to calculate. However, the TVA module in VoxDet doesn't require an extremely accurate rotation and 2D appearance. We present further experiments addressing these issues in Appendix E.", '---', '> Section: Supplementary Material', "> To ensure the full reproducibility of our model, we present complete implementation details in Appendix A. Our code library will be publicly released upon acceptance. We provide more extensive comparisons between our QVM module and various 2D matching/relation techniques [1,5,45] in Appendix B, further demonstrating QVM's superiority in instance-level 3D matching. Appendix C contains additional detection qualitative results. We also present a more in-depth discussion regarding the sim-to-real domain gap of VoxDet in Appendix D. To rigorously test the robustness of VoxDet under interference with the voxel representation, we display results obtained from intentionally flawed voxels in Appendix E. Finally, Appendix F provides extended discussions on related works, where we exhaustively compare VoxDet with existing instance-level tasks, including visual tracking, instance pose estimation, and instance retrieval.", '91,92c101,103', '< Section: Supplementary', '< To make our model fully reproducible, we present complete implementation details in Appendix A. Besides, our code library will be released upon acceptance. We report more comparisons between our QVM module and the 2D matching/relation techniques [1,5,45] in Appendix B to demonstrate the superiority of QVM in instance-level 3D matching. In Appendix C, we present more detection qualitative results. We further present some discussions about the sim2real domain gap of VoxDet in Appendix D. To test the robustness of VoxDet under interference on the voxel representation, we display results obtained from the flawed voxel in Appendix E. Finally, we provide extended related works discussions in Appendix F, where we exhaustively compare VoxDet with the existing instance-level tasks, including visual tracking, instance pose estimation, and instance retrieval.', '---', '> Section: A. Implementation Details', '> **Model Structure:** We adopt ResNet50 [46] with a Feature Pyramid Network (FPN) [26] as our feature extractor ψ(•). The default multi-scale ROIAlign mechanism from [26] is leveraged to obtain the 2D proposal features, with dimensions set to N = 500, C = 256, and w = 7. In our 2D-3D mapping, we set C/d = 32 and d = 8, which results in a voxel feature dimension of C v = 256 and spatial dimensions D = 16, L = 14. All 3D convolutions within TVA and QVM utilize a kernel size of 3 and padding of 1, ensuring that the dimensions of the voxels remain consistent throughout both modules. For the Rot(•, •) function, we follow [10] by employing `torch.nn.functional.affine_grid()` and `torch.nn.functional.grid_sample()` functionalities. While the 2D-3D mapping effectively learns rotations in the physical world, it inherently sacrifices some semantic information in the feature channels during reshaping. Therefore, in QVM, we incorporate a global matching branch to retrieve this potentially lost semantic information. Specifically, we apply global average pooling on the support features to obtain a support vector k ∈ R 1×C×1×1. We then perform depth-wise convolution between k and F Q to generate a correlation map. Crucially, this correlation map preserves all semantic channels from the backbone ψ(•), compensating for information lost in the 2D-3D mapping. This map is then added to the voxel relation output R v (V S , Rot(V Q , RQ )) to compute the final score.', '> **Training Details:** In the first reconstruction stage, we set the loss weights as w recon = 10.0, w gan = 0.01, and w percep = 1.0. The model is trained for 16 epochs on 9,600 instances from the OWID dataset. We leverage the Adam optimizer [47] with a base learning rate of 5 × 10 -5 during this training phase. In the second detection stage, we initialize the 2D-3D mapping modules in TVA and QVM with the weights pre-trained during reconstruction. VoxDet first learns the detection task without rotation estimation; specifically, the loss weights are set as w 1 = w 2 = w 3 = w 4 = w 5 = 1.0 and w 6 = 0 for the initial 10 epochs, using SGD as an optimizer with a base learning rate of 0.02. During this stage, the 2D-3D mapping part is trained with 1/10th of the base learning rate. In the final epoch, VoxDet learns rotation estimation while keeping the detection part fixed (i.e., w 1 = w 2 = w 3 = w 4 = w 5 = 0.0, w 6 = 1.0). While supervising rotation is not a strict requirement, it is an optional component for VoxDet and typically improves performance slightly by 1-2%.', '94,96c105,110', '< Section: A Implementation Details', '< Model Structure: We adopt ResNet50 [46] with feature pyramid network [26] as our feature extractor ψ(•). The default multi-scale ROIAlign in [26] is leveraged to obtain the 2D proposal features, where the dimensions are N = 500, C = 256, w = 7. In our 2D-3D mapping, we set C/d = 32, d = 8, which results in the voxel feature dimension C v = 256, D = 16, L = 14. All the 3D convolutions in TVA and QVM take kernel size as 3 and the padding equals to 1, so that the dimension of the voxels remains the same throughout the two modules. For the Rot(•, •) function, we have followed [10] to use torch.nn.functional.affine_grid() and torch.nn.functional.grid_sample() functionalities. Though the 2D-3D mapping can learn the rotations in the physical world, it sacrifices some semantics information in the feature channels when reshaping. Therefore, in QVM, we have a global matching branch to retrieve the lost semantic information. To be more specific, we apply global average pooling on the support features to get a support vector k ∈ R 1×C×1×1 . Then we adopt depth-wise convolution between k and F Q to get a correlation map. Note that this correlation map preserved all the semantic channels from the backbone ψ(cot), so that the lost information in the 2D-3D mapping. The map is added to the voxel relation output R v (V S , Rot(V Q , RQ )) for the final score.', '< Training Details: In the first reconstruction stage, we set the loss weights as w recon = 10.0, w gan = 0.01, w percep = 1.0. The model is trained for 16 epoch on the 9600 instances from OWID datasets. We leveraged Adam optimizer [47] with a base learning rate of 5 × 10 -5 during training. In the second detection stage, we initialize the 2D-3D mapping modules in TVA and QVM with the reconstruction pre-trained weights. VoxDet first only learns the detection task, without learning the rotation estimation, i.e., the loss weights are set as w 1 = w 2 = w 3 = w 4 = w 5 = 1.0, w 6 = 0 in the first 10 epochs, where SGD is leveraged as an optimizer with 0.02 base learning rate. Note that in this stage, the 2D-3D mapping part only takes 1  10 of the base learning rate. Then in the final epoch, VoxDet learns the rotation estimation with the detection part fixed, i.e., w 1 = w 2 = w 3 = w 4 = w 5 = 0.0, w 6 = 1.0. However, supervising rotation is not the key requirements and is optional for VoxDet. It improves the performance slightly by 1 ∼ 2%.', '---', '> Section: B. More Matching Module Comparisons', '> Table 5: Comparison with different types of matching modules. We compare QVM with the correlation in [5], class-level relation proposed in [1], and the class distance defined in FSDet [45]. We compare QVM with more matching techniques in Table 5, where the averaged results on the cluttered LM-O [17] and RoboTools benchmark are reported. We first ablate the Voxel Relation module in QVM, which results in QVM †. Specifically, all Voxel Relation operations in QVM † are replaced by a simple depth-wise convolution. This involves first applying global average pooling on the template voxel to obtain a feature vector, which is then used as the convolution kernel to calculate the correlation voxel from the queries. We observe that such a naive design leads to a noticeable performance drop, validating the efficacy of our Voxel Relation operator. For all other methods, we utilized the same open-world detector to generate universal proposals, which are then matched with the template images using various matching techniques. To be more specific:', '> *   **2D Corr. [5]:** This method constructs support vectors from every reference image. Subsequently, depth-wise convolution is performed between each support vector and the proposal patch. The resulting correlation maps are then fed into an MLP for classification score prediction.', '> *   **2D Relation [1]:** In this approach, we substitute the simple depth-wise convolution used in 2D Corr. with the spatial and channel relation mechanism proposed in [1].', '> *   **FSDet [45]:** Here, the depth-wise convolution in 2D Corr. is replaced by the distance metric defined in [45].', '> Since all these 2D techniques are inherently geometry-unaware, we find that they consistently perform worse than our proposed QVM module. Additionally, we designed a **Local Matching** baseline [49,48]. In Local Matching, we first extract local key points from the reference images and proposals using SuperPoint [49]. These point descriptors are then matched by SuperGlue [48]. We take the mean matching score of all points within a proposal as its classification score. We observe that such an implementation, although geometry-invariant at the local feature level, falls short in our instance detection task because it lacks a holistic semantic representation of the entire instance.', '98,99c112,113', '< Section: B More Matching Module Comparison', '< Table 5: Comparison with different types of matching module. We compare QVM with the correlation in [5], class-level relation proposed in [1], and the class distance defined in FSDet [45]. We compare QVM with more matching techniques in Table 5, where the averaged results onthe cluttered LM-O [17] and RoboTools benchmark are reported. We first ablate the Voxel Relation module in QVM, which results in QVM † . Specifically, all the Voxel Relation in QVM † are replaced by a simple depth-wise convolution, i.e., we first apply global average pooling on the template voxel to get a feature vector, which is then taken as the convolution kernel to calculate the correlation voxel from the queries. We can see such a naive design will result in a performance drop. For all the rest methods, we used the same open-world detector to obtain the universal proposals, which are then matched with the template images using different matching techniques. To be more specific, 2D Corr. [5] constructs support vectors from every reference image. Then, depthwise convolution is conducted between each support vector and the proposal patch. The resulting', '---', '> Section: C. More Detection Visualizations', '> We present more detailed detection qualitative comparisons in Fig. 9. VoxDet, highlighted in red, is compared against three prominent baselines: DTOID [6], Gen6D [5], and OLN DINO [14,9]. Compared with previous instance detectors [6,5], VoxDet demonstrates superior robustness under significant orientation variations and severe occlusions, a direct benefit of its learned geometric knowledge. For example, in the LM-O benchmark (second column), when the duck is partially occluded and the egg box is presented in different orientations, VoxDet accurately identifies them, whereas Gen6D fails. Furthermore, compared with similarity matching approaches [9], VoxDet exhibits a stronger capability to distinguish between visually similar yet geometrically distinct instances via its QVM module. For instance, in the RoboTools benchmark (third column), the desired instance might be confused with a motor that possesses similar appearances but fundamentally different geometry. Our VoxDet effectively discovers such geometric differences and makes correct classifications, while similarity matching approaches fall short, even when utilizing features from powerful backbones like DINO [9] (which is stronger than ResNet50 [46]).', '101,102c115,116', '< Section: C More Detection Visualizations', '< We present more detection qualitative comparisons in Fig. 9. VoxDet, in red, is compared with three baselines, DTOID [6], Gen6D [5], and OLN DINO . Compared with previous instance detectors [6,5], VoxDet is more robust under orientation variation and severe occlusion by virtue of the learned geometric knowledge. For example, in the LM-O benchmark, second column, when the duck is partially occluded and the egg box is in different orientations, VoxDet can still find them while Gen6D fails. Compared with similarity matching [9], VoxDet can better distinguish similar instances via the QVM module. For instance, in the RoboTools benchmark, the third column, the desired instance could be distracted by the motor, which has similar appearances but different geometry. Our VoxDet can discover such geometric differences and make correct classification, while the similarity matching falls short even if the feature from DINO [9] is stronger than ResNet50 [46].', '---', '> Section: D. Sim-to-Real Comparison', "> VoxDet is exclusively trained on our synthetic dataset, OWID-10k. We observe that the model exhibits a domain gap when directly transferred to real-world images, particularly evident on the RoboTools benchmark. On the synthetic-real datasets, LM-O [17] and YCB-V [18], our model consistently outperforms methods trained on real images. However, it shows certain limitations on the fully real RoboTools test set. For instance, Gen6D [5], primarily trained on real images, reports 17.0 mAR, 35.5 AR 50, and 14.3 AR 75. While its AR 50 is higher than VoxDet's (23.6), our model performs better on harder metrics like AR 75 (20.5). When compared with cutting-edge foundation models trained on large-scale real images, our model still presents opportunities for improvement. For example, OLN CLIP [14,19] achieves 11.0 mAR, 20.8 AR 50, and 9.2 AR 75, which is lower than VoxDet. However, OLN DINO [14,9] can outperform VoxDet on RoboTools with over 30 mAR. We conclude that leveraging the powerful feature representations from concurrent 2D foundation models [13] could serve as a stronger backbone for VoxDet to mitigate the domain gap issue. Developing a geometry-aware, robust voxel representation learned from such foundation models represents a promising direction for our future work. It is also important to acknowledge that VoxDet currently assumes known instance masks and poses for the reference video, which may introduce noise during real-world deployment.", '104,109c118,121', '< Section: D Sim-to-Real Comparison', '< VoxDet is entirely trained on synthetic dataset, OWID. We observe that the model shows some domain gap when transferred to real-world images like RoboTools. On the synthetic-real datasets, LM-O [17] and YCB-V [18], our model easily outperforms those trained on real images, while it shows limitations in fully real test set RoboTools. For example, Gen6D [5] is mainly trained on real-images, which reports 17.0 mAR, 35.5 AR 50 , and 14.3 AR 75 . Its AR 50 is higher than VoxDet (23.6) while in harder metrics like AR 75 , our model works better (20.5). Compared with the cutting edge foundation models that are trained on large-scale real images, our model still shows spaces for improvement. For example, OLN CLIP achieves 11.0 mAR, 20.8 AR 50 , and 9.2 AR 75 , which is worse than VoxDet. Yet, OLN DIN O [13] can outperform VoxDet in RoboTools with over 30 mAR. We conclude that the feature representation from the concurrent 2D foundation model [13] could be a stronger backbone for VoxDet to overcome the domain gap issue. Learning a geometry-aware strong voxel representation from such foundation model will be one of our future work. VoxDet assumes known instance masks and poses for the reference video, which may have some noise during realworld deployment.', '< ', '< Section: E Performance under Flawed Voxel', '< To quantitatively analysis the robustness of VoxDet under flawed Voxels, we present its results on RoboTools when the reference video is disturbed in appearance and geometry.', '< Add noise on the reference image patches : We tried to add random shift on the cropped area in the reference images, resulting in inaccurate instance appearance. The results on RoboTools are shown in Table 6. We conclude that even when we disturb around 65% of the voxel (30% shift on each 2D patch), the model still works, which means VoxDet is robust to appearance noise. Add noise on the relative poses : We tried to add random error on the pose of the reference images, resulting in inaccurate instance geometry. When we add as large as 15 degree angular error, the performance (AR 50 ) decreased from 23.6 to 20.4. We conclude that VoxDet is not very sensitive to the geometry noise.', '---', '> Section: E. Performance under Flawed Voxel Inputs', '> To quantitatively analyze the robustness of VoxDet under imperfect voxel inputs, we present its performance on the RoboTools benchmark when the reference video is subjected to disturbances in both appearance and geometry.', "> **Adding Noise to Reference Image Patches:** We introduced random shifts to the cropped areas within the reference images, leading to inaccurate instance appearance. The results on RoboTools are presented in Table 6. We conclude that even when approximately 65% of the voxel is disturbed (corresponding to a 30% shift on each 2D patch), the model maintains effective performance. This demonstrates VoxDet's inherent robustness to appearance noise.", '> **Adding Noise to Relative Poses:** We also introduced random angular errors to the poses of the reference images, resulting in inaccurate instance geometry. When we applied a significant angular error of up to 15 degrees, the performance (AR 50) decreased from 23.6 to 20.4. This indicates that VoxDet is not overly sensitive to moderate levels of geometric noise in the input poses, further highlighting its robustness.', '111,114c123,124', '< Section: F Extended Related Works', "< Visual Object Tracking aims to localize a general target instance in a video, given its initial state in the first frame. Early methods adopt discriminative correlation filters [50][51][52], where the calculation in the frequency domain is so efficient that real-time speed can be achieved on a single CPU. More recently, methods are developed on Siamese Network [53] and Transformers [54][55][56]. Unlike detection, object tracking has a strong temporal consistency assumption, i.e., the location and appearance of the instance in the next frame do not largely vary from the previous frame. So that they only conduct detection/matching in the small search region with a single 2D template, which can't work for our whole image detection setting.", '< Instance Pose Estimation is developed to estimate the 6 DoF pose of an unseen instance. Some of them [57,58] match the local point features and resort to RANSAC to optimize the relative pose.', '< While others [5,59] first selects the closest template frame and then conducts pose refinement on the', '---', '> Section: F. Extended Related Works', '> **Visual Object Tracking:** Visual object tracking aims to localize a general target instance within a video, given its initial state in the first frame. Early methods adopted discriminative correlation filters [50-52], where frequency domain calculations enabled real-time performance on a single CPU. More recently, advancements have been made with methods based on Siamese Networks [53] and Transformers [54-56]. Unlike object detection, object tracking relies on a strong temporal consistency assumption; specifically, the location and appearance of the instance in subsequent frames are not expected to vary significantly from the previous frame. Consequently, these methods typically perform detection or matching within a small search region using a single 2D template, which is fundamentally unsuited for our whole-image novel instance detection setting.', '118,130c128,140', '< [b0] B Li; C Wang; P Reddy; S Kim; S Scherer (2022). AirDet: Few-Shot Detection without Fine-tuning for Autonomous Exploration. ', '< [b1] H Hu; S Bai; A Li; J Cui; L Wang (2021-06). Dense Relation Distillation With Context-Aware Aggregation for Few-Shot Object Detection. ', '< [b2] Y Li; H Mao; R Girshick; K He (2022). Exploring Plain Vision Transformer Backbones for Object Detection. ', '< [b3] I Misra; R Girdhar; A Joulin (2021). An End-to-End Transformer Model for 3D Object Detection. ', '< [b4] Y Liu; Y Wen; S Peng; C Lin; X Long; T Komura; W Wang (2022). Gen6D: Generalizable Model-Free 6-DoF Object Pose Estimation from RGB Images. ', '< [b5] J.-P Mercier; M Garon; P Giguere; J.-F Lalonde (2021). Deep Template-based Object Instance Detection. ', '< [b6] A Osokin; D Sumin; V Lomakin (2020). OS2D: One-Stage One-Shot Object Detection by Matching Anchor Features. ', '< [b7] P Ammirato; C.-Y Fu; M Shvets; J Kosecka; A C Berg (2018). Target Driven Instance Detection. ', '< [b8] M Caron; H Touvron; I Misra; H Jégou; J Mairal; P Bojanowski; A Joulin (2021). Emerging Properties in Self-supervised Vision Transformers. ', '< [b9] Z Lai; S Liu; A A Efros; X Wang (2021). Video Autoencoder: Self-Supervised Disentanglement of Static 3D Structure and Motion. ', '< [b10] H.-Y F Tung; R Cheng; K Fragkiadaki (2019). Learning Spatial Common Sense with Geometry-Aware Recurrent Networks. ', '< [b11] T Nguyen-Phuoc; C Li; L Theis; C Richardt; Y.-L Yang (2019). HoloGAN: Unsupervised Learning of 3D Representations from Natural Images. ', '< [b12] M Oquab; T Darcet; T Moutakanni; H Vo; M Szafraniec; V Khalidov; P Fernandez; D Haziza; F Massa; A El-Nouby (2023). Dinov2: Learning Robust Visual Features without Supervision. ', '---', '> [b0] B Li; C Wang; P Reddy; S Kim; S Scherer (2022). AirDet: Few-Shot Detection without Fine-tuning for Autonomous Exploration.', '> [b1] H Hu; S Bai; A Li; J Cui; L Wang (2021-06). Dense Relation Distillation With Context-Aware Aggregation for Few-Shot Object Detection.', '> [b2] Y Li; H Mao; R Girshick; K He (2022). Exploring Plain Vision Transformer Backbones for Object Detection.', '> [b3] I Misra; R Girdhar; A Joulin (2021). An End-to-End Transformer Model for 3D Object Detection.', '> [b4] Y Liu; Y Wen; S Peng; C Lin; X Long; T Komura; W Wang (2022). Gen6D: Generalizable Model-Free 6-DoF Object Pose Estimation from RGB Images.', '> [b5] J.-P Mercier; M Garon; P Giguere; J.-F Lalonde (2021). Deep Template-based Object Instance Detection.', '> [b6] A Osokin; D Sumin; V Lomakin (2020). OS2D: One-Stage One-Shot Object Detection by Matching Anchor Features.', '> [b7] P Ammirato; C.-Y Fu; M Shvets; J Kosecka; A C Berg (2018). Target Driven Instance Detection.', '> [b8] M Caron; H Touvron; I Misra; H Jégou; J Mairal; P Bojanowski; A Joulin (2021). Emerging Properties in Self-supervised Vision Transformers.', '> [b9] Z Lai; S Liu; A A Efros; X Wang (2021). Video Autoencoder: Self-Supervised Disentanglement of Static 3D Structure and Motion.', '> [b10] H.-Y F Tung; R Cheng; K Fragkiadaki (2019). Learning Spatial Common Sense with Geometry-Aware Recurrent Networks.', '> [b11] T Nguyen-Phuoc; C Li; L Theis; C Richardt; Y.-L Yang (2019). HoloGAN: Unsupervised Learning of 3D Representations from Natural Images.', '> [b12] M Oquab; T Darcet; T Moutakanni; H Vo; M Szafraniec; V Khalidov; P Fernandez; D Haziza; F Massa; A El-Nouby (2023). Dinov2: Learning Robust Visual Features without Supervision.', '132,138c142,148', '< [b14] A X Chang; T Funkhouser; L Guibas; P Hanrahan; Q Huang; Z Li; S Savarese; M Savva; S Song; H Su (2015). ShapeNet: An Information-Rich 3D Model Repository. ', '< [b15] J Collins; S Goel; K Deng; A Luthra; L Xu; E Gundogdu; X Zhang; T F Y Vicente; T Dideriksen; H Arora (2022). Abo: Dataset and Benchmarks for Real-World 3D Object Understanding. ', '< [b16] E Brachmann; A Krull; F Michel; S Gumhold; J Shotton; C Rother (2014). Learning 6D Object Pose Estimation using 3D Object Coordinates. ', '< [b17] B Calli; A Singh; A Walsman; S Srinivasa; P Abbeel; A M Dollar (2015). The YCB Object and Model Set: Towards Common Benchmarks for Manipulation Research. ', '< [b18] A Radford; J W Kim; C Hallacy; A Ramesh; G Goh; S Agarwal; G Sastry; A Askell; P Mishkin; J Clark (2021). Learning Transferable Visual Models from Natural Language Supervision. ', '< [b19] R Girshick; J Donahue; T Darrell; J Malik (2014). Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. ', '< [b20] R Girshick (2015). Fast R-CNN. ', '---', '> [b14] A X Chang; T Funkhouser; L Guibas; P Hanrahan; Q Huang; Z Li; S Savarese; M Savva; S Song; H Su (2015). ShapeNet: An Information-Rich 3D Model Repository.', '> [b15] J Collins; S Goel; K Deng; A Luthra; L Xu; E Gundogdu; X Zhang; T F Y Vicente; T Dideriksen; H Arora (2022). Abo: Dataset and Benchmarks for Real-World 3D Object Understanding.', '> [b16] E Brachmann; A Krull; F Michel; S Gumhold; J Shotton; C Rother (2014). Learning 6D Object Pose Estimation using 3D Object Coordinates.', '> [b17] B Calli; A Singh; A Walsman; S Srinivasa; P Abbeel; A M Dollar (2015). The YCB Object and Model Set: Towards Common Benchmarks for Manipulation Research.', '> [b18] A Radford; J W Kim; C Hallacy; A Ramesh; G Goh; S Agarwal; G Sastry; A Askell; P Mishkin; J Clark (2021). Learning Transferable Visual Models from Natural Language Supervision.', '> [b19] R Girshick; J Donahue; T Darrell; J Malik (2014). Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.', '> [b20] R Girshick (2015). Fast R-CNN.', '140,145c150,155', '< [b22] J Redmon; S Divvala; R Girshick; A Farhadi (2016). You Only Look Once: Unified, Real-time Object Detection. ', '< [b23] J Redmon; A Farhadi (2017). YOLO9000: Better, Faster, Stronger. ', '< [b24]  (2018). YOLOv3: An Incremental Improvement. ', '< [b25] T.-Y Lin; P Dollár; R Girshick; K He; B Hariharan; S Belongie (2017). Feature Pyramid Networks for Object Detection. ', '< [b26] L Qiao; Y Zhao; Z Li; X Qiu; J Wu; C Zhang (2021). DeFRCN: Decoupled Faster R-CNN for Few-Shot Object Detection. ', '< [b27] X Wang; T Huang; J Gonzalez; T Darrell; F Yu (2020). Frustratingly Simple Few-Shot Object Detection. ', '---', '> [b22] J Redmon; S Divvala; R Girshick; A Farhadi (2016). You Only Look Once: Unified, Real-time Object Detection.', '> [b23] J Redmon; A Farhadi (2017). YOLO9000: Better, Faster, Stronger.', '> [b24] (2018). YOLOv3: An Incremental Improvement.', '> [b25] T.-Y Lin; P Dollár; R Girshick; K He; B Hariharan; S Belongie (2017). Feature Pyramid Networks for Object Detection.', '> [b26] L Qiao; Y Zhao; Z Li; X Qiu; J Wu; C Zhang (2021). DeFRCN: Decoupled Faster R-CNN for Few-Shot Object Detection.', '> [b27] X Wang; T Huang; J Gonzalez; T Darrell; F Yu (2020). Frustratingly Simple Few-Shot Object Detection.', '147,152c157,162', '< [b29] K Joseph; S Khan; F S Khan; V N Balasubramanian (2021). Towards Open World Object Detection. ', '< [b30] A Gupta; S Narayan; K Joseph; S Khan; F S Khan; M Shah (2022). OW-DETR: Open-World Detection Transformer. ', '< [b31] A Kirillov; E Mintun; N Ravi; H Mao; C Rolland; L Gustafson; T Xiao; S Whitehead; A C Berg; W.-Y Lo (2023). Segment Anything. ', '< [b32] J L Schonberger; J.-M Frahm (2016). Structure-from-Motion Revisited. ', '< [b33] A W Harley; S K Lakshmikanth; F Li; X Zhou; H.-Y F Tung; K Fragkiadaki (2019). Learning from Unlabelled Videos Using Contrastive Predictive Neural 3D Mapping. ', '< [b34] V Sitzmann; J Thies; F Heide; M Nießner; G Wetzstein; M Zollhofer (2019). Deepvoxels: Learning Persistent 3D Feature Embeddings. ', '---', '> [b29] K Joseph; S Khan; F S Khan; V N Balasubramanian (2021). Towards Open World Object Detection.', '> [b30] A Gupta; S Narayan; K Joseph; S Khan; F S Khan; M Shah (2022). OW-DETR: Open-World Detection Transformer.', '> [b31] A Kirillov; E Mintun; N Ravi; H Mao; C Rolland; L Gustafson; T Xiao; S Whitehead; A C Berg; W.-Y Lo (2023). Segment Anything.', '> [b32] J L Schonberger; J.-M Frahm (2016). Structure-from-Motion Revisited.', '> [b33] A W Harley; S K Lakshmikanth; F Li; X Zhou; H.-Y F Tung; K Fragkiadaki (2019). Learning from Unlabelled Videos Using Contrastive Predictive Neural 3D Mapping.', '> [b34] V Sitzmann; J Thies; F Heide; M Nießner; G Wetzstein; M Zollhofer (2019). Deepvoxels: Learning Persistent 3D Feature Embeddings.', '154,166c164,176', '< [b36] V Sitzmann; M Zollhöfer; G Wetzstein (2019). Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations. ', '< [b37] R Yang; G Yang; X Wang (2023). Neural Volumetric Memory for Visual Locomotion Control. ', '< [b38] W Wang; Y Hu; S Scherer (2021). Tartanvo: A Generalizable Learning-based VO. ', '< [b39] Y Zhou; C Barnes; J Lu; J Yang; H Li (2019). On the Continuity of Rotation Representations in Neural Networks. ', '< [b40] K Simonyan; A Zisserman (2014). Very Deep Convolutional Networks for Large-Scale Image Recognition. ', '< [b41] H Yang; S Cai; H Sheng; B Deng; J Huang; X.-S Hua; Y Tang; Y Zhang (2022). Balanced and Hierarchical Relation Learning for One-Shot Object Detection. ', '< [b42] M Denninger; M Sundermeyer; D Winkelbauer; Y Zidan; D Olefir; M Elbadrawy; A Lodhi; H Katam (2019). Blenderproc. ', '< [b43] T.-Y Lin; M Maire; S Belongie; J Hays; P Perona; D Ramanan; P Dollár; C L Zitnick (2014). Microsoft COCO: Common Objects in Context. ', '< [b44] Y Xiao; R Marlet (2020). Few-Shot Object Detection and Viewpoint Estimation for Objects in the Wild. ', '< [b45] K He; X Zhang; S Ren; J Sun (2016). Deep Residual Learning for Image Recognition. ', '< [b46] D P Kingma; J Ba (2014). Adam: A Method for Stochastic Optimization. ', '< [b47] P.-E Sarlin; D Detone; T Malisiewicz; A Rabinovich (2020). SuperGlue: Learning Feature Matching with Graph Neural Networks. ', '< [b48] D Detone; T Malisiewicz; A Rabinovich (2018). Superpoint: Self-Supervised Interest Point Detection and Description. ', '---', '> [b36] V Sitzmann; M Zollhöfer; G Wetzstein (2019). Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations.', '> [b37] R Yang; G Yang; X Wang (2023). Neural Volumetric Memory for Visual Locomotion Control.', '> [b38] W Wang; Y Hu; S Scherer (2021). Tartanvo: A Generalizable Learning-based VO.', '> [b39] Y Zhou; C Barnes; J Lu; J Yang; H Li (2019). On the Continuity of Rotation Representations in Neural Networks.', '> [b40] K Simonyan; A Zisserman (2014). Very Deep Convolutional Networks for Large-Scale Image Recognition.', '> [b41] H Yang; S Cai; H Sheng; B Deng; J Huang; X.-S Hua; Y Tang; Y Zhang (2022). Balanced and Hierarchical Relation Learning for One-Shot Object Detection.', '> [b42] M Denninger; M Sundermeyer; D Winkelbauer; Y Zidan; D Olefir; M Elbadrawy; A Lodhi; H Katam (2019). Blenderproc.', '> [b43] T.-Y Lin; M Maire; S Belongie; J Hays; P Perona; D Ramanan; P Dollár; C L Zitnick (2014). Microsoft COCO: Common Objects in Context.', '> [b44] Y Xiao; R Marlet (2020). Few-Shot Object Detection and Viewpoint Estimation for Objects in the Wild.', '> [b45] K He; X Zhang; S Ren; J Sun (2016). Deep Residual Learning for Image Recognition.', '> [b46] D P Kingma; J Ba (2014). Adam: A Method for Stochastic Optimization.', '> [b47] P.-E Sarlin; D Detone; T Malisiewicz; A Rabinovich (2020). SuperGlue: Learning Feature Matching with Graph Neural Networks.', '> [b48] D Detone; T Malisiewicz; A Rabinovich (2018). Superpoint: Self-Supervised Interest Point Detection and Description.', '168,175c178,185', '< [b50] Y Li; C Fu; F Ding; Z Huang; G Lu (2020). AutoTrack: Towards High-Performance Visual Tracking for UAV with Automatic Spatio-Temporal Regularization. ', '< [b51] B Li; C Fu; F Ding; J Ye; F Lin (2021). ADTrack: Target-Aware Dual Filter Learning for Real-Time Anti-Dark UAV Tracking. ', '< [b52] B Li; W Wu; Q Wang; F Zhang; J Xing; J Yan (2019). SiamRPN++: Evolution of Siamese Visual Tracking With Very Deep Networks. ', '< [b53] Z Cao; C Fu; J Ye; B Li; Y Li (2021). HiFT: Hierarchical Feature Transformer for Aerial Tracking. ', '< [b54] B Ye; H Chang; B Ma; S Shan; X Chen (2022). Joint Feature Learning and Relation Modeling for Tracking: A One-Stream Framework. ', '< [b55] Y Cui; C Jiang; L Wang; G Wu (2022). Mixformer: End-to-End Tracking with Iterative Mixed Attention. ', '< [b56] J Sun; Z Wang; S Zhang; X He; H Zhao; G Zhang; X Zhou (2022). Onepose: One-Shot Object Pose Estimation without CAD Models. ', '< [b57] Y He; Y Wang; H Fan; J Sun; Q Chen (2022). FS6D: Few-Shot 6D Pose Estimation of Novel Objects. ', '---', '> [b50] Y Li; C Fu; F Ding; Z Huang; G Lu (2020). AutoTrack: Towards High-Performance Visual Tracking for UAV with Automatic Spatio-Temporal Regularization.', '> [b51] B Li; C Fu; F Ding; J Ye; F Lin (2021). ADTrack: Target-Aware Dual Filter Learning for Real-Time Anti-Dark UAV Tracking.', '> [b52] B Li; W Wu; Q Wang; F Zhang; J Xing; J Yan (2019). SiamRPN++: Evolution of Siamese Visual Tracking With Very Deep Networks.', '> [b53] Z Cao; C Fu; J Ye; B Li; Y Li (2021). HiFT: Hierarchical Feature Transformer for Aerial Tracking.', '> [b54] B Ye; H Chang; B Ma; S Shan; X Chen (2022). Joint Feature Learning and Relation Modeling for Tracking: A One-Stream Framework.', '> [b55] Y Cui; C Jiang; L Wang; G Wu (2022). Mixformer: End-to-End Tracking with Iterative Mixed Attention.', '> [b56] J Sun; Z Wang; S Zhang; X He; H Zhao; G Zhang; X Zhou (2022). Onepose: One-Shot Object Pose Estimation without CAD Models.', '> [b57] Y He; Y Wang; H Fan; J Sun; Q Chen (2022). FS6D: Few-Shot 6D Pose Estimation of Novel Objects.', '178,182c188,192', '< [b60] A Babenko; V Lempitsky (2015). Aggregating Local Deep Features for Image Retrieval. ', '< [b61] R Arandjelovic; P Gronat; A Torii; T Pajdla; J Sivic (2016). NetVLAD: CNN Architecture for Weakly Supervised Place Recognition. ', '< [b62] T Ng; V Balntas; Y Tian; K Mikolajczyk (2020). Solar: Second-order loss and attention for image retrieval. ', '< [b63] A Gordo; J Almazán; J Revaud; D Larlus (2016). Deep Image Rretrieval: Learning Global Representations for Image Search. ', '< [b64] M Yang; D He; M Fan; B Shi; X Xue; F Li; E Ding; J Huang (2021). Dolg: Single-Stage Image Retrieval with Deep Orthogonal Fusion of Local and Global Features. ', '---', '> [b60] A Babenko; V Lempitsky (2015). Aggregating Local Deep Features for Image Retrieval.', '> [b61] R Arandjelovic; P Gronat; A Torii; T Pajdla; J Sivic (2016). NetVLAD: CNN Architecture for Weakly Supervised Place Recognition.', '> [b62] T Ng; V Balntas; Y Tian; K Mikolajczyk (2020). Solar: Second-order loss and attention for image retrieval.', '> [b63] A Gordo; J Almazán; J Revaud; D Larlus (2016). Deep Image Rretrieval: Learning Global Representations for Image Search.', '> [b64] M Yang; D He; M Fan; B Shi; X Xue; F Li; E Ding; J Huang (2021). Dolg: Single-Stage Image Retrieval with Deep Orthogonal Fusion of Local and Global Features.', '215c225', '< Figure tab_0: ', '---', '> Figure tab_0:', '217c227', '< Caption: Train mAR AR 50 AR 75 AR 95 mAR AR 50 AR 75 AR 95 mAR AR 50 AR 75 Speed', '---', '> Caption: Table 1: Overall performance comparison on synthetic-real datasets LM-O [17] and YCB-V [18]. Compared with various 2D methods, including correlation [5], attention [6], and feature matching [9,19], our VoxDet holds superiority in both accuracy and efficiency. OLN* means the open-world object detector (OW Det.) [14] is jointly trained with the matching head while OLN denotes using fixed modules. † the model is trained on both synthetic dataset OWID and real images.', '255d264', '< ']
