EdgeCrafter: Compact ViTs for Edge Dense Prediction via Task-Specialized Distillation

EdgeCrafter: Compact ViTs for Edge Dense Prediction via Task-Specialized Distillation

TMLR Paper7963 Authors

17 Mar 2026 (modified: 27 Mar 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Deploying high performance dense prediction models on resource-constrained edge devices remains challenging due to strict limits on computation and memory. In practice, lightweight systems for object detection, instance segmentation, and pose estimation are still dominated by CNN-based architectures such as YOLO, while compact Vision Transformers (ViTs) often struggle to achieve similarly strong accuracy–efficiency trade-offs, even with large scale pretraining. We argue that this gap is largely due to insufficient task-specific representation learning in small-scale ViTs, rather than an inherent mismatch between ViTs and edge dense prediction. To address this issue, we introduce EdgeCrafter, a unified compact ViT framework for edge dense prediction centered on ECDet, a detection model built from a distilled compact backbone and an edge-friendly encoder–decoder design. We first adapt a large DINOv3 pretrained ViT to object detection and use it as a task-specialized teacher to distill rich representations into compact student backbones. We further improve efficiency by replacing standard patch embedding with a lightweight convolutional stem and constructing multi-scale features with simple interpolation and linear projection instead of costly feature pyramids. The resulting detection-distilled representation transfers directly to instance segmentation and human pose estimation through lightweight task-specific prediction modules. On the COCO dataset, ECDet-S achieves 51.7 AP with fewer than 10M parameters using only COCO annotations. For instance segmentation, ECInsSeg achieves performance comparable to RF-DETR-Seg while using substantially fewer parameters and without the need for additional Objects365 pretraining. For pose estimation, ECPose-X reaches 74.8 AP, significantly outperforming YOLO26-Pose-X (71.6 AP). These results show that compact ViTs, when paired with task-specialized distillation and edge-aware design, can be a practical and competitive option for edge dense prediction. The code and pretrained models for reproducing our results will be released upon publication.

Submission Type: Long submission (more than 12 pages of main content)

Assigned Action Editor: ~Xavier_Alameda-Pineda1

Submission Number: 7963

Loading