Zeros can be Informative: Masked Binary U-Net for Image Segmentation on Tensor Cores

Zeros can be Informative: Masked Binary U-Net for Image Segmentation on Tensor Cores

ICLR 2026 Conference Submission13940 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: U-Net, segmentation, binary neural network, GPU Tensor Core

TL;DR: This paper introduces MBU‑Net, a cost‑aware masked‑binary U‑Net mapped to GPU Tensor Cores that delivers near full‑precision accuracy at near‑binary efficiency for segmentation on edge.

Abstract: Real-time image segmentation is a key enabler for AR/VR, robotics, drones, and autonomous systems, where tight accuracy, latency, and energy budgets must be met on resource‑constrained edge devices. While U‑Net offers a favorable balance of accuracy and efficiency compared to large transformer‑based models, achieving real‑time performance on high‑resolution input remains challenging due to compute, memory, and power limits. Extreme quantization, particularly binary networks, is appealing for its hardware‑friendly operations. However, two obstacles limit practicality: (1) severe accuracy degradation, and (2) a lack of end‑to‑end implementations that deliver efficiency on general‑purpose GPUs. We make two empirical observations that guide our design. (1) An explicit zero state is essential: training with zero masking to binary U‑Net weights yields noticeable sparsity. (2) Quantization sensitivity is uniform across layers. Motivated by these findings, we introduce Masked Binary U‑Net (MBU‑Net), obtained through a cost‑aware masking strategy that prioritizes masking where it yields the highest accuracy‑per‑cost, reconciling accuracy with near‑binary efficiency. To realize these gains in practice, we develop a GPU execution framework that maps MBU‑Net to Tensor Cores via a subtractive bit‑encoding scheme, efficiently implementing masked binary weights with binary activations. This design leverages native binary Tensor Core BMMA instructions, enabling high throughput and energy savings on widely available GPUs. Across 3 segmentation benchmarks, MBU‑Net attains near full‑precision accuracy (3\% average drop) while delivering 2.04$\times$ speedup and 3.54$\times$ energy reductions over a 16-bit floating point U‑Net. The code will be released to the public alongside this publication.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 13940

Loading