TINY BUT MIGHTY: A SOFTWARE-HARDWARE CO- DESIGN APPROACH FOR EFFICIENT MULTIMODAL IN- FERENCE ON BATTERY-POWERED SMALL DEVICES
Keywords: On-device VLM, Efficient Inference, Software-Hardware Co-Design, Quantization, NPU, GPU
TL;DR: The smallest battery-powered device that can run VLMs in the world
Abstract: Large Multimodal Models (LMMs) are inherently modular, consisting of vision
and audio encoders, projectors, and large language models. Yet, they are almost
always executed monolithically, which underutilizes the heterogeneous accelera-
tors (NPUs, GPUs, DSPs) in modern SoCs and leads to high end-to-end latency.
In this paper, we present NANOMIND, a hardware–software co-design inference
framework for Large Multimodal Models (LMMs) that breaks large models into
modular “bricks” (vision, language, audio, etc.) and maps each to its ideal accelera-
tor. The key insight is that large models can be broken into modular components and
scheduled to run on the most appropriate compute units. It performs module-level
dynamic offloading across accelerators on unified-memory SoCs. By combining
customized hardware design, system-level scheduling, and optimized low-bit com-
putation kernels, we demonstrate our framework with a compact, battery-powered
device capable of running LMMs entirely on-device. This prototype functions as
a self-contained intelligent assistant that requires no network connectivity, while
achieving higher throughput and superior power efficiency under strict resource
constraints. The design further bypasses CPU bottlenecks and reduces redundant
memory usage through token-aware buffer management and module-level coordi-
nation. Our system outperforms existing implementations in resource efficiency,
cutting energy consumption by 42.3% and GPU memory usage by 11.2%. This
enables a battery-powered device to run LlaVA-OneVision-qwen2-05B with a
camera for nearly 20.8 hours.
Primary Area: infrastructure, software libraries, hardware, systems, etc.
Submission Number: 5282
Loading