Keywords: TinyML, model compression, generative compression, microcontrollers, pointwise convolution, ECG classification, keyword spotting, efficient deep learning
TL;DR: We compress TinyML models by generating most 1×1 pointwise mixers once at load time from tiny per-layer codes, achieving a much better accuracy–flash trade-off on ECG and strong transfer to keyword spotting while preserving standard INT8 inference.
Abstract: Neural networks on microcontrollers are constrained
by kilobytes of flash/SRAM, where 1×1
pointwise (PW) mixers often dominate memory
even after INT8 quantization. We present
HYPERTINYPW, a compression-as-generation
method that replaces most stored PW weights
with generated weights: a shared micro-MLP
synthesizes PW kernels once at load time from
tiny per-layer codes, caches them, and executes
them with standard integer operators. This preserves
commodity MCU runtimes and incurs only
a one-off synthesis cost; steady-state inference
matches INT8 separable CNNs. Sharing a latent
basis across layers removes cross-layer redundancy,
while keeping PW1 in INT8 stabilizes
early, morphology-sensitive mixing. We also introduce
TinyML-faithful packed-byte accounting
(generator, heads/factorization, codes, kept PW1,
backbone) and a unified evaluation protocol with
validation-tuned thresholds and bootstrap CIs. On
three ECG benchmarks (Apnea-ECG, PTB-XL,
MIT-BIH), HYPERTINYPW improves the macro-
F1–vs.–flash Pareto: at ∼225 kB it achieves neariso
performance to a ∼1.4MB CNN while being
6.31× smaller (84.15% fewer bytes), retaining
≥95% of large-model macro-F1. Beyond ECG,
HYPERTINYPW transfers to TinyML audio: on
Speech Commands keyword spotting it reaches
96.2% test accuracy (98.2% best validation), supporting
that generate-and-cache channel mixing
applies broadly to embedded sensing workloads
where repeated linear mixers dominate memory.
Topics: Algorithms: Memory and compute-efficient optimizers, Benchmarks, Datasets, and Evaluation: Benchmarks for training, inference, and efficiency, ML for Systems: ML for systems infrastructure, Model Serving: Compression, quantization, pruning, distillation at system scale, Model Serving: Edge, mobile, and IoT systems, Model Serving: System optimizations for model serving
Submission Number: 11
Loading