QUTE: Quantifying Uncertainty in TinyML models with Early-exit-assisted ensembles for model-monitoring

Nikhil Pratap Ghanathe; Steven J E Wilton

QUTE: Quantifying Uncertainty in TinyML models with Early-exit-assisted ensembles for model-monitoring

Nikhil Pratap Ghanathe, Steven J E Wilton

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: QUTE is a resource-efficient early-exit ensemble method for on-device uncertainty quantification (UQ) and monitoring in tinyML, reducing model size by 59% and latency by 31%, while detecting accuracy drops. It fits on a 256KB microcontroller.

Abstract: Uncertainty quantification (UQ) provides a resource-efficient solution for on-device monitoring of tinyML models deployed remotely without access to true labels. However, existing UQ methods impose significant memory and compute demands, making them impractical for ultra-low-power, KB-sized tinyML devices. Prior work has attempted to reduce overhead by using early-exit ensembles to quantify uncertainty in a single forward pass, but these approaches still carry prohibitive costs. To address this, we propose QUTE, a novel resource-efficient early-exit-assisted ensemble architecture optimized for tinyML models. QUTE introduces additional output blocks at the final exit of the base network, distilling early-exit knowledge into these blocks to form a diverse yet lightweight ensemble. We show that QUTE delivers superior uncertainty quality on tiny models, achieving comparable performance on larger models with 59% smaller model sizes than the closest prior work. When deployed on a microcontroller, QUTE demonstrates a 31% reduction in latency on average. In addition, we show that QUTE excels at detecting accuracy-drop events, outperforming all prior works.

Lay Summary: Tiny AI models, known as TinyML, are being used in everything from wearable health monitors to environmental sensors. These models run directly on low-power devices — often no bigger than a coin — and help make fast decisions without needing to send data to the cloud. But there’s a problem: *how do we know when these tiny models are making a mistake*, especially when they’re deployed in remote or hard-to-reach places with no cloud access or human supervision? That’s where **QUTE** comes in. QUTE is a new technique we’ve developed to help tiny AI models recognize when they’re uncertain — in other words, when their predictions might be wrong. This ability to estimate confidence (called uncertainty quantification) is crucial for safety, reliability, and smart decision-making in the field. Unlike other methods that are too large or slow for tiny devices, QUTE is designed to work efficiently on ultra-low-power hardware. It works by adding small "checkpoints" inside the model and training lightweight prediction blocks that each learn something different. Together, they act like a mini-team of experts that can double-check each other’s confidence and flag when the model might be unsure — all without taking up much space or time. In real-world tests, QUTE made decisions faster (31% lower delay), used less memory (59% smaller), and was better at spotting problems compared to previous approaches. This makes QUTE a strong choice for making AI on tiny devices both faster and more trustworthy.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Primary Area: General Machine Learning->Hardware and Software

Keywords: ML Model monitoring, TinyML, Resource-efficient learning, Uncertainty quantification, Early-exit networks, Trust in ML, Lightweight neural networks, Ensemble learning, Failure detection

Submission Number: 13233

Loading