FluidEdge: Expediting Serverless Machine Learning Inference via Bottleneck-Aware Auto-Scaling on Edge SoCs

Borui Li, Tiange Xia, Weilong Wang, Jingyuan Zhang, Shuai Wang, Chenhong Cao, Zheng Dong

Published: 01 Dec 2025, Last Modified: 16 Mar 2026IEEE Transactions on Mobile ComputingEveryoneRevisionsCC BY-SA 4.0
Abstract: Mobile applications based on machine learning (ML) are increasingly relying on offloading to the edge devices for low-latency, resource-efficient computation. Applying serverless computing for these ML applications on the edge offers a promising solution for handling dynamic workloads while meeting user-specified latency service-level objectives (SLOs). However, existing serverless frameworks, with their coarse-grained data parallelism and rigid model partitioning, are inadequate for ML inference on widely adopted edge System-on-Chip (SoC) devices. This paper presents FluidEdge, an edge-native serverless inference framework. FluidEdge identifies bottleneck operators in ML models and addresses them through a novel fine-grained intra-function latency-sensitive auto-scaling approach that dynamically scales inference bottlenecks during online serving. Additionally, it employs inter-function scaling to further prevent latency SLO violations and leverages the unified memory of edge SoCs for efficient data sharing during inference. Experimental results demonstrate that FluidEdge achieves a 37.4% latency improvement and 67.3% -87.6% SLO violation reduction compared to best-performed state-of-the-art serverless inference frameworks.
Loading