FluidEdge: Expediting Serverless Machine Learning Inference via Bottleneck-Aware Auto-Scaling on Edge SoCs
Abstract: Mobile applications based on machine learning (ML) are increasingly relying on offloading to the edge devices for low-latency, resource-efficient computation. Applying serverless computing for these ML applications on the edge offers a promising solution for handling dynamic workloads while meeting user-specified latency service-level objectives (SLOs). However, existing serverless frameworks, with their coarse-grained data parallelism and rigid model partitioning, are inadequate for ML inference on widely adopted edge System-on-Chip (SoC) devices. This paper presents FluidEdge, an edge-native serverless inference framework. FluidEdge identifies bottleneck operators in ML models and addresses them through a novel fine-grained intra-function latency-sensitive auto-scaling approach that dynamically scales inference bottlenecks during online serving. Additionally, it employs inter-function scaling to further prevent latency SLO violations and leverages the unified memory of edge SoCs for efficient data sharing during inference. Experimental results demonstrate that FluidEdge achieves a 37.4% latency improvement and 67.3% -87.6% SLO violation reduction compared to best-performed state-of-the-art serverless inference frameworks.
External IDs:doi:10.1109/tmc.2025.3592334
Loading