Abstract: Online services in modern datacenters use Remote Procedure Calls (RPCs) to communicate between different software layers. Despite RPCs using just a few small functions, inefficient RPC handling can cause delays to propagate across the system and degrade end-to-end performance. Prior work has reduced RPC processing time to less than 1 $\mu$ s, which now shifts the bottleneck to the scheduling of RPCs. Existing RPC schedulers suffer from either high overheads, inability to effectively utilize high core-count CPUs or do not adaptively fit different traffic patterns. To address these shortcomings, we present ALTOCUMULUS,1 a scalable, software-hardware codesign to schedule RPCs at nanosecond scales. ALTOCUMULUS provides a proactive scheduling scheme and low-overhead messaging mechanism on top of a decentralized user runtime. ALTOCUMULUS also offers direct access from the user space to a set of simple hardware primitives to quickly migrate long-latency RPCs. We evaluate ALTOCUMULUS with synthetic workloads and an end-to-end in-memory key-value store application under real-world traffic patterns. ALTOCUMULUS improves throughput by 1.3-24.6$\times$ under a 99th percentile latency <300$\mu$ s and reduces tail latency by up to 15.8$\times$ on 16-core systems over current state-of-the-art software and hardware schedulers. For 256-core systems, integrating ALTOCUMULUS with either a hardware-optimized NIC or commodity PCIe NIC can improve throughput by $ 2.8\times$ or $ 2.7\times$, respectively, under 99th percentile latency $\lt 8.5\mu \mathrm{s}$.1Automatic Concurrent Migration Load-balancing Strategy (AutoCuMuLuS), homophonic with “altocumulus” as a type of clouds in meteorology, fragmented to separate patches or nodes.
Loading