Abstract: It is important to perform measurement and monitoring in
order to understand network performance and debug problems encountered by distributed applications. Despite many
products and much research on these topics, in the context
of data centers, performing accurate measurement at scale in
near real-time has remained elusive. There are two main approaches to network telemetry–switch-based and end-hostbased–each with its own advantages and drawbacks.
In this paper, we attempt to push the boundary of edgebased measurement by scalably and accurately reconstructing the full queueing dynamics in the network with data gathered entirely at the transmit and receive network interface
cards (NICs). We begin with a Signal Processing framework for quantifying a key trade-off: reconstruction accuracy versus the amount of data gathered. Based on this,
we propose SIMON, an accurate and scalable measurement
system for data centers that reconstructs key network state
variables like packet queuing times at switches, link utilizations, and queue and link compositions at the flow-level. We
use two ideas to speed up SIMON: (i) the hierarchical nature
of data center topologies, and (ii) the function approximation capability of multi-layered neural networks. The former gives a speedup of 1,000x while the latter implemented
on GPUs gives a speedup of 5,000x to 10,000x, enabling
SIMON to run in real-time. We deployed SIMON in three
testbeds with different link speeds, layers of switching and
number of servers. Evaluations with NetFPGAs and a crossvalidation technique show that SIMON reconstructs queuelengths to within 3-5 KBs and link utilizations to less than
1% of actual. The accuracy and speed of SIMON enables
sensitive A/B tests, which greatly aids the real-time development of algorithms, protocols, network software and applications.
0 Replies
Loading