Abstract: The increasing demand for cloud computing drives the expansion in scale of datacenters and their internal optical network, in a strive for increasing bandwidth, high reliability, and lower latency. Optical transceivers are essential elements of optical networks, whose reliability has not been well-studied compared to other hardware components. In this paper, we leverage high quantities of monitoring data from optical transceivers and OS-level metrics to provide statistical insights about the occurrence of optical transceiver failures. We estimate transceiver failure rates and normal operating ranges for monitored attributes, correlate early-observable patterns to known failure symptoms, and finally develop failure prediction models based on our analyses. Our results enable network administrators to deploy early-warning systems and enact predictive maintenance strategies, such as replacement or traffic re-routing, reducing the number of incidents and their associated costs.
Loading