Abstract: Database information evolves over time as new entries are added and existing entries are modified or deleted. To deliver database changes to downstream applications efficiently, log-based data replication is widely used for data replication. Systems for change data capture (CDC), such as LinkedIn Databus, can achieve data consistency and real-time transmission when mining database log. However, the existing approaches for processing the log suffer from the following problems: (1) There is no efficient way to process the log with high availability and low latency. (2) Traditional incremental computing methods are not compatible with database log processing, which exploits the specific mechanisms and semantics of databases. In this paper, we describe a two-phase MapReduce approach for incremental replication. By employing an efficient, resilient result reuse mechanism with two-phase MapReduce, the result can be restored rapidly. This approach is implemented on Spark Streaming, and by comparing with existing solutions, the result shows the higher effectiveness and efficiency than others.
Loading