Apache Kafka is a distributed pub/sub messaging system for collecting and delivering a large amount of event data for both real time and offline consumption. In the 0.8 release, Kafka supports intra-cluster replication, which increases both the availability and the durability of the system. In this talk, we describe how replication works in Kafka 0.8.
Apache Kafka and its usage at LinkedIn: Apache Kafka is a distributed pub/sub messaging system. At LinkedIn, Kafka has been used in production for more than 2 years. Its applications include: (a) collecting and delivering tracking and operational data for both real time and offline (Hadoop) consumption; (b) queuing for asynchronous event processing; and (c) delivering data generated in Hadoop to other live data centers. Each day, we collect about 10TB of compressed data at LinkedIn.
Kafka replication The most important feature added in Kafka 0.8 is intra-cluster replication. A message is replicated in multiple brokers in a cluster. This makes Kafka both highly available and durable. 2.1 Design Goal: Our goal is to support 10,000s partitions per broker and 10s to 100s of brokers in a cluster within the same datacenter. Replicas are strongly consistent. We handle typical failures such as controlled rolling restart of the whole cluster and isolated uncontrolled broker failures. 2.2 Semantics: A message is considered as committed only after it has been written to all (in-sync) replicas. A consumer only sees messages already committed by brokers. A producer can choose to wait until the produced messages are committed or have been written to a specified number of replicas. 2.3 How replication works: One of the replicas is elected as the leader and the rest of the replicas are followers. All writes go to the leader replica. Followers copy data from the leader in the same order. The leader maintains an in-sync replica (ISR) set in Zookeeper. When the leader fails, a new replica in ISR is selected as the new leader, which guarantees no data loss. Each broker maintains a high watermark that tracks committed messages. The high watermark is used to keep replicas consistent during failover. 2.4 Leader Election: All leader elections are done by an embedded controller. The controller registers watchers in Zookeeper and gets notified of any broker failure. The controller determines the new leaders and sends the decision to each broker through an RPC. 2.5 Performance: In controlled failures, we expect the failover time to be less than 2 seconds.