Kafka's Performance Strategy

background

In my work, I have been using Kafka and maintaining several Kafka clusters. There is no doubt that Kafka is the most popular message queue middleware at present, and what makes it popular and popular. Next, I will explain why Kafka is so popular from the perspective of performance.

What is Kafka

First, Kafka is a distributed streaming data processing platform. Its important functions include:

Message system, providing publishing and subscription of event flow
Persistent storage, fault tolerance function to avoid consumer node failure
Stream processing, such as stream aggregation and connection, concretely refers to processing out of order/delayed data, windows, states and other operations

In the context of big data processing requirements, Kafka will inevitably optimize the performance of the above functions. The key points/bottlenecks of performance optimization lie in:

Transmission efficiency of data stream
The process of mass sending messages by producers and pulling messages by consumers, and the real-time, reliability and throughput of messages
Platform level persistent storage scheme, high fault tolerance, multi node backup

transmission efficiency

The performance improvement of Kafka mainly uses the IO optimization technology of the operating system to break away from the memory limitation of the JVM.

Why start with the operating system? People use the operating system every day, but the role of the operating system is generally ignored. Let's recall that one of the major roles of the operating system is to eliminate hardware differences and provide a unified standard API for user programs. As a result, most people use IO to stay in calling the system read/write Back end engineers will learn more about NIO's epoll/kqueue Let's take a look at Kafka's following two optimization strategies.

mmap

In fact, modern operating systems have made complex optimization of disk IO. Under Linux, there is a common abbreviation, vfs, that is, virtual file system, which maps memory and external storage (disk) to improve read and write speed, such as but not limited to:

2018-12-01T14:54:14.png

Read ahead: load a large disk block into the memory in advance. The efficiency of the user program to read the data on the disk is equal to the speed of copying the kernel memory to the memory allocated by the user program
Write behind: A certain number of small logical write operations will be mapped to the disk cache and merged into a large physical write operation. The time of writing is usually the operating system cycle sync Depending, users can also actively call sync

The optimization of the above memory/disk mapping depends on the prediction strategy of the operating system. Generally speaking, it is usually sequential access to disks, which is significantly more efficient.

In addition to automatically completing the above processes, the operating system also provides APIs mmap Actively map files to users page cache , the system will page cache The memory shared to the user program, the user program does not have to advance alloc memory , read the page cache directly to access data. Therefore, when frequently accessing a large file write ， mmap It is a better choice. It reduces the number of context switches between user mode and kernel mode in the normal write process (repeatedly copying the cache).

Zero-Copy

Above, we know the optimization strategy of disk cache. How to optimize the socket, another frequently used IO object

Introduced in Linux 2.1+ sendfile System call, via sendfile , we can page cache 's data directly copied to socket cache Only file descriptors and data offset And size Pass parameters to sendfile ， DMA The engine (Direct Memory Access) copies the data in the kernel buffer to the protocol engine, without going through the user mode context or requiring the user program to prepare the cache, so the user cache is zero. This is Zero Copy technology.

 #include<sys/sendfile.h> ssize_t sendfile(int out_fd,  int in_fd, off_t *offset, size_t count);

2018-12-01T14:54:44.png

Zero Copy technology is like a magic weapon for Java programs, making the size and speed of cache free from the limitations of JVM.

In combination with Kafka's usage scenario, multiple subscribers pull messages, which are pulled by multiple different consumers. Using Zero Copy technology, call mmap Read disk data to a copy page cache , call again sendfile Take a copy page cache Copy to different socket cache The entire replication process is completed in the kernel state of the system, which makes the best use of the performance of the operating system. The bottleneck is almost limited to the hardware conditions (disk read/write, network speed).

Message quality

We can see how Kafka guarantees the quality of messages from three dimensions, namely, the real-time, reliability and throughput of messages.

Space for time

In order to solve the problem of excessive network requests, Kafka producers will combine multiple messages and submit them again, reducing the frequency of network IO, and sacrificing a little delay for higher throughput.

In practice, we can configure the parameters of this process for the producer client. For example, set the maximum message size to 64 bytes and the delay time to 100 milliseconds. The effect is that if the message size reaches 64 bytes in 100 milliseconds, it will be sent immediately. If the message size does not reach 64 bytes in 100 milliseconds, but it reaches 100 milliseconds, the message in the cache will also be sent.

Distributed design

Previously, we analyzed the strategies of the stand-alone environment (operating system, communication IO). Then, in terms of horizontal expansion, what performance optimization strategies does Kafka have?

What should be considered to build a distributed messaging system?

How to use the advantages of distributed and multi node to enhance the throughput, disaster tolerance and flexible capacity expansion of the message system

Let's abstract our thinking. News is flowing water, a water pipe under a single machine, and a tap water network under multiple nodes. In order to make the flow of information more stable, we must ensure that every link of the flow of information is guaranteed

List each link of the information flow first

In most message queue applications, there are three levels of message status:
- After the producer publishes the message, whether it is received by the consumer or not
- After the producer publishes the message, more than one consumer must receive it
- After the producer publishes the message, ensure that only one consumer receives it
Cache of messages
- Traffic peak shaving. In unit time, when the consumer processing traffic is lower than the producer publishing traffic, the message system needs redundant messages
- If the consumer has an error in processing the message, or if the consumer fails to process the message after receiving it, the message needs to be processed again
Order of messages
- Specific business scenarios have requirements for the order of messages. Messages need to be processed in strict order to ensure that consumers receive sequential messages

How does Kafka deal with these links?

Kafka rule 1: messages are stored in two dimensions, a virtual partition and a physical broker, which are many to one
Kafka rule 2: One partition of a topic can only have one consumer thread pull messages at the same time
Kafka rule 3 maintains the consumer pull progress of message cache on a partition

In terms of horizontal expansion, Kafka stores messages collected from producers under one topic on multiple partitions. The number of partitions is greater than or equal to the number of Kafka nodes. Each partition can allocate at most one consumer.

For the problem of message cache capacity and traffic differentiation, Kafka can flexibly expand the nodes (brokers) under the cooperation of zookeeper
For the status of messages, Kafka maintains the offset value on a partition to save the consumption progress of messages, which needs to be submitted by consumers
- If a consumer pulls a message from a partition and does not submit the latest offset value, the offset on the partition is always behind the actual progress of the message pulling. When the association between the partition and the consumer is reallocated, the partition sends the message again from the last start
- If a consumer pulls a message, submits the offset to the partition after normal processing, and the same continues. When a consumer fails unexpectedly during the processing of a message, Kafka can identify that the consumer's communication connection is interrupted, and trigger the rebalance associated with the consumer in the partition
For the timing problem of messages, each partition can allocate at most one consumer. By maintaining the consumption progress, you can ensure the timing of messages in the same partition, thus:
- When a producer publishes a message that has temporal properties for an object, it can mark the message with the object's ID, Kafka will allocate messages to a partition that conforms to the hash value through the hash value of the object ID. Thus, the sequential messages of an object are stored in a partition and pulled by a consumer. The messages of the partition can ensure that messages are received by consumers in a chronological order
  The above scenario takes effect only when the partition is not reallocated. If the broker node expands or fails, the reallocation of the partition will be triggered. On the other hand, if the consumer node expands or fails, the reallocation of the partition associated with the consumer will be triggered

Intuitively, the partition and consumer will undergo the following rebalancing

Backup Policy

In addition to horizontally expanded partitions, multiple backups (Replicas) should be made to the total partition. One leader and multiple followers should be set for a partition. One leader should handle the transactions of the partition. The follower should be in ISR state (In Sync Replicas). Once the leader fails, a new leader should be generated in the latest follower

summary

IO intensive and large file operations can be optimized at the operating system level, making good use of mmap and sendfile
When repeated operations are intensive, they can be processed "in batches" to exchange time for space and obtain greater throughput
Distributed has a widely used design, which is horizontally expanded from two dimensions: virtual partitions and physical nodes

This article is written by Chakhsu Lau Creation, adoption Knowledge Sharing Attribution 4.0 International License Agreement.
All articles on this website are original or translated by this website, except for the reprint/source. Please sign your name before reprinting.