Installation and configuration of Kafka

0 installation

Omitted, download it on the official website. Note that Kafka also needs zookeeper support.

  • Kafka version: Kafka ou 2.13-2.4.0
  • Zookeeper version: zookeeper-3.5.4-beta
  • JDK version: openjdk 8

1 Kafka configuration

Kafka’s main configuration file is / config / server.properties.

##Instance ID, the IDs of all instances in the same cluster cannot be the same
broker.id = 0

##Address and port that Kafka service listens to
##After the advertised.listeners is set, the hosts file of the client can not be set. The reason is unknown
listener = PLAINTEXT://xxxxxxx:9092
advertised.listeners = PLAINTEXT://xxxxxxx:9092

#####General configuration#####

##Number of threads processing disk IO
num.io.threads = 8

##Number of threads processing network IO
num.network.threads = 3

##The number of threads used to process background tasks, such as deletion of expired message files, etc
backgroud.threads = 4

##Number of threads used by Kafka to reply to data on startup and save data to disk on shutdown
num.recovery.threads.per.data.dir = 10

##Size of cache for message sending
socket.send.buffer.bytes = 102400

##Size of buffer received by message
socket.receive.buffer.bytes = 102400

##Maximum cache value of socket request
socket.request.max.bytes = 10240000

##The number of messages each consumer gets from the broker
max.poll.records = 100

#####Rebalance configuration#####

##Pull interval time of consumer
##If the consumer does not pull the data again within the specified interval,
##Broker will think that the consumer is dead and cause rebalancing
##Default 3000 ms
max.poll.interval.ms = 3000

##Rebalance delay time
##The meaning of rebalancing is that when consumers go up and down the line, they will reallocate the consumption right of partition
##The advantage of delay is that if the consumer is centralized up and down the line for a period of time, it can be processed once after the delay time
##If there is a change, it will be handled once, and the efficiency will be lower
##Default is 0 ms, i.e. no delay
group.initial.rebalance.delay.ms = 10


##The leader of each partition will maintain a list of ISRs
##This is the list of followers that are basically similar to the leader data
##If the data gap between a follower and the leader is too large, it will be removed from the list
##If the leader hangs up, it is preferred to choose a new leader from ISR follower
##The followers who are excluded from ISR are stored in OSR list
##Set the time of the follower gap data here
replica.lag.time.max.ms = 5000

##Allow followers in OSR list to participate in leader election
##True by default
##If this is set to false and the ISR list is empty, the Kafka cluster will stop and wait for the leader to recover
unclear.leader.election.enable = true

##Whether to enable the partition leader auto balancing mechanism
##With this policy enabled, if there are three copies of 0 1 2 in a partition, 0 is the leader
##At this time, 0 hangs up, 1 becomes a new leader, and then 0 restarts, then 0 will restore the leader identity
##The advantage of this is that it is not easy to see that many partitions on the same machine are leaders, resulting in system instability
##True by default
auto.leader.rebalance.enable = true

##If the leader imbalance ratio exceeds this value, the partition will be rebalanced
leader.imbalance.per.broker.percentage = 10

##Time interval to check whether the leader is unbalanced
leader.imbalance.check.interval.seconds = 300






#####Data log configuration#####

##Kafka log data storage directory, multiple can be specified by comma Division
# log.dirs = /tmp/kafka-logs-1,/tmp/kafka-logs-2
log.dirs = /tmp/kafka-logs


##Refresh log to disk threshold
#Refresh according to the number of messages. This policy is used by default
# log.flush.interval.messages = 10000
#Refresh according to time, in milliseconds, not on by default
log.flush.interval.ms = 1000

##Time interval to check refresh mechanism
##The meaning of this parameter is to check whether the set threshold value of flush is reached every certain time
log.flush.scheduler.interval.ms = 3000

##Record the time point from the last solidified data to the hard disk, mainly used for data recovery
##Default 60000
log.flush.offset.checkpoint.interval.ms = 60000


##The storage time of the log can be in hours, minutes or milliseconds
##After this time, log clearing will be performed
# log.retention.minutes = 120
# log.retention.ms = 120
#Default usage hours, 168
log.retention.hours = 2

##If the storage size of each partition exceeds, log cleanup will be performed
##- 1 for unlimited, default 1073741824
log.retention.bytes = 1073741824

##Check cycle of log size. If the size has been reached, file deletion will be triggered
log.retention.check.interval.ms = 300000


##Specifies how often the log can be deleted. The default value is 1 min
log.cleanup.internal.mins = 1

##The policy of log clearing, default to delete
##If you want to use log compression, you need to have the policy include compact
##It should be noted that if the compact policy is enabled, the key of the message submitted by the client cannot be null. Otherwise, an error is reported
# log.cleanup.policy = delete
log.cleanup.policy = delete


##Enable log compression or not
##It is enabled by default, but log compression will take effect only when the log clearing policy contains compact
##The logic of log compression is to integrate the keys. Only the last version is saved for different value values of the same key
log.cleaner.enable = true

##Only when compression is turned on will it take effect. The number of threads running for log compression
log.cleaner.threads = 8

##The larger the memory, the better the efficiency
##Unit byte, default 524288 byte
log.cleaner.io.buffer.size = 524288

##The higher the frequency of log cleaning, the more efficient it will be, but the greater the memory consumption will be
log.cleaner.min.cleanable.ratio = 0.7


##The size of a single log file, 1073741824 by default
log.segment.bytes = 1073741824

##Time when the log was actually cleared
##After the saving time, the log is only deleted logically and cannot be indexed, but it is not deleted from the disk
##This parameter is used to set the time when the logs marked as logical deletion are actually deleted
log.segment.delete.delay.ms = 60000


#####Topic configuration#####

##Allow to create topic automatically
##If it is false, you can only create topic through command
##Default true
auto.create.topics.enable = true

##The number of default partition for each topic can be specified during topic creation. If not, use this parameter
##The number of partitions directly affects the number of cosumers that can be accommodated
num.partitions = 1

##The number of copies of topic partition, the more copies, the less likely to lose data due to individual broker problems
##The more replicas, the higher availability, but the more time it takes to synchronize after each data write
offsets.topic.replication.factor = 1
transaction.state.log.replication.factor = 1
transaction.state.log.min.isr = 1



#####Zookeeper configuration#####

##Zookeeper address, multiple can be specified by comma
# zookeeper.connect = localhost:2181,localhost:2182
zookeeper.connect = localhost:2101
##Timeout of zookeeper cluster
zookeeper.connection.timeout.ms = 6000

2 create topic

Automatic topic creation is set in the above configuration, but it can also be created manually:

./bin/kafka-topics.sh --create \
--zookeeper localhost:2101 \
--replication-factor 1 \
--partitions 1 \
--topic test1

View topic:

./bin/kafka-topics.sh --zookeeper localhost:2101 --list

3 start Kafka

./bin/kafka-server-start.sh -daemon ./config/server.properties &

Springboot Kafka configuration code

1 pom

Spring boot version:

<parent>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-parent</artifactId>
    <version>2.2.2.RELEASE</version>
    <relativePath/>
</parent>

To import a jar package:

<! -- necessary package for Kafka -- >
<dependency>
    <groupId>org.springframework.kafka</groupId>
    <artifactId>spring-kafka</artifactId>
    <version>2.4.1.RELEASE</version>
    <exclusions>
        <! -- if spring boot starter has been introduced elsewhere, spring related packages can be excluded here -- >
        <exclusion>
            <groupId>org.springframework</groupId>
            <artifactId>spring-*</artifactId>
        </exclusion>
    </exclusions>
</dependency>

2 springboot yaml configuration

spring:
  kafka:
    ##Producer configuration. If this instance is only a consumer, this part can not be configured
    producer:
      #Client ID, optional configuration, non repeatable
      client-id: boot-producer
      #Kafka service address, PI + port
      bootstrap-servers: aliyun-ecs:9092
      #Tool class for serialization
      key-serializer: org.apache.kafka.common.serialization.StringSerializer
      value-serializer: org.apache.kafka.common.serialization.StringSerializer
      #Number of retries in case of message sending failure
      retries: 1
      #The buffer size of batch upload can be the number of messages or the amount of memory
      batch-size: 10000
      buffer-memory: 300000
      #Wait for the replica synchronization to confirm that the message is sent successfully. The optional values are 0,1, - 1, all, etc
      #Set to 0 to return directly without waiting for any replica synchronization to complete
      #Setting to 1 means only waiting for leader synchronization to complete
      #All means to confirm after all synchronization, but the speed will be slower
      acks: 1

    ##Consumer configuration. If this instance is only a producer, this part can not be configured
    consumer:
      #Client ID, optional configuration, non repeatable
      client-id: boot-consumer
      #Consumer group ID, different consumers of the same group consume a piece of data together
      group-id: consumer-group-1
      #Kafka service address, PI + port
      bootstrap-servers: aliyun-ecs:9092
      #Tool class for deserialization
      key-deserializer: org.apache.kafka.common.serialization.StringDeserializer
      value-deserializer: org.apache.kafka.common.serialization.StringDeserializer
      #Auto update offset
      enable-auto-commit: true
      #If enable auto commit is set to true, the offset is committed every other period of time
      #Time in milliseconds, default 5000 (5S)
      auto-commit-interval: 1000
      #Offset consumption pointer
      #The earliest represents consumption from the beginning, and the last represents consumption from the newly generated part
      auto-offset-reset: earliest

3 code

Producer supporting code:

import org.springframework.kafka.core.KafkaTemplate;
import org.springframework.stereotype.Component;
import javax.annotation.Resource;

/**
 *Kafka MQ producer packaging class
 */
@Component
public class MQProductor {

    @Resource
    KafkaTemplate<String,String> kt;

    /**
     *How to send messages
     *Topic created by @ param topic
     *@ param partition topic fragment number, starting from 0
     *@ param key message key, which is mainly used for fragmentation and compression of credentials. It can be repeated and can be empty
     *@ param message body
     */
    public void send(String topic,Integer partition,String key,String message) {
        kt.send(topic,partition,key,message);
    }

    public void send(String message) {
        send("test-topic",0,"",message);
    }
}

Consumer matching code:

import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.springframework.kafka.annotation.KafkaListener;
import org.springframework.stereotype.Component;

/**
 *Kafka MQ consumer packaging
 */
@Component
public class MQListener {

    /**
     *Monitoring method, which can monitor multiple topics at one time
     *Message wrapper class returned by @ param CR Kafka
     */
    @KafkaListener(topics = {"test-topic" /*,"test-topic-2"*/ })
    public void consume(ConsumerRecord<String,String> cr) {
        // value
        String value = cr.value();
        System.out.println(value);
        // key
        String key = cr.key();
        System.out.println(key);
        //Read pointer
        long offset = cr.offset();
        System.out.println(offset);
        //Partition number read
        int partition = cr.partition();
        System.out.println(partition);
        // topic
        String topic = cr.topic();
        System.out.println(topic);
    }
}

4 log mask

Kafka’s listener client will constantly turn to the server to try to get the latest data and print many useless logs at the same time.
In the log4j2 configuration file, you can configure the contents in the < loggers > < loggers > tag to mask the logs in this part:

<!-- log4j.xml -->
<Loggers>

    <! -- other configurations omitted -- >

    <! -- log printing of Kafka monitoring class is excluded here -- >
    <! -- set that only error level logs will be printed, and the info / debug level logs will be blocked -- >
    <Logger name="org.apache.kafka.clients.FetchSessionHandler" level="ERROR"/>
    <Logger name="org.springframework.kafka.listener.KafkaMessageListenerContainer" level="ERROR"/>
</Loggers>

Thoughts on Kafka’s relevant principles

Why Kafka?

The core function of MQ is to cut peak and fill valley. Due to the difference between peak and valley periods of business itself, data with less real-time requirements can be temporarily placed in MQ first, and then taken out for data processing when the computing power of the server is surplus.
In addition, asynchronous MQ has lower coupling than real-time RPC connection, even if one of the producers and consumers has problems, it will not affect the normal use of the other.
In fact, the real-time performance of MQ will not be much lower than that of RPC connection when the consumer’s consumption ability can keep up with it; Kafka is actually an excellent solution to replace computing power with network power and memory with disk.
When the service requires high throughput, but not high real-time data, and can allow data loss within a certain range, Kafka can be selected, otherwise MQ with higher reliability for other services can be selected.

How is Kafka efficient?

Kafka’s efficiency is reflected in two aspects, the first is network IO, and the second is disk IO.
In terms of network IO, the bottom layer of Kafka uses netty to deal with network communication, and defines compact application layer protocol by itself, making full use of network bandwidth and multithreading performance of computer. Clients generally use the way of bulk upload to reduce the number of network IO;
In terms of disk IO, Kafka adopts the mode of sequential writing. From the source, Kafka will use a ByteBuffer to wrap messages, and then use RandomAccessFile and filechannel to write accurately. Filechannel can use the force (…) method to precisely brush the data in pagecache to disk.
When reading data, Kafka will complete zero copy by calling filechannel. Transferto (…) or filechannel. Map (…) methods.

How does Kafka not lose data?

In Kafka cluster, data is saved in multiple copies. Each partition will have multiple copies of data, each of which is on a different machine, and specific copies can be configured by themselves;
Kafka has a configuration item for retry and ACK in the client.
The significance of retry mechanism is the number of retransmissions if the client fails to transmit the message to the server;
The significance of ACK mechanism is that only when several copies have completed the latest data synchronization can the data transmission be confirmed to be successful, otherwise it will be a failure.
It must be noted that these adjustable mechanisms are set more rigorously. In fact, the throughput of Kafka cluster is smaller, but the data is more reliable and the loss rate is low, but at the same time, there may be the risk of repeated data upload; otherwise, the throughput is large, but the loss rate is high.

How can Kafka avoid repeated consumption of data?

From Kafka’s own point of view, a partition can only be consumed by one consumer. Therefore, in the same consumer group, Kafka services and consumers are almost one-to-one. Kafka maintains an offset pointer internally to ensure that consumers do not repeat the expense message under normal circumstances. Of course, if you need to repeat the consumption of messages, then consumers can choose in different groups.
From the perspective of producers, if a message has been submitted successfully, but Kafka fails to return success due to network reasons, there may be repeated submissions.
From the perspective of consumers, if the message has been successfully consumed, but the update information of offset has not been submitted successfully due to network reasons, there will be repeated submissions.
When these two special cases occur, we can use the idea similar to solving the idempotence of interface to solve them. For example, a token is attached to each piece of data. When the message has been consumed, the token will expire.
In addition to the above methods, the data repeated consumption can also be reduced by manually controlling the submission of Kafka’s offset.
In data transmission, there are three transaction levels: at most once, at least once, and precisely once.
In fact, it is difficult to achieve this transaction level exactly once. By default, the transaction level of Kafka using auto submit offset can be understood as at least once, that is, Kafka will determine to update the offset only after the correct consumption message, but if there is an error in the consumption process, the offset will not be updated, and the data will be re consumed; manually submit the offset An offset update can be performed when the message is pulled but not consumed. Even if the subsequent logic fails, the data will not be consumed again.

How does Kafka ensure data consistency?

  • 1. The copy mechanism in Kafka is only used to back up data, and the copy does not support data reading. In other words, consumer can only pull data from leader partition.
  • 2 the data backup mechanism in Kafka is that the follower partition actively requests new data from the leader partition.
  • 3. Leader partition will maintain a list of ISR copies. If the data progress of a follower partition lags behind leader partition too much, it will be removed from the list. If the leader partition crashes at a certain time, it will be selected as a new leader from the followers in the ISR list in general. If the ISR list is empty, other followers will be considered.

What is Kafka’s rebalancing?

When a consumer suddenly goes online or offline, Kafka needs to reallocate partition resources for consumer consumption. This process is rebalancing.
The default strategy for rebalancing is range, which is to allocate in order first. If there are redundant partitions after allocation, several consumers with the highest number will be assigned.
The latest sticky policy will try to keep the existing relationship between consumer and partition.