We are pleased to announce that streamnnative and ovhcloud open source “Kop” (Kafka on pulsar). Kop introduces Kafka protocol processing plug-in into pulsar broker. In this way, Apache pulsar supports the native Apache Kafka protocol. After adding the Kop protocol processing plug-in to the existing pulsar cluster, users can migrate the existing Kafka applications and services to pulsar without modifying the code. In this way, Kafka applications can use the powerful features of pulsar, such as:
- Simplify operation by using enterprise multi tenant feature.
- Avoid data migration and simplify operation.
- Using Apache bookkeeper and hierarchical storage to persist the event flow.
- Pulsar functions are used to process non server events.
What is Apache pulsar
Apache pulsar is an event flow platform. At the beginning, Apache pulsar adopted the cloud native, layered and fragmented architecture. The architecture separates service and storage, and makes the system more container friendly. Pulsar’s cloud native architecture has strong scalability, high consistency and high elasticity, enabling the company to expand its business through real-time data solutions. Pulsar has been widely adopted since it was opened in 2016 and became the top Apache project in 2018.
Desire for Kop
Plusar provides a unified messaging model for queue and streaming workloads. Pulsar supports its own protobuf based binary protocol to ensure high performance and low latency. Protobuf is conducive to the implementation of pulsar client. In addition, the project also supports Java, go, Python and C + + languages as well as third-party clients provided by the community. However, for applications written using other messaging protocols, users must rewrite these applications, otherwise these applications cannot adopt pulsar’s new unified messaging protocol.
To address this issue, the pulsar community has developed applications to migrate Kafka applications from other messaging systems to pulsar. For example, pulsar provides Kafka wrapper on the Kafka Java API. Kafka wrapper allows users to switch Kafka java client applications from Kafka to pulsar without changing the code. Pulsar also provides a rich connector ecosystem for connecting pulsar with other data systems. However, there is still a strong demand for users who want to switch from other Kafka applications to pulsar.
Teamnative and ovhcloud
Streamnnative receives a large number of inbound requests for assistance in migrating from other messaging systems to pulsar. At the same time, streamnnative also realizes the necessity of supporting other message transfer protocols (such as AMQP and Kafka) on pulsar. Therefore, streamnnative is committed to introducing the general protocol processing plug-in framework into pulsar. The framework allows developers using other messaging protocols to use pulsar.
For many years, ovhcloud has adopted Apache Kafka. Despite their experience running multiple clusters on Kafka and processing millions of messages per second, they still face formidable operational challenges. For example, without the multi tenant feature, it is difficult for them to put thousands of topics of thousands of users in a cluster.
Therefore, ovhcloud abandoned Kafka and decided to transfer its theme as a service product (iostream) to pulsar and build its products on pulsar. Compared with Kafka, pulsar supports multi tenant features and its overall architecture includes Apache bookkeep components, which helps to simplify user operations.
After the preliminary experiment, ovhcloud decided to use Kop as POC proxy to convert Kafka protocol to pulsar. In the process, ovhcloud noticed that streamnnative is working to introduce Kafka protocol into pulsar natively. So they teamed up to develop the Kop.
Kop aims to provide a simple and comprehensive solution by using pulsar and bookkeeper’s event stream storage architecture and pulsar’s pluggable protocol processing plug-in framework. Kop is a protocol processing plug-in named “Kafka”. Kop is bound to the pulsar broker and runs with the pulsar broker.
aboutjournalBoth pulsar and Kafka use very similar data models for publish / subscribe messages and event flows. For example, pulsar and Kafka both use distributed logging. The main difference between the two systems is how they implement distributed logging. Kafka adopts partition architecture, and stores distributed logs (logs in Kafka partition) in a group of brokers. Pulsar adoptsSectioningThe distributed log is stored in Apache bookkeeper by using Apache bookkeeper as its horizontally extended partitioned storage layer. Pulsar is based onSectioningThe architecture helps to avoid data migration, achieve high scalability, and store event streams persistently. For more information on the main differences between pulsar and Kafka, refer to the Splunk blog and the bookkeeper project blog.
Both pulsar and Kafka are built on similar data model (distributed log), and pulsar adopts distributed log storage and pluggable plug-in framework for protocol processing (introduced in version 2.5.0), so pulsar can easily implement Kafka compatible protocol processing plug-ins.
By comparing pulsar and Kafka, we find that there are many similarities between the two systems. Both systems include the following operations:
- Topic search: all clients connect to any broker to find the metadata of the topic (i.e. owner broker). After obtaining the metadata, the client establishes a persistent TCP connection with the owner broker.
- release: client and topic areaownerThe broker talks to append the message to theDistributed logMedium.
- consumption: client and topic partitionownerThe broker talks to read messages from the distributed log.
- Offset: assign to messages published to topic partitionsOffset。 In pulsar, the offset is calledMessageId。 Consumer can useOffsetTo find a given location in the log to read the message.
Consumption state: both systems maintain the consumption state of the consumer (Kafka calls it the consumer group) in the subscription. Kafka stores the consumption status in the
__offsetsTopic, while pulsar stores the consumption status in the
As you can see, these are all the raw operations provided by scale out distributed log stores, such as Apache bookkeeper. The core functions of pulsar are implemented on Apache bookkeeper. Therefore, we can implement the Kafka concept very simply and directly using the existing components developed by pulsar on bookkeeper.
The figure below shows how we can add Kafka protocol support to pulsar. We introduce a new oneProtocol processing plug-inThe protocol processing plug-in uses pulsar’s existing components (such as topic discovery, distributed log library managed ledger, cursor, etc.) to implement Kafka transmission protocol.
Kafka stores all topics in a flat namespace. However, pulsar stores topics in a hierarchical, multi tenant namespace. We added the
kafkaNamespaceConfigure so that administrators can map Kafka topics to pulsar topics.
In order to facilitate Kafka users to use the multi tenant feature of Apache pulsar, when Kafka users use SASL authentication mechanism to verify Kafka clients, they can specify a pulsar tenant and namespace as their SASL user name.
Message ID and offset
Kafka specifies an offset for each message that is successfully published to the topic partition. Pulsar specifies one for each message
MessageID。 Message ID by
batch-indexform. We use the same method in pulsar Kafka wrapper to convert the message ID of pulsar to offset and vice versa.
Both Kafka and pulsar messages contain keys, values, timestamps, and headers (called ‘properties’ in pulsar). We automatically convert these fields between Kafka messages and pulsar messages.
We provide the same topic lookup method for Kafka and pulsar’s request processing plug-ins. The request processing plug-in discovers the topic, finds the full ownership of the requested topic partition, and then Kafka contains the ownership information
TopicMetadataReturn to Kafka client.
After receiving the message from Kafka client, Kafka request processing plug-in maps multiple fields (such as key, value, timestamp and headers) one by one, so as to convert Kafka message into pulsar message. At the same time, Kafka request processing plug-in uses the managedledger append API to store these converted pulsar messages in bookkeeper. After Kafka request processing plug-in converts Kafka messages into pulsar messages, existing pulsar applications can receive messages published by Kafka clients.
When a consumer request is received from the Kafka client, the Kafka request processing plug-in opens a non persistent cursor and reads entries from the offset of the request. After Kafka request processing plug-in converts pulsar messages back to Kafka messages, existing Kafka applications can receive messages published by pulsar clients.
Group Coordinator & offset management
The biggest challenge is to implement group coordinator and offset management. Pulsar does not support centralized group coordinators, can not allocate partitions to consumers in consumption groups, and cannot manage offsets for each consumer group. The pulsar broker manages partition allocation based on partitions, while the partition owner broker manages offsets by storing acknowledgement information in cursors.
It is very difficult to keep the pulsar model consistent with the Kafka model. Therefore, to be fully compatible with the Kafka client, we store the changes and offsets of the coordinator group in the pulsar named
public/kafka/__offsetsIn the system topic, the Kafka coordinator group is implemented. In this way, we can build a bridge between pulsar and Kafka, and allow users to use existing pulsar tools and policies to manage subscriptions and monitor Kafka consumers. We add a background thread to the implemented coordinator group to synchronize the offset updates from the system topic to the pulsar cursor on a regular basis. Therefore, the Kafka consumer group is actually considered a pulsar subscription. All existing pulsar tools can also be used to manage Kafka consumer groups.
Connecting two popular messaging ecosystems
Both streamnnative and ovhcloud value customer success. We believe that providing native Kafka protocol on Apache pulsar can help users with pulsar achieve business success faster. Kop integrates two popular event flow ecosystems, unlocking new use cases. Customers can use the advantages of these two ecosystems to build a truly unified event flow platform with Apache pulsar to accelerate the development of real-time applications and services.
Kop enables the log collector to continue collecting log data from its source and publish messages to Apache pulsar using the existing Kafka integration. Downstream applications can use pulsar functions to process the events arriving at the system, so as to realize the event stream transmission without server.
Kop uses Apache license V2 license, and the project address is: https://github.com/streamnati… 。 The streamnnative platform has built-in Kop. You can choose to download the streamnnative platform to try out all the features of Kop. If the pulsar cluster is already running and you want it to support Kafka protocol, you can install the Kop protocol processing plug-in to the existing pulsar cluster. Please refer to the instructions for details.
If you want to learn more about Kop, please refer to Kop’s code and documentation. We look forward to your questions and PR. You can also add it to pulsar slack
＃kopChannel, talk about everything about Kafka on pulsar.
Streamnnative and ovhcloud will hold a webinar on Kop on March 31. For more details, click Register. Looking forward to meeting you online.
Initially, streamnnative launched the Kop project. Later, the ovhcloud team joined the project. We worked together to develop the Kop project. Thanks to Pierre zemb and Steven Le Roux of ovhcloud for their contributions to this project!