Kafka - The basics you need to know¶
Kafka is a distributed streaming platform which uses logs as the unit of storage for messages passed within the system. It is horizontally scalable, fault-tolerant, fast, and runs in production in thousands of companies. Likely your business is already using it in some form.
What you must know about Apache Kafka to use Faust¶
Topics
A topic is a stream name to which records are published. Topics in Kafka are always multi-subscriber; that is, a topic can have zero, one, or many consumers that subscribe to the data written to it. Topics are also the base abstraction of where data lives within Kafka. Each topic is backed by logs which are partitioned and distributed.
Faust uses the abstraction of a topic to both consume data from a stream as well as publish data to more streams represented by Kafka topics. A Faust application needs to be consuming from at least one topic and may create many intermediate topics as a by-product of your streaming application. Every topic that is not created by Faust internally can be thought of as a source (for your application to process) or sink (for other systems to pick up).
Partitions
Partitions are the fundamental unit within Kafka where data lives. Every topic is split into one or more partitions. These partitions are represented internally as logs and messages always make their way to one partition (log). Each partition is replicated and has one leader at any point in time.
Faust uses the notion of a partition to maintain order amongst messages and as a way to split the processing of data to increase throughput. A Faust application uses the notion of a “key” to make sure that messages that should appear together and be processed on the same box do end up on the same box.
Fault Tolerance
Every partition is replicated with the total copies represented by the In Sync Replicas (ISR). Every ISR is a candidate to take over as the new leader should the current leader fail. The maximum number of faulty Kafka brokers that can be tolerated is the number of ISR - 1. I.e., if every partition has three replicas, all fault tolerance guarantees hold as long as at least one replica is functional
Faust has the same guarantees that Kafka offers with regards to fault tolerance of the data.
Distribution of load/work
For every partition, all reads and writes are routed to the leader of that partition. For a specific topic, the load is as distributed as the number of partitions. Note: Since the partition is the lowest degree of parallel processing of messages, the number of partitions control how much many parallel instances of the consumers can operate on messages.
Faust uses parallel consumers and therefore is also limited by the number of partitions to dictate how many concurrent Faust application instances can run to distribute work. Extra Faust application instances beyond the source topic partition count will be idle and not improve message processing rates.
- Offsets
For every <topic, partition> combination Kafka keeps track of the offset of messages in the log to know where new messages should be appended. On a consumer level, offsets are maintained at the <group, topic, partition> level for consumers to know where to continue consuming for a given “group”. The group acts as a namespace for consumers to register when multiple consumers want to share the load on a single topic.
Kafka maintains processing guarantees of at least once by committing offsets after message consumption. Once an offset has been committed at the consumer level, the message at that offset for the <group, topic, partition> will not be reread.
Faust uses the notion of a group to maintain a namespace within an app. Faust commits offsets after when a message is processed through all of its operations. Faust allows a configurable commit interval which makes sure that all messages that have been processed completely since the last interval will be committed.
Log Compaction
Log compaction is a methodology Kafka uses to make sure that as data for a key changes it will not affect the size of the log such that every state change is maintained for all time. Only the most recent value is guaranteed to be available. Periodic compaction removes all values for a key except the last one.
Tables in Faust use log compaction to ensure table state can be recovered without a large space overhead.
This summary and information about Kafka is adapted from original documentation on Kafka available at https://kafka.apache.org/