Broker #1 | Kafka

2024-05-01 2. Data Ingestion Comments

Introduction

Kafka is an open-source distributed event streaming platform for:
- high-performance data pipelines,
- streaming analytics,
- data integration,
- mission-critical applications.
Contents
- 1.1 Kafka Cluster Components
- 1.2 Kafka does not delete consumed message
- 1.3 Consumer group rebalancing
- References

1. Kafka Cluster Components

Cluster : A Kafka cluster is a distributed system composed of multiple Kafka brokers working together to handle the storage and processing of real-time streaming data.
Broker : In Kafka, a broker works as a container that can hold multiple topics with different partitions. A unique integer ID is used to identify brokers in the Kafka cluster. Connection with any one of the kafka brokers in the cluster implies a connection with the whole cluster. If there is more than one broker in a cluster, the brokers need not contain the complete data associated with a particular topic.
Topic & Partition : A stream of messages that are a part of a specific category or feed name is referred to as a Kafka topic.
- In Kafka, data is stored in the form of topics. Producers write their data to topics, and consumers read the data from these topics.
- Topics in Kafka are divided into a configurable number of parts, which are known as partitions. Partitions allow several consumers to read data from a particular topic in parallel.
Producer : Producers in Kafka publish messages to one or more topics
Consumer & Consumer Group : Consumers read data from the Kafka cluster. The data to be read by the consumers has to be pulled from the broker when the consumer is ready to receive the message. A consumer group in Kafka refers to a number of consumers that pull data from the same topic or same set of topics.
Replication : Replicas are like backups for partitions in Kafka. They are used to ensure that there is no data loss in the event of a failure or a planned shutdown.

2. Kafka does not delete consumed message

Kafka has consumer groups and their own offsets, so each consumer within consumer groups can read messages in serialized order, and each group can read topics in parallel

3. Consumer group rebalancing

session.timeout.ms (default: 10s)
- This is the amount of time a Kafka broker waits to determine that a Kafka consumer has failed and is no longer active.
- The Kafka consumer sends heartbeats to the group coordinator to prove it is still alive.
- If the session.timeout.ms elapses without receiving a heartbeat, the consumer group will be rebalanced, and the partitions it was handling will be reassigned to other consumers.
max.poll.interval.ms (default: 5m)
- This setting defines the maximum interval between consecutive poll() calls by the consumer.
- If the consumer does not call poll() within the max.poll.interval.ms, it will be considered inactive and removed from the group, triggering a rebalance of the consumer group.
- If your processing task takes a long time, you may need to increase max.poll.interval.ms to avoid triggering rebalancing due to lengthy processing times.
max.poll.records (default: 500)
- This defines the maximum number of records (messages) returned in a single call to poll().
- Increasing this value can improve throughput by reducing the number of poll() calls needed to process messages, but it can also increase memory usage if the consumer processes large batches.