For Kafka Connect cheat sheet:

Overview

Kafka is a distributed event-streaming platform designed for high-throughput, low-latency, and fault-tolerant data pipelines. It is used to publish, store, process, and consume streams of records in real time, and is commonly deployed for log aggregation, metrics, messaging, stream processing, and event-driven microservices at scale.

This documentation provides a concise reference for Kafka architecture, configuration, scalability, and operational best practices.

Components

Broker: Stores and serves Kafka data to clients.
Topic: A category where records are published.
Partition: A split of a topic for parallelism and scaling.
- A partition has one leader replica and zero or more follower replicas.
- Partition leader is responsible for all reads and writes. Each follower’s responsibility is to replicate the leader’s data to serve as a ‘backup’ partition.
Producer: Sends records to topics.
Consumer: Reads records from topics.
Consumer Group: A group of consumers that share work.
Controller: Coordinates brokers and partitions.
Leader: The main replica handling reads and writes.
Follower: A replica that syncs from the leader.
Replica: A copy of a partition for fault tolerance.

Each Kafka topic is split into multiple partitions—ordered, append-only logs stored across different brokers. That lets Kafka scale horizontally and handle large workloads across many machines. Data is split across the partitions.

Kafka partitions distribute a topic across multiple brokers, enabling horizontal scaling, fault tolerance, and higher throughput. Partitions enable parallel processing of messages by consumers, increasing throughput. Partitioning also allows topics to scale beyond the limitations of a single broker. If one broker fails, partitions on other brokers remain available, ensuring data durability. Messages can be strictly ordered to a specific partition.