CentralMesh.io

Kafka Fundamentals for Beginners
AdSense Banner (728x90)

3.3 Topics and Partitions

How Kafka distributes and stores data using topics and partitions.

Video Coming Soon

Topics and Partitions: Data Distribution and Storage

Overview

This lesson explores how Kafka distributes data using topics and partitions, how message keys enable ordering, and how Kafka uses log storage and offsets for efficient data management.

Key-Based Message Distribution

The Role of Message Keys

When producers send messages to Kafka, they can include a key with each message. This key determines which partition receives the message.

Example: Payment Transactions

  • Use user ID as the message key
  • All transactions from the same user go to the same partition
  • Message order is preserved per user

Without a Key:

  • Messages distributed round-robin across partitions
  • Good for load balancing
  • No ordering guarantee

Step-by-Step Example: Payment Topic

Consider a Payment topic with 2 partitions processing user transactions:

#### Step 1: User 1 Sends Transaction 1.1

  • User 1's ID used as key
  • Routed to Partition 1
  • Partition 1: Transaction 1.1
  • Partition 2: Empty

#### Step 2: User 2 Sends Transaction 2.1

  • User 2's ID routes to Partition 2
  • Partition 1: Transaction 1.1
  • Partition 2: Transaction 2.1

#### Step 3: User 1 Sends Transaction 1.2

  • Same user ID routes to same partition (Partition 1)
  • Partition 1: Transaction 1.1, Transaction 1.2
  • Partition 2: Transaction 2.1

#### Step 4: User 2 Sends Transaction 2.2

  • Routes to Partition 2
  • Partition 1: Transaction 1.1, Transaction 1.2
  • Partition 2: Transaction 2.1, Transaction 2.2

#### Step 5: User 3 Sends Transaction 3.1

  • User 3's ID routes to Partition 1 (for balancing)
  • Partition 1: Transaction 1.1, Transaction 1.2, Transaction 3.1
  • Partition 2: Transaction 2.1, Transaction 2.2

#### Step 6: User 3 Sends Transaction 3.2

  • Routes to Partition 1 (same as Transaction 3.1)
  • Partition 1: Transaction 1.1, Transaction 1.2, Transaction 3.1, Transaction 3.2
  • Partition 2: Transaction 2.1, Transaction 2.2

Key Insight: Ordering Guarantee

All messages from the same user:

  • Go to the same partition
  • Are stored in the order they were sent
  • Are read by consumers in the same order

This is critical for:

  • Financial transactions
  • User action tracking
  • Any sequential data processing

Benefits of Key-Based Distribution

1. Guaranteed Ordering

  • Within a partition, order is preserved
  • Multiple consumers can process different users in parallel
  • Each user's data maintains correct sequence

2. Efficient Processing

  • Related messages stay together
  • Easier offset tracking
  • Simplified consumer progress management

3. Simplified Failover

  • Replicas contain all messages for specific users
  • If a broker fails, replica has complete user history
  • Seamless takeover without data gaps

Log Storage Structure

Partition Storage

Partitions are stored as folders on brokers. Each folder contains log files (segments) that store actual data.

#### Example: Partition Folder Structure

bash
1/var/lib/kafka/data/Payment_Topic-Partition_1/
2├── 00000000000000000000.log
3├── 00000000000000001000.log
4├── 00000000000000002000.log

Key Points:

  • Each log file is a segment
  • Numbers indicate starting offset
  • Segments created as data grows
  • Old segments can be deleted when no longer needed

Managing Log Segments

Kafka divides partition data into segments for efficient management:

  • Each segment has a maximum size
  • New segment created when limit reached
  • Easier to handle and delete old data
  • Optimized for sequential access

Offsets and Data Tracking

What Are Offsets?

An offset is a unique identifier for each message in a partition. Offsets enable:

  • Tracking message position
  • Resuming from last read position
  • Reliable message delivery

Consumer Group Tracking

Kafka doesn't track individual consumers - it tracks consumer groups:

  • Each group maintains its own offset
  • Offset stored in __consumer_offsets topic
  • Groups can recover from failures
  • Multiple groups can read same data independently

Example: Single Partition Tracking

Initial State:

  • Consumer A reads from Partition 0, Offset 0
  • Processing the first message

| Topic | Partition | Consumer Group | Offset |

|-------|-----------|----------------|--------|

| Payment_Topic | 0 | Consumer_A | 1 |

After Processing:

  • Consumer A advanced to Offset 1
  • Now reading the second message
  • Can resume from this position after any failure

| Topic | Partition | Consumer Group | Offset |

|-------|-----------|----------------|--------|

| Payment_Topic | 0 | Consumer_A | 2 |

Example: Two Partitions with Consumer Group

Consider a Payment_Group with two consumers reading from two partitions:

Initial State:

Messages in Partition 0:

  • Offset 0: Message_0
  • Offset 1: Message_1
  • Offset 2: Message_2

Messages in Partition 1:

  • Offset 0: Message_3
  • Offset 1: Message_4
  • Offset 2: Message_5

Offset Tracking:

| Topic | Partition | Consumer Group | Offset |

|-------|-----------|----------------|--------|

| Payment_Topic | 0 | Payment_Group | 1 |

| Payment_Topic | 1 | Payment_Group | 1 |

  • Consumer A reads from Partition 0, currently at Offset 0
  • Consumer B reads from Partition 1, currently at Offset 0

After Progression:

| Topic | Partition | Consumer Group | Offset |

|-------|-----------|----------------|--------|

| Payment_Topic | 0 | Payment_Group | 2 |

| Payment_Topic | 1 | Payment_Group | 2 |

  • Consumer A processed up to Offset 1 in Partition 0
  • Consumer B processed up to Offset 1 in Partition 1
  • Group-level tracking enables seamless failover
  • If a consumer fails, another can resume from committed offset

Data Distribution and Fault Tolerance

Core Principles

Kafka distributes data across brokers using:

  • Partitions: Enable parallel processing
  • Replicas: Ensure fault tolerance
  • Leaders: Handle all reads and writes for a partition
  • Followers: Replicate leader data

Example Architecture

#### Diagram 1: Partition Leaders

Two brokers managing two partitions:

  • Broker 1: Partition 0 Leader
  • Broker 2: Partition 1 Leader
  • Producer sends data to respective leaders

#### Diagram 2: Replication

Each partition replicated across brokers:

  • Partition 0 Leader on Broker 1, Replica on Broker 2
  • Partition 1 Leader on Broker 2, Replica on Broker 1
  • Leaders replicate to followers
  • Data accessible from replicas if broker fails

#### Diagram 3: Leader Election on Failure

When Broker 1 fails:

  • Zookeeper coordinates leader election
  • Partition 0 replica on Broker 2 becomes new leader
  • Producer redirects to new leader
  • Service continues uninterrupted

Summary

Kafka's data distribution model provides:

Ordering:

  • Messages with same key go to same partition
  • Order preserved within partitions
  • Critical for sequential processing

Storage:

  • Partitions stored as log segments
  • Efficient management and cleanup
  • Optimized for high throughput

Tracking:

  • Offsets identify message positions
  • Consumer groups track progress
  • Enables reliable delivery and recovery

Fault Tolerance:

  • Replication ensures data availability
  • Automatic leader election
  • No data loss on broker failure

This architecture makes Kafka ideal for mission-critical data streaming applications that require both high performance and reliability.