Topics and Partitions: Data Distribution and Storage

Overview

This lesson explores how Kafka distributes data using topics and partitions, how message keys enable ordering, and how Kafka uses log storage and offsets for efficient data management.

Key-Based Message Distribution

The Role of Message Keys

When producers send messages to Kafka, they can include a key with each message. This key determines which partition receives the message.

Example: Payment Transactions

Use user ID as the message key
All transactions from the same user go to the same partition
Message order is preserved per user

Without a Key:

Messages distributed round-robin across partitions
Good for load balancing
No ordering guarantee

Step-by-Step Example: Payment Topic

Consider a Payment topic with 2 partitions processing user transactions:

#### Step 1: User 1 Sends Transaction 1.1

User 1's ID used as key
Routed to Partition 1
Partition 1: Transaction 1.1
Partition 2: Empty

#### Step 2: User 2 Sends Transaction 2.1

User 2's ID routes to Partition 2
Partition 1: Transaction 1.1
Partition 2: Transaction 2.1

#### Step 3: User 1 Sends Transaction 1.2

Same user ID routes to same partition (Partition 1)
Partition 1: Transaction 1.1, Transaction 1.2
Partition 2: Transaction 2.1

#### Step 4: User 2 Sends Transaction 2.2

Routes to Partition 2
Partition 1: Transaction 1.1, Transaction 1.2
Partition 2: Transaction 2.1, Transaction 2.2

#### Step 5: User 3 Sends Transaction 3.1

User 3's ID routes to Partition 1 (for balancing)
Partition 1: Transaction 1.1, Transaction 1.2, Transaction 3.1
Partition 2: Transaction 2.1, Transaction 2.2

#### Step 6: User 3 Sends Transaction 3.2

Routes to Partition 1 (same as Transaction 3.1)
Partition 1: Transaction 1.1, Transaction 1.2, Transaction 3.1, Transaction 3.2
Partition 2: Transaction 2.1, Transaction 2.2

Key Insight: Ordering Guarantee

All messages from the same user:

Go to the same partition
Are stored in the order they were sent
Are read by consumers in the same order

This is critical for:

Financial transactions
User action tracking
Any sequential data processing

Benefits of Key-Based Distribution

1. Guaranteed Ordering

Within a partition, order is preserved
Multiple consumers can process different users in parallel
Each user's data maintains correct sequence

2. Efficient Processing

Related messages stay together
Easier offset tracking
Simplified consumer progress management

3. Simplified Failover

Replicas contain all messages for specific users
If a broker fails, replica has complete user history
Seamless takeover without data gaps

Log Storage Structure

Partition Storage

Partitions are stored as folders on brokers. Each folder contains log files (segments) that store actual data.

#### Example: Partition Folder Structure

bash

1/var/lib/kafka/data/Payment_Topic-Partition_1/
2├── 00000000000000000000.log
3├── 00000000000000001000.log
4├── 00000000000000002000.log

Key Points:

Each log file is a segment
Numbers indicate starting offset
Segments created as data grows
Old segments can be deleted when no longer needed

Managing Log Segments

Kafka divides partition data into segments for efficient management:

Each segment has a maximum size
New segment created when limit reached
Easier to handle and delete old data
Optimized for sequential access

Offsets and Data Tracking

What Are Offsets?

An offset is a unique identifier for each message in a partition. Offsets enable:

Tracking message position
Resuming from last read position
Reliable message delivery

Consumer Group Tracking

Kafka doesn't track individual consumers - it tracks consumer groups:

Each group maintains its own offset
Offset stored in __consumer_offsets topic
Groups can recover from failures
Multiple groups can read same data independently

Example: Single Partition Tracking

Initial State:

Consumer A reads from Partition 0, Offset 0
Processing the first message

|-------|-----------|----------------|--------|

| Payment_Topic | 0 | Consumer_A | 1 |

After Processing:

Consumer A advanced to Offset 1
Now reading the second message
Can resume from this position after any failure

|-------|-----------|----------------|--------|

| Payment_Topic | 0 | Consumer_A | 2 |

Example: Two Partitions with Consumer Group

Consider a Payment_Group with two consumers reading from two partitions:

Initial State:

Messages in Partition 0:

Offset 0: Message_0
Offset 1: Message_1
Offset 2: Message_2

Messages in Partition 1:

Offset 0: Message_3
Offset 1: Message_4
Offset 2: Message_5

Offset Tracking:

|-------|-----------|----------------|--------|

| Payment_Topic | 0 | Payment_Group | 1 |

| Payment_Topic | 1 | Payment_Group | 1 |

Consumer A reads from Partition 0, currently at Offset 0
Consumer B reads from Partition 1, currently at Offset 0

After Progression:

|-------|-----------|----------------|--------|

| Payment_Topic | 0 | Payment_Group | 2 |

| Payment_Topic | 1 | Payment_Group | 2 |

Consumer A processed up to Offset 1 in Partition 0
Consumer B processed up to Offset 1 in Partition 1
Group-level tracking enables seamless failover
If a consumer fails, another can resume from committed offset

Data Distribution and Fault Tolerance

Core Principles

Kafka distributes data across brokers using:

Partitions: Enable parallel processing
Replicas: Ensure fault tolerance
Leaders: Handle all reads and writes for a partition
Followers: Replicate leader data

Example Architecture

#### Diagram 1: Partition Leaders

Two brokers managing two partitions:

Broker 1: Partition 0 Leader
Broker 2: Partition 1 Leader
Producer sends data to respective leaders

#### Diagram 2: Replication

Each partition replicated across brokers:

Partition 0 Leader on Broker 1, Replica on Broker 2
Partition 1 Leader on Broker 2, Replica on Broker 1
Leaders replicate to followers
Data accessible from replicas if broker fails

#### Diagram 3: Leader Election on Failure

When Broker 1 fails:

Zookeeper coordinates leader election
Partition 0 replica on Broker 2 becomes new leader
Producer redirects to new leader
Service continues uninterrupted

Summary

Kafka's data distribution model provides:

Ordering:

Messages with same key go to same partition
Order preserved within partitions
Critical for sequential processing

Storage:

Partitions stored as log segments
Efficient management and cleanup
Optimized for high throughput

Tracking:

Offsets identify message positions
Consumer groups track progress
Enables reliable delivery and recovery

Fault Tolerance:

Replication ensures data availability
Automatic leader election
No data loss on broker failure

This architecture makes Kafka ideal for mission-critical data streaming applications that require both high performance and reliability.

CentralMesh.io

3.3 Topics and Partitions

Topics and Partitions: Data Distribution and Storage

Overview

Key-Based Message Distribution

The Role of Message Keys

Step-by-Step Example: Payment Topic

Key Insight: Ordering Guarantee

Benefits of Key-Based Distribution

Log Storage Structure

Partition Storage

Managing Log Segments

Offsets and Data Tracking

What Are Offsets?

Consumer Group Tracking

Example: Single Partition Tracking

Example: Two Partitions with Consumer Group

Data Distribution and Fault Tolerance

Core Principles

Example Architecture

Summary