3.3 Topics and Partitions
How Kafka distributes and stores data using topics and partitions.
Video Coming Soon
Topics and Partitions: Data Distribution and Storage
Overview
This lesson explores how Kafka distributes data using topics and partitions, how message keys enable ordering, and how Kafka uses log storage and offsets for efficient data management.
Key-Based Message Distribution
The Role of Message Keys
When producers send messages to Kafka, they can include a key with each message. This key determines which partition receives the message.
Example: Payment Transactions
- Use user ID as the message key
- All transactions from the same user go to the same partition
- Message order is preserved per user
Without a Key:
- Messages distributed round-robin across partitions
- Good for load balancing
- No ordering guarantee
Step-by-Step Example: Payment Topic
Consider a Payment topic with 2 partitions processing user transactions:
#### Step 1: User 1 Sends Transaction 1.1
- User 1's ID used as key
- Routed to Partition 1
- Partition 1: Transaction 1.1
- Partition 2: Empty
#### Step 2: User 2 Sends Transaction 2.1
- User 2's ID routes to Partition 2
- Partition 1: Transaction 1.1
- Partition 2: Transaction 2.1
#### Step 3: User 1 Sends Transaction 1.2
- Same user ID routes to same partition (Partition 1)
- Partition 1: Transaction 1.1, Transaction 1.2
- Partition 2: Transaction 2.1
#### Step 4: User 2 Sends Transaction 2.2
- Routes to Partition 2
- Partition 1: Transaction 1.1, Transaction 1.2
- Partition 2: Transaction 2.1, Transaction 2.2
#### Step 5: User 3 Sends Transaction 3.1
- User 3's ID routes to Partition 1 (for balancing)
- Partition 1: Transaction 1.1, Transaction 1.2, Transaction 3.1
- Partition 2: Transaction 2.1, Transaction 2.2
#### Step 6: User 3 Sends Transaction 3.2
- Routes to Partition 1 (same as Transaction 3.1)
- Partition 1: Transaction 1.1, Transaction 1.2, Transaction 3.1, Transaction 3.2
- Partition 2: Transaction 2.1, Transaction 2.2
Key Insight: Ordering Guarantee
All messages from the same user:
- Go to the same partition
- Are stored in the order they were sent
- Are read by consumers in the same order
This is critical for:
- Financial transactions
- User action tracking
- Any sequential data processing
Benefits of Key-Based Distribution
1. Guaranteed Ordering
- Within a partition, order is preserved
- Multiple consumers can process different users in parallel
- Each user's data maintains correct sequence
2. Efficient Processing
- Related messages stay together
- Easier offset tracking
- Simplified consumer progress management
3. Simplified Failover
- Replicas contain all messages for specific users
- If a broker fails, replica has complete user history
- Seamless takeover without data gaps
Log Storage Structure
Partition Storage
Partitions are stored as folders on brokers. Each folder contains log files (segments) that store actual data.
#### Example: Partition Folder Structure
1/var/lib/kafka/data/Payment_Topic-Partition_1/
2├── 00000000000000000000.log
3├── 00000000000000001000.log
4├── 00000000000000002000.logKey Points:
- Each log file is a segment
- Numbers indicate starting offset
- Segments created as data grows
- Old segments can be deleted when no longer needed
Managing Log Segments
Kafka divides partition data into segments for efficient management:
- Each segment has a maximum size
- New segment created when limit reached
- Easier to handle and delete old data
- Optimized for sequential access
Offsets and Data Tracking
What Are Offsets?
An offset is a unique identifier for each message in a partition. Offsets enable:
- Tracking message position
- Resuming from last read position
- Reliable message delivery
Consumer Group Tracking
Kafka doesn't track individual consumers - it tracks consumer groups:
- Each group maintains its own offset
- Offset stored in
__consumer_offsetstopic - Groups can recover from failures
- Multiple groups can read same data independently
Example: Single Partition Tracking
Initial State:
- Consumer A reads from Partition 0, Offset 0
- Processing the first message
| Topic | Partition | Consumer Group | Offset |
|-------|-----------|----------------|--------|
| Payment_Topic | 0 | Consumer_A | 1 |
After Processing:
- Consumer A advanced to Offset 1
- Now reading the second message
- Can resume from this position after any failure
| Topic | Partition | Consumer Group | Offset |
|-------|-----------|----------------|--------|
| Payment_Topic | 0 | Consumer_A | 2 |
Example: Two Partitions with Consumer Group
Consider a Payment_Group with two consumers reading from two partitions:
Initial State:
Messages in Partition 0:
- Offset 0: Message_0
- Offset 1: Message_1
- Offset 2: Message_2
Messages in Partition 1:
- Offset 0: Message_3
- Offset 1: Message_4
- Offset 2: Message_5
Offset Tracking:
| Topic | Partition | Consumer Group | Offset |
|-------|-----------|----------------|--------|
| Payment_Topic | 0 | Payment_Group | 1 |
| Payment_Topic | 1 | Payment_Group | 1 |
- Consumer A reads from Partition 0, currently at Offset 0
- Consumer B reads from Partition 1, currently at Offset 0
After Progression:
| Topic | Partition | Consumer Group | Offset |
|-------|-----------|----------------|--------|
| Payment_Topic | 0 | Payment_Group | 2 |
| Payment_Topic | 1 | Payment_Group | 2 |
- Consumer A processed up to Offset 1 in Partition 0
- Consumer B processed up to Offset 1 in Partition 1
- Group-level tracking enables seamless failover
- If a consumer fails, another can resume from committed offset
Data Distribution and Fault Tolerance
Core Principles
Kafka distributes data across brokers using:
- Partitions: Enable parallel processing
- Replicas: Ensure fault tolerance
- Leaders: Handle all reads and writes for a partition
- Followers: Replicate leader data
Example Architecture
#### Diagram 1: Partition Leaders
Two brokers managing two partitions:
- Broker 1: Partition 0 Leader
- Broker 2: Partition 1 Leader
- Producer sends data to respective leaders
#### Diagram 2: Replication
Each partition replicated across brokers:
- Partition 0 Leader on Broker 1, Replica on Broker 2
- Partition 1 Leader on Broker 2, Replica on Broker 1
- Leaders replicate to followers
- Data accessible from replicas if broker fails
#### Diagram 3: Leader Election on Failure
When Broker 1 fails:
- Zookeeper coordinates leader election
- Partition 0 replica on Broker 2 becomes new leader
- Producer redirects to new leader
- Service continues uninterrupted
Summary
Kafka's data distribution model provides:
Ordering:
- Messages with same key go to same partition
- Order preserved within partitions
- Critical for sequential processing
Storage:
- Partitions stored as log segments
- Efficient management and cleanup
- Optimized for high throughput
Tracking:
- Offsets identify message positions
- Consumer groups track progress
- Enables reliable delivery and recovery
Fault Tolerance:
- Replication ensures data availability
- Automatic leader election
- No data loss on broker failure
This architecture makes Kafka ideal for mission-critical data streaming applications that require both high performance and reliability.