4.7 In-Sync Replicas and Acknowledgements
Ensuring data reliability with ISR and acknowledgement modes.
Video Coming Soon
In-Sync Replicas (ISR) and Acknowledgements
Overview
ISR and acknowledgement modes are crucial for data reliability in Kafka's distributed system.
What is ISR?
In-Sync Replicas (ISR) is a group of replicas that are fully up-to-date with the leader replica.
Purpose
- Ensures data consistency
- Provides durability
- Enables fault tolerance
- Maintains data safety during failures
Example
For a topic partition with replication factor 3:
- ISR typically includes: Leader + 2 followers
- All replicas fully synchronized
- Redundancy built into the system
Scenario: Leader Failure
Initial State
Payment Topic (Replication Factor 3):
- Broker 1: Partition 0 Leader
- Broker 2: Partition 0 Replica 1
- Broker 3: Partition 0 Replica 2
- ISR: [Leader, Replica_1, Replica_2]
Producer sends data → Leader → Replicates to both replicas
When Leader Crashes
- Failure Detected: Kafka identifies leader is down
- Leader Election: Replica_1 elected as new leader
- ISR Updated: ISR = [Replica_1 (new leader), Replica_2]
- Traffic Redirected: Producer sends data to Replica_1
- Replication Continues: Replica_1 replicates to Replica_2
Result: Seamless transition with minimal downtime
Handling Partition Lag
Scenario
Initial ISR: [Leader, Replica_1, Replica_2]
Why Replicas Fall Behind
Network Latency:
- Congestion between brokers
- Temporary slowdown in communication
- Packet loss
Resource Contention:
- High CPU usage
- Disk I/O saturation
- Memory pressure
Detection Mechanism
Kafka monitors two key metrics:
- Time-based:
replica.lag.time.max.ms- Replica must acknowledge within timeout
- Default: 10 seconds
- Offset-based: Message offset lag
- Compares replica offset to leader offset
- Flags if not catching up
ISR Update Process
- Lag Detected: Replica_2 falls behind
- ISR Updated: ISR = [Leader, Replica_1] (Replica_2 removed)
- Issue Resolved: Network/resource stabilizes
- Catch-Up: Replica_2 syncs with leader
- Rejoin ISR: Replica_2 added back to ISR
This dynamic mechanism maintains resilience during temporary disruptions.
Adding a New Replica
Scenario
Current State:
- ISR: [Leader, Replica_1]
- Missing: One replica due to network issue
- Desired: Restore replication factor to 3
Process
- Add Broker: Introduce new broker to cluster
- Assign Replica: Replica_3 created on new broker
- Initial Sync: Replica_3 starts replicating from leader
- Catch-Up: Replicates all data to match leader
- Full Sync: Replica_3 fully synchronized
- ISR Update: ISR = [Leader, Replica_1, Replica_3]
Result: Replication factor restored, redundancy complete
Acknowledgement Modes
acks=0 (No Acknowledgement)
Behavior: Producer doesn't wait for broker confirmation
Performance: Fastest
Reliability: Lowest
Use Case: Non-critical logs, metrics where some loss is acceptable
Risk: Message loss if broker fails
acks=1 (Leader Acknowledgement)
Behavior: Producer waits for leader confirmation only
Performance: Moderate
Reliability: Moderate
Use Case: Balanced performance and reliability
Risk: Data loss if leader fails before replication
acks=all or -1 (All ISR Acknowledgement)
Behavior: Producer waits for all ISR members to confirm
Performance: Slowest
Reliability: Highest
Use Case: Critical data (financial transactions, user data)
Risk: Higher latency, but no data loss
Configuration
Producer Configuration
1acks=all
2min.insync.replicas=2Broker Configuration
1replica.lag.time.max.ms=10000
2min.insync.replicas=2Best Practices
ISR Management
- Monitor ISR size
- Alert on ISR shrinkage
- Investigate lag causes
- Maintain adequate replication
Acknowledgement Strategy
- Use
acks=allfor critical data - Set
min.insync.replicas ≥ 2 - Balance latency vs reliability
- Monitor producer metrics
Replication
- Replication factor ≥ 3 for production
- Spread replicas across availability zones
- Monitor replica lag
- Plan for broker failures
Monitoring
Key Metrics
- ISR size per partition
- Replica lag (time and offset)
- Under-replicated partitions
- Producer acknowledgement latency
Alerts
- ISR shrinkage
- Replica lag exceeding threshold
- Under-replicated partitions
- Producer errors
Summary
ISR and acknowledgements provide:
- Data durability: Through replication
- Fault tolerance: Automatic failover
- Flexibility: Choose reliability vs performance
- Resilience: Dynamic ISR management
Understanding these concepts is essential for building reliable Kafka-based systems that meet your data consistency and availability requirements.