In-Sync Replicas (ISR) and Acknowledgements

Overview

ISR and acknowledgement modes are crucial for data reliability in Kafka's distributed system.

What is ISR?

In-Sync Replicas (ISR) is a group of replicas that are fully up-to-date with the leader replica.

Purpose

Ensures data consistency
Provides durability
Enables fault tolerance
Maintains data safety during failures

Example

For a topic partition with replication factor 3:

ISR typically includes: Leader + 2 followers
All replicas fully synchronized
Redundancy built into the system

Scenario: Leader Failure

Initial State

Payment Topic (Replication Factor 3):

Broker 1: Partition 0 Leader
Broker 2: Partition 0 Replica 1
Broker 3: Partition 0 Replica 2
ISR: [Leader, Replica_1, Replica_2]

Producer sends data → Leader → Replicates to both replicas

When Leader Crashes

Failure Detected: Kafka identifies leader is down
Leader Election: Replica_1 elected as new leader
ISR Updated: ISR = [Replica_1 (new leader), Replica_2]
Traffic Redirected: Producer sends data to Replica_1
Replication Continues: Replica_1 replicates to Replica_2
Result: Seamless transition with minimal downtime

Handling Partition Lag

Scenario

Initial ISR: [Leader, Replica_1, Replica_2]

Why Replicas Fall Behind

Network Latency:

Congestion between brokers
Temporary slowdown in communication
Packet loss

Resource Contention:

High CPU usage
Disk I/O saturation
Memory pressure

Detection Mechanism

Kafka monitors two key metrics:

Time-based: replica.lag.time.max.ms
- Replica must acknowledge within timeout
- Default: 10 seconds
Offset-based: Message offset lag
- Compares replica offset to leader offset
- Flags if not catching up
ISR Update Process
Lag Detected: Replica_2 falls behind
ISR Updated: ISR = [Leader, Replica_1] (Replica_2 removed)
Issue Resolved: Network/resource stabilizes
Catch-Up: Replica_2 syncs with leader
Rejoin ISR: Replica_2 added back to ISR
This dynamic mechanism maintains resilience during temporary disruptions.

Adding a New Replica

Scenario

Current State:

ISR: [Leader, Replica_1]
Missing: One replica due to network issue
Desired: Restore replication factor to 3

Process

Add Broker: Introduce new broker to cluster
Assign Replica: Replica_3 created on new broker
Initial Sync: Replica_3 starts replicating from leader
Catch-Up: Replicates all data to match leader
Full Sync: Replica_3 fully synchronized
ISR Update: ISR = [Leader, Replica_1, Replica_3]
Result: Replication factor restored, redundancy complete

Acknowledgement Modes

acks=0 (No Acknowledgement)

Behavior: Producer doesn't wait for broker confirmation

Performance: Fastest

Reliability: Lowest

Use Case: Non-critical logs, metrics where some loss is acceptable

Risk: Message loss if broker fails

acks=1 (Leader Acknowledgement)

Behavior: Producer waits for leader confirmation only

Performance: Moderate

Reliability: Moderate

Use Case: Balanced performance and reliability

Risk: Data loss if leader fails before replication

acks=all or -1 (All ISR Acknowledgement)

Behavior: Producer waits for all ISR members to confirm

Performance: Slowest

Reliability: Highest

Use Case: Critical data (financial transactions, user data)

Risk: Higher latency, but no data loss

Configuration

Producer Configuration

properties

1acks=all
2min.insync.replicas=2

Broker Configuration

properties

1replica.lag.time.max.ms=10000
2min.insync.replicas=2

Best Practices

ISR Management

Monitor ISR size
Alert on ISR shrinkage
Investigate lag causes
Maintain adequate replication

Acknowledgement Strategy

Use acks=all for critical data
Set min.insync.replicas ≥ 2
Balance latency vs reliability
Monitor producer metrics

Replication

Replication factor ≥ 3 for production
Spread replicas across availability zones
Monitor replica lag
Plan for broker failures

Monitoring

Key Metrics

ISR size per partition
Replica lag (time and offset)
Under-replicated partitions
Producer acknowledgement latency

Alerts

ISR shrinkage
Replica lag exceeding threshold
Under-replicated partitions
Producer errors

Summary

ISR and acknowledgements provide:

Data durability: Through replication
Fault tolerance: Automatic failover
Flexibility: Choose reliability vs performance
Resilience: Dynamic ISR management

Understanding these concepts is essential for building reliable Kafka-based systems that meet your data consistency and availability requirements.

CentralMesh.io

4.7 In-Sync Replicas and Acknowledgements

In-Sync Replicas (ISR) and Acknowledgements

Overview

What is ISR?

Purpose

Example

Scenario: Leader Failure

Initial State

When Leader Crashes

Handling Partition Lag

Scenario

Why Replicas Fall Behind

Detection Mechanism

ISR Update Process

Adding a New Replica

Scenario

Process

Acknowledgement Modes

acks=0 (No Acknowledgement)

acks=1 (Leader Acknowledgement)

acks=all or -1 (All ISR Acknowledgement)

Configuration

Producer Configuration

Broker Configuration

Best Practices

ISR Management

Acknowledgement Strategy

Replication

Monitoring

Key Metrics

Alerts

Summary