6.3 Kafka Streams
Real-time stream processing with Kafka Streams library. Covers stateless vs stateful operations, exactly-once processing, and event-time windowing.
Video Coming Soon
Kafka Streams: Introduction
What is Kafka Streams?
Kafka Streams is a powerful client library for real-time stream processing directly from Kafka topics.
Key Characteristics
- No Separate Cluster Required: Runs as a simple Java application
- Easy Integration: Fits into existing architectures seamlessly
- Flexible Processing: Supports both stateless and stateful operations
- Built-in Scalability: Horizontal scaling by running multiple instances
Why Use Kafka Streams?
Simple Yet Powerful
- Easy-to-use API for stream processing
- Minimal setup required
- Designed for real-time data processing
Scalability and Reliability
- Distributed and fault-tolerant architecture
- Handles high volumes of data
- Automatic failure recovery
Exactly-Once Processing
- Guarantees messages processed without duplicates or loss
- Critical for financial transactions and analytics
- Ensures data consistency
Event-Time Processing
- Uses actual event timestamps (not arrival time)
- Accurate handling of late or out-of-order events
- Better processing semantics
Windowing and Aggregation
- Count, sum, and group data within time frames
- Supports time-based operations
- Flexible aggregation capabilities
Seamless Kafka Integration
- Natural choice for Kafka-based applications
- Direct topic consumption and production
- Leverages existing Kafka infrastructure
Stateless vs Stateful Processing
Stateless Processing
- Independent Event Handling: Each event processed in isolation
- No Memory Required: Doesn't store past data
- Common Operations:
- Filtering messages
- Transforming data
- Mapping values
- Characteristics: Lightweight and fast
Stateful Processing
- Depends on History: Relies on previous data for decisions
- State Storage Required: Uses state stores for fault tolerance
- Common Operations:
- Aggregations
- Windowing operations
- Joins between streams
- Characteristics: Provides deeper insights but requires more resources
Kafka Streams Architecture
Core Components
#### Kafka Topics
- Act as input and output data streams
- Foundation of stream processing
#### Stream Processors
- Transform, filter, and aggregate data
- Each instance runs as independent processing node
- Core building blocks of processing logic
#### State Stores
- Enable stateful operations
- Track previous data for aggregations, joins, windowing
- Backed by Kafka for fault tolerance
- Ensure recovery capability
Horizontal Scalability
- Multiple instances run in parallel
- Each instance processes portion of data
- Automatic workload redistribution on failure
- Highly scalable and fault-tolerant
Kafka Streams API
High-Level DSL (Domain-Specific Language)
#### KStream API
- Represents continuous stream of records
- Processes events as they arrive
- Use cases:
- Filtering events
- Mapping transformations
- Joining streams
#### KTable
- Represents changelog stream
- Keeps latest value for each key
- Ideal for:
- Aggregated data
- Deduplicated data
- Tracking current state (e.g., latest order status)
#### GlobalKTable
- Global, partition-independent view of data
- Stored across all instances
- Useful for lookups requiring access to all partitions
Low-Level Processor API
- Available for advanced use cases
- Provides fine-grained control
Stream Processing Operations
Stateless Operations
#### Filtering
1stream.filter((key, value) -> value.contains("error"));- Remove records not matching condition
- Keep only specific messages
#### Mapping
1stream.mapValues(value -> value.toUpperCase());- Transform message values
- Apply functions to data
Stateful Operations
#### Windowing
1stream.windowedBy(TimeWindows.of(Duration.ofMinutes(5)));- Group records within fixed time frame
- Process events together in windows
#### Aggregation
1stream.groupByKey().count();- Count records sharing same key
- Maintain running totals
Kafka Streams Topology
A topology defines how data flows through processing steps.
Example Topology
1// Create StreamsBuilder
2StreamsBuilder builder = new StreamsBuilder();
3
4// Read from input topic
5KStream<String, String> stream = builder.stream("input-topic");
6
7// Apply transformations
8stream
9 .filter((key, value) -> value.length() > 5) // Keep values > 5 chars
10 .mapValues(value -> value.toUpperCase()) // Convert to uppercase
11 .to("output-topic"); // Write to output topicProcessing Flow
- Read from input topic
- Filter records (value length > 5)
- Map values to uppercase
- Write to output topic
Stateless Processing Example
Processing each event independently without storing past data:
1Input: "apple", "banana", "cherry"
2
3Filter (length > 5):
4 "apple" → dropped
5 "banana" → kept
6 "cherry" → kept
7
8MapValues (uppercase):
9 "banana" → "BANANA"
10 "cherry" → "CHERRY"
11
12Output: "BANANA", "CHERRY"No state store required - each record transformed in isolation.
Stateful Processing with State Stores
Why State Stores?
Stateful processing requires remembering past data to produce correct results. State Stores:
- Store intermediate results
- Enable aggregations, joins, windowing
- Provide fault tolerance and recovery
Example: Running Sum Per Key
#### Phase 1: Incoming Events
1user1 → 5
2user2 → 3
3user1 → 2
4user2 → 4#### Phase 2: Processing with State Store
1Group by key and aggregate:
2 user1: 5 (initial)
3 user2: 3 (initial)
4 user1: 5 + 2 = 7 (update)
5 user2: 3 + 4 = 7 (update)State Store maintains running totals per user.
#### Phase 3: Final Output
1user1 → 7
2user2 → 7Consolidated totals written to output topic.
Monitoring Kafka Streams
JMX Metrics
- Processing rates, latency, errors
- Tools: Prometheus, Grafana
- Real-time monitoring and visualization
State Change Logs
- Stored in Kafka topics
- Track updates to state stores
- Useful for debugging
Application Logs
- Check using
tail -f - Detect errors, rebalances, delays
- Essential for troubleshooting
Summary
- Kafka Streams: Real-time stream processing without separate cluster
- Processing Types:
- Stateless: Filter, map (events in isolation)
- Stateful: Aggregations, joins (requires state stores)
- Key Abstractions: KStream, KTable, GlobalKTable
- Fault-Tolerant: Built-in recovery mechanisms
- Scalable: Horizontal scaling with multiple instances
- Seamless Integration: Works directly with existing Kafka deployments