Kafka Streams: Introduction

What is Kafka Streams?

Kafka Streams is a powerful client library for real-time stream processing directly from Kafka topics.

Key Characteristics

No Separate Cluster Required: Runs as a simple Java application
Easy Integration: Fits into existing architectures seamlessly
Flexible Processing: Supports both stateless and stateful operations
Built-in Scalability: Horizontal scaling by running multiple instances

Why Use Kafka Streams?

Simple Yet Powerful

Easy-to-use API for stream processing
Minimal setup required
Designed for real-time data processing

Scalability and Reliability

Distributed and fault-tolerant architecture
Handles high volumes of data
Automatic failure recovery

Exactly-Once Processing

Guarantees messages processed without duplicates or loss
Critical for financial transactions and analytics
Ensures data consistency

Event-Time Processing

Uses actual event timestamps (not arrival time)
Accurate handling of late or out-of-order events
Better processing semantics

Windowing and Aggregation

Count, sum, and group data within time frames
Supports time-based operations
Flexible aggregation capabilities

Seamless Kafka Integration

Natural choice for Kafka-based applications
Direct topic consumption and production
Leverages existing Kafka infrastructure

Stateless vs Stateful Processing

Stateless Processing

Independent Event Handling: Each event processed in isolation
No Memory Required: Doesn't store past data
Common Operations:

- Filtering messages

- Transforming data

- Mapping values

Characteristics: Lightweight and fast

Stateful Processing

Depends on History: Relies on previous data for decisions
State Storage Required: Uses state stores for fault tolerance
Common Operations:

- Aggregations

- Windowing operations

- Joins between streams

Characteristics: Provides deeper insights but requires more resources

Kafka Streams Architecture

Core Components

#### Kafka Topics

Act as input and output data streams
Foundation of stream processing

#### Stream Processors

Transform, filter, and aggregate data
Each instance runs as independent processing node
Core building blocks of processing logic

#### State Stores

Enable stateful operations
Track previous data for aggregations, joins, windowing
Backed by Kafka for fault tolerance
Ensure recovery capability

Horizontal Scalability

Multiple instances run in parallel
Each instance processes portion of data
Automatic workload redistribution on failure
Highly scalable and fault-tolerant

Kafka Streams API

High-Level DSL (Domain-Specific Language)

#### KStream API

Represents continuous stream of records
Processes events as they arrive
Use cases:

- Filtering events

- Mapping transformations

- Joining streams

#### KTable

Represents changelog stream
Keeps latest value for each key
Ideal for:

- Aggregated data

- Deduplicated data

- Tracking current state (e.g., latest order status)

#### GlobalKTable

Global, partition-independent view of data
Stored across all instances
Useful for lookups requiring access to all partitions

Low-Level Processor API

Available for advanced use cases
Provides fine-grained control

Stream Processing Operations

Stateless Operations

#### Filtering

java

1stream.filter((key, value) -> value.contains("error"));

Remove records not matching condition
Keep only specific messages

#### Mapping

java

1stream.mapValues(value -> value.toUpperCase());

Transform message values
Apply functions to data

Stateful Operations

#### Windowing

java

1stream.windowedBy(TimeWindows.of(Duration.ofMinutes(5)));

Group records within fixed time frame
Process events together in windows

#### Aggregation

java

1stream.groupByKey().count();

Count records sharing same key
Maintain running totals

Kafka Streams Topology

A topology defines how data flows through processing steps.

Example Topology

java

1// Create StreamsBuilder
2StreamsBuilder builder = new StreamsBuilder();
3
4// Read from input topic
5KStream<String, String> stream = builder.stream("input-topic");
6
7// Apply transformations
8stream
9  .filter((key, value) -> value.length() > 5)  // Keep values > 5 chars
10  .mapValues(value -> value.toUpperCase())      // Convert to uppercase
11  .to("output-topic");                          // Write to output topic

Processing Flow

Read from input topic
Filter records (value length > 5)
Map values to uppercase
Write to output topic

Stateless Processing Example

Processing each event independently without storing past data:

text

1Input: "apple", "banana", "cherry"
2
3Filter (length > 5):
4  "apple" → dropped
5  "banana" → kept
6  "cherry" → kept
7
8MapValues (uppercase):
9  "banana" → "BANANA"
10  "cherry" → "CHERRY"
11
12Output: "BANANA", "CHERRY"

No state store required - each record transformed in isolation.

Stateful Processing with State Stores

Why State Stores?

Stateful processing requires remembering past data to produce correct results. State Stores:

Store intermediate results
Enable aggregations, joins, windowing
Provide fault tolerance and recovery

Example: Running Sum Per Key

#### Phase 1: Incoming Events

text

1user1 → 5
2user2 → 3
3user1 → 2
4user2 → 4

#### Phase 2: Processing with State Store

text

1Group by key and aggregate:
2  user1: 5 (initial)
3  user2: 3 (initial)
4  user1: 5 + 2 = 7 (update)
5  user2: 3 + 4 = 7 (update)

State Store maintains running totals per user.

#### Phase 3: Final Output

text

1user1 → 7
2user2 → 7

Consolidated totals written to output topic.

Monitoring Kafka Streams

JMX Metrics

Processing rates, latency, errors
Tools: Prometheus, Grafana
Real-time monitoring and visualization

State Change Logs

Stored in Kafka topics
Track updates to state stores
Useful for debugging

Application Logs

Check using tail -f
Detect errors, rebalances, delays
Essential for troubleshooting

Summary

Kafka Streams: Real-time stream processing without separate cluster
Processing Types:

- Stateless: Filter, map (events in isolation)

- Stateful: Aggregations, joins (requires state stores)

Key Abstractions: KStream, KTable, GlobalKTable
Fault-Tolerant: Built-in recovery mechanisms
Scalable: Horizontal scaling with multiple instances
Seamless Integration: Works directly with existing Kafka deployments

CentralMesh.io

6.3 Kafka Streams

Kafka Streams: Introduction

What is Kafka Streams?

Key Characteristics

Why Use Kafka Streams?

Simple Yet Powerful

Scalability and Reliability

Exactly-Once Processing

Event-Time Processing

Windowing and Aggregation

Seamless Kafka Integration

Stateless vs Stateful Processing

Stateless Processing

Stateful Processing

Kafka Streams Architecture

Core Components

Horizontal Scalability

Kafka Streams API

High-Level DSL (Domain-Specific Language)

Low-Level Processor API

Stream Processing Operations

Stateless Operations

Stateful Operations

Kafka Streams Topology

Example Topology

Processing Flow

Stateless Processing Example

Stateful Processing with State Stores

Why State Stores?

Example: Running Sum Per Key

Monitoring Kafka Streams

JMX Metrics

State Change Logs

Application Logs

Summary