CentralMesh.io

Kafka Fundamentals for Beginners
AdSense Banner (728x90)

6.3 Kafka Streams

Real-time stream processing with Kafka Streams library. Covers stateless vs stateful operations, exactly-once processing, and event-time windowing.

Video Coming Soon

Kafka Streams: Introduction

What is Kafka Streams?

Kafka Streams is a powerful client library for real-time stream processing directly from Kafka topics.

Key Characteristics

  • No Separate Cluster Required: Runs as a simple Java application
  • Easy Integration: Fits into existing architectures seamlessly
  • Flexible Processing: Supports both stateless and stateful operations
  • Built-in Scalability: Horizontal scaling by running multiple instances

Why Use Kafka Streams?

Simple Yet Powerful

  • Easy-to-use API for stream processing
  • Minimal setup required
  • Designed for real-time data processing

Scalability and Reliability

  • Distributed and fault-tolerant architecture
  • Handles high volumes of data
  • Automatic failure recovery

Exactly-Once Processing

  • Guarantees messages processed without duplicates or loss
  • Critical for financial transactions and analytics
  • Ensures data consistency

Event-Time Processing

  • Uses actual event timestamps (not arrival time)
  • Accurate handling of late or out-of-order events
  • Better processing semantics

Windowing and Aggregation

  • Count, sum, and group data within time frames
  • Supports time-based operations
  • Flexible aggregation capabilities

Seamless Kafka Integration

  • Natural choice for Kafka-based applications
  • Direct topic consumption and production
  • Leverages existing Kafka infrastructure

Stateless vs Stateful Processing

Stateless Processing

  • Independent Event Handling: Each event processed in isolation
  • No Memory Required: Doesn't store past data
  • Common Operations:

- Filtering messages

- Transforming data

- Mapping values

  • Characteristics: Lightweight and fast

Stateful Processing

  • Depends on History: Relies on previous data for decisions
  • State Storage Required: Uses state stores for fault tolerance
  • Common Operations:

- Aggregations

- Windowing operations

- Joins between streams

  • Characteristics: Provides deeper insights but requires more resources

Kafka Streams Architecture

Core Components

#### Kafka Topics

  • Act as input and output data streams
  • Foundation of stream processing

#### Stream Processors

  • Transform, filter, and aggregate data
  • Each instance runs as independent processing node
  • Core building blocks of processing logic

#### State Stores

  • Enable stateful operations
  • Track previous data for aggregations, joins, windowing
  • Backed by Kafka for fault tolerance
  • Ensure recovery capability

Horizontal Scalability

  • Multiple instances run in parallel
  • Each instance processes portion of data
  • Automatic workload redistribution on failure
  • Highly scalable and fault-tolerant

Kafka Streams API

High-Level DSL (Domain-Specific Language)

#### KStream API

  • Represents continuous stream of records
  • Processes events as they arrive
  • Use cases:

- Filtering events

- Mapping transformations

- Joining streams

#### KTable

  • Represents changelog stream
  • Keeps latest value for each key
  • Ideal for:

- Aggregated data

- Deduplicated data

- Tracking current state (e.g., latest order status)

#### GlobalKTable

  • Global, partition-independent view of data
  • Stored across all instances
  • Useful for lookups requiring access to all partitions

Low-Level Processor API

  • Available for advanced use cases
  • Provides fine-grained control

Stream Processing Operations

Stateless Operations

#### Filtering

java
1stream.filter((key, value) -> value.contains("error"));
  • Remove records not matching condition
  • Keep only specific messages

#### Mapping

java
1stream.mapValues(value -> value.toUpperCase());
  • Transform message values
  • Apply functions to data

Stateful Operations

#### Windowing

java
1stream.windowedBy(TimeWindows.of(Duration.ofMinutes(5)));
  • Group records within fixed time frame
  • Process events together in windows

#### Aggregation

java
1stream.groupByKey().count();
  • Count records sharing same key
  • Maintain running totals

Kafka Streams Topology

A topology defines how data flows through processing steps.

Example Topology

java
1// Create StreamsBuilder
2StreamsBuilder builder = new StreamsBuilder();
3
4// Read from input topic
5KStream<String, String> stream = builder.stream("input-topic");
6
7// Apply transformations
8stream
9  .filter((key, value) -> value.length() > 5)  // Keep values > 5 chars
10  .mapValues(value -> value.toUpperCase())      // Convert to uppercase
11  .to("output-topic");                          // Write to output topic

Processing Flow

  1. Read from input topic
  2. Filter records (value length > 5)
  3. Map values to uppercase
  4. Write to output topic

Stateless Processing Example

Processing each event independently without storing past data:

text
1Input: "apple", "banana", "cherry"
2
3Filter (length > 5):
4  "apple" → dropped
5  "banana" → kept
6  "cherry" → kept
7
8MapValues (uppercase):
9  "banana" → "BANANA"
10  "cherry" → "CHERRY"
11
12Output: "BANANA", "CHERRY"

No state store required - each record transformed in isolation.

Stateful Processing with State Stores

Why State Stores?

Stateful processing requires remembering past data to produce correct results. State Stores:

  • Store intermediate results
  • Enable aggregations, joins, windowing
  • Provide fault tolerance and recovery

Example: Running Sum Per Key

#### Phase 1: Incoming Events

text
1user1 → 5
2user2 → 3
3user1 → 2
4user2 → 4

#### Phase 2: Processing with State Store

text
1Group by key and aggregate:
2  user1: 5 (initial)
3  user2: 3 (initial)
4  user1: 5 + 2 = 7 (update)
5  user2: 3 + 4 = 7 (update)

State Store maintains running totals per user.

#### Phase 3: Final Output

text
1user1 → 7
2user2 → 7

Consolidated totals written to output topic.

Monitoring Kafka Streams

JMX Metrics

  • Processing rates, latency, errors
  • Tools: Prometheus, Grafana
  • Real-time monitoring and visualization

State Change Logs

  • Stored in Kafka topics
  • Track updates to state stores
  • Useful for debugging

Application Logs

  • Check using tail -f
  • Detect errors, rebalances, delays
  • Essential for troubleshooting

Summary

  • Kafka Streams: Real-time stream processing without separate cluster
  • Processing Types:

- Stateless: Filter, map (events in isolation)

- Stateful: Aggregations, joins (requires state stores)

  • Key Abstractions: KStream, KTable, GlobalKTable
  • Fault-Tolerant: Built-in recovery mechanisms
  • Scalable: Horizontal scaling with multiple instances
  • Seamless Integration: Works directly with existing Kafka deployments