04_stream_processing

Stream Processing

Stream processing involves ingesting, analyzing, and taking action on data as it is produced. This near-real-time or real-time methodology is helpful for applications that need to respond quickly to continuously updating information, such as IoT sensor readings, financial transactions, or social media data. By processing events or records as they arrive, systems can quickly generate insights, trigger alerts, or update dashboards without waiting for a complete batch.

ASCII DIAGRAM: Stream Processing Overview

+---------------+      +---------------------------------+      +--------------+
|               |      | Stream Processing               |      |              |
| Data Sources  +----->+    (Real-time/Near-real-time)   +----->+ Final Output |
| (Sensors, etc.)      |   +-----+   +-----+   +-----+   |      | (Dashboard, |
+---------------+      |   | P1  |   | P2  |   | P3  |   |      |  Alerts, etc.) 
                       +----+-----+---+-----+---+-----+---+      +--------------+

A data stream is continuously produced by various sources, like logs, sensors, or user activities.
A stream processing system consumes these events as they arrive, applying transformations, aggregations, or filtering.
Final results, such as alerts, dashboards, or enriched data records, are emitted with minimal delay.

Message Brokers

Message brokers provide asynchronous communication between producing and consuming services, often serving as a backbone for stream processing pipelines.

Load Balancing: Distributes incoming messages across multiple consumers.
Fan-out: Broadcasts messages so every subscribed consumer can receive a copy.
Queue vs. Pub-Sub: A queue model is often used for point-to-point communication (one consumer), while pub-sub broadcasts messages to multiple subscribers.

ASCII DIAGRAM: Message Broker in the Pipeline

  Publisher(s)             Message Broker           Consumer(s)
+------------+     (Push)     +-----------+     (Pull)  +------------+
|  Service A |  ---------->   |   Topic   |  <----------|  Service B |
+------------+               +-----+-----+             +------------+
                                     \
                                      +------------+
                                      | Service C  |
                                      +------------+

This log-based structure can be critical in ensuring message ordering and replay capabilities for consumers.
Topic partitioning is often used to scale throughput and handle fault tolerance.

Log-Based Message Brokers

Log-based brokers (like Apache Kafka) store messages in an immutable, append-only log. Consumers read from an offset that tracks their position, allowing for replaying messages or consuming them at different rates.

Sequential processing lets consumers receive messages in the order produced, crucial for maintaining consistency.
Partitioning topics across multiple logs provides scalability; each partition can be processed by a separate consumer or consumer group.

ASCII DIAGRAM: Log-Based Broker (Conceptual)

               +---------+
Producer --->  | Partition 0 | ---> Consumer Group 1
               +---------+
Producer --->  | Partition 1 | ---> Consumer Group 2
               +---------+
Producer --->  | Partition 2 | ---> Consumer Group 2
               +---------+

Each partition is independent, so reading from one partition does not affect the offsets in another.
If a consumer fails, another consumer in the group can take over reading from the partition where it left off.

Change Data Capture (CDC)

CDC captures and streams row-level database changes in real time. By converting inserts, updates, and deletes into event streams, CDC simplifies synchronizing data across systems.

Event Generation: Whenever a record in the database changes, an event is appended to the stream.
Data Synchronization: Downstream services or microservices can subscribe to these change events, ensuring they have up-to-date data without heavy polling.

ASCII DIAGRAM: CDC Flow

   Database            CDC Engine                Message Broker               Consumers
+------------+   +---------------+           +---------------+          +-----------------+
|  Table(s)  |-->| Binlog Reader | --------> |   Topic(s)    |  ----->  | Microservices  |
+------------+   +---------------+           +-------+-------+          +-----------------+
                                                    |
                                                    v
                                             +---------------+
                                             |   Analytics   |
                                             +---------------+

This approach is especially popular for maintaining read replicas, caches, or downstream analytics systems in near real time.

Event Sourcing

Event Sourcing maintains an append-only log of all state changes in an application, rather than storing only the latest state. By replaying these events, the application’s state can be reconstructed at any point in time, providing excellent auditability and historical insight.

Event Log: All changes are appended as discrete events, forming a chronological record.
State Reconstruction: The current or past state is obtained by replaying every event from the start, or from a known checkpoint.
Debugging & Auditing: Full history of changes can be helpful for diagnosing issues or analyzing user actions.

Streams and Time

Time management in distributed stream systems can be complex due to network delays and clock drift. Common windowing strategies address how events are grouped over time:

Hopping Windows
- Fixed-size windows, with a “hop” period that is larger than the window size, leaving gaps in coverage.
- Example: Window of size 6 minutes, but a new window starts every 10 minutes, leading to partial coverage.
Sliding Windows
- Overlapping windows, each capturing a fixed duration but starting at smaller intervals.
- Example: A 6-minute window slides every 2 minutes, allowing continuous coverage of data.
Tumbling Windows
- Consecutive, non-overlapping windows of a fixed size.
- Example: Every 6 minutes forms a new batch of events with no overlap or gap.

ASCII DIAGRAM: Time Window Examples (Conceptual)

Hopping (Size=6, Hop=10)
[0---6]             [10--16]            [20--26]  

Sliding (Size=6, Slide=2)
[0---6]
   [2---8]
      [4---10]
         [6---12]  

Tumbling (Size=6)
[0---6][6---12][12--18][18--24]

Stream Joins

Joining data in a streaming environment allows correlation of events from multiple sources or combining real-time events with reference data.

Stream-Stream Join: Correlates events arriving on two (or more) streams, requiring stateful tracking of unjoined events until a corresponding match arrives.
Stream-Table Join: Combines a stream of incoming events with a static or slowly changing table (often a cached copy).
Table-Table Join: Involves joining two changing datasets, typically kept in sync via CDC or event sourcing.

ASCII DIAGRAM: Stream-Stream Join (Conceptual)

Stream A (Orders)         Join on OrderID          Stream B (Payments)
     (A1)  ------------------>   +--------------+   <---------------- (B1)
     (A2)  ------------------>   |  Join State  |   <---------------- (B2)
     (A3)  ------------------>   +--------------+   <---------------- (B3)

Intermediate state must be maintained to match events from each stream within a specified time window.

Fault Tolerance

Ensuring a streaming system remains reliable in the face of failures is critical:

Micro-batching can simplify fault-tolerance, treating short spans of data as a batch.
Checkpointing periodically saves the progress (offsets, partial aggregations) to stable storage.
Idempotent operations allow reprocessing events without duplicating side effects.

ASCII DIAGRAM: Checkpointing Example

        Stream
          |
          v
+---------+---------+
| Stream Processing |
+---------+---------+
          |
          | (Periodic Save State)
          v
+------------------+
|  Checkpoint Store|
+------------------+

When a node fails, it can recover from the latest checkpoint, replaying only the events since that point.

Key Stream Processing Technologies

Apache Kafka: A widely used log-based message broker with support for streams via Kafka Streams or ksqlDB.
Apache Flink: Offers low-latency, high-throughput stream processing with advanced state management.
Apache Spark Streaming: Built on Spark’s micro-batch model, helpful for near-real-time processing of large datasets.
Apache Pulsar: Combines a scalable message queue with stream processing capabilities, including geo-replication.

When to Use Stream Processing

Real-time Analytics: Dashboards showing up-to-the-second metrics (e.g., IoT sensor readings, site traffic).
Fraud Detection: Financial transactions monitored as they occur, triggering immediate alerts on anomalies.
Monitoring & Alerting: Infrastructure logs and events analyzed on the fly, responding to performance issues quickly.
Personalization: Recommending items or content to users in near real time based on new interactions.

Vault 📚

Explorer