As promised last time, we talk about the first of two blog-post of Tyler Akidau, in which he explains the concepts behind ‘streaming’ and why you should call it different.

Shownotes

The links to the two blog post we are talking about. The second is quite nice, as it has some video visualisation embedded.

Sum-up

As we go through the blog posts, you could just read it. :) For those willing to follow along while listening, we provide a bullet-point-ish outline here.

Background

First we make clear that we have to be precise about the terminology and the capabilities, as well as the time-domains.

It’s not about how you do it, it’s how you design it. A processing engine that is designed for infinite data sets.

Streaming? WTF

Streaming is a mouthful, and Tyler points out the terminology should be way clearer:

  • Unbounded Data, which are an ever growing, infinite set of data
  • Unbounded Data Processing, the way to continuously deal with the aforementioned unbounded data…
  • Low-latency, approximate, and/or speculative results: Historically streaming was sold as low-latency, but with the drawback that it won’t give you correct results. That is not true anymore.

Limit of Streaming

How unbounded data and batches could work together by using the Lambda Architecture, but within the same sentence why this is a sub-optimal idea.

Segwaying somehow into, why tools for reasoning about time and correct results are much cooler and how to put this together using something like Kafka…

Correctness

You need to be able to store persistently and replay the stream if necessary.

Papers: - MillWheel - Spark Streaming

Event vs. Processing Time

Yeah, this topic was touched in our very first (and very german) podcast and it will be talked about here as well.

  • Data processing Patterns
  • Bounded Data
  • Unbounded Data - batch
  • Fixed Windows
  • Session
  • Unbounded Data - streaming
  • Time agnostic
  • Filtering
  • Inner-Joins
  • Approximation Algorithms

  • Windowing
    • Fixed Windows
    • Sliding Windows
    • Sessions
  • Windowing by Processing Time
  • Windowing by Event Time
    • Buffering / Completness
Download .mp3 (67.3M)