IB21 Newsflash No6 13 October 2018
IB20 Newsflash No5 29 May 2018
IB19 Newsflash No4 25 April 2018
IB18 Stateful Stream Processing 05 November 2017
IB17 DockerConEU Newsflash 17 October 2017
IB16 Newscast No3 15 October 2017
IB15 Newscast No2 09 October 2017
IB14 Newscast No1 01 October 2017
IB13 HPCAC Student Cluster Competition w/ Dan 29 September 2017
IB12 High-Performance Commoditization 28 September 2017
IB11 IT Operations w/ Albert 15 September 2017
IB10 WrapUp 2016 w/ Patrick 20 December 2016
IB9 Beyond Batch 31 August 2016
As promised last time, we talk about the first of two blog-post of Tyler Akidau, in which he explains the concepts behind ‘streaming’ and why you should call it different.
Shownotes
The links to the two blog post we are talking about. The second is quite nice, as it has some video visualisation embedded.
Sum-up
As we go through the blog posts, you could just read it. :) For those willing to follow along while listening, we provide a bullet-point-ish outline here.
Background
First we make clear that we have to be precise about the terminology and the capabilities, as well as the time-domains.
It’s not about how you do it, it’s how you design it. A processing engine that is designed for infinite data sets.
Streaming? WTF
Streaming is a mouthful, and Tyler points out the terminology should be way clearer:
- Unbounded Data, which are an ever growing, infinite set of data
- Unbounded Data Processing, the way to continuously deal with the aforementioned unbounded data…
- Low-latency, approximate, and/or speculative results: Historically streaming was sold as low-latency, but with the drawback that it won’t give you correct results. That is not true anymore.
Limit of Streaming
How unbounded data and batches could work together by using the Lambda Architecture, but within the same sentence why this is a sub-optimal idea.
Segwaying somehow into, why tools for reasoning about time and correct results are much cooler and how to put this together using something like Kafka…
Correctness
You need to be able to store persistently and replay the stream if necessary.
Papers: - MillWheel - Spark Streaming
Event vs. Processing Time
Yeah, this topic was touched in our very first (and very german) podcast and it will be talked about here as well.
- Data processing Patterns
- Bounded Data
- Unbounded Data - batch
- Fixed Windows
- Session
- Unbounded Data - streaming
- Time agnostic
- Filtering
- Inner-Joins
-
Approximation Algorithms
- Windowing
- Fixed Windows
- Sliding Windows
- Sessions
- Windowing by Processing Time
- Windowing by Event Time
- Buffering / Completness
Tweet