Tyler Akidau published Streaming 101: The world beyond batch up on Oreilly in 2015, this is the fundamental theory of Google Cloud DataFlow, nowadays, the API of Google Cloud DataFlow has become Apache Beam, Apache Flink has the same concept of how Apache Beam and Google Cloud DataFlow processing streaming applications. I'm going to talk about how to use 2 critical concepts Time and Window in Flink here. for detail theory, please refer Streaming 101: The world beyond batch for details.

flink

Event Time VS Processing Time

  • Event time: which is the time at which events actually occurred.
  • Processing time: which is the time at which events are observed in the system.

Some Application may not care about the difference between event time and processing time, this type of application is easier to handle. but a lot of applicants need to handle this, think about an application that handle a game player's rank in real time, on player is playing the game , he will generate a serious of event and push to the data procesing system, in a perfect world, the event will be pushed into the data processing system seamlessly, then the Event Time and Processing time has the same trend, but in real world , it could be differnet, think about the player is playing the game on a train, the train is passing a tunnel, there is no 4G signal in the tunnel, so the event will be buffered in the player's device , after the train pass the tunnel, the device will send out all the buffered event to the data processing system, in this case the event will be process late. when we calculate result based on the event time, this player's score will be wrong when he is in the tunnel.

Flink support both Event Time Stream and Processing Time Stream, we could set the application's characteristic, like below

val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
//or
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime)

Window

A stream application is an endless data stream, if you want to do any statistics or aggregation, you will have to do the calculation based on a boundary, e.g: what's the average score every 5 mins or what's the max score every 5 mins, Window is designed to support such feature, Flink support both key based window and none-key based window, the key based window will be evaluated in parallel , none-key based window will be evaluate in a single thread. for key based window, Flink support 3 types of windows for both Event Time and Processing Time.

TumblingWindows

The event will be placed into a timed window side by side, we could define the window interval and the framework will assign the event into the right window, there is no overlap between windows, every event will be assing to one window for processing, example below assign event to window every 1 minute and it will trigger the processing function when all the event time or processing time has passed the window

stream.keyBy(_.year)
      .timeWindow(Time.minutes(1))

SlidingWindows

a sliding window will define a window size and an interval SlidingEventTimeWindows.of(Time.hours(1),Time.minutes(15)) will define a 1hour time window every event within this hour will be placed into sliding window 4 times as time passed

SessionWindows

an event in a session window will be placed into a window, the session window will define a gap time, the window will be triggered when no event arrived exceed the defined gap. EventTimeSessionWindows.withGap(Time.minutes(1)) will trigger the session window if no more event with given key more than 1 minute