Wednesday, January 29, 2020

Introduction to Apache Beam and the runners

🕥 7 min.

This is a first blog to introduce Apache Beam before going into more details about Beam runners in the next blogs.


What is Apache Beam ?


To define Apache Beam let's start with a quote
«Batch / streaming? Never heard of either.»
(Batch is nearly always part of higher-level streaming) 

Beam is a programming model that allows to create big data pipelines. Its particularity is that its API is unified between batch and streaming because the only difference between a batch pipeline and a streaming pipeline is that batch pipelines deal with finite data whereas streaming pipelines deal with infinite data. So, batch can be seen as a sub-part of the streaming problem. And Beam provides windowing features that divide infinite data into finite chunks.

Let's take a look at the Beam stack:

The pipeline user code is written using one of the several language SDKs (java, python or go) that Beam provides to the user. The SDK is a set of libraries of transforms and IOs that allow him to input data into his pipeline, do some computation on this data and then output to a given destination. Once the pipeline is written, the user choses a Big Data execution engine such as Apache FlinkApache Spark or others to run his pipeline. When the pipeline is run, the runner first translates it to native code depending on the chosen Big Data engine. It is this native code that is executed by the Big Data cluster.

Primitive transforms


The Beam SDK contains a set of transforms that are the building blocks of the pipelines the user writes. There are only 3 primitives :


Pardo: It is the good old flatmap that allows to process a collection element per element in parallel and apply them a function called DoFn


GroupByKey: This one groups the elements that have a common key. The groups can then be processed in parallel by next transforms downstream in the pipeline.


Read: The read transform is the way to input data into the pipeline by reading an external source which can be a batch source (like a database) or a streaming (continuously growing) source (such as a kafka topic).

Composite transforms



Composite transforms: All the other transforms available in the SDK are actually implemented as composites of the 3 previous ones.

As an example, the Reduce of Beam which is called Combine and which allows to do a computation on data spread across the cluster is implemented like this:


Other examples of composite transforms provided by the SDK are FlatMapElements, MapElements or Count that we will see in next chapter. But the user can also create his own composites and composites of composites.

A simple Beam pipeline


Let's take a look at a first simple Beam pipeline:


This is the usual Hello World type big data pipeline that counts the occurrences of the different words in a text file. It reads a text file from google storage and the result of the count is output to google storage. This is a very simple straight pipeline. Not all the pipelines have to be straight that way, it is there to serve as a baseline example for the continuation of the blog and to illustrate some key concepts of Beam :
  • Pipeline: the user interaction object.
  • PCollection: Beam abstraction of the collection of elements spread across a cluster.
  • TextIO: this IO is used to read and write from/to the text file. Reading part of the IO is in reality a Read transform and writing part of the IO is in reality a Write transform
  • Count: Combine transform that counts occurrences
  • FlatMapElements and MapElements: They are composite transforms of ParDo that are there for convenience.

The resulting DAG


When Beam executes the above pipeline code, the SDK first creates the adjacent graph to represent the pipeline. It is known as the DAG (Direct Acyclic Graph). For each transform of the pipeline code, a node of the DAG is created.  
  •  Read node corresponds to the Read transform of TextIO (the input of the piepeline)
  • Write node corresponds to the Write transform of TextIO (the output of the pipeline)
  • All the other nodes are a direct representation of the transforms of the pipeline.
Please note that only the Read transform is a primitive transform as described in the above paragraph, all the others are composite transforms.

But what about the runner ?


This is when the runner enters the scene. The job of the runner is simply to translate the pipeline DAG into a native pipeline code for the targeted Big Data platform. It is this code that will be executed by a Big Data cluster such as Apache Spark or Apache Flink.

You want to know more about the runner ? The next article describes what the runner is and uses the Spark runner as an example