Tuesday 11 April 2017

Introducing Apache Beam

Introducing Apache Beam

As part of the Google Cloud ecosystem, Google created Dataflow SDK. Now, as a Google, Talend, Cask, data Artisans, PayPal, and Cloudera join effort, Apache Dataflow is at Apache Incubator.

Architecture and Programming model

Just Imagine, you have a Hadoop cluster where you used MapReduce jobs. Now, you want to “migrate” these jobs to Spark: you have to refactor all your jobs which requires lot of works and cost. And after that, see the effort and cost if you want to change for a new platform like Flink: you have to refactor your jobs again.

Dataflow aims to provide an abstraction layer between your code and the execution run-time.
The SDK allows you to use an unified programming model: you implement your data processing logic using the Dataflow SDK, the same code will run on different back ends(Spark / Flink / etc). You don’t need to refactor and change the code anymore !

If your target back end is not yet supported by Dataflow, you can implement your own runner for this back end, again the code using Dataflow SDK doesn’t change.

Dataflow is able to deal with batch processing jobs, but also with streaming jobs.

Pipelines, translators and runners

Using this SDK, your jobs are actually designed as pipeline. A pipeline is a chain of processes on the data.

It’s basically the only part that you have to write.

Dataflow reads the pipelines definition, and translate them for a target runner like SPARK/FLINK. A translator is responsible of adapting the pipeline code depending of the runner. For instance, the MapReduce translator will transform pipelines as MapReduce jobs, the Spark translator will transform pipelines as Spark jobs, etc.

The runners are the “execution” layer. Once a pipeline has been “translated” by a translator, it can be run directly on a runner. The runner “hides” the actual back end: MapReduce/Yarn cluster, Spark cluster (running on Yarn or Mesos), etc.

If Dataflow comes with ready to use translators and runners, you can create your own ones.
For instance, you can implement your own runner by creating a class extending PipelineRunner. You will have to implement different runner behaviors (like the transform evaluates, supported options, apply main transform hook method, etc).

SDK:

The SDK is composed by four parts:

Pipelines are the streaming and processing logic that you want to implement. It’s a chain of processes. Basically, in a pipeline, you read data from a source, you apply transformations on the data, and eventually send the data to a destination (named sink in Dataflow wording).

PCollection is the object transported inside a pipeline. It’s the container of the data, flowing between each step of the pipeline.

Transform is a step of a pipeline. It takes an incoming PCollection and creates an out coming PCollection. You can implement your own transform function.

Sink and Source are used to retrieve data as input (first step) of a pipeline, and eventually send the data outside of the pipeline.

13 comments:

  1. The introduced piece is really perfect, thanks a lot for sharing your extensive knowledge you possess in this complicated field.

    ReplyDelete
  2. An advanced data programming model can speed up many tasks. This will be an indispensable advantage for users during their subsequent work

    ReplyDelete
  3. The fact that it consists of several parts does not facilitate the work and understanding, since each part is responsible for the work of a certain process which is still necessary to understand and know.

    ReplyDelete