{"id":1617,"date":"2020-06-29T06:00:55","date_gmt":"2020-06-29T11:00:55","guid":{"rendered":"https:\/\/www.bigdatainrealworld.com\/?p=1617"},"modified":"2023-02-19T07:32:13","modified_gmt":"2023-02-19T13:32:13","slug":"batch-processing-with-google-cloud-dataflow-and-apache-beam","status":"publish","type":"post","link":"https:\/\/www.bigdatainrealworld.com\/batch-processing-with-google-cloud-dataflow-and-apache-beam\/","title":{"rendered":"Batch Processing with Google Cloud DataFlow and Apache Beam"},"content":{"rendered":"<p style=\"text-align: justify;\">In this post we will see how to implement a Batch processing pipeline by moving data from Google Cloud Storage to Google Big Query using Cloud Dataflow.<\/p>\n<p style=\"text-align: justify;\">Cloud Dataflow is a fully managed data processing service on Google Cloud Platform. Apache Beam SDK let us develop both BATCH as well as STREAM processing pipelines. We program our ETL\/ELT flow and Beam let us run them on Cloud Dataflow using Dataflow Runner.<\/p>\n<p style=\"text-align: justify;\">In this post, we will code the pipeline in Apache Bean and run the pipeline on Google Data Flow.<\/p>\n<p style=\"text-align: justify;\">Code for this post can be found <a href=\"https:\/\/github.com\/hadoopinrealworld\/google-data-flow-batch\">here<\/a>.<\/p>\n<h2 style=\"text-align: justify;\">Dataflow vs Apache Beam<\/h2>\n<p style=\"text-align: justify;\">Most of the time, people get confused in understanding what is Apache Beam and what is Cloud Dataflow. To understand how to write a pipeline, it is very important to understand what is the difference between the two.<\/p>\n<p style=\"text-align: justify;\">Apache Beam is an open source framework to create Data processing pipelines (BATCH as well as STREAM processing). The pipeline is then executed by one of Beam&#8217;s supported distributed processing back-ends, which include Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow.<\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #ff0000;\">Interested in getting in to Big Data? check out our\u00a0<a style=\"color: #ff0000;\" href=\"https:\/\/hadoopinrealworld.com\/developer\/\">Hadoop Developer In Real World<\/a>\u00a0course for interesting use case and real world projects\u00a0just like what you are reading.<\/span><\/p>\n<h2 style=\"text-align: justify;\">Benefits of Cloud Dataflow<\/h2>\n<ol style=\"text-align: justify;\">\n<li>Horizontal autoscaling of worker nodes<\/li>\n<li>Fully Managed Service<\/li>\n<li>Monitor the pipeline anytime during its execution<\/li>\n<li>Reliable and consistent processing<\/li>\n<\/ol>\n<h2 style=\"text-align: justify;\">What is Google Cloud Storage?<\/h2>\n<p style=\"text-align: justify;\">Google Cloud Storage is a service for storing your objects. An object is an immutable piece of data consisting of a file of any format. You store objects in containers called buckets. All buckets are associated with a project. You can compare <b>GCS buckets<\/b> with <b>Amazon S3 buckets<\/b>.<\/p>\n<h2 style=\"text-align: justify;\">What is a Big Query?<\/h2>\n<p style=\"text-align: justify;\">Big Query is a highly scalable, cost-effective data warehouse solution on Google Cloud Platform.<\/p>\n<h3 style=\"text-align: justify;\">Benefits of Big Query<\/h3>\n<ol style=\"text-align: justify;\">\n<li>Analyze petabytes of data using ANSI SQL queries.<\/li>\n<li>Access data and share insights with ease<\/li>\n<li>More secure platform that scales with your needs<\/li>\n<\/ol>\n<h2 style=\"text-align: justify;\">Batch processing from Google Cloud Storage to Big Query<\/h2>\n<h3 style=\"text-align: justify;\">Architecture Design<\/h3>\n<p><a href=\"https:\/\/www.bigdatainrealworld.com\/wp-content\/uploads\/2020\/06\/architecture.png\"><img fetchpriority=\"high\" decoding=\"async\" class=\"alignnone size-full wp-image-1618\" src=\"https:\/\/www.bigdatainrealworld.com\/wp-content\/uploads\/2020\/06\/architecture.png\" alt=\"google dataflow architecture\" width=\"545\" height=\"266\" srcset=\"https:\/\/www.bigdatainrealworld.com\/wp-content\/uploads\/2020\/06\/architecture.png 545w, https:\/\/www.bigdatainrealworld.com\/wp-content\/uploads\/2020\/06\/architecture-300x146.png 300w, https:\/\/www.bigdatainrealworld.com\/wp-content\/uploads\/2020\/06\/architecture-260x127.png 260w, https:\/\/www.bigdatainrealworld.com\/wp-content\/uploads\/2020\/06\/architecture-50x24.png 50w, https:\/\/www.bigdatainrealworld.com\/wp-content\/uploads\/2020\/06\/architecture-150x73.png 150w\" sizes=\"(max-width:767px) 480px, 545px\" \/><\/a><\/p>\n<p style=\"text-align: justify;\">This is how the pipeline flow will look like. Here, the source is Google Cloud Storage Bucket and the sink is Big Query. Big Query is a Data Warehouse offering on Google Cloud Platform.<\/p>\n<p style=\"text-align: justify;\"><a href=\"https:\/\/www.bigdatainrealworld.com\/wp-content\/uploads\/2020\/06\/google-cloud-storage-bucket.png\"><img decoding=\"async\" class=\"alignnone size-full wp-image-1619\" src=\"https:\/\/www.bigdatainrealworld.com\/wp-content\/uploads\/2020\/06\/google-cloud-storage-bucket.png\" alt=\"google-cloud-storage-bucket\" width=\"573\" height=\"333\" srcset=\"https:\/\/www.bigdatainrealworld.com\/wp-content\/uploads\/2020\/06\/google-cloud-storage-bucket.png 573w, https:\/\/www.bigdatainrealworld.com\/wp-content\/uploads\/2020\/06\/google-cloud-storage-bucket-300x174.png 300w, https:\/\/www.bigdatainrealworld.com\/wp-content\/uploads\/2020\/06\/google-cloud-storage-bucket-251x146.png 251w, https:\/\/www.bigdatainrealworld.com\/wp-content\/uploads\/2020\/06\/google-cloud-storage-bucket-50x29.png 50w, https:\/\/www.bigdatainrealworld.com\/wp-content\/uploads\/2020\/06\/google-cloud-storage-bucket-129x75.png 129w\" sizes=\"(max-width:767px) 480px, 573px\" \/><\/a><\/p>\n<p style=\"text-align: justify;\">As you can see in the above screenshot, this is how the data in Google Cloud Storage bucket will look like. We have data in the form of JSON files which we will push in Big Query.<\/p>\n<h3 style=\"text-align: justify;\">Initiate and Configure the Pipeline<\/h3>\n<p style=\"text-align: justify;\">The very first step is to configure the pipeline configuration. We have to set what machine type the pipeline will use, in which available region, the pipeline will execute and so on.<\/p>\n<p style=\"text-align: justify;\">We can program our pipeline in JAVA or Python. First, we have to set up the Dataflow Pipeline Options object where we will define the configuration of our pipeline.<\/p>\n<p style=\"text-align: justify;\">We have used Direct Runner to execute and test the pipeline locally.<\/p>\n<pre class=\"lang:java decode:true \">options.setRunner(DirectRunner.class);<\/pre>\n<p style=\"text-align: justify;\">Once we test it locally, then we can replace Direct Runner with Dataflow Runner. That\u2019s all that we need to deploy our pipeline on Cloud Dataflow.<\/p>\n<pre class=\"lang:java decode:true \">options.setRunner(DataflowRunner.class);<\/pre>\n<p style=\"text-align: justify;\">Apart from this, we also need to pass other configurations too to the pipeline like <i>project id, max number of worker nodes, temp location, staging location, worker machine type, region<\/i> where our pipeline will be deployed, etc.<\/p>\n<h3 style=\"text-align: justify;\">Create Pipeline<\/h3>\n<p style=\"text-align: justify;\">After passing all the configurations to the Dataflow Pipeline Options object, then we will create our Pipeline object.<\/p>\n<p style=\"text-align: justify;\">Refer below snippet to take a closer look at it.<\/p>\n<pre class=\"lang:java decode:true \">public class StorageToBQBatchPipeline {\r\n\r\n  public static void main(String[] args) {\r\n\r\n    \/*\r\n     * Initialize Pipeline Configurations\r\n     *\/\r\n    DataflowPipelineOptions options = PipelineOptionsFactory\r\n        .as(DataflowPipelineOptions.class);\r\n    options.setRunner(DirectRunner.class);\r\n    options.setProject(\"\");\r\n    options.setStreaming(true);\r\n    options.setTempLocation(\"\"); \r\n    options.setStagingLocation(\"\");\r\n    options.setRegion(\"\");\r\n    options.setMaxNumWorkers(1);\r\n    options.setWorkerMachineType(\"n1-standard-1\");\r\n\r\n    Pipeline pipeline = Pipeline.create(options);<\/pre>\n<p><span style=\"color: #ff0000;\">Like what you are reading? You would like our free live webinars too. Sign up and get notified when we host webinars =&gt;<\/span><script src=\"\/\/static.leadpages.net\/leadboxes\/current\/embed.js\" async=\"\" defer=\"defer\"><\/script><button style=\"background: #afbf00; border-color: #afbf00; border-radius: 20px; color: #ffffff; display: inline-block; vertical-align: middle; padding: 16px 32px; min-width: 192px; border: 1px solid #afbf00; font-size: 1rem; font-family: Helvetica, Arial, sans-serif; text-align: center; outline: 0; line-height: 1; cursor: pointer; -webkit-transition: background 0.3s, color 0.3s, border 0.3s; transition: background 0.3s, color 0.3s, border 0.3s; box-shadow: 0px 2px 5px rgba(0, 0, 0, 0.6);\" data-leadbox-popup=\"146ccea73f72a2:103d035da346dc\" data-leadbox-domain=\"ccubecompany.lpusercontent.com\">Click here to subscribe<\/button><\/p>\n<h3 style=\"text-align: justify;\">Processing Data from Source (Google Cloud Storage)<\/h3>\n<p style=\"text-align: justify;\">Here, the source of reading the data is Google Cloud Storage Buckets, Once we create the pipeline object.<\/p>\n<pre class=\"lang:java decode:true \">    \/*\r\n     * Read files from GCS buckets\r\n     *\/\r\n    PCollection&lt;ReadableFile&gt; data = pipeline\r\n        .apply(FileIO.match().filepattern(\"gs:\/\/dump_*\"))\r\n        .apply(FileIO.readMatches());\r\n\r\n    \/*\r\n     * Create Tuple Tag to process passed as well as failed records while parsing in ParDo functions\r\n     *\/\r\n    final TupleTag&lt;KV&lt;String, String&gt;&gt; mapSuccessTag = new TupleTag&lt;KV&lt;String, String&gt;&gt;() {\r\n      private static final long serialVersionUID = 1L;\r\n    };\r\n    final TupleTag&lt;KV&lt;String, String&gt;&gt; mapFailedTag = new TupleTag&lt;KV&lt;String, String&gt;&gt;() {\r\n      private static final long serialVersionUID = 1L;\r\n    };\r\n\r\n    PCollectionTuple mapTupleObj = data.apply(ParDo.of(new MapTransformation(mapSuccessTag, mapFailedTag))\r\n        .withOutputTags(mapSuccessTag, TupleTagList.of(mapFailedTag)));\r\n\r\n    PCollection&lt;KV&lt;String, String&gt;&gt; map = mapTupleObj.get(mapSuccessTag).setCoder(KvCoder.of(StringUtf8Coder.of(), StringUtf8Coder.of()));<\/pre>\n<pre class=\"lang:java decode:true\">FileIO.match().filepattern(\"gs:\/\/dump_*\")<\/pre>\n<p style=\"text-align: justify;\">FileIO is the connector which is built in Apache Beam SDK that lets you read files from GCS.<\/p>\n<p style=\"text-align: justify;\">We have used ParDo functions to first convert File IO objects into Key Value pair Objects as you can see below.<\/p>\n<p style=\"text-align: justify;\">Using Tuple Tags, we will make sure that we process only correct results to the next step. The failed records will be processed separately using failed tuple tags if we face any kind of exception during processing of records.<\/p>\n<pre class=\"lang:java decode:true \">  \/*\r\n   * Convert File to KV Object\r\n   *\/\r\n  private static class MapTransformation extends DoFn&lt;FileIO.ReadableFile,KV&lt;String, String&gt;&gt; {\r\n    static final long serialVersionUID = 1L;\r\n    TupleTag&lt;KV&lt;String, String&gt;&gt; successTag;\r\n    TupleTag&lt;KV&lt;String, String&gt;&gt; failedTag;\r\n\r\n    public MapTransformation(TupleTag&lt;KV&lt;String, String&gt;&gt; successTag, TupleTag&lt;KV&lt;String, String&gt;&gt; failedTag) {\r\n      this.successTag = successTag;\r\n      this.failedTag = failedTag;\r\n    }\r\n    @ProcessElement\r\n    public void processElement(ProcessContext c) {\r\n      FileIO.ReadableFile f = c.element();\r\n      String fileName = f.getMetadata().resourceId().toString();\r\n      String fileData = null;\r\n      try {\r\n        fileData = f.readFullyAsUTF8String();\r\n        c.output(successTag,KV.of(fileName, fileData));\r\n      } catch (Exception e) {\r\n        c.output(failedTag, KV.of(e.getMessage(), fileName));\r\n      }\r\n    }\r\n  }<\/pre>\n<h3 style=\"text-align: justify;\">Pushing Data to Destination (Google Big Query)<\/h3>\n<p style=\"text-align: justify;\">At this step, we will clean every Key Value pair and can do any kind of transformation as per the use case or requirement. In this case, we are directly pushing records to Big Query.<\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #ff0000;\">Interested in getting in to Big Data? check out our <a style=\"color: #ff0000;\" href=\"https:\/\/hadoopinrealworld.com\/spark-developer-in-real-world\/\">Spark Developer In Real World<\/a>\u00a0course for interesting use case and real world projects\u00a0just like what you are reading.<\/span><\/p>\n<p style=\"text-align: justify;\">Before pushing the records in BQ, we will have to first convert Key value pairs to Big Query Table Row objects.<\/p>\n<p style=\"text-align: justify;\">See below snippet for that.<\/p>\n<pre class=\"lang:java decode:true\">  \/*\r\n   * Convert KV to Table Row\r\n   *\/\r\n  private static class TableRowTransformation extends DoFn&lt;KV&lt;String, String&gt;, TableRow&gt; {\r\n    static final long serialVersionUID = 1L;\r\n    TupleTag&lt;TableRow&gt; successTag;\r\n    TupleTag&lt;TableRow&gt; failedTag;\r\n\r\n    public TableRowTransformation(TupleTag&lt;TableRow&gt; successTag, TupleTag&lt;TableRow&gt; failedTag) {\r\n      this.successTag = successTag;\r\n      this.failedTag = failedTag;\r\n    }\r\n    @ProcessElement\r\n    public void processElement(ProcessContext c) {\r\n      try {\r\n        KV&lt;String, String&gt; kvObj = c.element();\r\n        TableRow tableRow = new TableRow();\r\n        tableRow.set(kvObj.getKey(), kvObj.getValue());\r\n\r\n        c.output(successTag,tableRow);\r\n      } catch (Exception e) {\r\n        TableRow tableRow = new TableRow();\r\n        c.output(failedTag, tableRow);\r\n      }\r\n    }\r\n  }<\/pre>\n<p style=\"text-align: justify;\">Once we convert the objects into Table Row objects, then using the built- in Big Query connector in Apache Beam SDK, you can push records into the table.<\/p>\n<p style=\"text-align: justify;\">As we can see, we have a bunch of options in BQ connector. We have to pass the table name, where the records will be saved.<\/p>\n<pre class=\"lang:java decode:true \">    \/*\r\n     * Push records to BQ.\r\n     *\/\r\n    rowObj.apply(BigQueryIO.writeTableRows()\r\n        .to(\"options.getOutputTable()\")\r\n        .ignoreUnknownValues() \r\n        .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND) \r\n        .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));\r\n       \r\n    pipeline.run().waitUntilFinish();<\/pre>\n<p style=\"text-align: justify;\">Once the pipeline gets deployed, we can see the monitoring details on the right hand side. As you can see below images, the config that we have passed into the pipeline is visible there.<\/p>\n<p style=\"text-align: justify;\"><a href=\"https:\/\/www.bigdatainrealworld.com\/wp-content\/uploads\/2020\/06\/google-data-flow-job-pipeline.png\"><img decoding=\"async\" class=\"alignnone size-full wp-image-1620\" src=\"https:\/\/www.bigdatainrealworld.com\/wp-content\/uploads\/2020\/06\/google-data-flow-job-pipeline.png\" alt=\"google-data-flow-job-pipeline\" width=\"587\" height=\"453\" srcset=\"https:\/\/www.bigdatainrealworld.com\/wp-content\/uploads\/2020\/06\/google-data-flow-job-pipeline.png 587w, https:\/\/www.bigdatainrealworld.com\/wp-content\/uploads\/2020\/06\/google-data-flow-job-pipeline-300x232.png 300w, https:\/\/www.bigdatainrealworld.com\/wp-content\/uploads\/2020\/06\/google-data-flow-job-pipeline-189x146.png 189w, https:\/\/www.bigdatainrealworld.com\/wp-content\/uploads\/2020\/06\/google-data-flow-job-pipeline-50x39.png 50w, https:\/\/www.bigdatainrealworld.com\/wp-content\/uploads\/2020\/06\/google-data-flow-job-pipeline-97x75.png 97w\" sizes=\"(max-width:767px) 480px, 587px\" \/><\/a><\/p>\n<p style=\"text-align: justify;\">As Dataflow is a managed offering by Google Cloud Platform, we can define the auto scaling algorithm as well the pipeline.<\/p>\n<p style=\"text-align: justify;\">The monitoring section will let us know how many worker machines are currently in use, what will be the CPU utilization of the pipeline and so on.<\/p>\n<p style=\"text-align: justify;\"><a href=\"https:\/\/www.bigdatainrealworld.com\/wp-content\/uploads\/2020\/06\/google-cloud-platform-counters-options.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1621\" src=\"https:\/\/www.bigdatainrealworld.com\/wp-content\/uploads\/2020\/06\/google-cloud-platform-counters-options.png\" alt=\"google-cloud-platform-counters-options\" width=\"574\" height=\"450\" srcset=\"https:\/\/www.bigdatainrealworld.com\/wp-content\/uploads\/2020\/06\/google-cloud-platform-counters-options.png 574w, https:\/\/www.bigdatainrealworld.com\/wp-content\/uploads\/2020\/06\/google-cloud-platform-counters-options-300x235.png 300w, https:\/\/www.bigdatainrealworld.com\/wp-content\/uploads\/2020\/06\/google-cloud-platform-counters-options-186x146.png 186w, https:\/\/www.bigdatainrealworld.com\/wp-content\/uploads\/2020\/06\/google-cloud-platform-counters-options-50x39.png 50w, https:\/\/www.bigdatainrealworld.com\/wp-content\/uploads\/2020\/06\/google-cloud-platform-counters-options-96x75.png 96w\" sizes=\"(max-width:767px) 480px, 574px\" \/><\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this post we will see how to implement a Batch processing pipeline by moving data from Google Cloud Storage to Google Big Query using Cloud<span class=\"excerpt-hellip\"> [\u2026]<\/span><\/p>\n","protected":false},"author":3,"featured_media":1618,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[22,21],"tags":[25,24,23],"class_list":["post-1617","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-apache-beam","category-google-cloud","tag-beam","tag-dataflow","tag-google-cloud"],"_links":{"self":[{"href":"https:\/\/www.bigdatainrealworld.com\/wp-json\/wp\/v2\/posts\/1617","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.bigdatainrealworld.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.bigdatainrealworld.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.bigdatainrealworld.com\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/www.bigdatainrealworld.com\/wp-json\/wp\/v2\/comments?post=1617"}],"version-history":[{"count":4,"href":"https:\/\/www.bigdatainrealworld.com\/wp-json\/wp\/v2\/posts\/1617\/revisions"}],"predecessor-version":[{"id":1624,"href":"https:\/\/www.bigdatainrealworld.com\/wp-json\/wp\/v2\/posts\/1617\/revisions\/1624"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.bigdatainrealworld.com\/wp-json\/wp\/v2\/media\/1618"}],"wp:attachment":[{"href":"https:\/\/www.bigdatainrealworld.com\/wp-json\/wp\/v2\/media?parent=1617"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.bigdatainrealworld.com\/wp-json\/wp\/v2\/categories?post=1617"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.bigdatainrealworld.com\/wp-json\/wp\/v2\/tags?post=1617"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}