Notice: Function _load_textdomain_just_in_time was called incorrectly. Translation loading for the becustom domain was triggered too early. This is usually an indicator for some code in the plugin or theme running too early. Translations should be loaded at the init action or later. Please see Debugging in WordPress for more information. (This message was added in version 6.7.0.) in /home4/joyplace/public_html/wp-includes/functions.php on line 6114

Notice: Function _load_textdomain_just_in_time was called incorrectly. Translation loading for the wordpress-seo domain was triggered too early. This is usually an indicator for some code in the plugin or theme running too early. Translations should be loaded at the init action or later. Please see Debugging in WordPress for more information. (This message was added in version 6.7.0.) in /home4/joyplace/public_html/wp-includes/functions.php on line 6114

Warning: Cannot modify header information - headers already sent by (output started at /home4/joyplace/public_html/wp-includes/functions.php:6114) in /home4/joyplace/public_html/wp-includes/rest-api/class-wp-rest-server.php on line 1893

Warning: Cannot modify header information - headers already sent by (output started at /home4/joyplace/public_html/wp-includes/functions.php:6114) in /home4/joyplace/public_html/wp-includes/rest-api/class-wp-rest-server.php on line 1893

Warning: Cannot modify header information - headers already sent by (output started at /home4/joyplace/public_html/wp-includes/functions.php:6114) in /home4/joyplace/public_html/wp-includes/rest-api/class-wp-rest-server.php on line 1893

Warning: Cannot modify header information - headers already sent by (output started at /home4/joyplace/public_html/wp-includes/functions.php:6114) in /home4/joyplace/public_html/wp-includes/rest-api/class-wp-rest-server.php on line 1893

Warning: Cannot modify header information - headers already sent by (output started at /home4/joyplace/public_html/wp-includes/functions.php:6114) in /home4/joyplace/public_html/wp-includes/rest-api/class-wp-rest-server.php on line 1893

Warning: Cannot modify header information - headers already sent by (output started at /home4/joyplace/public_html/wp-includes/functions.php:6114) in /home4/joyplace/public_html/wp-includes/rest-api/class-wp-rest-server.php on line 1893

Warning: Cannot modify header information - headers already sent by (output started at /home4/joyplace/public_html/wp-includes/functions.php:6114) in /home4/joyplace/public_html/wp-includes/rest-api/class-wp-rest-server.php on line 1893

Warning: Cannot modify header information - headers already sent by (output started at /home4/joyplace/public_html/wp-includes/functions.php:6114) in /home4/joyplace/public_html/wp-includes/rest-api/class-wp-rest-server.php on line 1893
{"id":1617,"date":"2020-06-29T06:00:55","date_gmt":"2020-06-29T11:00:55","guid":{"rendered":"https:\/\/www.bigdatainrealworld.com\/?p=1617"},"modified":"2023-02-19T07:32:13","modified_gmt":"2023-02-19T13:32:13","slug":"batch-processing-with-google-cloud-dataflow-and-apache-beam","status":"publish","type":"post","link":"https:\/\/www.bigdatainrealworld.com\/batch-processing-with-google-cloud-dataflow-and-apache-beam\/","title":{"rendered":"Batch Processing with Google Cloud DataFlow and Apache Beam"},"content":{"rendered":"

In this post we will see how to implement a Batch processing pipeline by moving data from Google Cloud Storage to Google Big Query using Cloud Dataflow.<\/p>\n

Cloud Dataflow is a fully managed data processing service on Google Cloud Platform. Apache Beam SDK let us develop both BATCH as well as STREAM processing pipelines. We program our ETL\/ELT flow and Beam let us run them on Cloud Dataflow using Dataflow Runner.<\/p>\n

In this post, we will code the pipeline in Apache Bean and run the pipeline on Google Data Flow.<\/p>\n

Code for this post can be found here<\/a>.<\/p>\n

Dataflow vs Apache Beam<\/h2>\n

Most of the time, people get confused in understanding what is Apache Beam and what is Cloud Dataflow. To understand how to write a pipeline, it is very important to understand what is the difference between the two.<\/p>\n

Apache Beam is an open source framework to create Data processing pipelines (BATCH as well as STREAM processing). The pipeline is then executed by one of Beam’s supported distributed processing back-ends, which include Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow.<\/p>\n

Interested in getting in to Big Data? check out our\u00a0Hadoop Developer In Real World<\/a>\u00a0course for interesting use case and real world projects\u00a0just like what you are reading.<\/span><\/p>\n

Benefits of Cloud Dataflow<\/h2>\n
    \n
  1. Horizontal autoscaling of worker nodes<\/li>\n
  2. Fully Managed Service<\/li>\n
  3. Monitor the pipeline anytime during its execution<\/li>\n
  4. Reliable and consistent processing<\/li>\n<\/ol>\n

    What is Google Cloud Storage?<\/h2>\n

    Google Cloud Storage is a service for storing your objects. An object is an immutable piece of data consisting of a file of any format. You store objects in containers called buckets. All buckets are associated with a project. You can compare GCS buckets<\/b> with Amazon S3 buckets<\/b>.<\/p>\n

    What is a Big Query?<\/h2>\n

    Big Query is a highly scalable, cost-effective data warehouse solution on Google Cloud Platform.<\/p>\n

    Benefits of Big Query<\/h3>\n
      \n
    1. Analyze petabytes of data using ANSI SQL queries.<\/li>\n
    2. Access data and share insights with ease<\/li>\n
    3. More secure platform that scales with your needs<\/li>\n<\/ol>\n

      Batch processing from Google Cloud Storage to Big Query<\/h2>\n

      Architecture Design<\/h3>\n

      \"google<\/a><\/p>\n

      This is how the pipeline flow will look like. Here, the source is Google Cloud Storage Bucket and the sink is Big Query. Big Query is a Data Warehouse offering on Google Cloud Platform.<\/p>\n

      \"google-cloud-storage-bucket\"<\/a><\/p>\n

      As you can see in the above screenshot, this is how the data in Google Cloud Storage bucket will look like. We have data in the form of JSON files which we will push in Big Query.<\/p>\n

      Initiate and Configure the Pipeline<\/h3>\n

      The very first step is to configure the pipeline configuration. We have to set what machine type the pipeline will use, in which available region, the pipeline will execute and so on.<\/p>\n

      We can program our pipeline in JAVA or Python. First, we have to set up the Dataflow Pipeline Options object where we will define the configuration of our pipeline.<\/p>\n

      We have used Direct Runner to execute and test the pipeline locally.<\/p>\n

      options.setRunner(DirectRunner.class);<\/pre>\n

      Once we test it locally, then we can replace Direct Runner with Dataflow Runner. That\u2019s all that we need to deploy our pipeline on Cloud Dataflow.<\/p>\n

      options.setRunner(DataflowRunner.class);<\/pre>\n

      Apart from this, we also need to pass other configurations too to the pipeline like project id, max number of worker nodes, temp location, staging location, worker machine type, region<\/i> where our pipeline will be deployed, etc.<\/p>\n

      Create Pipeline<\/h3>\n

      After passing all the configurations to the Dataflow Pipeline Options object, then we will create our Pipeline object.<\/p>\n

      Refer below snippet to take a closer look at it.<\/p>\n

      public class StorageToBQBatchPipeline {\r\n\r\n  public static void main(String[] args) {\r\n\r\n    \/*\r\n     * Initialize Pipeline Configurations\r\n     *\/\r\n    DataflowPipelineOptions options = PipelineOptionsFactory\r\n        .as(DataflowPipelineOptions.class);\r\n    options.setRunner(DirectRunner.class);\r\n    options.setProject(\"\");\r\n    options.setStreaming(true);\r\n    options.setTempLocation(\"\"); \r\n    options.setStagingLocation(\"\");\r\n    options.setRegion(\"\");\r\n    options.setMaxNumWorkers(1);\r\n    options.setWorkerMachineType(\"n1-standard-1\");\r\n\r\n    Pipeline pipeline = Pipeline.create(options);<\/pre>\n

      Like what you are reading? You would like our free live webinars too. Sign up and get notified when we host webinars =><\/span>