Notice: Function _load_textdomain_just_in_time was called incorrectly. Translation loading for the becustom domain was triggered too early. This is usually an indicator for some code in the plugin or theme running too early. Translations should be loaded at the init action or later. Please see Debugging in WordPress for more information. (This message was added in version 6.7.0.) in /home4/joyplace/public_html/wp-includes/functions.php on line 6114

Notice: Function _load_textdomain_just_in_time was called incorrectly. Translation loading for the wordpress-seo domain was triggered too early. This is usually an indicator for some code in the plugin or theme running too early. Translations should be loaded at the init action or later. Please see Debugging in WordPress for more information. (This message was added in version 6.7.0.) in /home4/joyplace/public_html/wp-includes/functions.php on line 6114

Warning: Cannot modify header information - headers already sent by (output started at /home4/joyplace/public_html/wp-includes/functions.php:6114) in /home4/joyplace/public_html/wp-includes/rest-api/class-wp-rest-server.php on line 1893

Warning: Cannot modify header information - headers already sent by (output started at /home4/joyplace/public_html/wp-includes/functions.php:6114) in /home4/joyplace/public_html/wp-includes/rest-api/class-wp-rest-server.php on line 1893

Warning: Cannot modify header information - headers already sent by (output started at /home4/joyplace/public_html/wp-includes/functions.php:6114) in /home4/joyplace/public_html/wp-includes/rest-api/class-wp-rest-server.php on line 1893

Warning: Cannot modify header information - headers already sent by (output started at /home4/joyplace/public_html/wp-includes/functions.php:6114) in /home4/joyplace/public_html/wp-includes/rest-api/class-wp-rest-server.php on line 1893

Warning: Cannot modify header information - headers already sent by (output started at /home4/joyplace/public_html/wp-includes/functions.php:6114) in /home4/joyplace/public_html/wp-includes/rest-api/class-wp-rest-server.php on line 1893

Warning: Cannot modify header information - headers already sent by (output started at /home4/joyplace/public_html/wp-includes/functions.php:6114) in /home4/joyplace/public_html/wp-includes/rest-api/class-wp-rest-server.php on line 1893

Warning: Cannot modify header information - headers already sent by (output started at /home4/joyplace/public_html/wp-includes/functions.php:6114) in /home4/joyplace/public_html/wp-includes/rest-api/class-wp-rest-server.php on line 1893

Warning: Cannot modify header information - headers already sent by (output started at /home4/joyplace/public_html/wp-includes/functions.php:6114) in /home4/joyplace/public_html/wp-includes/rest-api/class-wp-rest-server.php on line 1893
{"id":1590,"date":"2020-06-15T06:00:34","date_gmt":"2020-06-15T11:00:34","guid":{"rendered":"https:\/\/www.bigdatainrealworld.com\/?p=1590"},"modified":"2023-02-19T07:32:13","modified_gmt":"2023-02-19T13:32:13","slug":"building-a-data-pipeline-with-apache-nifi","status":"publish","type":"post","link":"https:\/\/www.bigdatainrealworld.com\/building-a-data-pipeline-with-apache-nifi\/","title":{"rendered":"Building a Data Pipeline with Apache NiFi"},"content":{"rendered":"

NiFi is an open source data flow framework. It is highly automated for flow of data between systems. It works as a data transporter between data producer and data consumer. Producer means the system that generates data and consumer means the other system that consumes data. NiFi ensures to solve high complexity, scalability, maintainability and other major challenges of a Big Data pipeline.\u00a0<\/span><\/p>\n

NiFi is used extensively in Energy and Utilities, Financial Services, Telecommunication , Healthcare and Life Sciences, Retail Supply Chain, Manufacturing and many others.<\/span><\/p>\n

Commonly used sources are data repositories, flat files, XML, JSON, SFTP location, web servers, HDFS and many others.<\/span><\/p>\n

Destinations can be S3, NAS, HDFS, SFTP, Web Servers, RDBMS, Kafka etc.,<\/span><\/p>\n

\"NiFi<\/a><\/p>\n

Why NiFi?<\/span><\/h2>\n

Primary uses of NiFi include data ingestion. In any Big Data projects, the biggest challenge is to bring different types of data from different sources into a centralized data lake. NiFi is capable of ingesting any kind of data from any source to any destination. NiFi comes with 280+ in built processors which are capable enough to transport data between systems.\u00a0<\/span><\/p>\n

Interested in getting in to Big Data? check out our\u00a0Hadoop Developer In Real World<\/a>\u00a0course for interesting use case and real world projects\u00a0just like what you are reading.<\/span><\/p>\n

NiFi is an easy to use tool which prefers configuration over coding.<\/b><\/p>\n

However, NiFi is not limited to data ingestion only. NiFi can also perform data provenance, data cleaning, schema evolution, data aggregation, transformation, scheduling jobs and many others. We will discuss these in more detail in some other blog very soon with a real world data flow pipeline.\u00a0<\/span><\/p>\n

Hence, we can say NiFi is a highly automated framework used for gathering, transporting, maintaining and aggregating data of various types from various sources to destination in a data flow pipeline.<\/span><\/p>\n

A sample NiFi DataFlow pipeline would look like something below<\/span><\/p>\n

\"NiFi<\/a><\/p>\n

Seems too complex right. This is the beauty of NiFi: we can build complex pipelines just with the help of some basic configuration. So, always remember NiFi ensures <\/span>configuration over coding.<\/b><\/p>\n

Step by step instructions to build a data pipeline in NiFi<\/h2>\n

Before we move ahead with NiFi Components. As a developer, to create a NiFi pipeline we need to configure or build certain processors and group them into a processor group and connect each of these groups to create a NiFi pipeline.<\/span><\/p>\n

Let us understand these components using a real time pipeline.\u00a0<\/span><\/p>\n

Suppose we have some streaming incoming flat files in the source directory. Now, I will design and configure a pipeline to check these files and understand their name,type and other properties. This procedure is known as listing. After listing the files we will ingest them to a target directory.<\/span><\/p>\n

We will create a processor group \u201cList – Fetch\u201d by selecting and dragging the processor group icon from the top-right toolbar and naming it.<\/span><\/p>\n

\"NiFi<\/a><\/p>\n

Now, double click on the processor group to enter \u201cList-Fetch\u201d and drag the processor icon to create a processor. A pop will open, search for the required processor and add.<\/span><\/p>\n

The processor is added but with some warning \u26a0 as it’s just not configured . Right click\u00a0 and goto configure. Here, we can add\/update the scheduling , setting, properties and any comments for the processor. As of now, we will update the source path for our processor in Properties tab. Each of the field marked in <\/span>bold <\/b>are mandatory and each field have a question mark next to it, which explains its usage.<\/span><\/p>\n

\u00a0\"NiFi<\/a><\/span><\/p>\n

Similarly, add another processor \u201cFetchFile\u201d. Move the cursor on the ListFile processor and drag the arrow on ListFile to FetchFile. This will give you a pop up which informs that the relationship from ListFile to FetchFile is on Success execution of ListFile. Once the connection is established. Warnings from ListFile will be resolved now and List File is ready for Execution. This can be confirmed by a thick red square box on processor.<\/span><\/p>\n

\"NiFi<\/a><\/p>\n

Similarly, open FetchFile to configure. In the settings select all the four options from \u201cAutomatically Terminate Relationships\u201d. This ensures that the pipeline will exit once any of these relationships is found.<\/span><\/p>\n

\"C:\\Hadoop\\Blog\\Nifi<\/a><\/p>\n

Next, on Properties tab leave <\/span>File to fetch <\/b>field as it is because it is coupled on success relationship with ListFile. Change Completion Strategy to <\/span>Move File<\/b> and input target directory accordingly. Choose the other options as per the use case. Apply and close.<\/span><\/p>\n

\"NiFi<\/a><\/p>\n

Pipeline is ready with warnings. Let\u2019s execute it.<\/span><\/p>\n

\"NiFi<\/a><\/p>\n

If we want to execute a single processor, just right click and start. For complete pipeline in a processor group. Goto the processor group by clicking on the processor group name at the bottom left navigation bar. Then right click and start.<\/span><\/p>\n

\"NiFi<\/a><\/p>\n

\"NiFi<\/a><\/p>\n

The green button indicates that the pipeline is in running state and red for stopped. Here, file moved from one processor to another through a Queue. If one of the processor completes and the successor gets stuck\/stop\/failed, the data processed will be stuck in Queue. Other details regarding execution history, summary, data provenance, Flow configuration history etc., can be accessed either by right click on processor\/processor group or by clicking on three horizontal line button on top right.<\/span><\/p>\n

This is a real world example of a building and deploying NiFi pipeline.<\/span><\/p>\n

Like what you are reading? You would like our free live webinars too. Sign up and get notified when we host webinars =><\/span>