BLOG – Page 9 – Big Data In Real World

December 30, 2020

Published by Big Data In Real World at December 30, 2020

Categories

Kafka

How to delete a Kakfa topic?

In this post we will see how to delete a Kafka topic and get the details of the topic before deleting it. Fetch current details of […]

December 28, 2020

Published by Big Data In Real World at December 28, 2020

Categories

Kafka

How to purge or delete messages in a Kafka topic?

The easiest way to purge or delete messages in a Kafka topic is by setting the retention.ms to a low value. retention.ms configuration controls how long […]

December 25, 2020

Published by Big Data In Real World at December 25, 2020

Categories

Hadoop
HDFS

What is the difference between hadoop fs, hadoop dfs and hdfs dfs commands?

If you are trying to access HDFS, you might have come across the below three types of commands. In this post we will see the difference […]

December 23, 2020

Published by Big Data In Real World at December 23, 2020

Categories

Spark

How to convert List to a JavaRDD in Spark?

This is a very common use case, if you are working on a Spark project and writing code in Java. You have a list and now […]

December 21, 2020

Published by Big Data In Real World at December 21, 2020

Categories

Spark

What is the difference between repartition and coalesce in Spark?

The function of repartition and coalesce functions in Spark is to change the number of partitions on a DataFrame. Repartition Increase or decrease partitions. Repartition always […]

December 18, 2020

Published by Big Data In Real World at December 18, 2020

Categories

Apache Hive

How to fix the Hive metastore database is not initialized issue?

So you were installing Hive and ran into the below issue when Hive was trying to set up the metastore database. Exception in thread "main" java.lang.RuntimeException: […]

December 16, 2020

Published by Big Data In Real World at December 16, 2020

Categories

Apache Hive

How to update or drop a Hive Partition?

It is a common use case in your production jobs or Hive scripts to update or drop a Hive partition from your table. Let’s see how […]

December 14, 2020

Published by Big Data In Real World at December 14, 2020

Categories

Apache Hive

How to set variables in Hive scripts?

Hive scripts which are scheduled to run in production always take in variables. These variables are set in dynamically and you would need to pass the […]

December 11, 2020

Published by Big Data In Real World at December 11, 2020

Categories

Apache Hive

What is the difference between hivevar and hiveconf?

In this post we have shown how to set and refer to the variable in Hive prompt and Hive scripts. In the post, we have used […]

December 9, 2020

Published by Big Data In Real World at December 9, 2020

Categories

Hadoop
Spark

How to convert RDD to DataFrame in spark?

You have RDD in your code and now you want to work the data using DataFrames in Spark. Spark provides you with functions to convert RDD to […]

December 7, 2020

Published by Big Data In Real World at December 7, 2020

Categories

Spark

How to show full column content in a Spark DataFrame?

Most often when we are trying to work with data in Spark we might want to preview the data or the solution in Spark shell right […]

September 20, 2020

Published by Big Data In Real World at September 20, 2020

Categories

Hadoop

Try Hadoop In 5 Minutes

Let’s do this! We are going to do 3 things and trust us, it is going to take less than 5 minutes. 1. Login to our […]

July 13, 2020

Published by Big Data In Real World at July 13, 2020

Categories

Hadoop

Calculate Resource Allocation for Spark Applications

In this post we will look at how to calculate resource allocation for Spark applications. Figuring out how to allocate resources for a Spark application requires […]

June 29, 2020

Published by Big Data In Real World at June 29, 2020

Categories

Batch Processing with Google Cloud DataFlow and Apache Beam

In this post we will see how to implement a Batch processing pipeline by moving data from Google Cloud Storage to Google Big Query using Cloud […]

June 15, 2020

Published by Big Data In Real World at June 15, 2020

Categories

Building a Data Pipeline with Apache NiFi

NiFi is an open source data flow framework. It is highly automated for flow of data between systems. It works as a data transporter between data […]

June 1, 2020

Published by Big Data In Real World at June 1, 2020

Categories

Kafka

Building Stream Processing Applications using KSQL

ksqlDB is an event streaming database that enables creating powerful stream processing applications on top of Apache Kafka by using the familiar SQL syntax, which is […]

May 18, 2020

Published by Big Data In Real World at May 18, 2020

Categories

Spark

How To Catch Malware Using Spark

Those in the industry know that one of the best techniques for catching malware using Machine Learning (ML) is only possible with distributed computing. Before showing […]

May 4, 2020

Published by Big Data In Real World at May 4, 2020

Categories

Building Big Data Streaming Pipelines – architecture, concepts and tool choices

Considering building a big data streaming application? You have come to the right place. This is a comprehensive post on the architectural and orchestration of big […]

January 11, 2020

Published by Big Data In Real World at January 11, 2020

Categories

Kafka

Schema Registry & Schema Evolution in Kafka

In this post we are going to look at schema evolution and compatibility types in Kafka with Kafka schema registry. With a good understanding of compatibility […]

March 25, 2019

Published by Big Data In Real World at March 25, 2019

Categories

Improving Performance In Spark Using Partitions

In this blog post we are going to show how to optimize your Spark job by partitioning the data correctly. To demonstrate this we are going […]