BLOG – Page 7 – Big Data In Real World

April 2, 2021

Published by Big Data In Real World at April 2, 2021

Categories

Hadoop
HDFS

How to recursively list files and directories in HDFS?

Common problem with a pretty simple solution. Solution When you are doing the directory listing use the -R option to recursively list the directories. If you […]

March 31, 2021

Published by Big Data In Real World at March 31, 2021

Categories

Apache Hive

Different ways to insert data into Hive table

There are several different variations and ways when it comes to inserting or loading data into Hive tables. This post will cover 3 broad ways to […]

March 29, 2021

Published by Big Data In Real World at March 29, 2021

Categories

Elasticsearch

How to get specific fields from a document in Elasticsearch?

When you search or lookup a document, Elasticsearch by default returns or shows you all the fields in the document. $ curl -X GET "localhost:9200/account/_doc/954?pretty" { […]

March 26, 2021

Published by Big Data In Real World at March 26, 2021

Categories

Spark

What is the difference between cache and persist in Spark?

cache() and persist() functions are used to cache intermediate results of a RDD or DataFrame or Dataset. You can mark an RDD, DataFrame or Dataset to […]

March 24, 2021

Published by Big Data In Real World at March 24, 2021

Categories

Hadoop
HDFS

Where HDFS stores files locally in Datanodes?

HDFS divides the files into blocks and stores the blocks locally in datanodes. The location varies from cluster to cluster based on the configuration in hdfs-site.xml […]

March 22, 2021

Published by Big Data In Real World at March 22, 2021

Categories

Kafka

How to fix the LEADER_NOT_AVAILABLE error in Kafka?

If you are using Kafka on the public cloud like AWS or Azure or on Docker you are more likely to experience the below error. WARN […]

March 19, 2021

Published by Big Data In Real World at March 19, 2021

Categories

Spark

When to use cache and persist functions in Spark?

Here are three high level reasons using cache() or persist() functions would be appropriate and helpful. Performing multiple actions on the same RDD numbersRDD is used […]

March 17, 2021

Published by Big Data In Real World at March 17, 2021

Categories

How to get a specific version of a file from S3 using AWS CLI?

In this post we are going to see how to enable version on a bucket and then how to get a very specific version of a […]

March 15, 2021

Published by Big Data In Real World at March 15, 2021

Categories

Apache Hive

What is the difference between INNER JOIN and LEFT SEMI JOIN in Hive?

Both INNER JOIN and LEFT SEMI JOIN return matching records between both tables with a subtle difference. Let’s consider 2 tables – employee and employee_department_mapping with […]

March 12, 2021

Published by Big Data In Real World at March 12, 2021

Categories

Spark

Definitive guide on Spark join algorithms

Over time we have written several posts on Spark joins and join algorithms explaining the internal working of these join algorithms. Here are all the posts […]

March 10, 2021

Published by Big Data In Real World at March 10, 2021

Categories

HDFS

What does hadoop namenode -format do and is it safe to run?

Don’t run the command until you fully understand what you are trying to do. hadoop namenode -format will format or delete all files and folders in […]

March 8, 2021

Published by Big Data In Real World at March 8, 2021

Categories

Kafka

Differences between RabbitMQ and Kafka

Ever wondered the differences between Rabbit MQ and Kafka and the need for Kafka when there is already RabbitMQ? Read through the post and you will […]

March 5, 2021

Published by Big Data In Real World at March 5, 2021

Categories

Spark

Why does Cartesian Product Join aka Shuffle-and-Replication Nested Loop Join does not cause a shuffle?

Shuffle-and-Replication does not mean a “true” shuffle as in records with the same keys are sent to the same partition. Instead the entire partition of the […]

March 3, 2021

Published by Big Data In Real World at March 3, 2021

Categories

Difference between EBS, S3 and Glacier in AWS?

EBS, S3 and Glacier are different storage options available in Amazon Web Services. They differ in cost, use case and purpose. Here is a super quick […]

March 1, 2021

Published by Big Data In Real World at March 1, 2021

Categories

Apache Hive

How to skip the first line or header when reading a file in Hive?

This is a common problem because most of the data files that come from the legacy system will contain a header in the first row. This […]

February 26, 2021

Published by Big Data In Real World at February 26, 2021

Categories

Spark

How to avoid a Broadcast Nested Loop join in Spark?

Broadcast Nested Loop join works by broadcasting one of the entire datasets and performing a nested loop to join the data. So essentially every record from […]

February 24, 2021

Published by Big Data In Real World at February 24, 2021

Categories

Hadoop
HDFS

What is the difference between hadoop fs put and copyFromLocal?

The answer to this question depends on the version of Hadoop you are using. Older version of Hadoop (1.x.x) The first argument of copyFromLocal is restricted […]

February 22, 2021

Published by Big Data In Real World at February 22, 2021

Categories

Elasticsearch

What is the difference between Query and Filter in Elasticsearch?

Elasticsearch offers 2 different contexts on how you have query or filter documents in an index – Query context and Filter context. Query context In the […]

February 19, 2021

Published by Big Data In Real World at February 19, 2021

Categories

Spark

How to force Spark to use Shuffle Hash Join when it defaults to Sort Merge Join?

Take a look at the below execution plan. Currently when you print the executed plan, you see that Spark is using Sort Merge Join. scala> dfJoined.queryExecution.executedPlan […]

February 17, 2021

Published by Big Data In Real World at February 17, 2021

Categories

How to migrate an Amazon S3 bucket from one region to another?

The short answer is you can’t migrate a S3 bucket from one region to another. But there is a workaround to this. Workaround Create a new […]