BLOG – Page 5 – Big Data In Real World

You are getting the below error during DataNode startup. This post talks about how to fix the issue. 2013-04-11 16:25:50,515 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting […]

August 30, 2021

Published by Big Data In Real World at August 30, 2021

Categories

Kafka

How to change the number of replicas of a Kafka topic?

Changing the number of replicas on an existing topic is a 3 step process. Get the current information about the topic Configure new replica assignment in […]

August 27, 2021

Published by Big Data In Real World at August 27, 2021

Categories

Apache Hive

How to calculate median in Hive?

You are trying to calculate median and you couldn’t find a function to calculate median in Hive. No worries you have come to the right place. […]

August 25, 2021

Published by Big Data In Real World at August 25, 2021

Categories

Spark

How to subtract or see differences between two DataFrames in Spark?

Pretty simple. Use the except() to subtract or find the difference between two dataframes. Do you like us to send you a 47 page Definitive guide […]

August 23, 2021

Published by Big Data In Real World at August 23, 2021

Categories

Hadoop
HDFS

LeaseExpiredException: No lease error on HDFS

So you got this below execution and wondering how to fix it. In this post we are going to explain the possible causes for this exception which […]

August 20, 2021

Published by Big Data In Real World at August 20, 2021

Categories

Spark

Error initializing SparkContext with java.net.BindException

java.net.BindException is a common exception when Spark is trying to initialize SparkContext. This is especially a common error when you try to run Spark locally. 16/01/04 […]

August 18, 2021

Published by Big Data In Real World at August 18, 2021

Categories

Kafka

How to get a list of consumers connected to a Kafka topic?

In Kafka, consumers connected to a topic are grouped into consumer groups. So a consumer group will have a list of consumers. So to get the […]

August 16, 2021

Published by Big Data In Real World at August 16, 2021

Categories

Spark

What is the difference between RDD, Dataframe and Dataset in Spark?

RDD, Dataframe and Dataset are all Spark APIs introduced in Spark at different points in time. The goal of these API is to help us work […]

August 13, 2021

Published by Big Data In Real World at August 13, 2021

Categories

Hadoop
HDFS

Why is Namenode stuck in safe mode and how to fix it?

When namenode is started or restarted, namenode will be in safemode for a period of time. At this time you will not be able to use […]

August 11, 2021

Published by Big Data In Real World at August 11, 2021

Categories

How to download an entire bucket from S3?

Downloading an entire bucket or folder inside a bucket is quite straightforward with AWS CLI. Install AWS CLI from here if you don’t have it already. […]

August 9, 2021

Published by Big Data In Real World at August 9, 2021

Categories

Spark

What does “Stage Skipped” mean in Apache Spark web UI?

Sometimes you might see a stage being skipped in the DAG visualization in Spark web UI. In this post we are going to discuss couple of […]

August 6, 2021

Published by Big Data In Real World at August 6, 2021

Categories

Apache Hive

How to export a Hive table into a CSV file?

It is a pretty common use case to export the contents of a Hive table into a CSV file. It’s pretty simple if you are using […]

August 4, 2021

Published by Big Data In Real World at August 4, 2021

Categories

Spark

How does Spark decide the number of tasks and number of tasks to execute in parallel?

In this post we will see how Spark decides the number of tasks and number of tasks to execute in parallel in a job. Let’s see […]

May 31, 2021

Published by Big Data In Real World at May 31, 2021

Categories

How to save Spark DataFrame directly to a Hive table?

It is a very common use case to process the data in Spark and save the processed data or Spark dataframe directly into a Hive table. […]

May 28, 2021

Published by Big Data In Real World at May 28, 2021

Categories

HDFS

What is the difference between NameNode and Secondary NameNode?

NameNode NameNode is the heart of HDFS. NameNode maintains the metadata of HDFS – files, list of blocks, directories, permissions etc. The metadata is persisted on […]

May 26, 2021

Published by Big Data In Real World at May 26, 2021

Categories

Kafka

Can multiple Kafka consumers read the same message from a partition?

Let’s redefine the question a little bit. Can multiple Kafka consumers from a consumer group read the same message from a partition? The short answer to […]

May 24, 2021

Published by Big Data In Real World at May 24, 2021

Categories

Spark

What are broadcast variables in Spark and when to use them?

Broadcast variables are variables which are available in all executors executing the Spark application. These variables are already cached and ready to be used by tasks […]

May 21, 2021

Published by Big Data In Real World at May 21, 2021

Categories

Apache Hive

What is the difference between Apache Pig and Hive?

Apache Pig was created by Yahoo. Apache Hive was created by Facebook. Both tools aimed at hiding the complexities of writing MapReduce jobs. Pig is similar […]