BLOG – Page 6 – Big Data In Real World

Both explode and posexplode are User Defined Table generating Functions. UDTFs operate on single rows and produce multiple rows as output. explode() There are 2 flavors […]

May 12, 2021

Published by Big Data In Real World at May 12, 2021

Categories

Kafka

How to change or reset consumer offset in Kafka?

This is a very common use case. Consumer offset is used to track the messages that are consumed by consumers in a consumer group. A topic […]

May 10, 2021

Published by Big Data In Real World at May 10, 2021

Categories

Spark

How to read multiple files into a single RDD or DataFrame in Spark?

We get this question a lot so we thought we would write a small post to answer this question. Spark leverages Hadoop’s InputFileFormat to read files […]

May 7, 2021

Published by Big Data In Real World at May 7, 2021

Categories

Apache Hive

What is the difference between HiveServer1 and HiveServer2?

HiveServer (or HiveSerer1) was introduced when Hive first came out and it had several limitations. HiveServer2 was later introduced with Hive 0.11 and aimed to solve […]

May 5, 2021

Published by Big Data In Real World at May 5, 2021

Categories

Elasticsearch

How to perform range filtering like greater than, less than in Elasticsearch?

Filtering based on a range like greater than, less than, greater than equal etc. are pretty common requirements when you work with data. In this post […]

May 3, 2021

Published by Big Data In Real World at May 3, 2021

Categories

Spark

What is the difference between map and mapPartitions in Spark?

Both map and mapPartitions are narrow transformation functions. Both functions don’t trigger a shuffle. Let’s say our RDD has 5 partitions and 10 elements in each […]

April 30, 2021

Published by Big Data In Real World at April 30, 2021

Categories

Apache Hive

What is the difference between UDF, UDAF and UDTF in Hive?

Hive has 3 different types of functions – User Defined Function (UDF), User Defined Aggregate Function (UDAF) and User Defined Table generating Function (UDTF). User Defined […]

April 28, 2021

Published by Big Data In Real World at April 28, 2021

Categories

Hadoop

How to get a list of YARN applications that are currently running in a Hadoop cluster?

A common question with a simple solution. Solution Use the below YARN command to list all applications that are running in YARN. yarn application -appStates RUNNING […]

April 26, 2021

Published by Big Data In Real World at April 26, 2021

Categories

Kafka

What is consumer offset and the purpose of consumer offset in Kafka?

There is a lot of confusion when it comes to consumer offset, what is the purpose for it and how to change the offset. The goal […]

April 23, 2021

Published by Big Data In Real World at April 23, 2021

Categories

Hadoop

How to fix “Could not locate executable winutils.exe” issue in Hadoop?

It is pretty common in certain Hadoop distributions to get the below error when you attempt to start resource manager service or other services. ERROR [main] […]

April 21, 2021

Published by Big Data In Real World at April 21, 2021

Categories

Apache Hive

How to alter the type of the column in a Hive table?

Let’s say you created a table with one of the columns as STRING and after a couple of days you realized the column should actually be […]

April 19, 2021

Published by Big Data In Real World at April 19, 2021

Categories

Spark

What is the difference between reduceByKey and aggregateByKey in Spark?

reduceByKey() reduceByKey() has the below properties The result of the combination (e.g. a sum) is of the same type that the values The operation when combined […]

April 16, 2021

Published by Big Data In Real World at April 16, 2021

Categories

Hadoop

How to properly remove a node from a Hadoop cluster?

This post will list out the steps to properly remove a node from a Hadoop cluster. It is not advisable to just shut down the node […]

April 14, 2021

Published by Big Data In Real World at April 14, 2021

Categories

Elasticsearch

How to specify conditional expressions (OR, AND, NOT) when searching documents in Elasticsearch?

We can specify conditional expressions like OR, AND using the Query expression during search in Elasticsearch. We have an index named account and in the index […]

April 12, 2021

Published by Big Data In Real World at April 12, 2021

Categories

How to see the first few lines from a file in S3 using AWS CLI?

This is a very common requirement and it has an easy solution. Solution aws s3api get-object to get the object and use to –range option to […]

April 9, 2021

Published by Big Data In Real World at April 9, 2021

Categories

Apache Hive

How to solve word count problem in Hive?

If you have read about MapReduce you know what a word count problem is. Word count is simply counting the number of words in a dataset. […]

April 7, 2021

Published by Big Data In Real World at April 7, 2021

Categories

Spark

What is the difference between groupByKey and reduceByKey in Spark?

Both reduceByKey and groupByKey result in wide transformations which means both triggers a shuffle operation. The key difference between reduceByKey and groupByKey is that reduceByKey does […]

April 5, 2021

Published by Big Data In Real World at April 5, 2021

Categories

Kafka

How to send large messages in Kafka?

By default the messages you can send and manage in Kafka should be less than 1 MB. To increase this limit there are few properties you […]