How to read multiple files into a single RDD or DataFrame in Spark? - Big Data In Real World

How to read multiple files into a single RDD or DataFrame in Spark?

What is the difference between HiveServer1 and HiveServer2?
May 7, 2021
How to change or reset consumer offset in Kafka?
May 12, 2021
What is the difference between HiveServer1 and HiveServer2?
May 7, 2021
How to change or reset consumer offset in Kafka?
May 12, 2021

We get this question a lot so we thought we would write a small post to answer this question.

Spark leverages Hadoop’s InputFileFormat to read files and the same option that is available with Hadoop when reading files also applied in Spark.

Do you like us to send you a 47 page Definitive guide on Spark join algorithms? ===>

Solution

Here is how we read files from multiple directories and a file.

sc.textFile("/home/hirw/dir1,/home/hirw/dir2,/home/hirw/specific/file")

You can also use wildcards to match files and directories

sc.textFile("/home/hirw/dir1,/home/hirw/dir-10[0-5]*")

 

Big Data In Real World
Big Data In Real World
We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

Comments are closed.

How to read multiple files into a single RDD or DataFrame in Spark?
This website uses cookies to improve your experience. By using this website you agree to our Data Protection Policy.

Hadoop In Real World is now Big Data In Real World!

X