How to convert List to a JavaRDD in Spark?

What is the difference between repartition and coalesce in Spark?

December 21, 2020

What is the difference between hadoop fs, hadoop dfs and hdfs dfs commands?

December 25, 2020

Published by Big Data In Real World at December 23, 2020

Solution

List<String> l = new ArrayList<>();

l.add(“Red”);
l.add(“Green”);
l.add(“Blue”);

Simply use the JavaSparkContext’s parallelize method and it returns JavaRDD.

JavaSparkContext jsc = new JavaSparkContext();

JavaRDD<String> rdd = jsc.parallelize(l);

JavaPairRDD

Sometimes it is convenient to work with JavaPairRDD and here is how you can create a JavaPairRDD.

First you create a list of Tuples. In the below example you have a tuple with an Integer and String.

List<Tuple2<Integer, String>> pair = new ArrayList<>(); 

pair.add(new Tuple2<>(0, “Red”)); 
pair.add(new Tuple2<>(1, “Blue”));

Once you have the list, use the parallelizePairs on JavaSparkContext to create the JavaPairRDD.

JavaSparkContext jsc = new JavaSparkContext(); 

JavaPairRDD<Integer, String> rdd = jsc.parallelizePairs(pair);

Big Data In Real World

We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

Comments are closed.

How to convert List to a JavaRDD in Spark?

What is the difference between repartition and coalesce in Spark?

What is the difference between hadoop fs, hadoop dfs and hdfs dfs commands?

What is the difference between repartition and coalesce in Spark?

What is the difference between hadoop fs, hadoop dfs and hdfs dfs commands?

Solution

JavaPairRDD

Big Data In Real World

Related posts

How to kill a running Spark application?

What is the default number of executors in Spark?

What is the default number of cores and amount of memory allocated to an application in Spark?