What is the difference between repartition and coalesce in Spark?
December 21, 2020What is the difference between hadoop fs, hadoop dfs and hdfs dfs commands?
December 25, 2020This is a very common use case, if you are working on a Spark project and writing code in Java. You have a list and now you want to convert the list into a JavaRDD.
Do you like us to send you a 47 page Definitive guide on Spark join algorithms? ===>
Solution
List<String> l = new ArrayList<>(); l.add(“Red”); l.add(“Green”); l.add(“Blue”);
Simply use the JavaSparkContext’s parallelize method and it returns JavaRDD.
JavaSparkContext jsc = new JavaSparkContext(); JavaRDD<String> rdd = jsc.parallelize(l);
JavaPairRDD
Sometimes it is convenient to work with JavaPairRDD and here is how you can create a JavaPairRDD.
First you create a list of Tuples. In the below example you have a tuple with an Integer and String.
List<Tuple2<Integer, String>> pair = new ArrayList<>(); pair.add(new Tuple2<>(0, “Red”)); pair.add(new Tuple2<>(1, “Blue”));
Once you have the list, use the parallelizePairs on JavaSparkContext to create the JavaPairRDD.
JavaSparkContext jsc = new JavaSparkContext(); JavaPairRDD<Integer, String> rdd = jsc.parallelizePairs(pair);