What is a pipeline and how to create a pipeline in Elasticsearch?
March 13, 2023How to fail a Hive script based on a condition?
March 20, 2023In this post we are going to see how to work with Excel files in Spark. We will be using the spark-excel package created by Crealytics.
Reading an Excel file in Spark
For both reading and writing excel files we will use the spark-excel package so we have started the spark-shell by supplying the package flag.
spark-shell --packages com.crealytics:spark-excel_2.11:0.13.1
We have provided 2 options with the read – sheetName and use header. There are several options and you can see them here – https://github.com/crealytics/spark-excel
val df = spark.read.format("com.crealytics.spark.excel") .option("sheetName", "employee") .option("header", "true") .load("file:///C:/Hadoop/employee.xlsx") scala> df.show +---------+-------+---------------+ | Project| Name|Cost_To_Project| +---------+-------+---------------+ |Ingestion| Arya| 2000| |Analytics| Arya| 10000| |Analytics| Donald| 1000| |Ingestion| Emily| 3000| |Analytics| Emily| 8000| |Ingestion| Jerry| 1000| |Ingestion| John| 1000| | BI| Martin| 5000| | ML| Mickey| 8000| | BI| Mickey| 12000| | ML|Patrick| 1000| | ML| Riley| 9000| +---------+-------+---------------+
Writing an Excel file in Spark
Let’s add an index column to the DataFrame and write the DataFrame back to Excel. Here is the code.
val dfidx = df.withColumn("index", monotonically_increasing_id) dfidx.show +---------+-------+---------------+-----+ | Project| Name|Cost_To_Project|index| +---------+-------+---------------+-----+ |Ingestion| Arya| 2000| 0| |Analytics| Arya| 10000| 1| |Analytics| Donald| 1000| 2| |Ingestion| Emily| 3000| 3| |Analytics| Emily| 8000| 4| |Ingestion| Jerry| 1000| 5| |Ingestion| John| 1000| 6| | BI| Martin| 5000| 7| | ML| Mickey| 8000| 8| | BI| Mickey| 12000| 9| | ML|Patrick| 1000| 10| | ML| Riley| 9000| 11| +---------+-------+---------------+-----+ dfidx.write.format("com.crealytics.spark.excel").option("sheetName", "Daily").option("header", "true").mode("overwrite").save("file:///home/osboxes/employee_idx.xlsx")
Here is the excel file created by Spark.