How to automatically add timestamp to documents and find the latest document in Elasticsearch?
March 27, 2023How to transpose or convert columns to rows in Hive?
April 3, 2023We will be using the spark-xml package from Databrick to read and write XML files with Spark.
Here is how we enter the spark shell to use the Spark XML package.
spark-shell --packages com.databricks:spark-xml_2.11:0.10.0
Read XML files in Spark
Here is the content of our XML file.
<employees> <employee> <name>Arya</name> <project>Ingestion</project> <cost-to-project>2000</cost-to-project> </employee> <employee> <name>Donald</name> <project>ML</project> <cost-to-project>3000</cost-to-project> </employee> <employee> <name>Emily</name> <project>BI</project> <cost-to-project>8000</cost-to-project> </employee> <employee> <name>Riley</name> <project>ETL</project> <cost-to-project>30000</cost-to-project> </employee> <employee> <name>Donald</name> <project>BI</project> <cost-to-project>5000</cost-to-project> </employee> </employees>
Here is the code to read the above file. Each row in the XML is tagged with the <employee> tag so we specify the rowTag as employee.
val df = spark.read.format("com.databricks.spark.xml").option("rowTag", "employee").load("file:///home/osboxes/employee.xml") scala> df.show +---------------+------+---------+ |cost-to-project| name| project| +---------------+------+---------+ | 2000| Arya|Ingestion| | 3000|Donald| ML| | 8000| Emily| BI| | 30000| Riley| ETL| | 5000|Donald| BI| +---------------+------+---------+
Write XML files in Spark
Let’s now add an index column to the DataFrame and save the contents back to the XML file. Here is the code.
val dfidx = df.withColumn("index", monotonically_increasing_id) dfidx.show +---------------+------+---------+-----+ |cost-to-project| name| project|index| +---------------+------+---------+-----+ | 2000| Arya|Ingestion| 0| | 3000|Donald| ML| 1| | 8000| Emily| BI| 2| | 30000| Riley| ETL| 3| | 5000|Donald| BI| 4| +---------------+------+---------+-----+ dfidx.write.format("com.databricks.spark.xml").option("rowTag", "employee").mode("overwrite").save("file:///home/osboxes/employee_idx.xml")
Here is the output XML. The root element ROWS is automatically added.
<ROWS> <employee> <cost-to-project>2000</cost-to-project> <name>Arya</name> <project>Ingestion</project> <index>0</index> </employee> <employee> <cost-to-project>3000</cost-to-project> <name>Donald</name> <project>ML</project> <index>1</index> </employee> <employee> <cost-to-project>8000</cost-to-project> <name>Emily</name> <project>BI</project> <index>2</index> </employee> <employee> <cost-to-project>30000</cost-to-project> <name>Riley</name> <project>ETL</project> <index>3</index> </employee> <employee> <cost-to-project>5000</cost-to-project> <name>Donald</name> <project>BI</project> <index>4</index> </employee> </ROWS>