How to read and write XML files with Spark? - Big Data In Real World

How to read and write XML files with Spark?

How to automatically add timestamp to documents and find the latest document in Elasticsearch?
March 27, 2023
How to transpose or convert columns to rows in Hive?
April 3, 2023
How to automatically add timestamp to documents and find the latest document in Elasticsearch?
March 27, 2023
How to transpose or convert columns to rows in Hive?
April 3, 2023

We will be using the spark-xml package from Databrick to read and write XML files with Spark.

Here is how we enter the spark shell to use the Spark XML package.

spark-shell --packages com.databricks:spark-xml_2.11:0.10.0

Read XML files in Spark

Here is the content of our XML file.

    <employees>
        <employee>
              <name>Arya</name>
              <project>Ingestion</project>
              <cost-to-project>2000</cost-to-project>
        </employee>
        <employee>
              <name>Donald</name>
              <project>ML</project>
              <cost-to-project>3000</cost-to-project>
         </employee>
        <employee>
              <name>Emily</name>
              <project>BI</project>
              <cost-to-project>8000</cost-to-project>
         </employee>
        <employee>
              <name>Riley</name>
              <project>ETL</project>
              <cost-to-project>30000</cost-to-project>
         </employee>
        <employee>
              <name>Donald</name>
              <project>BI</project>
              <cost-to-project>5000</cost-to-project>
         </employee>
    </employees>

Here is the code to read the above file. Each row in the XML is tagged with the <employee> tag so we specify the rowTag as employee.

val df = spark.read.format("com.databricks.spark.xml").option("rowTag", "employee").load("file:///home/osboxes/employee.xml")

scala> df.show
+---------------+------+---------+
|cost-to-project|  name|  project|
+---------------+------+---------+
|           2000|  Arya|Ingestion|
|           3000|Donald|       ML|
|           8000| Emily|       BI|
|          30000| Riley|      ETL|
|           5000|Donald|       BI|
+---------------+------+---------+

Write XML files in Spark

Let’s now add an index column to the DataFrame and save the contents back to the XML file. Here is the code.

val dfidx = df.withColumn("index", monotonically_increasing_id)

dfidx.show

+---------------+------+---------+-----+
|cost-to-project|  name|  project|index|
+---------------+------+---------+-----+
|           2000|  Arya|Ingestion|    0|
|           3000|Donald|       ML|    1|
|           8000| Emily|       BI|    2|
|          30000| Riley|      ETL|    3|
|           5000|Donald|       BI|    4|
+---------------+------+---------+-----+


dfidx.write.format("com.databricks.spark.xml").option("rowTag", "employee").mode("overwrite").save("file:///home/osboxes/employee_idx.xml")

Here is the output XML.  The root element ROWS is automatically added.

<ROWS>
    <employee>
        <cost-to-project>2000</cost-to-project>
        <name>Arya</name>
        <project>Ingestion</project>
        <index>0</index>
    </employee>

    <employee>
        <cost-to-project>3000</cost-to-project>
        <name>Donald</name>
        <project>ML</project>
        <index>1</index>
    </employee>

    <employee>
        <cost-to-project>8000</cost-to-project>
        <name>Emily</name>
        <project>BI</project>
        <index>2</index>
    </employee>

    <employee>
        <cost-to-project>30000</cost-to-project>
        <name>Riley</name>
        <project>ETL</project>
        <index>3</index>
    </employee>

    <employee>
        <cost-to-project>5000</cost-to-project>
        <name>Donald</name>
        <project>BI</project>
        <index>4</index>
    </employee>

</ROWS>
Big Data In Real World
Big Data In Real World
We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

Comments are closed.

How to read and write XML files with Spark?
This website uses cookies to improve your experience. By using this website you agree to our Data Protection Policy.

Hadoop In Real World is now Big Data In Real World!

X