How to read and write XML files with Spark?

How to automatically add timestamp to documents and find the latest document in Elasticsearch?

March 27, 2023

How to transpose or convert columns to rows in Hive?

April 3, 2023

Published by Big Data In Real World at March 30, 2023

Read XML files in Spark

Here is the content of our XML file.

    <employees>
        <employee>
              <name>Arya</name>
              <project>Ingestion</project>
              <cost-to-project>2000</cost-to-project>
        </employee>
        <employee>
              <name>Donald</name>
              <project>ML</project>
              <cost-to-project>3000</cost-to-project>
         </employee>
        <employee>
              <name>Emily</name>
              <project>BI</project>
              <cost-to-project>8000</cost-to-project>
         </employee>
        <employee>
              <name>Riley</name>
              <project>ETL</project>
              <cost-to-project>30000</cost-to-project>
         </employee>
        <employee>
              <name>Donald</name>
              <project>BI</project>
              <cost-to-project>5000</cost-to-project>
         </employee>
    </employees>

Here is the code to read the above file. Each row in the XML is tagged with the <employee> tag so we specify the rowTag as employee.

val df = spark.read.format("com.databricks.spark.xml").option("rowTag", "employee").load("file:///home/osboxes/employee.xml")

scala> df.show
+---------------+------+---------+
|cost-to-project|  name|  project|
+---------------+------+---------+
|           2000|  Arya|Ingestion|
|           3000|Donald|       ML|
|           8000| Emily|       BI|
|          30000| Riley|      ETL|
|           5000|Donald|       BI|
+---------------+------+---------+

Write XML files in Spark

Let’s now add an index column to the DataFrame and save the contents back to the XML file. Here is the code.

val dfidx = df.withColumn("index", monotonically_increasing_id)

dfidx.show

+---------------+------+---------+-----+
|cost-to-project|  name|  project|index|
+---------------+------+---------+-----+
|           2000|  Arya|Ingestion|    0|
|           3000|Donald|       ML|    1|
|           8000| Emily|       BI|    2|
|          30000| Riley|      ETL|    3|
|           5000|Donald|       BI|    4|
+---------------+------+---------+-----+


dfidx.write.format("com.databricks.spark.xml").option("rowTag", "employee").mode("overwrite").save("file:///home/osboxes/employee_idx.xml")

Here is the output XML. The root element ROWS is automatically added.

<ROWS>
    <employee>
        <cost-to-project>2000</cost-to-project>
        <name>Arya</name>
        <project>Ingestion</project>
        <index>0</index>
    </employee>

    <employee>
        <cost-to-project>3000</cost-to-project>
        <name>Donald</name>
        <project>ML</project>
        <index>1</index>
    </employee>

    <employee>
        <cost-to-project>8000</cost-to-project>
        <name>Emily</name>
        <project>BI</project>
        <index>2</index>
    </employee>

    <employee>
        <cost-to-project>30000</cost-to-project>
        <name>Riley</name>
        <project>ETL</project>
        <index>3</index>
    </employee>

    <employee>
        <cost-to-project>5000</cost-to-project>
        <name>Donald</name>
        <project>BI</project>
        <index>4</index>
    </employee>

</ROWS>

Big Data In Real World

We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

Comments are closed.

How to read and write XML files with Spark?

How to automatically add timestamp to documents and find the latest document in Elasticsearch?

How to transpose or convert columns to rows in Hive?

How to automatically add timestamp to documents and find the latest document in Elasticsearch?

How to transpose or convert columns to rows in Hive?

Read XML files in Spark

Write XML files in Spark

Big Data In Real World

Related posts

How to kill a running Spark application?

What is the default number of executors in Spark?

What is the default number of cores and amount of memory allocated to an application in Spark?