Wednesday, December 2, 2020

Stack Abuse: Reading and Writing XML Files in Python with Pandas

Introduction

XML (Extensible Markup Language) is a markup language used to store structured data. The Pandas data analysis library provides functions to read/write data for most of the file types.

For example, it includes read_csv() and to_csv() for interacting with CSV files. However, Pandas does not include any methods to read and write XML files.

In this article, we will take a look at how we can use other modules to read data from an XML file, and load it into a Pandas DataFrame. We'll also take data from a Pandas DataFrame and write it to an XML file.

Reading XML with Pandas

Let's have a look at a few ways to read XML data and put it in a Pandas DataFrame.

For this section, we'll use one set of input data for every script. Save the following XML in a file called properties.xml:

<?xml version="1.0" encoding="UTF-8"?>
<root>
        <bathrooms>
                <n35237 type="number">1.0</n35237>
                <n32238 type="number">3.0</n32238>
                <n44699 type="number">nan</n44699>
        </bathrooms>
        <price>
                <n35237 type="number">7020000.0</n35237>
                <n32238 type="number">10000000.0</n32238>
                <n44699 type="number">4128000.0</n44699>
        </price>
        <property_id>
                <n35237 type="number">35237.0</n35237>
                <n32238 type="number">32238.0</n32238>
                <n44699 type="number">44699.0</n44699>
        </property_id>
</root>

Reading with xml.etree.ElementTree

The xml.etree.ElementTree module comes built-in with Python. It provides functionality for parsing and creating XML documents. ElementTree represents the XML document as a tree. We can move across the document using nodes which are elements and sub-elements of the XML file.

In this approach, we read the file content in a variable and use ET.XML() to parse the XML document from the string constant. We will loop across each child and sub child maintaining a list of data they contain. Meanwhile, writing child tags for the DataFrame column. Then we write this data into a DataFrame.

Note: When reading data from XML, we have to transpose the DataFrame, as the data list's sub-elements are written in columns.

Let's look at the code to demonstrate use of xml.etree.ElementTree:

import xml.etree.ElementTree as ET
import pandas as pd

xml_data = open('properties.xml', 'r').read()  # Read file
root = ET.XML(xml_data)  # Parse XML

data = []
cols = []
for i, child in enumerate(root):
    data.append([subchild.text for subchild in child])
    cols.append(child.tag)

df = pd.DataFrame(data).T  # Write in DF and transpose it
df.columns = cols  # Update column names
print(df)

The above code will produce this output (varies on the input file used):

  bathrooms       price property_id
0       1.0   7020000.0     35237.0
1       3.0  10000000.0     32238.0
2       nan   4128000.0     44699.0

Reading with lxml

The lxml library is a Python binding for the C libraries libxml2 and libxslt. It also extends the native ElementTree module. As this is a third-party module, you'll need to install it with pip like this:

$ pip install lxml

Unlike ElementTree, we don't read the file data and parse it. We can directly use objectify.parse() and give it the path to XML file. To get the root element, we will use getroot() on the parsed XML data.

Now we can loop through the children elements of the root node and write them into a Python list. Like before, we'll create a DataFrame using the data list, and transpose the DataFrame.

Let's look at the code to create a Pandas DataFrame using lxml:

from lxml import objectify
import pandas as pd

xml_data = objectify.parse('properties.xml')  # Parse XML data
root = xml_data.getroot()  # Root element

data = []
cols = []
for i in range(len(root.getchildren())):
    child = root.getchildren()[i]
    data.append([subchild.text for subchild in child.getchildren()])
    cols.append(child.tag)

df = pd.DataFrame(data).T  # Create DataFrame and transpose it
df.columns = cols  # Update column names
print(df)

If we run this on Python interpreter, we see the following output:

  bathrooms       price property_id
0       1.0   7020000.0     35237.0
1       3.0  10000000.0     32238.0
2       nan   4128000.0     44699.0

Reading with xmltodict

The xmltodict module converts the XML data into a Python dictionary as the name suggests. Like lxml, this is a third-party module we need to install with pip:

$ pip install xmltodict

Like we've done before, we read the XML contents into a variable. We give this data in parse() method which returns a dictionary of the XML data. It will be a nested dictionary that has elements and sub-elements of the XML file. We can loop through the elements and write them into a data list that we use to create a DataFrame.

Let's have a look at the code to parse XML data to create a DataFrame using xmltodict:

import xmltodict
import pandas as pd

xml_data = open('properties.xml', 'r').read()  # Read data
xmlDict = xmltodict.parse(xml_data)  # Parse XML

cols = xmlDict['root'].keys()
data = []

for i in xmlDict['root']:
    child = xmlDict['root'][i]
    data.append([child[subchild]['#text'] for subchild in child])

df = pd.DataFrame(data).T  # Create DataFrame and transpose it.
df.columns = cols
print(df)

If we run the above code, we can see the output as:

  bathrooms       price property_id
0       1.0   7020000.0     35237.0
1       3.0  10000000.0     32238.0
2       nan   4128000.0     44699.0

Note: The xmltodict library is not recommended for huge XML files as many developers have observed performance drops. The lxml library is considered to be the fastest at working with XML, even faster than the included xml.etree.ElementTree.

Use what's best for your project, and if performance is critical you should run tests with each library.

Writing XML with Pandas

Let's look at various ways to write Pandas DataFrame in XML file. Every script we use below will create a new file called coordinates.xml with the following contents:

<root>
  <A>
    <X>1.3</X>
    <Y>2.6</Y>
    <Z>2.1</Z>
  </A>
  <B>
    <X>1.4</X>
    <Y>1.4</Y>
    <Z>5.6</Z>
  </B>
  <C>
    <X>5.2</X>
    <Y>4.6</Y>
    <Z>4.6</Z>
  </C>
</root>

Writing with Built-in write() Function

We can use the included write() function for files to write a DataFrame as an XML file. To accomplish this, we'll keep a list of the XML data, such that each item represents a line in XML. We will then iterate over the DataFrame and write the data with appropriate opening and closing tags of XML in the data list.

Once that's complete, we iterate over the list once more to write the data into the XML file. Here's the code that shows the use of write():

import pandas as pd

df = pd.DataFrame([[1.3, 1.4, 5.2],
                   [2.6, 1.4, 4.6],
                   [2.1, 5.6, 4.6]],
                  columns=['A', 'B', 'C'],
                  index=['X', 'Y', 'Z'])

xml_data = ['<root>']
for column in df.columns:
    xml_data.append('<{}>'.format(column))  # Opening element tag
    for field in df.index:
        # writing sub-elements
        xml_data.append('<{0}>{1}</{0}>'.format(field, df[column][field]))
    xml_data.append('</{}>'.format(column))  # Closing element tag
xml_data.append('</root>')

with open('coordinates.xml', 'w') as f:  # Writing in XML file
    for line in xml_data:
        f.write(line)

Running this code will produce a file named coordinates.xml in the current directory.

Writing XML Files with xml.etree.ElementTree

The default xml.etree.ElementTree module can be used to store data as XML and convert it to a string so it can be written to a file.

Our first step is to create the root element. We then iterate over the columns and rows of the DataFrame, adding them as elements and sub-elements in the ElementTree. Then we convert the ElementTree object's data into a binary string using the tostring() method.

As the XML data is a binary string, we decode it into UTF-8 before writing it into the file.

The following code uses xml.etree.ElementTree to write a DataFrame as an XML file:

import xml.etree.ElementTree as ET
import pandas as pd

df = pd.DataFrame([[1.3, 1.4, 5.2],
                   [2.6, 1.4, 4.6],
                   [2.1, 5.6, 4.6]],
                  columns=['A', 'B', 'C'],
                  index=['X', 'Y', 'Z'])
header = df.columns

root = ET.Element('root')  # Root element

for column in df.columns:
    entry = ET.SubElement(root, column)  # Adding element
    for row in df.index:
        schild = row
        child = ET.SubElement(entry, schild)  # Adding sub-element
        child.text = str(df[column][schild])

xml_data = ET.tostring(root)  # binary string
with open('coordinates.xml', 'w') as f:  # Write in file as utf-8
    f.write(xml_data.decode('utf-8'))

Like before, running this script will create a coordinates.xml file with the expected output.

Writing XML Files with lxml

Using lxml is similar to how we used xml.etree.ElementTree. We begin by creating an etree object with the root element of the file we're creating. We then iterate over the DataFrame, adding columns and rows as elements and sub-elements of the tree. Lastly, we use the tostring() method to get the etree as a binary string. We write the file after decoding the binary string into UTF-8.

Here is the code to write DataFrame as XML using lxml:

from lxml import etree as et
import pandas as pd

root = et.Element('root')  # Create root element
df = pd.DataFrame([[1.3, 1.4, 5.2],
                   [2.6, 1.4, 4.6],
                   [2.1, 5.6, 4.6]],
                  columns=['A', 'B', 'C'],
                  index=['X', 'Y', 'Z'])

for column in df.columns:
    entry = et.SubElement(root, column)  # Writing element
    for row in df.index:
        schild = row
        child = et.SubElement(entry, schild)  # Writing sub-elements
        child.text = str(df[column][schild])

xml_data = et.tostring(root)  # binary string
with open('coordinates.xml', 'w') as f:  # Write in XML file as utf-8
    f.write(xml_data.decode('utf-8'))

Upon successful completion, you will see the coordinates.xml with the XML coordinates.

Conclusion

This tutorial shows various ways we can read and write XML data with Pandas DataFrames. You can read data with the built-in xml.etree.ElementTree module, as well as two third-party modules: lxml and xmltodict.

For writing a Pandas DataFrame to an XML file, we have used conventional file write() with lists, the xml.etree.ElementTree module, and lxml. Given that manipulating XML strings directly to write a file is more prone to human error, xml.etree.ElementTree and lxml are the preferable solutions for exporting a DataFrame to XML.



from Planet Python
via read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...