Introduction
Extensible Markup Language (XML) is a markup language that's popular because of the way it structures data. It found usage in data transmission (representing serialized objects) and configuration files.
Despite JSON's rising popularity, you can still find XML in Android development's manifest file, Java/Maven build tools and SOAP APIs on the web. Parsing XML is therefore still a common task a developer would have to do.
In Python, we can read and parse XML by leveraging two libraries: BeautifulSoup and LXML.
In this guide, we’ll take a look at extracting and parsing data from XML files with BeautifulSoup and LXML, and store the results using Pandas.
Setting up LXML and BeautifulSoup
We first need to install both libraries. We'll create a new folder in your workspace, set up a virtual environment, and install the libraries:
$ mkdir xml_parsing_tutorial
$ cd xml_parsing_tutorial
$ python3 -m venv env # Create a virtual environment for this project
$ . env/bin/activate # Activate the virtual environment
$ pip install lxml beautifulsoup4 # Install both Python packages
Now that we have everything set up, let's do some parsing!
Parsing XML with lxml and BeautifulSoup
Parsing always depends on the underlying file and the structure it uses so there's no single silver bullet for all files. BeautifulSoup parses them automatically, but the underlying elements are task-dependent.
Thus, it's best to learn parsing with a hands-on approach. Save the following XML into a file in your working directory - teachers.xml
:
<?xml version="1.0" encoding="UTF-8"?>
<teachers>
<teacher>
<name>Sam Davies</name>
<age>35</age>
<subject>Maths</subject>
</teacher>
<teacher>
<name>Cassie Stone</name>
<age>24</age>
<subject>Science</subject>
</teacher>
<teacher>
<name>Derek Brandon</name>
<age>32</age>
<subject>History</subject>
</teacher>
</teachers>
The <teachers>
tag indicates the root of the XML document, the <teacher>
tag is a child or sub-element of the <teachers></teachers>
, with information about a singular person. The <name>
, <age>
, <subject>
are children of the <teacher>
tag, and grand-children of the <teachers>
tag.
The first line, <?xml version="1.0" encoding="UTF-8"?>
, in the sample document above is called an XML prolog. It always comes at the beginning of an XML file, although it is completely optional to include an XML prolog in an XML document.
The XML prolog shown above indicates the version of XML used and the type of character encoding. In this case, the characters in the XML document are encoded in UTF-8.
Now that we understand the structure of the XML file - we can parse it. Create a new file called teachers.py
in your working directory, and import the BeautifulSoup library:
from bs4 import BeautifulSoup
Note: As you may have noticed, we didn’t import lxml
! With importing BeautifulSoup, LXML is automatically integrated, so importing it separately isn't necessary, but it isn't installed as part of BeautifulSoup.
Now let’s read the contents of the XML file we created and store it in a variable called soup
so we can begin parsing:
with open('teachers.xml', 'r') as f:
file = f.read()
# 'xml' is the parser used. For html files, which BeautifulSoup is typically used for, it would be 'html.parser'.
soup = BeautifulSoup(file, 'xml')
The soup
variable now has the parsed contents of our XML file. We can use this variable and the methods attached to it to retrieve the XML information with Python code.
Let’s say we want to view only the names of the teachers from the XML document. We can get that information with a few lines of code:
names = soup.find_all('name')
for name in names:
print(name.text)
Running python teachers.py
would give us:
Sam Davis
Cassie Stone
Derek Brandon
The find_all()
method returns a list of all the matching tags passed into it as an argument. As shown in the code above, soup.find_all('name')
returns all the <name>
tags in the XML file. We then iterate over these tags and print their text
property, which contains the tags' values.
Display Parsed Data in a Table
Let's take things one step further, we'll parse all the contents of the XML file and display it in a tabular format.
Let's rewrite the teachers.py
file with:
from bs4 import BeautifulSoup
# Opens and reads the xml file we saved earlier
with open('teachers.xml', 'r') as f:
file = f.read()
# Initializing soup variable
soup = BeautifulSoup(file, 'xml')
# Storing <name> tags and elements in names variable
names = soup.find_all('name')
# Storing <age> tags and elements in 'ages' variable
ages = soup.find_all('age')
# Storing <subject> tags and elements in 'subjects' variable
subjects = soup.find_all('subject')
# Displaying data in tabular format
print('-'.center(35, '-'))
print('|' + 'Name'.center(15) + '|' + ' Age ' + '|' + 'Subject'.center(11) + '|')
for i in range(0, len(names)):
print('-'.center(35, '-'))
print(
f'|{names[i].text.center(15)}|{ages[i].text.center(5)}|{subjects[i].text.center(11)}|')
print('-'.center(35, '-'))
The output of the code above would look like this:
-----------------------------------
| Name | Age | Subject |
-----------------------------------
| Sam Davies | 35 | Maths |
-----------------------------------
| Cassie Stone | 24 | Science |
-----------------------------------
| Derek Brandon | 32 | History |
-----------------------------------
Congrats! You just parsed your first XML file with BeautifulSoup and LXML! Now that you're more comfortable with the theory and the process, let's try a more real-world example.
We've formatted the data as a table as a precursor to storing it in a versatile data structure. Namely - in the upcoming mini-project, we'll store the data in a Pandas DataFrame
.
If you aren't already familiar with DataFrames - read our Python with Pandas: Guide to DataFrames!
Parsing an RSS Feed and Storing the Data to a CSV
In this section, we'll parse an RSS feed of The New York Times News, and store that data in a CSV file.
RSS is short for Really Simple Syndication. An RSS feed is a file that contains a summary of updates from a website and is written in XML. In this case, the RSS feed of The New York Times contains a summary of daily news updates on their website. This summary contains links to news releases, links to article images, descriptions of news items, and more. RSS feeds are also used to allow people to get data without scraping websites as a nice token by website owners.
Here's a snapshot of an RSS feed from The New York Times:
You can gain access to different New York Times RSS feeds of different continents, countries, regions, topics and other criteria via this link.
It's important to see and understand the structure of the data before you can begin parsing it. The data we would like to extract from the RSS feed about each news article is:
- Globally Unique Identifier (GUID)
- Title
- Publication Date
- Description
Now that we're familiar with the structure and have clear goals, let's kick off our program! We'll need the requests
library and the pandas
library to retrieve the data and easily convert it to a CSV file.
If you haven't worked with
requests
before, read out Guide to Python's requests Module!
With requests
, we can make HTTP requests to websites and parse the responses. In this case, we can use it to retrieve their RSS feeds (in XML) so BeautifulSoup can parse it. With pandas
, we will be able to format the parsed data in a table, and finally store the table's contents into a CSV file.
In the same working directory, install requests
and pandas
(your virtual environment should still be active):
$ pip install requests pandas
In a new file, nyt_rss_feed.py
, let's import our libraries:
import requests
from bs4 import BeautifulSoup
import pandas as pd
Then, let's make an HTTP request to The New York Times' server to get their RSS feed and retrieve its contents:
url = 'https://rss.nytimes.com/services/xml/rss/nyt/US.xml'
xml_data = requests.get(url).content
With the code above, we have been able to get a response from the HTTP request and store its contents in the xml_data
variable. The requests
library returns data as bytes
.
Now, create the following function to parse the XML data into a table in Pandas, with the help of BeautifulSoup:
def parse_xml(xml_data):
# Initializing soup variable
soup = BeautifulSoup(xml_data, 'xml')
# Creating column for table
df = pd.DataFrame(columns=['guid', 'title', 'pubDate', 'description'])
# Iterating through item tag and extracting elements
all_items = soup.find_all('item')
items_length = len(all_items)
for index, item in enumerate(all_items):
guid = item.find('guid').text
title = item.find('title').text
pub_date = item.find('pubDate').text
description = item.find('description').text
# Adding extracted elements to rows in table
row = {
'guid': guid,
'title': title,
'pubDate': pub_date,
'description': description
}
df = df.append(row, ignore_index=True)
print(f'Appending row %s of %s' % (index+1, items_length))
return df
The function above parses XML data from an HTTP request with BeautifulSoup, storing its contents in a soup
variable. The Pandas DataFrame with rows and columns for the data we would like to parse is referenced via the df
variable.
We then iterate through the XML file to find all tags with <item>
. By iterating through the <item>
tag we are able to extract its children tags: <guid>
, <title>
, <pubDate>
, and <description>
. Note how we use the find()
method to get only one object. We append the values of each child tag to the Pandas table.
Now, at the end of the file after the function, add these two lines of code to call the function and create a CSV file:
df = parse_xml(xml_data)
df.to_csv('news.csv')
Run python nyt_rss_feed.py
to create a new CSV file in your present working directory:
Appending row 1 of 24
Appending row 2 of 24
...
Appending row 24 of 24
The contents of the CSV file would look like this:
Note: Downloading data may take a bit depending on your internet connection and the RSS feed. Parsing data may take a bit depending on your CPU and memory resources as well. The feed we've used is fairly small so it should process quickly. Please be patient if you don't see results immediately.
Congrats, you've successfully parsed an RSS feed from The New York Times News and converted it to a CSV file!
Conclusion
In this guide, we learned how we can set up BeautifulSoup and LXML to parse XML files. We first got practice by parsing a simple XML file with teacher data, and then we parsed The New York Times's RSS feed, converting their data to a CSV file.
You can use these techniques to parse other XML you may encounter, and convert them into different formats that you need!
from Planet Python
via read more
No comments:
Post a Comment