Thursday, December 24, 2020

Stack Abuse: Reading and Writing HTML Tables with Pandas

Introduction

Hypertext Markup Language (HTML) is the standard markup language for building web pages. We can render tabular data using HTML's <table> element. The Pandas data analysis library provides functions like read_html() and to_html() so we can import and export data to DataFrames.

In this article, we will learn how to read tabular data from an HTML file and load it into a Pandas DataFrame. We'll also learn how to write data from a Pandas DataFrame and to an HTML file.

Note: In this article, we will read and write HTML <table> elements. This article does not include parsing the entire HTML file.

Reading HTML

We can read tables of an HTML file using the read_html() function. This function read tables of HTML files as Pandas DataFrames. It can read from a file or a URL.

Let's have a look at each input source one by one.

Reading HTML Data From a File

For this section, we'll use one set of input data. One table contains programming languages and the year of their creation. The other table has land sizes and their cost in USD.

Save the following HTML content in a file called table_data.html:

<!DOCTYPE html>
<html lang="en">

<head>
  <meta charset="UTF-8">
  <title>Table Data</title>
</head>

<body>
  <table>
    <thead>
      <tr>
        <th>Programming Language</th>
        <th>Creator</th>
        <th>Year</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>C</td>
        <td>Dennis Ritchie</td>
        <td>1972</td>
      </tr>
      <tr>
        <td>Python</td>
        <td>Guido Van Rossum</td>
        <td>1989</td>
      </tr>
      <tr>
        <td>Ruby</td>
        <td>Yukihiro Matsumoto</td>
        <td>1995</td>
      </tr>
    </tbody>
  </table>
  <table>
    <thead>
      <tr>
        <th>
          Area (sq.ft)
        </th>
        <th>
          Price (USD)
        </th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>
          12000
        </td>
        <td>
          500
        </td>
      </tr>
      <tr>
        <td>
          32000
        </td>
        <td>
          700
        </td>
      </tr>

    </tbody>
  </table>
</body>

</html>

Pandas needs help from another library, called lxml to parse HTML and XML files. For the read_html() function to work you need to install lxml:

$ pip install lxml

Once lmxl is installed, we can use the read_html() function. It returns a list of DataFrames, where each DataFrame is an entire table element of the given HTML file. We access each table as a DataFrame by indexing the list.

The code below demonstrates use of read_html() function to read tables from HTML file:

import pandas as pd

tables = pd.read_html('table_data.html')
print('Tables found:', len(tables))
df1 = tables[0]  # Save first table in variable df1
df2 = tables[1]  # Saving next table in variable df2

print('First Table')
print(df1)
print('Another Table')
print(df2)

Note: While you need lxml installed, you don't need to import it in your program for Pandas to work.

Running the above code on a Python interpreter will produce the following output:

Tables found: 2
First Table
  Programming Language             Creator  Year
0                    C      Dennis Ritchie  1972
1               Python    Guido Van Rossum  1989
2                 Ruby  Yukihiro Matsumoto  1995
Another Table
   Area (sq.ft)  Price (USD)
0         12000          500
1         32000          700

Reading HTML Data From URL

Just as we read table elements from an HTML file, we can also read table elements from an HTML web page into a DataFrame with read_html(). However, in place of the file name, we will provide a URL like this:

read_html('https://en.wikipedia.org/wiki/Python_(programming_language)')

And it will return a list of DataFrames where each DataFrame represents a table element from the given URL.

Here is an example code to read table elements from a website URL using Pandas:

import pandas as pd

tables = pd.read_html('https://en.wikipedia.org/wiki/Python_(programming_language)')
print('Tables found:', len(tables))
df1 = tables[0]  # Save first table in variable df1
print('First Table')
print(df1.head())  # To print first 5 rows

If we successfully run the above code, we can see the output as:

Tables found: 10
First Table
                0                                                  1
0             NaN                                                NaN
1        Paradigm  Multi-paradigm: functional, imperative, object...
2     Designed by                                   Guido van Rossum
3       Developer                         Python Software Foundation
4  First appeared                              1991; 29 years ago[1]

Reading HTML Data From URL That Requires Authentication

We know that we can read table elements from a website. However, when the site requires authentication, the code runs into the following exception:

raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 401: UNAUTHORIZED

To read data from such URLs we will use the requests module. You can install it with pip:

$ pip install requests

Now, we will use the get() method from the requests library to make a request to a website URL while providing the optional auth parameter if the site requires authentication.

This method returns a response object from the web page. We can check the status code (to make sure the content is definitely there) and get the text from the response object, and then convert the table to a DataFrame.

Let's see an example of using requests to get data that requires authentication. For this purpose, we are using https://httpbin.org:

import requests

r = requests.get('https://httpbin.org/basic-auth/john/johnspassword', auth=('john', 'johnspassword'))

print(r.status_code)
print(r.text)

On executing the above code, we can see the following output:

200
{
  "authenticated": true, 
  "user": "john"
}

This shows that we successfully accessed the web page content of an authenticated URL. However, this website only contains JSON data and we need HTML table elements as DataFrames.

Let's stick to the earlier URL and use requests to read HTML tables as DataFrames. While the previous site was public, the steps to access authenticated content is the same.

Once we get a response, we can pass the r.text to read_html() method. And as usual, we'll get a list of tables it contains as DataFrames:

import pandas as pd
import requests

# Can use auth parameter for authenticated URLs
r = requests.get('https://en.wikipedia.org/wiki/Python_(programming_language)',
                 auth=('john', 'johnspassword'))
tables = pd.read_html(r.text)
print('Tables found:', len(tables))
df1 = tables[0]
print('First Table')
print(df1.head())

Running this code will generate the following output:

Tables found: 10
First Table
                0                                                  1
0             NaN                                                NaN
1        Paradigm  Multi-paradigm: functional, imperative, object...
2     Designed by                                   Guido van Rossum
3       Developer                         Python Software Foundation
4  First appeared                              1991; 29 years ago[1]

Writing HTML Tables with Python's Pandas

We have successfully read data from HTML tables. Let's write Pandas DataFrame in an HTML file. This can be achieved by using the to_html() method.

The to_html() takes the path of the file you want the data exported to. If you don't provide an absolute path, it would save a file relative to the current directory.

You can export a DataFrame to an HTML table like this:

import pandas as pd

df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df.to_html('write_html.html')

This code will produce the following file write_html.html in the current directory:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>A</th>
      <th>B</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>1</td>
      <td>3</td>
    </tr>
    <tr>
      <th>1</th>
      <td>2</td>
      <td>4</td>
    </tr>
  </tbody>
</table>

Note that the export isn't an entire HTML document, but just the HTML table itself.

Writing Styled HTML Tables with Python's Pandas

As we can see, by default the table border is 1, alignment is right and it has also DataFrame indices in <th> tags. We can change this default structure by providing a few optional parameters.

Hide Index

If we don't want to include the index in the table output, we can set index=False in the to_html():

import pandas as pd

df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df.to_html('write_html.html', index=False)

This code produces write_html.html file with the following content:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th>A</th>
      <th>B</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1</td>
      <td>3</td>
    </tr>
    <tr>
      <td>2</td>
      <td>4</td>
    </tr>
  </tbody>
</table>
Changing Table Border

The default table border is 1 pixel. To change this default, we can set the border parameter to a value in pixels.

The following code changes the border to 3 pixels:

import pandas as pd

df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df.to_html('write_html.html', border=3)

The generated file now sets the border attribute of the table to "3":

<table border="3" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>A</th>
      <th>B</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>1</td>
      <td>3</td>
    </tr>
    <tr>
      <th>1</th>
      <td>2</td>
      <td>4</td>
    </tr>
  </tbody>
</table>
Justify Text

By default, the table's head text is right-aligned. We change this alignment with the justify parameter. For example, doing justify="center" will add style="text-align: center;" in the <tr> tag of the <thead> tag.

Let's try aligning the head text to center and see results:

import pandas as pd

df = pd.DataFrame({'A': [1, 'AAA'], 'B': ['BBB', 4]})
df.to_html('write_html.html', justify='center')

The table created by above code looks like this:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: center;">
      <th></th>
      <th>A</th>
      <th>B</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>1</td>
      <td>BBB</td>
    </tr>
    <tr>
      <th>1</th>
      <td>AAA</td>
      <td>4</td>
    </tr>
  </tbody>
</table>

The text of the table head is now beautifully aligned to the center.

Conclusion

In this tutorial, we have learned how to import and export HTML table data with Pandas DataFrames. We loaded HTML table data from files, as well as from webpage URLs. In the case of authenticated URLs, we used the requests module to authenticate and retrieve the site's data, and then pass the response text into the read_html() function.

We have also written a Pandas DataFrame as an HTML file using the to_html() function. We then styled the generated table by passing a few optional parameters like index, border and justify. This makes it easy to write the data of DataFrame in a presentational manner.



from Planet Python
via read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...