Monday, October 21, 2019

Erik Marsja: Converting HTML to a Jupyter Notebook

The post Converting HTML to a Jupyter Notebook appeared first on Erik Marsja.

In this short post, we are going to learn how to turn the code from blog posts to Jupyter notebooks.

In this post, we are going to use the Python packages BeautifulSoup4, json, and urllib. We are going to use these packages to scrape the code from webpages putting their code within <code></code>.

Note, this code is not intended to steal other people’s code. I created this script to scrape my code and save it to Jupyter notebooks because I noticed that my code, sometimes did not work as intended.

Install the Needed Packages

Now, we need to install BeautifulSoup4 before we continue converting html
to jupyter notebooks. Furthermore, we need to install lxml.

How to Install Python Packages using conda

In this section, we are going to learn how to install the needed packages using the packages manager conda. First, open up the Anaconda Powershell Prompt

Now, we are ready to install BeautifulSoup4.

conda -c install anaconda beautifulsoup4 lxml

How to Install Python Packages using Pip

It is, of course, possible to install the packages using pip as well:

install beautifulsoup4 lxml

How to Convert HTML to a Jupyter Notebook

Now, when we have installed the Python packages, we can continue with scraping the code from a web page. In the example, below, we will start by importing BeautifulSoup from bs4, json, and urllib. Next, we have the URL to the webpage that we want to convert to a Jupyter notebooks (this).

from bs4 import BeautifulSoup
  import json
  import urllib

  url = 'https://www.marsja.se/python-manova-made-easy-using-statsmodels/'

Setting a Custom User-Agent

In the next line of code, we create the dictionary headers.

This is because many websites (including the one you are reading now) will block web scrapers and this will prevent that from happening.


  headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11'\
             '(KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
         'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
         'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
         'Accept-Encoding': 'none',
         'Accept-Language': 'en-US,en;q=0.8',
         'Connection': 'keep-alive'}
  

In the next code chunk, we are going to create a Request object. This object represents the HTTP request we are making.

Simply put, we create a Request object that specifies the URL we want to retrieve. Furthermore, we are calling urlopen using the Request object. This will, in turn, a response object for the requested URL.

Finally, we call .read() on the response:

req = urllib.request.Request(url,
    headers=headers)
  page = urllib.request.urlopen(req)
  text = page.read()

We are now going to use BeautifulSoup4 to get make it easier to scrape the html:

soup = BeautifulSoup(text, 'lxml')
soup

Jupyter Notebook Metadata

Now we’re ready to convert HTML to a Jupyter Notebook (this code was inspired by this code example). First, we start by creating some metadata for the Jupyter notebook.

Jupyter notebook metadata

In the code below, we start by creating a dictionary in which we will, later, store our scraped code elements. This is going to be the metadata for the Jupyter notebook we will create. Note, .ipynb are simple JSON files, containing text, code, rich media output, and metadata. The metadata is not required but here we will add what language we are using (i.e., Python 3).

create_nb = {'nbformat': 4, 'nbformat_minor': 2, 
              'cells': [], 'metadata': 
             {"kernelspec": 
              {"display_name": "Python 3", 
               "language": "python", "name": "python3"
  }}}
Example code cell from a Jupyter Notebook

More information bout the format of Jupyter notebooks can be found here.

Getting the Code Elements from the HTML

Second, we are creating a Python function called get_code. This function will take two arguments. First, the beautifulsoup object, we earlier created, and the content_class to search for content in. In the case, of this particular WordPress, blog this will be post-content

Next, we are looping through all div tags in the soup object. Here, we only look for the post content. Next, we get all the code chunks searching for all code tags

In the final loop, we are going through each code chunk and creating a new dictionary (cell) in which we are going to store the code. The important part is where we add the text, using the get_text method. Here we are getting our code from the code chunk and add it to the dictionary.

Finally, we add this to the dictionary, nb_data, that will contain the data that we are going to save as a jupyter notebook (i.e., the blog post we have scraped).

def get_data(soup, content_class):
    for div in soup.find_all('div', 
                             attrs={'class': content_class}):
        
        code_chunks = div.find_all('code')
        
        for chunk in code_chunks:
            cell_text = ' '
            cell = {}
            cell['metadata'] = {}
            cell['outputs'] = []
            cell['source'] = [chunk.get_text()]
            cell['execution_count'] = None
            cell['cell_type'] = 'code'
            create_nb['cells'].append(cell)

get_data(soup, 'post-content')

with open('Python_MANOVA.ipynb', 'w') as jynotebook:
    jynotebook.write(json.dumps(create_nb))

Note, we get the nb_data which is a dictionary from which will create our notebook from. In the final two rows, of the code chunk, we will open a file (i.e., test.ipynb) and write to this file using json dump method.

Here’s a Jupyter notebook containing all code above.

The post Converting HTML to a Jupyter Notebook appeared first on Erik Marsja.



from Planet Python
via read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...