Monday, November 11, 2019

Erik Marsja: Tutorial: How to Read Stata Files in Python with Pandas

The post Tutorial: How to Read Stata Files in Python with Pandas appeared first on Erik Marsja.

In this post, we are going to learn how to read Stata (.dta) files in Python.

As previously described (in the read .sav files in Python post) Python is a general-purpose language that also can be used for doing data analysis and data visualization. One example of data visualization will be found in this post.

One potential downside, however, is that Python is not really user-friendly for data storage. This has, of course, lead to that our data many times are stored using Excel, SPSS, SAS, or similar software. See, for instance, the posts about reading .sav, and sas files in Python:

Can I Open a Stata File in Python?

We are soon going to practically answer how to open a Stata file in Python? In Python, there are two useful packages called Pyreadstat, and Pandas that enable us to open .dta files. If we are working with Pandas, the  read_stata method will help us import a .dta into a Pandas dataframe. Furthermore, the package Pyreadstat, which is dependent on Pandas, will also create a Pandas dataframe from a .dta file.

How to install Pyreadstat:

First, before learning how to read .dat files using Python and Pyreadstat we need to install it. As many Python packages this package can be installed using pip or conda:

  1. Install Pyreadstat using pip:
    Open up the Windows Command Prompt and type pip install pyreadstat
    How to install pyreadstat
  2. Install using Conda:
    Open up the Anaconda Prompt, and type conda install -c conda-forge pyreadstat

How to Open a Stata file in Python

In this section, we are finally ready to learn how to read a .dta file in Python using the Python packages Pyreadstat and Pandas.

How to Load a Stata File in Python Using Pyreadstat

In this section, we are going to use pyreadstat to import a .dta file into a Pandas dataframe. First, we import pyreadstat:

import pyreadstat

Second, we are ready to import Stata files using the method read_dta. Note that, when we load a file using the Pyreadstat package, it will look for the .dta file in Python’s working directory. In the read Stata files example below, the FifthDaydata.dta is located in a subdirectory (i.e., “SimData”).

dtafile = './SimData/FifthDayData.dta'
df, meta = pyreadstat.read_dta(dtafile)

In the code chunk above, two variables were created; df, and meta. If we use the Python function type we can see that “df” is a Pandas dataframe:

This means that we can use all the available methods for Pandas dataframe objects. In the next line of code, we are Pandas head method to print the first 5 rows.

df.head()

Learn more about working with Pandas dataframes in the following tutorials:

How to Read a Stata file with Python Using Pandas

In this section, we are going to read the same Stata file into a Pandas dataframe. However, this time we will use Pandas read_stata method. This has the advantage that we can load the Statafile from a URL.

Before we continue, we need to import Pandas:

import pandas as pd

Now, when we have done that, we can read the .dta file into a Pandas dataframe using the read_stata method. In the read Stata example here, we are importing the same data file as in the previous example.

After we have loaded the Stata file using Python Pandas, we print the last 5 rows of the dataframe with the tail method.

dtafile = './SimData/FifthDayData.dta'

df = pd.read_stata(dtafile)
df.tail()

How to Read .dta Files from URL

In this section, we are going to use Pandas read_stata method, again. However, this time we will read the Stata file from a URL.

url = 'http://www.principlesofeconometrics.com/stata/broiler.dta'

df = pd.read_stata(url)
df.head()

Note, the only thing we changed was we used a URL as input (url) and Pandas read_stata will import the .dta file that the URL is pointing to.

Pandas Scatter Plot

Here, we will create a scatter plot in Python using Pandas scatter method. This is to illustrate how we can work with data imported from .dta files.

df.plot.scatter(x='pchick',
                       y='cpi')
Scatter plot in PythonScatter Plot in Python

Learn more about data visualization in Python:

How to Read Specific Columns from a Stata file

Now using pyreadstat read_dta and Pandas read_stat both enables us to read specific columns from a Stata file. Note, that read_dta have the argument usecols and Pandas the argument columns.

Reading Specific Columns using Pyreadstat

In this Python read dta example, we use the argument usecols that takes a list as parameter.

import pyreadstat

dtafile = './SimData/FifthDayData.dta'
df, meta = pyreadstat.read_dta(dtafile,
                              usecols=['index', 'Name', 'ID',
                                      'Gender'])
df.head()
Dataframe created from Stata fileDataframe from .dta

Reading Specific Columns using Pandas read_stata

Here, we are going to use Pandas read_stata method and the argument columns. This argument, as in the example above, takes a list as input.

import pandas as pd
url = 'http://www.principlesofeconometrics.com/stata/broiler.dta'

df = pd.read_stata(url,
                  columns=['year', 'pchick', 'time',
                                      'meatex'])
df.head()
Pandas read_stata to DataframeDataframe

Note, the behavior of Pandas read_stata; in the resulting dataframe the order of the column will be the same as in the list we put in.

How to Save a Stata file

In this section of the Python Stata tutorial, we are going to save the dataframe as a .dta file. This is easily done, we just have to use the write_dta method when using Pyreadstat and the dataframe method to_stata in Pandas.

Saving a dataframe as a Stata file using Pyreadstat

In the example below, we are using the dataframe we created in the previous section and write it as a dta file.

pyreadstat.write_dta(df, 'broilerdata_edited.dta')

Now, between the parentheses is where the important stuff happens. The first argument is our dataframe and the second is the file path. Note, only having the filename, as in the example above, will make the write_dta method to write the Stata file to the current directory.

How to Save a dataframe as .dta with Pandas to_stata

In this example, we are going to save the same dataframe using Pandas to_stata:

df.to_stata('broilerdata_edited.dta')

As can be seen in the image above, the dataframe object has the to_stata method. Within, the parentheses we put the file path.

Save a CSV file as a Stata File

In this section, we are going to work with Pandas read_csv to read a CSV file, containing data. After we have imported the CSV to a dataframe we are going to save it as a .dta file using Pandas to_stat:

df = pd.read_csv('./SimData/FifthDayData.csv')
df.to_stata('./SimData/FifthDayData.dta')

Export an Excel file as a Stata File

In the final example, we are going to use Pandas read_excel to import a .xslx file and then save this dataframe as a Stata file using Pandas to_stat:

df = pd.read_excel('./SimData/example_concat.xlsx')
df.to_stata('./SimData/example_concat.dta')

Note, that in both of the last two examples above we save the data to a folder called SimData. If we want to save the CSV and Excel file to the current directory we simply remove the “./SimData/” part of the string.

Learn more about importing data using Pandas:

Note, all the files we have read using read_dta, read_stata, read_csv, and read_excel can be found here. It is, of course, possible to open SPSS and SAS files using Pandas and save them as .dta files as well.

Summary: Read Stata Files using Python

In this post, we have learned how to read Stata files in Python. Furthermore, we have learned how to write Pandas dataframes to Stata files.

The post Tutorial: How to Read Stata Files in Python with Pandas appeared first on Erik Marsja.



from Planet Python
via read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...