pip install pandas
. Next step is to load the package by running the following command. pd
is an alias of pandas package. We will use it instead of full name "pandas".
import pandas as pd
The sample data looks like below -
dt = {'ID': [11, 12, 13, 14, 15],
'first_name': ['David', 'Jamie', 'Steve', 'Stevart', 'John'],
'company': ['Aon', 'TCS', 'Google', 'RBS', '.'],
'salary': [74, 76, 96, 71, 78]}
mydt = pd.DataFrame(dt, columns = ['ID', 'first_name', 'company', 'salary'])
ID first_name company salary
0 11 David Aon 74
1 12 Jamie TCS 76
2 13 Steve Google 96
3 14 Stevart RBS 71
4 15 John . 78
Incase you want to change the working directory, you can specify it in under
import os
os.getcwd()
os.chdir( )
function. Single backslash does not work in Python so use 2 backslashes while specifying file location.
The following command tells python to write data in CSV format in your working directory.
os.chdir("C:\\Users\\DELL\\Documents\\")
mydt.to_csv('workingfile.csv', index=False)
Example 1 : Read CSV file with header row
It's the basic syntax of read_csv() function. You just need to mention the filename. It assumes you have column names in first row of your CSV file.It stores the data the way It should be as we have headers in the first row of our datafile. It is important to highlight that
mydata = pd.read_csv("workingfile.csv")
header=0
is the default value. Hence we don't need to mention the header= parameter. It means header starts from first row as indexing in python starts from 0. The above code is equivalent to this line of code. pd.read_csv("workingfile.csv", header=0)
It returns 5 number of rows and 4 number of columns. Column Names are
mydata.shape
mydata.columns
mydata.dtypes
['ID', 'first_name', 'company', 'salary']
See the column types of data we imported. first_name and company are character variables. Remaining variables are numeric ones.
ID int64
first_name object
company object
salary int64
Example 2 : Read CSV file with header in second row
Suppose you have column or variable names in second row. To read this kind of CSV file, you can submit the following command.mydata = pd.read_csv("workingfile.csv", header = 1)
header=1
tells python to pick header from second row. It's setting second row as header. It's not a realistic example. I just used it for illustration so that you get an idea how to solve it. To make it practical, you can add random values in first row in CSV file and then import it again.
11 David Aon 74
0 12 Jamie TCS 76
1 13 Steve Google 96
2 14 Stevart RBS 71
3 15 John . 78
skiprows = 1 means we are ignoring first row and names= option is used to assign variable names manually.
mydata0 = pd.read_csv("workingfile.csv", skiprows=1, names=['CustID', 'Name', 'Companies', 'Income'])
CustID Name Companies Income
0 11 David Aon 74
1 12 Jamie TCS 76
2 13 Steve Google 96
3 14 Stevart RBS 71
4 15 John . 78
Example 3 : Skip rows but keep header
In this case, we are skipping second and third rows while importing. Don't forget index starts from 0 in python so 0 refers to first row and 1 refers to second row and 2 implies third row.
mydata = pd.read_csv("workingfile.csv", skiprows=[1,2])
ID first_name company salary
0 13 Steve Google 96
1 14 Stevart RBS 71
2 15 John . 78
Instead of [1,2] you can also write range(1,3)
. Both means the same thing but range( ) function is very useful when you want to skip many rows so it saves time of manually defining row position.
When skiprows = 4, it means skipping four rows from top. skiprows=[1,2,3,4] means skipping rows from second through fifth. It is because when list is specified in skiprows= option, it skips rows at index positions. When a single integer value is specified in the option, it considers skip those rows from top
Example 4 : Read CSV file without header row
If you specify "header = None", python would assign a series of numbers starting from 0 to (number of columns - 1) as column names. In this datafile, we have column names in first row.mydata0 = pd.read_csv("workingfile.csv", header = None)See the output shown below-
Output |
mydata0 = pd.read_csv("workingfile.csv", header = None, prefix="var")In this case, we are setting
var
as prefix which tells python to include this keyword before each column name.
var0 var1 var2 var3
0 ID first_name company salary
1 11 David Aon 74
2 12 Jamie TCS 76
3 13 Steve Google 96
4 14 Stevart RBS 71
5 15 John . 78
Example 5 : Specify missing values
Thena_values=
options is used to set some values as blank / missing values while importing CSV file.
mydata00 = pd.read_csv("workingfile.csv", na_values=['.'])
ID first_name company salary
0 11 David Aon 74
1 12 Jamie TCS 76
2 13 Steve Google 96
3 14 Stevart RBS 71
4 15 John NaN 78
Example 6 : Set Index Column
mydata01 = pd.read_csv("workingfile.csv", index_col ='ID')
As you can see in the above output, the column ID has been set as index column.
first_name company salary
ID
11 David Aon 74
12 Jamie TCS 76
13 Steve Google 96
14 Stevart RBS 71
15 John . 78
Example 7 : Read CSV File from External URL
You can directly read data from the CSV file that is stored on a web link. It is very handy when you need to load publicly available datasets from github, kaggle and other websites.This DataFrame contains 2311 rows and 8 columns. Using
mydata02 = pd.read_csv("http://winterolympicsmedals.com/medals.csv")
mydata02.shape
, you can generate this summary.
Example 8 : Skip Last 5 Rows While Importing CSV
In the above code, we are excluding bottom 5 rows using skip_footer= parameter.
mydata04 = pd.read_csv("http://winterolympicsmedals.com/medals.csv", skip_footer=5)
Example 9 : Read only first 5 rows
Using nrows= option, you can load top K number of rows.
mydata05 = pd.read_csv("http://winterolympicsmedals.com/medals.csv", nrows=5)
Example 10 : Interpreting "," as thousands separator
mydata06 = pd.read_csv("http://winterolympicsmedals.com/medals.csv", thousands=",")
Example 11 : Read only specific columns
The above code reads only columns based on index positions which are second, sixth and eighth position.
mydata07 = pd.read_csv("http://winterolympicsmedals.com/medals.csv", usecols=[1,5,7])
Example 12 : Read some rows and columns
In the above command, we have combined usecols= and nrows= options. It will select only first 5 rows and selected columns.
mydata08 = pd.read_csv("http://winterolympicsmedals.com/medals.csv", usecols=[1,5,7], nrows=5)
Example 13 : Read file with semi colon delimiter
Using sep= parameter in read_csv( ) function, you can import file with any delimiter other than default comma. In this case, we are using semi-colon as a separator.
mydata09 = pd.read_csv("file_path", sep = ';')
Example 14 : Change column type while importing CSV
Suppose you want to change column format from int64 to float64 while loading CSV file into Python. We can use dtype = option for the same.
mydf = pd.read_csv("workingfile.csv", dtype = {"salary" : "float64"})
Example 15 : Measure time taken to import big CSV file
With the use ofverbose=True
, you can capture time taken for Tokenization, conversion and Parser memory cleanup.
mydf = pd.read_csv("workingfile.csv", verbose=True)
from Planet Python
via read more
No comments:
Post a Comment