Daily Python: Data Analysis With Pyspark Dataframe

Monday, April 5, 2021

Data Analysis With Pyspark Dataframe

Install Pyspark

!pip install pyspark

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:

import pyspark
from pyspark.rdd import RDD
from pyspark.sql import Row
from pyspark.sql import DataFrame
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql import functions
from pyspark.sql.functions import lit, desc, col, size, array_contains\
, isnan, udf, hour, array_min, array_max, countDistinct
from pyspark.sql.types import *

from pyspark.ml  import Pipeline     
from pyspark.sql.functions import mean,col,split, col, regexp_extract, when, lit

Pyspark Example

For this exercise, I will use the purchase data. Let us take a look at this data using unix head command. We can run unix commands in Python Jupyter notebook using ! in front of every command.

In [3]:

!head -1 purchases.csv

12-29      11:06   Fort Wayne      Sporting Goods  199.82  Cash

Firstly, We need to create a spark container by calling SparkSession. This step is necessary before doing anything

In [4]:

from pyspark.sql import SparkSession
from pyspark.sql.types import *

#create session in

(continued...)

from Planet SciPy
read more

Daily Python

Monday, April 5, 2021

Data Analysis With Pyspark Dataframe

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

Search This Blog