Monday, April 5, 2021

Data Analysis With Pyspark Dataframe

Data Analysis With Pyspark Dataframe
Install Pyspark

!pip install pyspark

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
In [2]:
import pyspark
from pyspark.rdd import RDD
from pyspark.sql import Row
from pyspark.sql import DataFrame
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql import functions
from pyspark.sql.functions import lit, desc, col, size, array_contains\
, isnan, udf, hour, array_min, array_max, countDistinct
from pyspark.sql.types import *

from pyspark.ml  import Pipeline     
from pyspark.sql.functions import mean,col,split, col, regexp_extract, when, lit
Pyspark Example

For this exercise, I will use the purchase data. Let us take a look at this data using unix head command. We can run unix commands in Python Jupyter notebook using ! in front of every command.

In [3]:
!head -1 purchases.csv
12-29      11:06   Fort Wayne      Sporting Goods  199.82  Cash

Firstly, We need to create a spark container by calling SparkSession. This step is necessary before doing anything

In [4]:
from pyspark.sql import SparkSession
from pyspark.sql.types import *

#create session in
(continued...)

from Planet SciPy
read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...