Wednesday, July 31, 2019

PSF GSoC students blogs: Weekly Check-in #8

In the past week, I was working on setting up Hadoop and trying to import data from it. I got my PRs reviewed by my mentor and working on the changes he suggested.

What did I do this week?

I had initially set up Hadoop in my Ubuntu system. But setting this up would be difficult in Travis CI. So I was exploring other options. The easy way to do this is through docker but there is no official Hadoop distribution in docker. I was checking out cloudera's quick start VM but when I was trying to set this up my laptop started to hang. I will continue to look into other options. Also my mentor had reviewed my HDFS source PR and guided me on how to proceed further. 

What is coming up next?

I'll have to work on the docker set up for Hdfs source. I'll probably have to write a script or a docker-compose script. MySQL PR had an issue while my mentor was adding a merge test. Will work on that as well. We'll be preparing for a release soon.

Did you get stuck anywhere?

I struggled a bit with the Hadoop set up. My mentor gave me some input on this and hopefully I'll be able to create a docker set up by next week.

 



from Planet Python
via read more

PSF GSoC students blogs: Coding week #9

What did I do this week?

After a productive discussion with my mentors last week, we agreed to proceed coding the local scoring algorithm for Binomial MGWR and testing the results from that. After familiarizing myself with the literature on the local scoring procedure, I coded it in the context of local models for MGWR. After multiple iterations, the model is converging and the bandwidth results are looking as expected. Though the parameter coefficients have values close to expected but not as accurate as needed. There could be possible issues with the weights associated in the model, and some adjustments need to be made for the coefficients which need to be figured out.

What is coming up next?

In the coming week I will work on resolving the coefficient value issue discussed above and design and implement a Monte-Carlo design for the Binomial model as was done for the Poisson MGWR model.

Did I get stuck anywhere?

The modeling of the binary response variable with MGWR is still not resolved and issues have been encountered continuously in it, though that is expected from research. Hoping to resolve these final issues soon and work further on the predictions with GWR and MGWR.

Looking forward to the progress update next week!



from Planet Python
via read more

Will Kahn-Greene: crashstats-tools v1.0.1 released! cli for Crash Stats.

What is it?

crashstats-tools is a set of command-line tools for working with Crash Stats (https://crash-stats.mozilla.org/).

crashstats-tools comes with two commands:

  • supersearch: for performing Crash Stats Super Search queries
  • fetch-data: for fetching raw crash, dumps, and processed crash data for specified crash ids

v1.0.1 released!

I extracted two commands we have in the Socorro local dev environment as a separate Python project. This allows anyone to use those two commands without having to set up a Socorro local dev environment.

The audience for this is pretty limited, but I think it'll help significantly for testing analysis tools.

Say I'm working on an analysis tool that looks at crash report minidump files and does some additional analysis on it. I could use supersearch command to get me a list of crash ids to download data for and the fetch-data command to download the requisite data.

$ export CRASHSTATS_API_TOKEN=foo
$ mkdir crashdata
$ supersearch --product=Firefox --num=10 | \
    fetch-data --raw --dumps --no-processed crashdata

Then I can run my tools on the dumps in crashdata/upload_file_minidump/.

Be thoughtful about using data

Make sure to use these tools in compliance with our data policy:

https://crash-stats.mozilla.org/documentation/memory_dump_access/

Where to go for more

See the project on GitHub which includes a README which contains everything about the project including examples of usage, the issue tracker, and the source code:

https://github.com/willkg/crashstats-tools

Let me know whether this helps you!



from Planet Python
via read more

PyPI now supports uploading via API token

We're further increasing the security of the Python Package Index with another new beta feature: scoped API tokens for package upload. This is thanks to a grant from the Open Technology Fund, coordinated by the Packaging Working Group of the Python Software Foundation.

Over the last few months, we've added two-factor authentication (2FA) login security methods. We added Time-based One-Time Password (TOTP) support in late May and physical security device support in mid-June. Now, over 1600 users have started using physical security devices or TOTP applications to better secure their accounts. And over the past week, over 7.8% of logins to PyPI.org have been protected by 2FA, up from 3% in the month of June.

Add API token screen, with textarea for token name and dropdown menu to choose token scope
PyPI interface for adding an
API token for package upload
Now, we have another improvement: you can use API tokens to upload packages to PyPI and Test PyPI! And we've designed the token to be a drop-in replacement for the username and password you already use (warning: this is a beta feature that we need your help to test).

How it works: Go to your PyPI account settings and select "Add API token". When you create an API token, you choose its scope: you can create a token that can upload to all the projects you maintain or own, or you can limit its scope to just one project.


API token management interface displays each token's name, scope, date/time created, and date/time last used, and the user can view each token's unique ID or revoke it
PyPI API token management interface
The token management screen shows you when each of your tokens were created, and last used. And you can revoke one token without revoking others, and without having to change your password on PyPI and in configuration files.

Uploading with an API token is currently optional but encouraged; in the future, PyPI will set and enforce a policy requiring users with two-factor authentication enabled to use API tokens to upload (rather than just their password sans second factor). Watch our announcement mailing list for future details.

A successful API token creation: a long string that only appears once, for the user to copy
Immediately after creating the API token,
PyPI gives the user one chance to copy it

Why: These API tokens can only be used to upload packages to PyPI, and not to log in more generally. This makes it safer to automate package upload and store the credential in the cloud, since a thief who copies the token won't also gain the ability to delete the project, delete old releases, or add or remove collaborators. And, since the token is a long character string (with 32 bytes of entropy and a service identifier) that PyPI has securely generated on the server side, we vastly reduce the potential for credential reuse on other sites and for a bad actor to guess the token.


Help us test: Please try this out! This is a beta feature and we expect that users will find minor issues over the next few weeks; we ask for your bug reports. If you find any potential security vulnerabilities, please follow our published security policy. (Please don't report security issues in Warehouse via GitHub, IRC, or mailing lists. Instead, please directly email security@python.org.) If you find an issue that is not a security vulnerability, please report it via GitHub.

We'd particularly like testing from:
  • Organizations that automate uploads using continuous integration
  • People who save PyPI credentials in a .pypirc file
  • Windows users
  • People on mobile devices
  • People on very slow connections
  • Organizations where users share an auth token within a group
  • Projects with 4+ maintainers or owners
  • People who usually block cookies and JavaScript
  • People who maintain 20+ projects
  • People who created their PyPI account 6+ years ago
What's next for PyPI: Next, we'll move on to working on an advanced audit trail of sensitive user actions, plus improvements to accessibility and localization for PyPI (some of which have already started). More details are in our progress reports on Discourse.

Thanks to the Open Technology Fund for funding this work. And please sign up for the PyPI Announcement Mailing List for future updates.


from Python Software Foundation News
via read more

Erik Marsja: How to Read and Write JSON Files using Python and Pandas

In this post we will learn how to read and write JSON files using Python. In the first, part we are going to use the Python package json to create a JSON file and write a JSON file. In the next part we are going to use Pandas json method to load JSON files into Pandas dataframe. Here, we will learn how to read from a JSON file locally and from an URL as well as how to read a nested JSON file using Pandas.

Finally, as a bonus, we will also learn how to manipulate data in Pandas dataframes, rename columns, and plot the data using Seaborn.

What is a JSON File?

JSON, short for JavaScript Object Notation, is a compact, text based format used to exchange data. This format that is common for downloading, and storing, information from web servers via so-called Web APIs. JSON is a text-based format and  when opening up a JSON file, we will recognize the structure. That is, it is not so different from Python’s structure for a dictionary.

Example JSON file

In the first example we are going to use the Python module json to create a JSON file. After we’ve done that we are going to load the JSON file. In this Python JSON tutorial, we start by create a dictionary for our data:

import json

data = {"Sub_ID":["1","2","3","4","5","6","7","8" ],
        "Name":["Erik", "Daniel", "Michael", "Sven",
                "Gary", "Carol","Lisa", "Elisabeth" ],
        "Salary":["723.3", "515.2", "621", "731", 
                  "844.15","558", "642.8", "732.5" ],
        "StartDate":[ "1/1/2011", "7/23/2013", "12/15/2011",
                     "6/11/2013", "3/27/2011","5/21/2012", 
                     "7/30/2013", "6/17/2014"],
        "Department":[ "IT", "Manegement", "IT", "HR", 
                      "Finance", "IT", "Manegement", "IT"],
        "Sex":[ "M", "M", "M", 
              "M", "M", "F", "F", "F"]}

print(data)
Python dictionary

Saving to a JSON file

In Python, there is the module json that enables us read and write content to and from a JSON file. This module converts the JSONs format to Python’s internal format for Data Structures. So we can work with JSON structures just as we do in the usual way with Python’s own data structures.

Python JSON Example:

In the example code below, we start by importing the json module. After we’ve done that, we open up a new file and use the dump method to write a json file using Python.

import json
with open('data.json', 'w') as outfile:
    json.dump(data, outfile)

How to Use Pandas to Load a JSON File

Now, if we are going to work with the data we might want to use Pandas to load the JSON file into a Pandas dataframe. This will enable us to manipulate data, do summary statistics, and data visualization using Pandas built-in methods. Note, we will cover this briefly later in this post also.

Pandas Read Json Example:

In the next example we are going to use Pandas read_json method to read the JSON file we wrote earlier (i.e., data.json). It’s fairly simple we start by importing pandas as pd:

import pandas as pd

df = pd.read_json('data.json')

df

The output, when working with Jupyter Notebooks, will look like this:

Data Manipulation using Pandas

Now that we have loaded the JSON file into a Pandas dataframe we are going use Pandas inplace method to modify our dataframe. We start by setting the Sub_ID column as index.

df.set_index('Sub_ID', inplace=True)
df

Pandas JSON to CSV Example

Now when we have loaded a JSON file into a dataframe we may want to save it in another format. For instance, we may want to save it as a CSV file and we can do that using Pandas read_csv method. It may be useful to store it in a CSV, if we prefer to browse through the data in a text editor or Excel.

In the Pandas JSON to CSV example below, we carry out the same data manipulation method.

df.to_csv("data.csv")

Learn more about working with CSV files using Pandas in the  Pandas Read CSV Tutorial

How to Load JSON from an URL

We have now seen how easy it is to create a JSON file, write it to our hard drive using Python, and, finally, how to read it using Pandas. However, as previously mentioned, many times the data in stored in the JSON format are on the web.

Thus, in this section of the Python json guide, we are going to learn how to use Pandas read_json method to read a JSON file from an URL. Most often, it’s fairly simple we just create a string variable pointing to the URL:

url = "https://api.exchangerate-api.com/v4/latest/USD"
df = pd.read_json(url)
df.head()

Load JSON from an URL Second Example

When loading some data, using Pandas read_json seems to create a dataframe with dictionaries within each cell. One way to deal with these dictionaries, nested within dictionaries, is to work with the Python module request. This module also have a method for parsing JSON files. After we have parsed the JSON file we will use the method json_normalize to convert the JSON file to a dataframe.

Pandas Dataframe from JSON
import requests
from pandas.io.json import json_normalize

url = "https://think.cs.vt.edu/corgis/json/airlines/airlines.json"
resp = requests.get(url=url)

df = json_normalize(resp.json())
df.head()

As can be seen in the image above, the column names are quite long. This is quite impractical when we are going to create a time series plot, later, using Seaborn. We are now going to rename the columns so they become a bit easier to use.

In the code example below, we use Pandas rename method together with the Python module re. That is, we are using a regular expression to remove “statistics.# of” and “statistics.” from the column names. Finally, we are also replacing dots (“.”) with underscores (“_”) using the str.replace method:

import re

df.rename(columns=lambda x: re.sub("statistics.# of","",x), 
          inplace=True)
df.rename(columns=lambda x: re.sub("statistics.","",x), 
          inplace=True)

df.columns = df.columns.str.replace("[.]", "_")
df.head()

 

Time Series Plot from JSON Data using Seaborn

In the last example, in this post, we are going to use Seaborn to create a time series plot. The data we loaded from JSON to a dataframe contains data about delayed and canceled flights. We are going to use Seaborns lineplot method to create a time series plot of the number of canceled flights throughout 2003 to 2016, grouped by carrier code.

%matplotlib inline

import matplotlib.pyplot as plt
import seaborn as sns

fig = plt.figure(figsize=(10, 7))
g = sns.lineplot(x="timeyear", y="flightscancelled", ci=False,
             hue="carriercode", data=df)

g.set_ylabel("Flights Cancelled",fontsize=20)
g.set_xlabel("Year",fontsize=20)


plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

Note, we changed the font size as well as the x- and y-axis’ labels using the methods set_ylabel and set_xlabel. Furthermore, we also moved the legend using the legend method from matplotlib.

For more about exploratory data analysis using Python:

Conclusion

In this post we have learned how to write a JSON file from a Python dictionary, how to load that JSON file using Python and Pandas. Furthermore, we have also learned how to use Pandas to load a JSON file from an URL to a dataframe, how to read a nested JSON file to a dataframe.

Here’s a link to a Jupyter Notebook containing all code examples in this post.

The post How to Read and Write JSON Files using Python and Pandas appeared first on Erik Marsja.



from Planet Python
via read more

PyCharm: Jupyter, PyCharm and Pizza

Hi there! Have you tried Jupyter Notebooks integration in PyCharm 2019.2? Not yet? Then let me show you what it looks like!

In this blog post, we’re going to explore some data using PyCharm and its Jupyter Notebook integration. First, of course, we’ll need said data. Whenever I need a new dataset to play with, I typically head to Kaggle where I’m sure to find something interesting to toy with. This time a dataset called “Pizza Restaurants and the Pizza They Sell” caught my attention. Who doesn’t love pizza? Let’s analyze these pizza restaurants and try to learn a thing or two from it.

Since this data isn’t a part of any of my existing PyCharm projects, I’ll create a new one.
Make sure to use PyCharm Professional Edition, the Community Edition does not include Jupyter Notebooks integration.

Create new PyCHarm project

Tip: When using Jupyter notebooks in the browser, I tend to create multiply temporary notebooks just for experiments. It would be quite tedious to create a PyCharm project for each of them, so instead, you can have a single project for such experiments.

I like my things organized, so once the project is created, I’ll add some structure to it – a directory for the data where I’ll move the downloaded dataset, and another directory for the notebooks.

Once I create my first pizza.ipynb notebook, PyCharm suggests to install Jupyter package and provides a link in the upper right corner to do that.

Install Jupyter package

Once the Jupyter package is installed, we’re ready to go!

The first thing that probably 90% of data scientists do in their Jupyter notebooks is type import pandas as pd. At this point, PyCharm will suggest installing pandas in this venv and you can do it with a single click:

Install Pandas

Once we have pandas installed, we can read the data from the csv into a pandas DataFrame:
df = pd.read_csv("../data/Datafiniti_Pizza_Restaurants_and_the_Pizza_They_Sell_May19.csv")

To execute this cell, hit Shift+Enter, or click the green arrow icon in the gutter next to the cell.
When you run a cell for the first time, PyCharm will launch a local Jupyter server to execute the code in it – you don’t need to manually do this from your terminal.

Let’s get to know the data. First, we’ll learn the basic things about this dataset – how many rows does it have? What are the columns? What does the data look like?

First look at the data

I have a suspicion that this data contains information only on restaurants in the US. To confirm this, let’s count the values in the country column:

Count unique values in country column

Yep, the only country presented in this dataset is US, so it’s safe to drop the country column altogether. Same goes for menus.currency and priceRangeCurrency, those values too are all the same – USD. I’ll also drop menuPageURL as it doesn’t add much value to the analysis, and key as it duplicates the information from other columns (country, state, city, etc.).

Another cleanup that I’ll do here is rename province column into states as it makes more sense in this context, and for better readability, I’ll replace the state acronyms with full names of the states.

Data cleanup

Once we’re done with cleaning the data, how about we plot it? As humans, we are better at understanding information when it’s presented visually.

First, let’s see what are the most common types of pizza we have in this dataset. Given the theme, it feels appropriate to visualize this as a pie with matplotlib :)

Pizza pie plot

Oops, where’s my pie? To have it displayed, I need to add %matplotlib inline magic command for IPython, and while I’m at it, I’ll add another magic command to let IPython know to render the plots appropriately for retina screen.

I could add these lines to the same cell and run it again, but I prefer to have this type of magic commands defined at the very beginning of the notebook.

To navigate to the very beginning of the notebook, you can use Cmd+[ (Ctrl+Alt+Left on Windows). Inserting a new cell is as easy as typing #%% (if you prefer a shortcut to insert a cell above your current one, there’s one! Option+Shift+Aon mac, or Alt+Shift+A on Windows). Now all I need to do is add the magic commands and run all cells below:

Run Below

And voila! Now we know that the most common type of pizza is Cheese Pizza closely followed by White Pizza.

Pie plot

What about the restaurants? We have their geographical locations in the dataset, so we can easily see where they are located.

Each restaurant has a unique id and can have multiple entries in the dataset, each entry representing a pizza from that restaurant’s menu. So to plot the restaurants and not the pizza, we’ll need to group the entries by restaurant id.

Unique restaurants

Now we can plot them on a map. For geographical plotting, I like to use plotly. Make sure to grab the latest version of it (4.0.0) to have plotly outputs rendered nicely in PyCharm.

Pizza restaurants on a map

What else can we learn from this data? Let’s try something a little more complicated. Let’s see what states have the most pizza restaurants in them. To make this comparison fair, we’ll count the restaurants per capita (per 100 000 residents). You can get the population data for the US and multiple other datasets at https://www.census.gov/.

Pizza restaurants per capita

And the winner is… New York!

One can think of a number of questions we can try to get answered with this dataset, like, what city has the most/least expensive Veggie Pizza? Or what are the most common pizza restaurant chains? If you want to toy with this dataset and answer these or other questions, you can grab it on kaggle and run your own analysis. The notebook used in this blog post is available on GitHub. And if you want to try it with PyCharm, make sure you’re using PyCharm 2019.2 Professional Edition.



from Planet Python
via read more

Real Python: First Steps With PySpark and Big Data Processing

It’s becoming more common to face situations where the amount of data is simply too big to handle on a single machine. Luckily, technologies such as Apache Spark, Hadoop, and others have been developed to solve this exact problem. The power of those systems can be tapped into directly from Python using PySpark!

Efficiently handling datasets of gigabytes and more is well within the reach of any Python developer, whether you’re a data scientist, a web developer, or anything in between.

In this tutorial, you’ll learn:

  • What Python concepts can be applied to Big Data
  • How to use Apache Spark and PySpark
  • How to write basic PySpark programs
  • How to run PySpark programs on small datasets locally
  • Where to go next for taking your PySpark skills to a distributed system

Free Bonus: Click here to get access to a chapter from Python Tricks: The Book that shows you Python's best practices with simple examples you can apply instantly to write more beautiful + Pythonic code.

Big Data Concepts in Python

Despite its popularity as just a scripting language, Python exposes several programming paradigms like array-oriented programming, object-oriented programming, asynchronous programming, and many others. One paradigm that is of particular interest for aspiring Big Data professionals is functional programming.

Functional programming is a common paradigm when you are dealing with Big Data. Writing in a functional manner makes for embarrassingly parallel code. This means it’s easier to take your code and have it run on several CPUs or even entirely different machines. You can work around the physical memory and CPU restrictions of a single workstation by running on multiple systems at once.

This is the power of the PySpark ecosystem, allowing you to take functional code and automatically distribute it across an entire cluster of computers.

Luckily for Python programmers, many of the core ideas of functional programming are available in Python’s standard library and built-ins. You can learn many of the concepts needed for Big Data processing without ever leaving the comfort of Python.

The core idea of functional programming is that data should be manipulated by functions without maintaining any external state. This means that your code avoids global variables and always returns new data instead of manipulating the data in-place.

Another common idea in functional programming is anonymous functions. Python exposes anonymous functions using the lambda keyword, not to be confused with AWS Lambda functions.

Now that you know some of the terms and concepts, you can explore how those ideas manifest in the Python ecosystem.

Lambda Functions

lambda functions in Python are defined inline and are limited to a single expression. You’ve likely seen lambda functions when using the built-in sorted() function:

>>>
>>> x = ['Python', 'programming', 'is', 'awesome!']
>>> print(sorted(x))
['Python', 'awesome!', 'is', 'programming']
>>> print(sorted(x, key=lambda arg: arg.lower()))
['awesome!', 'is', 'programming', 'Python']

The key parameter to sorted is called for each item in the iterable. This makes the sorting case-insensitive by changing all the strings to lowercase before the sorting takes place.

This is a common use-case for lambda functions, small anonymous functions that maintain no external state.

Other common functional programming functions exist in Python as well, such as filter(), map(), and reduce(). All these functions can make use of lambda functions or standard functions defined with def in a similar manner.

filter(), map(), and reduce()

The built-in filter(), map(), and reduce() functions are all common in functional programming. You’ll soon see that these concepts can make up a significant portion of the functionality of a PySpark program.

It’s important to understand these functions in a core Python context. Then, you’ll be able to translate that knowledge into PySpark programs and the Spark API.

filter() filters items out of an iterable based on a condition, typically expressed as a lambda function:

>>>
>>> x = ['Python', 'programming', 'is', 'awesome!']
>>> print(list(filter(lambda arg: len(arg) < 8, x)))
['Python', 'is']

filter() takes an iterable, calls the lambda function on each item, and returns the items where the lambda returned True.

Note: Calling list() is required because filter() is also an iterable. filter() only gives you the values as you loop over them. list() forces all the items into memory at once instead of having to use a loop.

You can imagine using filter() to replace a common for loop pattern like the following:

def is_less_than_8_characters(item):
    return len(item) < 8

x = ['Python', 'programming', 'is', 'awesome!']
results = []

for item in x:
    if is_less_than_8_characters(item):
        results.append(item)

print(results)

This code collects all the strings that have less than 8 characters. The code is more verbose than the filter() example, but it performs the same function with the same results.

Another less obvious benefit of filter() is that it returns an iterable. This means filter() doesn’t require that your computer have enough memory to hold all the items in the iterable at once. This is increasingly important with Big Data sets that can quickly grow to several gigabytes in size.

map() is similar to filter() in that it applies a function to each item in an iterable, but it always produces a 1-to-1 mapping of the original items. The new iterable that map() returns will always have the same number of elements as the original iterable, which was not the case with filter():

>>>
>>> x = ['Python', 'programming', 'is', 'awesome!']
>>> print(list(map(lambda arg: arg.upper(), x)))
['PYTHON', 'PROGRAMMING', 'IS', 'AWESOME!']

map() automatically calls the lambda function on all the items, effectively replacing a for loop like the following:

results = []

x = ['Python', 'programming', 'is', 'awesome!']
for item in x:
    results.append(item.upper())

print(results)

The for loop has the same result as the map() example, which collects all items in their upper-case form. However, as with the filter() example, map() returns an iterable, which again makes it possible to process large sets of data that are too big to fit entirely in memory.

Finally, the last of the functional trio in the Python standard library is reduce(). As with filter() and map(), reduce()applies a function to elements in an iterable.

Again, the function being applied can be a standard Python function created with the def keyword or a lambda function.

However, reduce() doesn’t return a new iterable. Instead, reduce() uses the function called to reduce the iterable to a single value:

>>>
>>> from functools import reduce
>>> x = ['Python', 'programming', 'is', 'awesome!']
>>> print(reduce(lambda val1, val2: val1 + val2, x))
Pythonprogrammingisawesome!

This code combines all the items in the iterable, from left to right, into a single item. There is no call to list() here because reduce() already returns a single item.

Note: Python 3.x moved the built-in reduce() function into the functools package.

lambda, map(), filter(), and reduce() are concepts that exist in many languages and can be used in regular Python programs. Soon, you’ll see these concepts extend to the PySpark API to process large amounts of data.

Sets

Sets are another common piece of functionality that exist in standard Python and is widely useful in Big Data processing. Sets are very similar to lists except they do not have any ordering and cannot contain duplicate values. You can think of a set as similar to the keys in a Python dict.

Hello World in PySpark

As in any good programming tutorial, you’ll want to get started with a Hello World example. Below is the PySpark equivalent:

import pyspark
sc = pyspark.SparkContext('local[*]')

txt = sc.textFile('file:////usr/share/doc/python/copyright')
print(txt.count())

python_lines = txt.filter(lambda line: 'python' in line.lower())
print(python_lines.count())

Don’t worry about all the details yet. The main idea is to keep in mind that a PySpark program isn’t much different from a regular Python program.

Note: This program will likely raise an Exception on your system if you don’t have PySpark installed yet or don’t have the specified copyright file, which you’ll see how to do later.

You’ll learn all the details of this program soon, but take a good look. The program counts the total number of lines and the number of lines that have the word python in a file named copyright.

Remember, a PySpark program isn’t that much different from a regular Python program, but the execution model can be very different from a regular Python program, especially if you’re running on a cluster.

There can be a lot of things happening behind the scenes that distribute the processing across multiple nodes if you’re on a cluster. However, for now, think of the program as a Python program that uses the PySpark library.

Now that you’ve seen some common functional concepts that exist in Python as well as a simple PySpark program, it’s time to dive deeper into Spark and PySpark.

What Is Spark?

Apache Spark is made up of several components, so describing it can be difficult. At its core, Spark is a generic engine for processing large amounts of data.

Spark is written in Scala and runs on the JVM. Spark has built-in components for processing streaming data, machine learning, graph processing, and even interacting with data via SQL.

In this guide, you’ll only learn about the core Spark components for processing Big Data. However, all the other components such as machine learning, SQL, and so on are all available to Python projects via PySpark too.

What Is PySpark?

Spark is implemented in Scala, a language that runs on the JVM, so how can you access all that functionality via Python?

PySpark is the answer.

The current version of PySpark is 2.4.3 and works with Python 2.7, 3.3, and above.

You can think of PySpark as a Python-based wrapper on top of the Scala API. This means you have two sets of documentation to refer to:

  1. PySpark API documentation
  2. Spark Scala API documentation

The PySpark API docs have examples, but often you’ll want to refer to the Scala documentation and translate the code into Python syntax for your PySpark programs. Luckily, Scala is a very readable function-based programming language.

PySpark communicates with the Spark Scala-based API via the Py4J library. Py4J isn’t specific to PySpark or Spark. Py4J allows any Python program to talk to JVM-based code.

There are two reasons that PySpark is based on the functional paradigm:

  1. Spark’s native language, Scala, is functional-based.
  2. Functional code is much easier to parallelize.

Another way to think of PySpark is a library that allows processing large amounts of data on a single machine or a cluster of machines.

In a Python context, think of PySpark has a way to handle parallel processing without the need for the threading or multiprocessing modules. All of the complicated communication and synchronization between threads, processes, and even different CPUs is handled by Spark.

PySpark API and Data Structures

To interact with PySpark, you create specialized data structures called Resilient Distributed Datasets (RDDs).

RDDs hide all the complexity of transforming and distributing your data automatically across multiple nodes by a scheduler if you’re running on a cluster.

To better understand PySpark’s API and data structures, recall the Hello World program mentioned previously:

import pyspark
sc = pyspark.SparkContext('local[*]')

txt = sc.textFile('file:////usr/share/doc/python/copyright')
print(txt.count())

python_lines = txt.filter(lambda line: 'python' in line.lower())
print(python_lines.count())

The entry-point of any PySpark program is a SparkContext object. This object allows you to connect to a Spark cluster and create RDDs. The local[*] string is a special string denoting that you’re using a local cluster, which is another way of saying you’re running in single-machine mode. The * tells Spark to create as many worker threads as logical cores on your machine.

Creating a SparkContext can be more involved when you’re using a cluster. To connect to a Spark cluster, you might need to handle authentication and a few other pieces of information specific to your cluster. You can set up those details similarly to the following:

conf = pyspark.SparkConf()
conf.setMaster('spark://head_node:56887')
conf.set('spark.authenticate', True)
conf.set('spark.authenticate.secret', 'secret-key')
sc = SparkContext(conf=conf)

You can start creating RDDs once you have a SparkContext.

You can create RDDs in a number of ways, but one common way is the PySpark parallelize() function. parallelize() can transform some Python data structures like lists and tuples into RDDs, which gives you functionality that makes them fault-tolerant and distributed.

To better understand RDDs, consider another example. The following code creates an iterator of 10,000 elements and then uses parallelize() to distribute that data into 2 partitions:

>>>
>>> big_list = range(10000)
>>> rdd = sc.parallelize(big_list, 2)
>>> odds = rdd.filter(lambda x: x % 2 != 0)
>>> odds.take(5)
[1, 3, 5, 7, 9]

parallelize() turns that iterator into a distributed set of numbers and gives you all the capability of Spark’s infrastructure.

Notice that this code uses the RDD’s filter() method instead of Python’s built-in filter(), which you saw earlier. The result is the same, but what’s happening behind the scenes is drastically different. By using the RDD filter() method, that operation occurs in a distributed manner across several CPUs or computers.

Again, imagine this as Spark doing the multiprocessing work for you, all encapsulated in the RDD data structure.

take() is a way to see the contents of your RDD, but only a small subset. take() pulls that subset of data from the distributed system onto a single machine.

take() is important for debugging because inspecting your entire dataset on a single machine may not be possible. RDDs are optimized to be used on Big Data so in a real world scenario a single machine may not have enough RAM to hold your entire dataset.

Note: Spark temporarily prints information to stdout when running examples like this in the shell, which you’ll see how to do soon. Your stdout might temporarily show something like [Stage 0:> (0 + 1) / 1].

The stdout text demonstrates how Spark is splitting up the RDDs and processing your data into multiple stages across different CPUs and machines.

Another way to create RDDs is to read in a file with textFile(), which you’ve seen in previous examples. RDDs are one of the foundational data structures for using PySpark so many of the functions in the API return RDDs.

One of the key distinctions between RDDs and other data structures is that processing is delayed until the result is requested. This is similar to a Python generator. Developers in the Python ecosystem typically use the term lazy evaluation to explain this behavior.

You can stack up multiple transformations on the same RDD without any processing happening. This functionality is possible because Spark maintains a directed acyclic graph of the transformations. The underlying graph is only activated when the final results are requested. In the previous example, no computation took place until you requested the results by calling take().

There are multiple ways to request the results from an RDD. You can explicitly request results to be evaluated and collected to a single cluster node by using collect() on a RDD. You can also implicitly request the results in various ways, one of which was using count() as you saw earlier.

Note: Be careful when using these methods because they pull the entire dataset into memory, which will not work if the dataset is too big to fit into the RAM of a single machine.

Again, refer to the PySpark API documentation for even more details on all the possible functionality.

Installing PySpark

Typically, you’ll run PySpark programs on a Hadoop cluster, but other cluster deployment options are supported. You can read Spark’s cluster mode overview for more details.

Note: Setting up one of these clusters can be difficult and is outside the scope of this guide. Ideally, your team has some wizard DevOps engineers to help get that working. If not, Hadoop publishes a guide to help you.

In this guide, you’ll see several ways to run PySpark programs on your local machine. This is useful for testing and learning, but you’ll quickly want to take your new programs and run them on a cluster to truly process Big Data.

Sometimes setting up PySpark by itself can be challenging too because of all the required dependencies.

PySpark runs on top of the JVM and requires a lot of underlying Java infrastructure to function. That being said, we live in the age of Docker, which makes experimenting with PySpark much easier.

Even better, the amazing developers behind Jupyter have done all the heavy lifting for you. They publish a Dockerfile that includes all the PySpark dependencies along with Jupyter. So, you can experiment directly in a Jupyter notebook!

Note: Jupyter notebooks have a lot of functionality. Check out Jupyter Notebook: An Introduction for a lot more details on how to use notebooks effectively.

First, you’ll need to install Docker. Take a look at Docker in Action – Fitter, Happier, More Productive if you don’t have Docker setup yet.

Note: The Docker images can be quite large so make sure you’re okay with using up around 5 GBs of disk space to use PySpark and Jupyter.

Next, you can run the following command to download and automatically launch a Docker container with a pre-built PySpark single-node setup. This command may take a few minutes because it downloads the images directly from DockerHub along with all the requirements for Spark, PySpark, and Jupyter:

$ docker run -p 8888:8888 jupyter/pyspark-notebook

Once that command stops printing output, you have a running container that has everything you need to test out your PySpark programs in a single-node environment.

To stop your container, type Ctrl+C in the same window you typed the docker run command in.

Now it’s time to finally run some programs!

Running PySpark Programs

There are a number of ways to execute PySpark programs, depending on whether you prefer a command-line or a more visual interface. For a command-line interface, you can use the spark-submit command, the standard Python shell, or the specialized PySpark shell.

First, you’ll see the more visual interface with a Jupyter notebook.

Jupyter Notebook

You can run your program in a Jupyter notebook by running the following command to start the Docker container you previously downloaded (if it’s not already running):

$ docker run -p 8888:8888 jupyter/pyspark-notebook
Executing the command: jupyter notebook
[I 08:04:22.869 NotebookApp] Writing notebook server cookie secret to /home/jovyan/.local/share/jupyter/runtime/notebook_cookie_secret
[I 08:04:25.022 NotebookApp] JupyterLab extension loaded from /opt/conda/lib/python3.7/site-packages/jupyterlab
[I 08:04:25.022 NotebookApp] JupyterLab application directory is /opt/conda/share/jupyter/lab
[I 08:04:25.027 NotebookApp] Serving notebooks from local directory: /home/jovyan
[I 08:04:25.028 NotebookApp] The Jupyter Notebook is running at:
[I 08:04:25.029 NotebookApp] http://(4d5ab7a93902 or 127.0.0.1):8888/?token=80149acebe00b2c98242aa9b87d24739c78e562f849e4437
[I 08:04:25.029 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 08:04:25.037 NotebookApp]

    To access the notebook, open this file in a browser:
        file:///home/jovyan/.local/share/jupyter/runtime/nbserver-6-open.html
    Or copy and paste one of these URLs:
        http://(4d5ab7a93902 or 127.0.0.1):8888/?token=80149acebe00b2c98242aa9b87d24739c78e562f849e4437

Now you have a container running with PySpark. Notice that the end of the docker run command output mentions a local URL.

Note: The output from the docker commands will be slightly different on every machine because the tokens, container IDs, and container names are all randomly generated.

You need to use that URL to connect to the Docker container running Jupyter in a web browser. Copy and paste the URL from your output directly into your web browser. Here is an example of the URL you’ll likely see:

$ http://127.0.0.1:8888/?token=80149acebe00b2c98242aa9b87d24739c78e562f849e4437

The URL in the command below will likely differ slightly on your machine, but once you connect to that URL in your browser, you can access a Jupyter notebook environment, which should look similar to this:

Jupyter notebook homepage

From the Jupyter notebook page, you can use the New button on the far right to create a new Python 3 shell. Then you can test out some code, like the Hello World example from before:

import pyspark
sc = pyspark.SparkContext('local[*]')

txt = sc.textFile('file:////usr/share/doc/python/copyright')
print(txt.count())

python_lines = txt.filter(lambda line: 'python' in line.lower())
print(python_lines.count())

Here’s what running that code will look like in the Jupyter notebook:

PySpark Hello World in Jupyter notebook

There is a lot happening behind the scenes here, so it may take a few seconds for your results to display. The answer won’t appear immediately after you click the cell.

Command-Line Interface

The command-line interface offers a variety of ways to submit PySpark programs including the PySpark shell and the spark-submit command. To use these CLI approaches, you’ll first need to connect to the CLI of the system that has PySpark installed.

To connect to the CLI of the Docker setup, you’ll need to start the container like before and then attach to that container. Again, to start the container, you can run the following command:

$ docker run -p 8888:8888 jupyter/pyspark-notebook

Once you have the Docker container running, you need to connect to it via the shell instead of a Jupyter notebook. To do this, run the following command to find the container name:

$ docker container ls
CONTAINER ID        IMAGE                      COMMAND                  CREATED             STATUS              PORTS                    NAMES
4d5ab7a93902        jupyter/pyspark-notebook   "tini -g -- start-no…"   12 seconds ago      Up 10 seconds       0.0.0.0:8888->8888/tcp   kind_edison

This command will show you all the running containers. Find the CONTAINER ID of the container running the jupyter/pyspark-notebook image and use it to connect to the bash shell inside the container:

$ docker exec -it 4d5ab7a93902 bash
jovyan@4d5ab7a93902:~$

Now you should be connected to a bash prompt inside of the container. You can verify that things are working because the prompt of your shell will change to be something similar to jovyan@4d5ab7a93902, but using the unique ID of your container.

Note: Replace 4d5ab7a93902 with the CONTAINER ID used on your machine.

Cluster

You can use the spark-submit command installed along with Spark to submit PySpark code to a cluster using the command line. This command takes a PySpark or Scala program and executes it on a cluster. This is likely how you’ll execute your real Big Data processing jobs.

Note: The path to these commands depends on where Spark was installed and will likely only work when using the referenced Docker container.

To run the Hello World example (or any PySpark program) with the running Docker container, first access the shell as described above. Once you’re in the container’s shell environment you can create files using the nano text editor.

To create the file in your current folder, simply launch nano with the name of the file you want to create:

$ nano hello_world.py

Type in the contents of the Hello World example and save the file by typing Ctrl+X and following the save prompts:

Example using Nano Text Editor

Finally, you can run the code through Spark with the pyspark-submit command:

$ /usr/local/spark/bin/spark-submit hello_world.py

This command results in a lot of output by default so it may be difficult to see your program’s output. You can control the log verbosity somewhat inside your PySpark program by changing the level on your SparkContext variable. To do that, put this line near the top of your script:

sc.setLogLevel('WARN')

This will omit some of the output of spark-submit so you can more clearly see the output of your program. However, in a real-world scenario, you’ll want to put any output into a file, database, or some other storage mechanism for easier debugging later.

Luckily, a PySpark program still has access to all of Python’s standard library, so saving your results to a file is not an issue:

import pyspark
sc = pyspark.SparkContext('local[*]')

txt = sc.textFile('file:////usr/share/doc/python/copyright')
python_lines = txt.filter(lambda line: 'python' in line.lower())

with open('results.txt', 'w') as file_obj:
    file_obj.write(f'Number of lines: {txt.count()}\n')
    file_obj.write(f'Number of lines with python: {python_lines.count()}\n')

Now your results are in a separate file called results.txt for easier reference later.

Note: The above code uses f-strings, which were introduced in Python 3.6.

PySpark Shell

Another PySpark-specific way to run your programs is using the shell provided with PySpark itself. Again, using the Docker setup, you can connect to the container’s CLI as described above. Then, you can run the specialized Python shell with the following command:

$ /usr/local/spark/bin/pyspark
Python 3.7.3 | packaged by conda-forge | (default, Mar 27 2019, 23:01:00)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.1
      /_/

Using Python version 3.7.3 (default, Mar 27 2019 23:01:00)
SparkSession available as 'spark'.

Now you’re in the Pyspark shell environment inside your Docker container, and you can test out code similar to the Jupyter notebook example:

>>>
>>> txt = sc.textFile('file:////usr/share/doc/python/copyright')
>>> print(txt.count())
316

Now you can work in the Pyspark shell just as you would with your normal Python shell.

Note: You didn’t have to create a SparkContext variable in the Pyspark shell example. The PySpark shell automatically creates a variable, sc, to connect you to the Spark engine in single-node mode.

You must create your own SparkContext when submitting real PySpark programs with spark-submit or a Jupyter notebook.

You can also use the standard Python shell to execute your programs as long as PySpark is installed into that Python environment. The Docker container you’ve been using does not have PySpark enabled for the standard Python environment. So, you must use one of the previous methods to use PySpark in the Docker container.

Combining PySpark With Other Tools

As you already saw, PySpark comes with additional libraries to do things like machine learning and SQL-like manipulation of large datasets. However, you can also use other common scientific libraries like NumPy and Pandas.

You must install these in the same environment on each cluster node, and then your program can use them as usual. Then, you’re free to use all the familiar idiomatic Pandas tricks you already know.

Remember: Pandas DataFrames are eagerly evaluated so all the data will need to fit in memory on a single machine.

Next Steps for Real Big Data Processing

Soon after learning the PySpark basics, you’ll surely want to start analyzing huge amounts of data that likely won’t work when you’re using single-machine mode. Installing and maintaining a Spark cluster is way outside the scope of this guide and is likely a full-time job in itself.

So, it might be time to visit the IT department at your office or look into a hosted Spark cluster solution. One potential hosted solution is Databricks.

Databricks allows you to host your data with Microsoft Azure or AWS and has a free 14-day trial.

After you have a working Spark cluster, you’ll want to get all your data into that cluster for analysis. Spark has a number of ways to import data:

  1. Amazon S3
  2. Apache Hive Data Warehouse
  3. Any database with a JDBC or ODBC interface

You can even read data directly from a Network File System, which is how the previous examples worked.

There’s no shortage of ways to get access to all your data, whether you’re using a hosted solution like Databricks or your own cluster of machines.

Conclusion

PySpark is a good entry-point into Big Data Processing.

In this tutorial, you learned that you don’t have to spend a lot of time learning up-front if you’re familiar with a few functional programming concepts like map(), filter(), and basic Python. In fact, you can use all the Python you already know including familiar tools like NumPy and Pandas directly in your PySpark programs.

You are now able to:

  • Understand built-in Python concepts that apply to Big Data
  • Write basic PySpark programs
  • Run PySpark programs on small datasets with your local machine
  • Explore more capable Big Data solutions like a Spark cluster or another custom, hosted solution

[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]



from Planet Python
via read more

Python Insider: PyPI now supports uploading via API token

We're further increasing the security of the Python Package Index with another new beta feature: scoped API tokens for package upload. This is thanks to a grant from the Open Technology Fund, coordinated by the Packaging Working Group of the Python Software Foundation.

Over the last few months, we've added two-factor authentication (2FA) login security methods. We added Time-based One-Time Password (TOTP) support in late May and physical security device support in mid-June. Now, over 1600 users have started using physical security devices or TOTP applications to better secure their accounts. And over the past week, over 7.8% of logins to PyPI.org have been protected by 2FA, up from 3% in the month of June.

Now, we have another improvement: you can use API tokens to upload packages to PyPI and Test PyPI! And we've designed the token to be a drop-in replacement for the username and password you already use (warning: this is a beta feature that we need your help to test).

Add API token screen, with textarea for token name and dropdown menu to choose token scope
PyPI interface for adding an
API token for package upload
How it works: Go to your PyPI account settings and select "Add API token". When you create an API token, you choose its scope: you can create a token that can upload to all the projects you maintain or own, or you can limit its scope to just one project.


The token management screen shows you when each of your tokens were created, and last used. And you can revoke one token without revoking others, and without having to change your password on PyPI and in configuration files.
API token management interface displays each token's name, scope, date/time created, and date/time last used, and the user can view each token's unique ID or revoke it
PyPI API token management interface

Uploading with an API token is currently optional but encouraged; in the future, PyPI will set and enforce a policy requiring users with two-factor authentication enabled to use API tokens to upload (rather than just their password sans second factor). Watch our announcement mailing list for future details.

A successful API token creation: a long string that only appears once, for the user to copy
Immediately after creating the API token,
PyPI gives the user one chance to copy it

Why: These API tokens can only be used to upload packages to PyPI, and not to log in more generally. This makes it safer to automate package upload and store the credential in the cloud, since a thief who copies the token won't also gain the ability to delete the project, delete old releases, or add or remove collaborators. And, since the token is a long character string (with 32 bytes of entropy and a service identifier) that PyPI has securely generated on the server side, we vastly reduce the potential for credential reuse on other sites and for a bad actor to guess the token.


Help us test: Please try this out! This is a beta feature and we expect that users will find minor issues over the next few weeks; we ask for your bug reports. If you find any potential security vulnerabilities, please follow our published security policy. (Please don't report security issues in Warehouse via GitHub, IRC, or mailing lists. Instead, please directly email security@python.org.) If you find an issue that is not a security vulnerability, please report it via GitHub.

We'd particularly like testing from:
  • Organizations that automate uploads using continuous integration
  • People who save PyPI credentials in a .pypirc file
  • Windows users
  • People on mobile devices
  • People on very slow connections
  • Organizations where users share an auth token within a group
  • Projects with 4+ maintainers or owners
  • People who usually block cookies and JavaScript
  • People who maintain 20+ projects
  • People who created their PyPI account 6+ years ago
What's next for PyPI: Next, we'll move on to working on an advanced audit trail of sensitive user actions, plus improvements to accessibility and localization for PyPI (some of which have already started). More details are in our progress reports on Discourse.

Thanks to the Open Technology Fund for funding this work. And please sign up for the PyPI Announcement Mailing List for future updates.

Written by Sumana Harihareswara, published initially to https://ift.tt/2ypln5v


from Planet Python
via read more

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...