Monday, November 23, 2020

Real Python: Split Your Dataset With scikit-learn's train_test_split()

One of the key aspects of supervised machine learning is model evaluation and validation. When you evaluate the predictive performance of your model, it’s essential that the process be unbiased. Using train_test_split() from the data science library scikit-learn, you can split your dataset into subsets that minimize the potential for bias in your evaluation and validation process.

In this tutorial, you’ll learn:

  • Why you need to split your dataset in supervised machine learning
  • Which subsets of the dataset you need for an unbiased evaluation of your model
  • How to use train_test_split() to split your data
  • How to combine train_test_split() with prediction methods

In addition, you’ll get information on related tools from sklearn.model_selection.

Free Bonus: Click here to get access to a free NumPy Resources Guide that points you to the best tutorials, videos, and books for improving your NumPy skills.

The Importance of Data Splitting

Supervised machine learning is about creating models that precisely map the given inputs (independent variables, or predictors) to the given outputs (dependent variables, or responses).

How you measure the precision of your model depends on the type of a problem you’re trying to solve. In regression analysis, you typically use the coefficient of determination, root-mean-square error, mean absolute error, or similar quantities. For classification problems, you often apply accuracy, precision, recall, F1 score, and related indicators.

The acceptable numeric values that measure precision vary from field to field. You can find detailed explanations from Statistics By Jim, Quora, and many other resources.

What’s most important to understand is that you usually need unbiased evaluation to properly use these measures, assess the predictive performance of your model, and validate the model.

This means that you can’t evaluate the predictive performance of a model with the same data you used for training. You need evaluate the model with fresh data that hasn’t been seen by the model before. You can accomplish that by splitting your dataset before you use it.

Training, Validation, and Test Sets

Splitting your dataset is essential for an unbiased evaluation of prediction performance. In most cases, it’s enough to split your dataset randomly into three subsets:

  1. The training set is applied to train, or fit, your model. For example, you use the training set to find the optimal weights, or coefficients, for linear regression, logistic regression, or neural networks.

  2. The validation set is used for unbiased model evaluation during hyperparameter tuning. For example, when you want to find the optimal number of neurons in a neural network or the best kernel for a support vector machine, you experiment with different values. For each considered setting of hyperparameters, you fit the model with the training set and assess its performance with the validation set.

  3. The test set is needed for an unbiased evaluation of the final model. You shouldn’t use it for fitting or validation.

In less complex cases, when you don’t have to tune hyperparameters, it’s okay to work with only the training and test sets.

Underfitting and Overfitting

Splitting a dataset might also be important for detecting if your model suffers from one of two very common problems, called underfitting and overfitting:

  1. Underfitting is usually the consequence of a model being unable to encapsulate the relations among data. For example, this can happen when trying to represent nonlinear relations with a linear model. Underfitted models will likely have poor performance with both training and test sets.

  2. Overfitting usually takes place when a model has an excessively complex structure and learns both the existing relations among data and noise. Such models often have bad generalization capabilities. Although they work well with training data, they usually yield poor performance with unseen (test) data.

You can find a more detailed explanation of underfitting and overfitting in Linear Regression in Python.

Prerequisites for Using train_test_split()

Now that you understand the need to split a dataset in order to perform unbiased model evaluation and identify underfitting or overfitting, you’re ready to learn how to split your own datasets.

You’ll use version 0.23.1 of scikit-learn, or sklearn. It has many packages for data science and machine learning, but for this tutorial you’ll focus on the model_selection package, specifically on the function train_test_split().

You can install sklearn with pip install:

$ python -m pip install -U "scikit-learn==0.23.1"

If you use Anaconda, then you probably already have it installed. However, if you want to use a fresh environment, ensure that you have the specified version, or use Miniconda, then you can install sklearn from Anaconda Cloud with conda install:

Read the full article at https://realpython.com/train-test-split-python-data/ »


[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]



from Planet Python
via read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...