In order to include categorical features in your Machine Learning model, you have to encode them numerically using "dummy" or "one-hot" encoding. But how do you do this correctly using scikit-learn?
In this 28-minute video, you'll learn:
- How to use
OneHotEncoder
andColumnTransformer
to encode your categorical features and prepare your feature matrix in a single step - How to include this step within a
Pipeline
so that you can cross-validate your model and preprocessing steps simultaneously - Why you should use scikit-learn (rather than pandas) for preprocessing your dataset
If you want to follow along with the code, you can download the Jupyter notebook from GitHub.
Click on a timestamp below to jump to a particular section:
0:22 Why should you use a Pipeline
?
2:30 Preview of the lesson
3:35 Loading and preparing a dataset
6:11 Cross-validating a simple model
10:00 Encoding categorical features with OneHotEncoder
15:01 Selecting columns for preprocessing with ColumnTransformer
19:00 Creating a two-step Pipeline
19:54 Cross-validating a Pipeline
21:44 Making predictions on new data
23:43 Recap of the lesson
24:50 Why should you use scikit-learn (rather than pandas) for preprocessing?
Related Resources
- scikit-learn documentation for OneHotEncoder, ColumnTransformer, and Pipeline
- My video series: Introduction to Machine Learning in Python
- My videos on cross-validation and grid search
- My lesson notebook on StandardScaler
P.S. Want to master Machine Learning in Python? Enroll in my online course, Machine Learning with Text in Python!
from Planet Python
via read more
No comments:
Post a Comment