Monday, November 11, 2019

Chris Moffitt: Book Review: Machine Learning Pocket Reference

Introduction

This article is a review of O’Reilly’s Machine Learning Pocket Reference by Matt Harrison. Since Machine Learning can cover a lot of topics, I was very interested to see what content a “Pocket Reference” would contain. Overall, I really enjoyed this book and think it deserves a place on many data science practitioner’s book shelves. Read on for more details about what is included in this reference and who should consider purchasing it.

Physical Size

I purchased this book from Amazon shortly after it was released. Since I was interested in the content and the price was relatively low for a new O’Reilly book ($24.99); I impulsively purchased it without any research. When it showed up, I laughed a little. I did not realize that the book was as small as it was. Obviously I should not have been surprised. It is a “Pocket Reference” and the product dimensions are listed on the page but I never put 2 and 2 together.

Just for comparison, here’s a picture comparing this book to Chris Albon’s book:

Book Comparison

I bring up the size for two reasons. First, the small size means I would not hesitate to carry it around in my laptop bag. I realize many people like electronic copies but I like the idea of paper reference book. From this perspective, the portability aspect is a positive consideration for me, it might not be for you.

The second point is that the small size means there is not a lot of real estate on the pages. For short code snippets, this is not an issue. However, for longer code sections or large visualizations it is not optimal. For example, on page 205 there is complex decision tree that is really tiny. There are a handful of other places in the book where the small physical size makes the visuals difficult to see.

However, I don’t view the size as huge negative issue. The author graciously includes jupyter notebooks in his github repo so it is easy to see the details if you need to. Since most readers will likely buy this without seeing it in person, I wanted to specifically mention this aspect so you could keep it in mind.

Who is this for?

There are many aspects of this book that I really like. One of the decisions that I appreciate is that Matt explicitly narrows down the Machine Learning topics he covers. This book’s subtitle is “Working with Structured Data in Python” which means that there is no discussion of deep learning libraries like TensorFlow or PyTorch nor is there any discussion about Natural Language Processing (NLP). This specific decision is smart because it focuses the content and gives the author the opportunity to go deeper in the topics he does choose to cover.

The other aspect of this book that I enjoy is that the author expects the reader to have basic python familiarity including a base level understanding of scikit-learn and pandas. Most of the code samples are relatively short and use consistent and idiomatic python. Therefore, anyone that has done a little bit of work in the python data science space should be able to follow along with the examples.

There is no discussion of how to program with python and there is only a very brief intro to using pip or conda to get libraries installed. I appreciate the fact that he does not try to cram in a python introduction and instead focuses on teaching the data science concepts in a crisp and clear manner.

The final point I want to mention is that this is truly a practical guide. There is almost no discussion about the mathematical theory behind the algorithms. In addition, this is not a book solely about scikit-learn. Matt chooses to highlight many libraries that a practitioner would use for real world problems.

Throughout the book, he introduces about 36 different python data science libraries including familiar ones like seaborn, numpy, pandas, scikit-learn as well as other libraries like Yellowbrick, mlxtend, pyjanitor, missing no and many others. In many cases, he shows how to perform similar functions in two different libraries. For example in Chapter 6, there are examples of similar plots done with both seaborn and Yellowbrick.

Some may think it is not necessary to show more than one way to solve a problem. However, I really enjoyed seeing how to use multiple approaches to solving a problem and the relative merits of the different approaches.

Book Organization

The Machine Learning Pocket Reference contains 19 chapters but is only 295 pages long (excluding indices and intro). For the most part, the chapters are very concise. For instance, chapter 2 is only 1 page and chapter 5 is 2 pages. Most chapters are 8-10 pages of clear code and explanation.

Chapter 3 is a special case in that it is the longest chapter and serves as a road map for the rest of the book. It provides a comprehensive walk through of working with the Titanic data set to solve a classification problem. The step by step process includes cleaning the data, building features, and normalizing data. Then using this data to build, evaluate and deploy a machine learning model. The rest of the book breaks down these various steps and goes into more detail on its respective data analysis topic. Here is how the chapters are laid out:

  1. Introduction
  2. Overview of the Machine Learning Processing
  3. Classification Walkthrough: Titanic Dataset
  4. Missing Data
  5. Cleaning Data
  6. Exploring
  7. Preprocess Data
  8. Feature Selection
  9. Imbalanced Classes
  10. Classification
  11. Model Selection
  12. Metrics and Classification Evaluation
  13. Explaining Models
  14. Regression
  15. Metrics and Regression Evaluation
  16. Explaining Regression Models
  17. Dimensionality Reduction
  18. Clustering
  19. Pipelines

Chapter 13 is a good illustrative example of the overall approach of the book. The topic of model interpretablity is very timely and a constantly evolving topic with many advancements over the past couple of years. This chapter starts with a short discussion of regression coefficients. Then moves on to discuss more recent tools like treeinterpreter, lime and SHAP . It also include a discussion about how to use surrogate models in place of models that do not lend themselves to the interpretive approaches shown in the chapter. All of this content is discussed with code examples, output visualizations and guidance on how to interpret the results.

How to Read

When I received the book, I read through it in a couple of sittings. As I read through it, I pulled out lots of interesting notes and insights. Some of them were related to new libraries and some were clever code snippets for analyzing data. The other benefit of going through cover to cover is that I had a good feel for what was in the book and how to reference it in the future when I find myself trying to solve a data science problem.

The pocket reference nature of this book means that it can be helpful for a quick refresher of a topic that is difficult or new to you. A quick review of the chapter may be enough to get you through the problem. It can also be useful for pointing out some of the challenges and trade-offs with different approaches. Finally, the book can be a good jumping off point for further in-depth research when needed.

Other Thoughts

I did not run much of the code from the book but I did not notice any glaring syntax issues. The code uses modern and idiomatic python, pandas and scikit-learn. As mentioned earlier, there is a brief introduction and some caveats about using pip or conda for installation. There is reference to pandas 0.24 and the new Int64 data type so the book is as up to date as can be expected for a book published in September 2019.

In the interest of full disclosure, I purchased this book on my own and received no compensation for this review. I am an Amazon affiliate so if you choose to buy this book through a link , I will receive a small commission.

Summary

It is clear that Matt has a strong understanding of practical approaches to using python data science tools to solve real world problems. I can definitely recommend Machine Learning Pocket Reference as a book to have at your side when you are dealing with structured data in python. Thank you to Matt for creating such a useful resource. I have added it to my recommended resources list.



from Planet Python
via read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...