I’ve had the pleasure of speaking at the first PyDataCambridge conference (2019), this is the second PyData conference in the UK after PyDataLondon (which colleagues and I co-founded 6 years back). I’m super proud to see PyData spread to 6 regional meetups and now 2 UK conferences.
I spoke on Higher Performance Python with a focus towards making Pandas operations go faster and an eye on the upcoming Second Edition of our High Performance Python (O’Reilly) book. The talk covers:
- Using line_profiler to evaluate sklearn’s LinearRegression vs NumPy’s lstsq (spoiler – lstsq is much faster but that’s due to sklearn being much safer, the slow-down is all due to safety code in sklearn that helps keep your productivity higher overall)
- Using Pandas for line-by-line iteration (slow) vs apply (faster) and apply with raw=True to expose NumPy arrays (fastest)
- Using Numba to JIT compile lstsq using apply with raw=True for a huge speed-up
- Using Dask to parallelise the Numba solution for further speed-ups
- Advice on being a “highly performant data scientist”
The last point is important – going “compiler happy” and writing highly efficient code may well slow down your team and your overall velocity. Amongst other items I recommended profiling first, maybe introducing Dask & Numba only with a team’s consent and looking at tools like Bulwark to add tests to DataFrames to avoid being derailed by strange data bugs.
Right now Micha and I are busily working to complete the second edition of our book, all going well it’ll be in for Christmas with a publication date around April 2020.
Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.
The post “Higher Performance Python” at PyDataCambridge 2019 appeared first on Entrepreneurial Geekiness.
from Planet Python
via read more
No comments:
Post a Comment