Wednesday, August 19, 2020

Real Python: Data Version Control With Python and DVC

Machine learning and data science come with a set of problems that are different from what you’ll find in traditional software engineering. Version control systems help developers manage changes to source code. But data version control, managing changes to models and datasets, isn’t so well established.

It’s not easy to keep track of all the data you use for experiments and the models you produce. Accurately reproducing experiments that you or others have done is a challenge. Many teams are actively developing tools and frameworks to solve these problems.

In this tutorial, you’ll learn how to:

  • Use a tool called DVC to tackle some of these challenges
  • Track and version your datasets and models
  • Share a single development computer between teammates
  • Create reproducible machine learning experiments

This tutorial includes several examples of data version control techniques in action. To follow along, you can get the repository with the sample code by clicking the following link:

Get the Source Code: Click here to get the source code you'll use to learn about data version control with DVC in this tutorial.

What Is Data Version Control?

In standard software engineering, many people need to work on a shared codebase and handle multiple versions of the same code. This can quickly lead to confusion and costly mistakes.

To address this problem, developers use version control systems, such as Git, that help keep team members organized.

In a version control system, there’s a central repository of code that represents the current, official state of the project. A developer can make a copy of that project, make some changes, and request that their new version become the official one. Their code is then reviewed and tested before it’s deployed to production.

These quick feedback cycles can happen many times per day in traditional development projects. But similar conventions and standards are largely missing from commercial data science and machine learning. Data version control is a set of tools and processes that tries to adapt the version control process to the data world.

Having systems in place that allow people to work quickly and pick up where others have left off would increase the speed and quality of delivered results. It would enable people to manage data transparently, run experiments effectively, and collaborate with others.

Note: An experiment in this context means either training a model or running operations on a dataset to learn something from it.

One tool that helps researchers govern their data and models and run reproducible experiments is DVC, which stands for Data Version Control.

What Is DVC?

DVC is a command-line tool written in Python. It mimics Git commands and workflows to ensure that users can quickly incorporate it into their regular Git practice. If you haven’t worked with Git before, then be sure to check out Introduction to Git and GitHub for Python Developers. If you’re familiar with Git but would like to take your skills to the next level, then check out Advanced Git Tips for Python Developers.

DVC is meant to be run alongside Git. In fact, the git and dvc commands will often be used in tandem, one after the other. While Git is used to store and version code, DVC does the same for data and model files.

Git can store code locally and also on a hosting service like GitHub, Bitbucket, or GitLab. Likewise, DVC uses a remote repository to store all your data and models. This is the single source of truth, and it can be shared amongst the whole team. You can get a local copy of the remote repository, modify the files, then upload your changes to share with team members.

The remote repository can be on the same computer you’re working on, or it can be in the cloud. DVC supports most major cloud providers, including AWS, GCP, and Azure. But you can set up a DVC remote repository on any server and connect it to your laptop. There are safeguards to keep members from corrupting or deleting the remote data.

When you store your data and models in the remote repository, a .dvc file is created. A .dvc file is a small text file that points to your actual data files in remote storage.

The .dvc file is lightweight and meant to be stored with your code in GitHub. When you download a Git repository, you also get the .dvc files. You can then use those files to get the data associated with that repository. Large data and model files go in your DVC remote storage, and small .dvc files that point to your data go in GitHub.

The best way to understand DVC is to use it, so let’s dive in. You’ll explore the most important features by working through several examples. Before you start, you’ll need to set up an environment to work in and then get some data.

Set Up Your Working Environment

In this tutorial, you’ll learn how to use DVC by practicing on examples that work with image data. You’ll play around with lots of image files and train a machine learning model that recognizes what an image contains.

To work through the examples, you’ll need to have Python and Git installed on your system. You can follow the Python 3 Installation and Setup Guide to install Python on your system. To install Git, you can read through Installing Git.

Read the full article at https://realpython.com/python-data-version-control/ »


[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]



from Planet Python
via read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...