Skip to content

Data Version Control

One of the resources that matcha provisions by default is a storage container for data. The storage container is tool-agnostic, so whether you want to use the dvc package, LakeFS or some other data version control package is up to you.

You can find all the specifications for the data version control resource in the infrastructure/data_version_control_storage folder inside matcha_ml.

Using the provisioned resources with the dvc package

Let's work through a short example of provisioning a set of resources using matcha, then setting up data version control with the popular dvc package such that it versions and stores data on the provisioned resources. The documentation for the dvc package can be found here.

First, let's provision a set of resources:

$ matcha provision

Once that's finished, we can ask matcha for the connection string to our storage bucket.

$ matcha get --show-sensitive data-version-control

This will print something like the following to your terminal:

Below are the resources provisioned.
Data version control
    - flavor: FLAVOR
    - connection-string: CONNECTION_STRING
    - account-name: ACCOUNT_NAME
    - container-name: CONTAINER_NAME

Now that we have our connection string (you should keep this a secret), assuming you have followed the steps from the dvc docs for initializing and adding files for dvc to track, we can tell the dvc package where to look for historic data, and where to push new, versionable data. This is done as below:

$ dvc remote modify --local my_dvc connection_string CONNECTION_STRING
$ dvc remote modify my_dvc url azure://CONTAINER_NAME
$ dvc remote modify my_dvc account_name STORAGE_ACCOUNT_NAME

It's important to make sure your connection string is stored in config.local, so your connection string never appears in any public repository.

That's it! Whenever you run dvc push, dvc pull or dvc checkout, you or whoever you grant access to your storage container is interacting with the azure storage container provisioned by matcha specifically for data version control.