Creating a DVC Pipeline

Now let’s use the power of DVC. Once we have some of our pipeline built in the nbs/ directory, we can use DVC to automate it using a DVC pipeline.

Introducing the Pipleine

A DVC pipeline can be initilized from the command line using this documentation, or by following the steps below.

First create a dvc.yaml file in the project root directory. This will serve as the map for our pipeline from which DVC will automate from. Here’s an example three stage pipeline taken from the iterative/example-get-started-experiments repo. We’ll explain what everything means below.

stages:
  data_split:
    cmd: python src/data_split.py
    deps:
    - data/pool_data
    - src/data_split.py
    params:
    - base
    - data_split
    outs:
    - data/test_data
    - data/train_data
  train:
    cmd: python src/train.py
    deps:
    - data/train_data
    - src/train.py
    params:
    - base
    - train
    outs:
    - models/model.pkl
  evaluate:
    cmd: python src/evaluate.py
    deps:
    - data/test_data
    - models/model.pkl
    - src/evaluate.py
    params:
    - base
    - evaluate

As you can see pipelines are defined by stages. Each stage requires a command or cmd: to run the stage, the necessary dependencies or deps: for that stage, and the outputs or outs: from that stage. In addition we can add params: as seen above, and graphs: which will be explained later.

While the pipeline above works perfectly fine, the frameworks team has made a few changes to fit our toolset and best practices. Let’s go over the process for adding a stage now.

How it Works

DVC pipelines are run using:

pdm run dvc repro

This ‘reproduces’ the entire pipeline top to bottom, tracking all inputs and outputs in a dvc.lock file. This file looks very similar to the example above, except it also contains the hash and file size for each input and output.

Heres’ an example of one input entry:

path: data/pool_data
md5: 14d187e749ee5614e105741c719fa185.dir
size: 18999874

This allows DVC to track exactly what goes in and what comes out with git. Furthermore, on execution of dvc repro DVC checks this .lock file. If the current hashes of all inputs and outputs are the same as all listed in this file, it will not rerun the pipeline. However, if any are missing or different, it will.

The Framework’s Approach

In an empty dvc.yml file, add the stages: line at the top and the name of your first stage, like the following:

stages:
  data_collection:

Now let’s add the command line:

Running Notebooks with Papermill

If you are using .py files for each stage, adding python stage.py as the command works perfectly fine. However, if you are using .ipynb files for each stage (accoding to the previous pages) we need to add a few tricks.

We’ll use papermill, another iterative.ai tool to run (and parameterize) notebooks. Papermill can be installed with the following:

pdm add papermill

Papermill works by taking an input notebook, running all cells from top to bottom, then writing all output to a new notebook. Usage is papermill <input_nb> <output_nb>. Since DVC will be running this from our project root, lets add a new directory scripts/ to hold all output notebooks. This can be added to the .gitignore.

Here’s an example using papermill for our data_collection stage:

stages:
  data_collection:
    cmd: >
        papermill
        nbs/00_Collecting_Data.ipynb
        scripts/00_Collecting_Data.ipynb
        --cwd nbs/

The --cwd nbs/ part executes the notebook from the nbs/ directory. This ensures any imports/exports with relative paths are routed properly.

Adding Dependencies

Now let’s add dependencies for this stage. Obviously we need to add the notebook itself to the deps, but we’ll also add pdm.lock, .pdm.toml, and pyproject.toml. We do this to every stage to track any project dependencies, their versions, and subdependencies required for that notebook. Finally we also add whatever our data source is.

Here’s an example:

stages:
  data_collection:
    cmd: >
        papermill 
        nbs/01_data_collection.ipynb 
        scripts/01_data_collection.ipynb 
        --cwd nbs/
    deps:
      - nbs/00_Collecting_Data.ipynb
      - pdm.lock
      - pyproject.toml
      - data/raw-data
      - .pdm.toml

Adding Outputs

Finally we add all outputs from this stage. We’ll add the output script from the papermill command, and whatever data we output. Here’s an example continuing the above:

stages:
  data_collection:
    cmd: >
        papermill 
        nbs/01_data_collection.ipynb 
        scripts/01_data_collection.ipynb 
        --cwd nbs/
    deps:
      - nbs/00_Collecting_Data.ipynb
      - pdm.lock
      - pyproject.toml
      - data/raw-data
      - .pdm.toml
    outs:
      - scripts/00_Collecting_Data.ipynb
      - data/Concatenated_Orig_data.csv

This completes one basic stage. We’ll repeat this process for every stage to fully automate a pipeline

Adding Parameters

In the introduction above, we discussed the idea of adding parameters to our pipeline. This is an incredibly powerful way to quickly tune hyperparameters of our training script and other stages to compare results. This becomes especially evident once DVC experiments are introduced to the mix. This video demonstrates the motivation here very well.

Like above, the frameworks team uses a few tricks to parameterize Jupyter Notebooks. This article covers the following in depth. We’ll cover the highlights.

We’ll use papermill to paramertize our notebooks. To get started, we’ll add a paremeters cell to 04_training_model.ipynb. To this cell we need to add the parameters tag. Here’s an example:

KFOLD = 1
TOKENIZER: str = "bert-base-cased"
LEARNING_RATE: float = 5e-5
BATCH_SIZE: int = 8
EPOCHS: int = 2

We can now use these parameters anywhere in our code as variables.

Next, we’ll create a params.yaml file in our project root. This allows us to set parameters outside of out notebook to be used in our pipeline. With this we can rerun a pipeline using new parameters without editing our notebook.

Here’s an example of params.yaml:

tokenizer: bert-base-cased
learning_rate: 5e-05
batch_size: 8
epochs: 2
kfold: 5

Finally let’s connect the params.yaml to 04_training_model.ipynb using our DVC pipeline. We do this by adding a few parameter or -p tags to our papermill command in dvc.yaml. Here’s an example of a training stage:

train_nn:
  cmd: >
    papermill
    nbs/03a_Training_Model.ipynb
    scripts/03a_Training_Model.ipynb
    -p TOKENIZER ${tokenizer}
    -p LEARNING_RATE ${learning_rate}
    -p BATCH_SIZE ${batch_size}
    -p EPOCHS ${epochs}
    -p KFOLD ${kfold}
    --cwd nbs/

For each parameter we add -p <name of parameter in notebook> ${<name of parameter in params.yaml>}. Be sure to also add params.yaml to the deps: section.

When reproducing the pipeline, papermill will now overwrite the notebook parameters with the corresponding values in params.yaml. Any outputs will reflect these changes.