Framework Components

The framework consists of several tools that work together to promote Trust. This document describes the specific roles needed and how each role fits together. While we have selected a group of software tools that work together, it is possible to swap out components so long as the functions required in each role are achieved. For example, different tools may be required when applying the framework to a legacy envrionment or code base.

For a more conceptual discussion on Trust in AI see the framework description. And for a more hands-on walkthrough, see the [tutorial].

The roles are

  1. Source Code Control and Data Control (Control over source code and data)
  2. Dependency Management (Control over third-party software libraries)
  3. Build and Training Management (Control over the training process)

Source Code Control and Data Control

The first component of the framework is software source control. Source control tracks changes to the model code over time. Tracking code changes is essential to tracking the provanence of each trained model. Source control also makes it possible to coordinate changes between a team, and to develop automated workflows.

The current state of the art is a distributed version control system, such as Git or Mercurial. We recommend Git if you do not already have a source control system in place. It is very common, and while there are public platforms providing Git hosting, that is not essential and private hosting is easy to set up.

In addition to model code, it is also important to track training data for model provanance. Since training data is often very large, it is not a good fit for source control systems, A storage overlay system is software that integrates with the version control systems to handle large files. These overlays will track the versions of the data files in the source code repository, but will store the large files in a separate location. We recommend using the [Data Version Control] (DVC) system since in addition to providing the storage overlay, it can also perform some of the other roles we will discuss.

The features as part of the overlay, one can add files, remove files, put things in an independant storage location.

Track the data used for each training run. Make sure every member of a team is using the same data.

Dependency management

All software these days are built on third-party libraries, and managing these libraries is essential for any serious project. One needs to track the versions of the libraries used, as well as any libraries they use in turn. This is a complicated situation, and most programming languages and frameworks have a tool for this dependicy management. Using a dependicy manager is essential to track provance for any output files.

Since most nueral network models use Python, we recommend using PDM for Python dependecy management. The files generated by the dependcy manager should then be tracked in the source control system.

Aside: Python dependecy management is not simple to figure out. There are two general mechansisms for providing the isolation needed between packages, the [virtualenv] method and the [PEP 582] method. The PEP 582 method has many nice benifits, such as self contained project directoriesr, but support for it is not quite mature yet. So we recommend using the virtualenv way. Fortunately PDM can handle both, and defaults to using virtualenv. (To configure PDM to use the PEP-582 mode, use the command

pdm config --local python.use_venv False¬

)

Build tools

As a machine learning project evolves, the training steps will envariably grow and the commands will get more complicated. But even before then, it is important to record the commands used for each training step. A build tool is essential for reproducible builds and model training. Such a tool allows one to save all the steps in a file and store it in the source control system. A build tool will track the commands that are needed to do each task as well as tracking which tasks depend on others, so that time is not wasted reprocessing things that have not changed.

We recommend using DVC for this. using its pipeline feature.

You will want a system that lets you organize your commands into independant stages and will track the inputs and outputs of each stage so that only pieces that have changed will be rurun.

Experiment management

As models become more complicated, there will usually be parameters that can be adjusted between model trainings. Since parameters tune a model without modifying its code, their settings are not tracked in the source control system. An experiment manager will let us formally declare the parameters and track their settings in model trainings.

DVC is our recommendation for this since experiments are integrated with its pipeline build tool. In addition DVC can run parameter sweeps for model tuning, and can plot output metrics for each model. Additionally, these output files can be saved back to the independent data storage location for documentation and sharing with others.