cft

The Journey from Development to Production Part 2: Project Governance

Project Governance & compliance is the most important factor in developing any problem-solving ML.


user

Akash Desarda

3 years ago | 6 min read

This is Part two of the series The Journey from Development to Production, if you have not read Part 1 then I would highly suggest you read it first as it explained the what approach & mind we need in any effective Project Governance.

What is Project Governance? Why it is so important? It seems so untechnical?

Indeed, Project Governance is not related to tech, but trust me by adapting it will bring great gains in any ML Project. (We all did at my work… :) )

Project Governance is

  • An oversight function that encompasses the project life cycle by providing the project manager and team with structure, processes, decision-making models and tools for managing and controlling the project while ensuring the successful delivery of the project also. It is a crucial element, especially for complicated and risky projects.
  • It defines, documents, and communicates consistent project practices to provide a comprehensive method of controlling the project and ensuring its success. It contains a framework for making decisions about the project, defines roles, responsibilities, and liabilities for the accomplishment of the project, and governs the effectiveness of the project manager.

It has proven its effectiveness in other domain so Machine Learning domain is no exception

How to adapt effective Project Governance?

1. Governance Models:

  • Based on the project’s scope, timeline, complexity, risk, stakeholders, and importance to the organization, the organization should formulate a baseline of critical elements needed for project governance.
  • There should be a primary tool that based on some of the above indicators decides what changes your governance framework needs to have and which components are compulsory.

2. Accountability and Responsibilities:

  • Defining accountability and responsibilities is the core of the project manager’s tasks. Improper distribution of accountabilities and responsibilities will have a negative impact on the effectiveness of operations of the organization.
  • While defining both the factors, the project manager not only needs to define who is accountable, but also who is responsible, consulted, and notified for each of the project’s deliverables.

3. Project Initialization:

  • One of the scariest trap in collaboration is ‘Confusion trap’. It occurs when everyone has a different project structure. At least 30% of the time will be wasted just to understand project & not actual implementation. Each team member must follow the same Project Structure.
  • Use Cookiecutter Data Science. It is a logical, reasonably standardized, but flexible project structure for doing and sharing data science work. It is pretty simple to use & they have good documentation too.
# Install
pip install cookiecutter# Init is as simple as
cookiecutter https://github.com/drivendata/cookiecutter-data-science

4. Git Ops & compliance:

  • Git is the most important innovation of the 21st century in Software Development. Use Git (& any remote like GitHub or GitLab, etc) to track your project. (Note: I am assuming you understand Git or have a basic sense of it, if not then please leave everything behind & learn it.)
  • As I have mentioned in Part 1 Git enables us to accommodate Microservice easily.

Eg: You can have a separate branch for Training, Inference, Serving. etc so that each team member will work isolated but at the same time in collaboration

Some of the best practices that should be followed:

  • Everyone must specify:
    →git config user.name “Your Name”
    →git config user.email “email@something.com
    →Large files (eg. weights file, large csv, images, etc) should not be included in the repo.
    →Master branch should be always bug-free and ready to run.
    → Logging of every change or commits should be intuitive.

Convention for git logging:
→Every commit must include some informative message.

  • Convention to write git commit message:

→{Process logic/code/folder structure refactor/change/(any similar kind of initiative message)}<Not more than five words> for {class:class name/method:method method name/file}< mention all hierarchy> in {branch}

Special cases like

  • merging branches: The branch must pass some kind of test.

→ Convention to write git commit message:{Summary of any one or two major logics}

  • resolve issue :{brief summary} for {issue no/ issue name}
    → Commit message should not exceed 50 characters (including spaces)
    → Use of imperative language

Extra: Read This awesome blog on writing effective commit message.

5. Data VCS using DVC :

  • Data is one of the key components of any ML System. So it is very important to version them just like our code. But Data is large (in GBs, Tbs, etc) which make it incapable of version control them using Git. So here comes the DVC to our rescue. I will suggest you go through their documentation first, it is one of the best out there.
  • DVC works very well (& easy) with cloud storage like AWS s3, GCP storage, etc as well as local storage.

Tutorial for DVC (needless to say this will be basics, here I aim to give you a brief idea. You can always find more advanced topics & tutorial at their docs)

# Step 0: Install
pip install dvc
pip install dvc[gs] #If install fo gcp storage or likewise# Step 1: Setup
# DVC a Git enabled project
git init .
dvc init# Step 2: Add remote; Remote is just a backend to store DVC versioned data
dvc remote add -d myremote /location/on/disk #Local Remote
dvc remote add -d myremote gs://some/bucket #GCP storage# Step 3: Adding Data to DVC
dvc add ./data
#additional steps for cloud storage
dvc push
  • We can also share DVC versioned data with other
# Sharing data
git clone some-repo-having-dvc@someremote.git
dvc fetch
  • The biggest benefit of Data VCS is Time travel. Maintained different folders for different data is just so painful, I know. Separating & merging them with different variety just like Hyperparameter tunning is Brute Force. But this can be done just in a snap using Data Time Travel.

Some Pro tips to enable Data Time Travel

There are multiple ways to do it but I will be sharing the best which worked for us.

→ We will be using Git & DVC

→ Let us assume you are adding new data every two weeks (or any date, not important). I am assuming you understand ‘git tag’

# Step 1: Add data to dvc
dvc add ./data#git tag to version data
git tag <date> # date can be replace by anything, it's just that date make it easy to backtrack it# Step 2: Assumming you have added new data to same ./data
dvc add ./data
git tag <new date># Step 3: Assuming new data diddn't work for & older data was good
git checkout <date>
dvc checkout
# Thats it you just did a time travel with just two steps...

→ In place of git tag, we can also use git branch but it will make it a lot messier considering you already have lots of branches for microservices.

Optional: Always try to use DVC in master branch. If you want to use some kind of experimental data then create a new branch & then use it. Do not disturb good data of master branch

Note: There is a lot more to discuss with regards to DVC. So I will post a separate blog dedicated just to DVC

6. ML Job Tracking using MLflow:

  • This part is optional but can be proved very vital if you incorporate it in your project.
  • MLFlow is not exactly part of Project Governance but more like ML Job tracking (like tracking training), comparing result, reproduce an older result, etc. So I fill it may not find scope in Project Governance but will play a significant role in overall development.
  • So I would insist you go to their docs & explore it. Integrating it in your project is very easy. They too have very intuitive documentation and good tutorials.

So let’s recap the story till now of both Part 1 & Part 2, shall we?

  1. Before getting hands dirty first device a generall life cycle involving all stakeholder.
  2. Developing with Microservice approach
  3. Packaging each service (Use of Container technology like docker)
  4. Governing Project for rapid development but with effective collaboration
  5. Tracking code with git, data with DVC & experiments with MLflow.

Upvote


user
Created by

Akash Desarda


people
Post

Upvote

Downvote

Comment

Bookmark

Share


Related Articles