Components of Machine Learning Pipeline.
A brief discussion on the components of a standard ML Pipeline
A machine learning pipeline allows for easy production and deployment of models subsequently. There are several steps involved in an end-to-end machine learning production pipeline.
Let's take a look at those steps ↓
The main components of the pipeline are:
- Data Ingestion
- Data Validation
- Data Preprocessing
- Model Training and Tuning
- Model Analysisrecognizing
- Model Versioning
- Model Deployment
1. Data Ingestion
• Data Ingestion is the very first step in the pipeline and doesn't concern with any kind of engineering on the data.
• The main function of data ingestion is to process the data in a format that the upcoming components can easily digest.
• Data ingestion is able to receive data in several formats and performs the required conversion for an efficient run in the pipelines.
• It supports multiple sources for data input including cloud storage.
2. Data Validation
• Over time due to data drift, we should retrain the model with new data under a newer version.
• But the quality of the data has huge importance to the model, so we do want to make sure that the data is as per the expectations.
• We check for anomalies, check the schema, and calculate several statistics on the data.
• This allows for recognizing the problems in the data if any and dealing with them before training the model on it.
3. Data Preprocessing
• More often than not we need to do a lot of feature engineering and transformations on the data.
• Transformations like normalization, one-hot encoding are performed on the data.
• Data preprocessing allows for doing all the preprocessing steps, heavy or light in a scalable manner over large datasets.
• It also makes sure that there is no inconsistency between pre-processing during training and during inferencing.
4. Model Training and Tuning
• Model training is the core step of the pipeline and with the new processed and validated data the model is trained and saved.
• The model architecture, generators, or hyperparameters all are defined within this training component.
5. Model Analysis
• While working on sequential deployments, we train models and their newer versions on new data.
• Once we train the models it is not fair to replace the existing deployment with the newer ones without finding out if it's actually better or in no way biased.
• In the model analysis we calculate several metrics and on parts of data to make sure that is performing as expected on the validation set.
• We can even compare the metrics that we are concerned with across multiple models.
6. Model Versioning
• While doing all the experimentation and subsequent deployments, it is essential to keep track of all the details of the training as well the data.
• Versioning concerns with maintaining all the gory details like hyperparameters, data & architecture.
7. Model Deployment
• Model deployment allows us to deploy the model without writing a lot of the web app code and generally provides REST or RPCs.
• We can also host multiple models and do A/B testing and get better insights on the behavior of the models for comparisons.
• Feedback can be a fully automated or semi-automated process that captures essential details on the current deployment.
• It can provide some information on how the deployment is performing and how it can be improved.
Before we end this, there are two more components that we should discuss.
Pipeline Orchestration & MetaData Store
• We have seen all the components in the pipeline but there needs to be orchestration that runs the entire pipeline.
• None of the components in the pipeline communicate directly but rather through a common store that stores everything.
These are some of the steps and components in a standard machine learning pipeline.
Any comments, suggestions, or corrections are welcome.•
Thanks for reading!!
Data Scientist @ToTheNew
I study stats, make machine learning models and share all about it.