A less discussed, yet important, topic when it comes to the development of Machine Learning pipeline

Many companies are using the power machine learning to provide prediction, recommendation, or classification on both front-end and back-end of their applications. A report from The Verge states that “Eventually, pretty much everything will have [machine learning] somewhere inside.” However, building and deploying Machine Learning models to a machine learning pipeline is not the end of the development circle. Machine Learning models need to be monitored and refitted over time in order to make sure that the models don’t encounter unexpected issues and are effective in their performance. The flowchart below beautifully illustrates the development cycle of a Machine Learning model and how performance monitoring fits into the picture and how human-in-the-loops can bring reinforce learning back to the model. A full article on “What is Machine Learning?” can be found on this awesome article.

Source: http://towardsdatascience.com/not-yet-another-article-on-machine-learning-e67f8812ba86
Source: http://towardsdatascience.com/not-yet-another-article-on-machine-learning-e67f8812ba86

There are many steps involved in the development of a machine learning model and each step requires different skills and knowledge to execute. This article will focus on a less-discussed area of machine learning development, performance monitoring of machine learning pipeline.

Machine Learning Pipeline monitoring can be broken down into two parts:

  1. ETL Monitoring
  2. Score Monitoring

In short, ETL Monitoring focuses on the data before it goes into the model while Score Monitoring looks at the output of the model. Together, Data Scientists can see the correlation between the data and behaviour of their models. The understanding is crucial to identify an area for improvement and detect a breakdown in the pipeline. This article will focus on ETL Monitoring.

ETL Monitoring of Machine Learning Pipeline

The first area to monitor in a machine learning pipeline is at the feature extraction process where input data is transformed into numerical features before it is fed into a machine learning model for classification. However, a Deep Learning model may not have the feature extraction as a separated process as seen in the figure below. To not confuse the audience, this part will focus only on the traditional machine learning model, like random forest, that has feature extraction as a separated step in the process.

Source: https://towardsdatascience.com/cnn-application-on-structured-data-automated-feature-extraction-8f2cd28d9a7e
Source: https://towardsdatascience.com/cnn-application-on-structured-data-automated-feature-extraction-8f2cd28d9a7e

As we look into the data before it goes into the model, ETL monitoring is less of the actual performance monitoring but more of a safeguard against bad data from getting into the model. Corrupted data or unusual pattern can mess with the model and results in messy classification. In other words, “garbage in, garbage out.”

Some data science projects take the data from their internal operations which makes the input clean, structured, and explainable. However, machine learning projects are often applied to wild data sources because simple rule-based algorithms cannot be used to handle them. Some examples of the sources are log files, user-generated contents, third-party supplier, and open data. The companies have less control over how the data is generated, when it will be available, or what change might happen to the schema.

The risk of using external data sources is that schema or pattern of the data may change over time without notification or the team’s awareness. For example, someone may vandalize the open data, the method of contents generation may change, or suppliers may alter the schema of the data without telling downstream users. The reasons above can threaten the performance of the ML model and create critical issues if left unnoticed for too long.

In order to monitor the ETL process effectively and prevent critical data issue to affect the model and its performance; integrity, consistency, and availability of the data are three aspects to consider when selecting metrics for ETL monitoring.

Integrity

There are many reasons that the integrity of the data can be jeopardized. Some of them are:

  • Bugs in the data pipeline
  • Software updates in the infrastructure
  • Routine maintenance at the data source
  • Data vandalism from an external party
  • Changes in the schema of the data at the source

The integrity of the data can be monitored by creating comprehensive validation rules to catch and report the exceptions. Validation rules are usually developed during the exploratory data analysis and training of the model. Some metrics that can be used are, for instance, the number of missing or unseen features or ratio of data points that fail each validation rule.

Even though data scientist would exclude the corrupted data from going into the model, it is imperative to monitor the magnitude and severity of the data issues in order to gauge the reliability of the data source and develop new rules to validate data integrity continuously.

Consistency

The second principle for ETL monitoring metrics is the consistency of the data. Consistency is important because a data point may pass the validation rules but the values in the features may be vastly different from its history in which the model was trained from. The causes that impact the integrity of the data also affect the consistency of the data as well. But in the case of user-generated data, the change in consistency can also happen organically due to a shift in how users interact on the platform.

Inconsistent data affects the performance of ML models because it changes the patterns of the data in which the model was trained for. The visualization below clarifies why inconsistency of the data could ruin the performance of the ML model.

In order to monitor consistency, the volume, rate of change, or individual values of the data may be derived in order to be aggregated for metrics that will be monitored. The aggregated values then become a data point in a time series where X-axis is the time and Y-axis is the measurement that is used to aggregate the data from each batch or time window (depends if the data comes in batch or stream).

https://towardsdatascience.com/almost-everything-you-need-to-know-about-time-series-860241bdc578
Time Series Data Example

There are many measurement variables that can be used to represent the distribution and skewness of the data, such as variance and percentile. From my experience, I found that Mean and Standard Deviation are the most effective measurement combo. The combo works best together because mean shows the centre point where standard deviation shows the range of the distribution. The graph below shows how means and STDs represent different distributions.

https://www.varsitytutors.com/hotmath/hotmath_help/topics/normal-distribution-of-data
Source: https://www.varsitytutors.com/hotmath/hotmath_help/topics/normal-distribution-of-data

Let’s try to explain this with a real-world example. We build a ML model to predict the life expectancy of the population based on their socio-economic data. One of the features is income and we got the data in batch yearly. As a result, we will find the mean and standard deviation of the income of the population and compare them from year to year in order to analyze the consistency of the income distribution. If the mean or standard deviation of income from 2017 is vastly different from 2018 and the model was trained on 2017 data, it is an indicator that the pattern has changed and the prediction might not be as accurate as when it was trained and test. It would also be a good idea to refit the model with 2018 data which I will touch on this topic in the future post.

Availability

Arguably, the availability of the data is the most important aspect to consider when monitoring the pipeline. This is especially important when the pipeline takes data from sources that the team has no control over the data sources, such as a third-party supplier or open data. Without data coming in, the ML pipeline is just like an abandon railway track and the investment to build the pipeline would have been lost.

The availability of the data may not affect the precision of the ML model but it does affect the performance of the ML pipeline as a whole since the business objectives can’t be achieved without the data coming in (or coming late). Examples of KPIs that can be used to monitor the availability of the data are:

  • Amount of delay time between each batch of data
  • Total minutes, hours, or days, without data in the pipeline
  • Number of times that data is delivered behind schedule

Knowing the above metrics allows the data science team to identify areas of improvement and be notified when the data is unavailable. If there are multiple data sources feeding into the pipeline, the metrics will help the team understand the reliability of each source in order to communicate the issue or concern to the right group as well.

This is the end of the article! I hope that you now know how to monitor integrity, consistency, and availability of the data and why they are crucial to the performance of the machine learning pipeline.

I will continue to publish the second part of the Machine Learning Pipeline Monitoring, which is the Score Monitoring, in the next article. You can follow me for more articles regarding performance monitoring and data visualization related topics. Thank you for reading this far and I hope you enjoy this article!