Data Drift And Its Effect on Models’ Performance

November 8, 2022

Photo by Abed Ismail on Unsplash

Introduction

Just like cars, data drifts. However, in order to fully comprehend data drift, you must first understand model drift. Model drift is a change in a model’s behavior. These changes are primarily the result of two factors. The first is a shift in concept, also known as concept drift.
When the statistical properties of the target variable(s) that the machine learning model is to predict change, concept drift occurs. Assume you built a model to predict the ratio of in-store shoppers to online shoppers for your company’s supermarket store using customer data from 2010 to 2016. The model, if properly trained, would produce accurate results from 2017 to 2019, just before the COVID-19 outbreak.
From 2020 onwards, the model would be insufficiently accurate due to a change in the context or concept in which the model was trained. Due to global movement restrictions, customers preferred shopping online in 2020, resulting in a significant shift in consumer behavior. This change in concept will degrade the model’s performance.
The second factor that contributes to model drift is data drift. Data drift occurs when the input data used for model training and validation changes. In the following sections, you will understand data drift by seeing its causes, effects, and prevention.

Causes of Data Drift

Photo by Bradyn Trollip on Unsplash

Data drift can be attributed to various factors. Some of which are:
1. Poor Data Integrity: The act of keeping data accurate and consistent over time, throughout its life cycle, is referred to as data integrity. When organizations or teams fail to keep an eye on the data being used for machine learning operations, errors in the data occur as a result of manipulations done on the data by various teams combined. These errors can cause data to deviate from its original form, resulting in a drift.
2. Poor Data Engineering: Data engineers are charged with the extracting, transforming, and loading of data from various sources and successfully delivering it to engineering teams. This is done via the data pipeline. When done incorrectly, it causes a mix-up in the data, resulting in a shift in the data’s properties. These shifts cause data errors, which affect the model’s performance.
3. Errors in Data Collection: Data errors can sometimes be traced back to the point of collection. This could be due to a number of factors, including a bug in the form’s frontend or a broken API used to retrieve the data. All of these factors can cause data to drift.

The Effect of Data Drift

The primary effect of data drift is a decrease in model performance or accuracy. While this appears to be an engineering issue, it also has an impact on the business’s bottom line. Fiddler AI explained in this video how model drift (which data drift can be responsible for) costs a client $500,000 just over a weekend. Another customer mentioned that it took their data science team two weeks to solve a model drift issue.
While there are numerous other examples, the fact remains that data drift can have serious consequences for a company’s bottom line. As a result, executives and leaders should make it a priority to address data drift once it has been discovered.

How To Solve Data Drift

Photo by Olav Ahrens Røtne on Unsplash

To solve data drift within an organization, one requires- data scientists, machine learning experts, as well as access to those with domain insight that can explain what has changed in the data and why. Once this has been determined, the engineering team can proceed with the following:
1. Check data pipelines: If a data pipeline is broken or has a flaw, inspecting it and repairing it as needed will help reduce the level of skew in your data. This goes a long way toward resolving data drift and restoring your data to its original distribution.
2. Proper Data Integrity: While data integrity does not directly address the issue of data drift, it does help to prevent it and ensures that the data used by teams is consistent and accurate across the board.
3. Correct Collection of Data: As previously discussed, data distortion can occur at the point of collection. As a result, ensuring that the correct data is collected at each stage of the process will help to reduce data drift and restore it to normalcy. The fault could be from your front end application, incorrect data entry, or a broken API used to fetch data, to name a few possibilities. Finding and fixing the points where incorrect data was collected will save the team future hours spent debugging or rebuilding the model, as well as money for the organization.

Conclusion

This article provided an overview of data drift, its causes, effects, and solutions. While resolving data drift is important, preventing it is preferable. To accomplish this, make sure to consistently monitor your models and/or data for out-of-the-ordinary changes and try to understand how they came about and how to fix them. To learn more about data drift and related concepts, you can review:
“My data drifted. What’s next?” How to handle ML model drift in production.
Don’t let your model’s quality drift away
What is Data Integrity and How Can You Maintain it?