Insights

A Working Model is Only the Start of the ML Life Cycle

Why companies need to think ahead about infrastructure for data science.

After months of hard, careful work exploring data and training models, your data team is finally ready to push their best model to production; you may think the ML project is done but the fact is that deploying an ML model is just the start of the ML Life Cycle.

For a company to maintain the benefits of deploying an ML solution, ongoing monitoring and improvement are required to maintain performance and optimize value. This may surprise some executives who start with the assumption that “we just need an algorithm to plug into our software/hardware system.”

For proper planning and execution of an ML initiative, it’s vital that executives understand:

  • The ML Life Cycle is an ongoing process that continues long after deployment
  • Effectively managing this ML Life Cycle requires a well-designed, cloud-based data architecture (ML Ops)


FIGURE 1: The ML Life Cycle is a Continuous Process


WHY THE ML LIFE CYCLE DOESN’T END AT DEPLOYMENT

Continual data science activity is required, even after deploying a highly accurate and robust model, because:


Maintaining competitive advantage often requires more data and model changes

Once a company is ahead of the competition, it will want to stay there by gathering more data to make the model better over time (the flywheel effect: better product->more data->better product).  Having more data may not only require retraining the model but evaluating the underlying data science techniques to determine whether a new model could work better.

Subtle input changes can degrade model performance over time

Input data can change (or “drift”) over time and degrade the performance of brittle models, eventually causing them to become obsolete. This need for constant monitoring and retraining can be unexpected to CEO’s or CIO’s; there is no expectation that the performance of otherwise static software will decay in traditional software development or deployment.  However, understanding this point is vital since many of these ML model implementations have sizable impacts on some combination of value, safety, and/or efficiency of the business.  To make matters worse, the “black box” nature of many models makes unexpected changes much more difficult to detect and diagnose.

Consider a mobile app-based medical device that uses cell phone camera images as inputs to a deep learning model that predicts patient health. If phone manufacturers suddenly push new camera software that “improves” images (fewer shadows, different color temperatures, etc.), the model may begin to generate faulty predictions.


Racial, gender, and other biases become apparent as models encounter more data

ML models should not only perform well but also perform equitably. As the ML-based product is launched and its use cases are gradually expanded, exposure to populations and situations not adequately represented in the initial training data often begins to reveal poor performance. However, since model performance is often measured in aggregate across many examples, it is easy to miss that a model is underperforming on particular groups. Businesses have to monitor ML models after deployment to detect (and correct) racial, gender and other biases.

For medical applications, racial biases can directly impact patient care. Early training sets may not cover all demographic or racial groups equally and some demographic or racial differences may not be apparent till more data is available (or new markets and groups are contained in that data).

A CONTINUOUS ML LIFE CYCLE STRATEGY PROVIDES ADDITIONAL BENEFITS

Beyond these considerations, there are a number of follow-on benefits to treating ML as a continuous process and building proper infrastructure to facilitate this strategic perspective.  A well-designed architecture will include versioning, traceability, modularity and other features that allow the company to:


Meet Evolving Regulatory Requirements

In regulated industries, like Medical Devices, regulations around documentation and quality control for this ML Life Cycle are already being discussed.  The FDA has proposed a concept of workflow that includes the need to log,track and evaluate performance [see this FDA article].

Operate in a Transparent, Scalable and Efficient manner

Each transformation of data should be tracked to avoid hidden errors, allowing larger teams to work together on the same data and models.  Also, modularity reduces the risk of accumulating hidden errors over time [see this paper on tech debt]. Overall, as a result of implementing the right architecture, model maintenance and retraining will be much more efficient.

Mitigate Turnover Risk

Data science and data engineering skill sets are in high demand and we’ve seen companies lose entire teams suddenly.  Having infrastructure that enforces process and provides historic traceability is the best way to avoid a disastrous loss of institutional learning when the continuous process of a deployed model may last years and be maintained by multiple data scientists.


Enhance Data Security and Access Control

A well-designed architecture will support better processes for the company to control and track data access and transformation, minimizing the risk that malicious or benign actions will negatively impact model performance and expose confidential information.


IMPLEMENTING THE ARCHITECTURE SHOULD BE A DELIBERATE PROCESS

To realize the benefits outlined above and to optimize overall model performance, a company needs well-designed processes and infrastructure (a.k.a. “ML Ops”) to manage data handling (storage, processing, serving), data and model versioning (experiment tracking), and continuous integration and deployment (CI/CD) of models and processing pipelines. Architectures involving these components can get quite complicated and even look something like this:

FIGURE 2: General Data Architecture Example (source)


While few companies need an infrastructure that is this complex , building the optimal infrastructure can be daunting because there is an ever-increasing number of cloud and software tools available. Without a deliberate decision to build the appropriate ML Ops architecture, this infrastructure can grow in an ad hoc manner, pulling valuable time from data science efforts and, potentially, placing the company at risk of accumulating technical debt or missing revenue/savings opportunities in the future.                        

Even companies with a savvy data science team can easily find themselves struggling to administer each of these components, which are in constant flux as technologies mature. At some point early in the ML product development project, the company will benefit from bringing in additional ML Ops expertise, whether internal or external, into the effort.

Depending on the pre-deployment modeling activities, it can be beneficial to have part or all of the ML Ops architecture in place before deployment.  Establishing process control over the ML life cycle early avoids having the architecture grow in an ad hoc manner that can create technical debt. [see more on Hidden Technical Debt in ML Systems here]


FIGURE 3: When to Design and Deploy ML Ops


THE BOTTOM LINE

  • Company management implementing a strategically important ML initiative should plan for a continuous ML Life Cycle.  Doing so helps manage some major risks and will lower the total cost of the ML initiative.
  • Companies need more than data science expertise to build a sustainable, scalable, and maintainable architecture (that makes the data scientist’s job easier and more impactful).  The key expertise required will be familiarity with cloud tools (like AWS SageMaker, Kubeflow, Azure ML, etc.) and experience in ML Ops.
  • Building out this infrastructure needs to be a deliberate process to avoid overly complex combinations of tools and processes, which can ultimately require future refactoring efforts and, on occasion, rebuilding entirely from scratch.
Written by:
Michael Bell
VP, Product
Anthony Scott
Director, Engineering
Published On:
February 2, 2021