MLOps System Design Has a Governance Problem

Why AI governance?

The rapid progression of artificial intelligence continues to create new business models, roles, and challenges, driven by immediately addressable opportunities, given current technology. The opportunities have expanded dramatically over the past few years. Not only are we using models to create better customer experiences, more effectively move inventory, and improve our spelling (while perhaps reducing our ability to spell), today, models are present at the most essential and personally impactful points of our lives.

The production of pharmaceuticals, the analysis of disease, the issuance (or denial) of insurance, and the analysis of fraudulent activities are a few key examples. A model error in any of the preceding examples will enormously impact a real person’s life, and we, as developers of models and modeling systems, need to incorporate that context into our practices.

Given the personal impact of today’s complex modeling systems, the emergence and focus on model governance should feel like a natural progression in an evolving field. The rise of data science and the data scientist as a profession was predicated on the ability to capture and process vast amounts of data, access to compute, and ever-evolving tooling that reduced the learning curve for many aspiring data scientists. And with technical evolution, new tools, roles, and disciplines arise to meet evolving needs.

Consider the rise of data engineers and data engineering as a discipline distinct from the parts of data and database management, software engineering, and analytics from which it came. Few argue that data engineering is a simple sum of its constituent parts - the tooling, practices, and needs are unique enough to require specialized skills and experience to be successful. And in retrospect, many data scientists wish that data engineering, as an independent field, would have matured earlier. The work certainly existed; it just happened to be us, as data scientists, doing it vs. the modeling work many of us prefer.

In time, the data science and machine learning communities will say the same about AI governance. Few individual data scientists wanted to revisit feature transformation notebooks of current working models, rip out the logic, and spend the time encapsulating the code in tested, resilient data pipelines, but we had to do so to progress. Even those who thought it was a good idea didn’t often have the time or capacity.

Data science as a field needed help from the worlds of software engineering and data management, as synthesized in the discipline of data engineering, and we now reap the benefits of data pipelines and associated tooling. Through the adoption of what at times felt like a heavy and onerous process, we reaped great rewards. We model more, drive greater business value, and have better visibility into an essential (and notoriously messy) part of our modeling process.

We will continue to unearth, address, and build practical patterns to address the present friction of delivering business value through the success of AI programs. And given market conditions and regulatory pressures emerging today, AI governance is the next piece of the puzzle we must critically consider.

In a former role, I worked in MLOps, a recent addition to our field and role ecosystem. Given our discussion of field evolution, this makes sense. Without the wherewithal to have insight into the technical behavior of your systems and strong, repeatable patterns for development and deployment, it is hard even to consider where governance may sit from a technical perspective.

The development of MLOPs brought DevOps's software engineering and infrastructure automation practices to the ML world with a flavor all its own. And while MLOps draws from DevOps, I don’t believe they are the same. Cultural cousins, but many components of ML systems are unique to their needs - experiment tracking, model retraining, and feature stores as examples.

Just as your series of feature transformation notebooks don’t end up in production data engineering pipelines - logging to fluent and container health checks are insufficient - maturity carries additional needs.

In a few years, we’ll make similar statements about AI governance and MLOps. Not because MLOPs is immature or your tooling is substandard - but because they serve different purposes and are ill-fit to stand in for each other.

Investing in MLOps is crucial for modern data-driven organizations

We hear it every day. An MLOps or model team leader believes they have AI governance handled. Because they have:

Logging
Monitoring
Alerting

They should 100% celebrate their investments in these elements - they are essential for well-built modeling systems; however, logging, monitoring, and alerting are only a small set of the many pieces of a well-governed system. Not another stack component, but a wide, additional layer that drives alignment, provides risk context, consumes information from existing processes, and delivers a real-time contextually appropriate aggregate abstraction of the business decisions, systems, and models that result in a prediction. This is the governance gap in MLOps system design.

MLOps is not governance

MLOps is the practice and components required to bring an ML model from idea to production from a technical perspective for technical stakeholders. ML platforms are designed and built from several components working in series to deliver a model into production for inference and monitor its performance. Observability and understanding of the process and components is an afterthought and, where present, only accessible and intelligible to technical folks. From a strictly technical perspective, the platform's goal is speed, efficiency, and performance. The practice and tooling in MLOps are drawn from software engineering and DevOps practices, particularly the ‘automate everything’ mindset.

Governance is about the counter to speed, efficiency, and automation - not because any of those qualities are inherently bad, but because well-built systems should mitigate undesirable outcomes. And sometimes going slower is the best way to get further. Good model governance embeds controls, visibility, and objectivity to achieve the most robust, performant, compliant, secure, and fair system possible. What the initial slowdown conceals is streamlining and enablement of downstream processes. Our DS and ML leaders don’t need to solve governance when done well. We can deliver context and tools that complement the model development lifecycle and remove the friction and painful administrative burden of future model reviews.

We all want the same thing - the best possible models in production as quickly as possible. But to date, we’ve not provided our business partners with the information they need to make decisions as quickly as we develop new models. As data scientists, we may not always be aware of broader system engineering or architectural considerations. Governance helps protect us from delivering projects with unforeseen risks, regardless of whether our business partners require this information.

Can MLOps cover the needs of AI governance?

Or vice versa? Not at all. The two play complementary roles that speak to different audiences, meet different needs, and neither is a subset of the other. Returning to our data engineering example, this is like asking if hiring an MLE or MLInfra person obviates the need for a data engineer. While there is overlap, expanded needs require specialist tools and skills. Let's examine some examples of the relationship between MLOps and governance.

Logging

Logging is generally concerned with creating a record of what happens in a system to provide a way to check on errors and understand system performance post-hoc. There are many similarities between the items we want to log from the MLOps and AI governance perspectives. Primary differences lie in the needs of the consuming audience, and thus the access patterns and the presentation layer must be considered. A DS or MLE can easily query a database or interact with a command-line tool and visualize the results using their favorite data visualization library. However, risk, compliance, and legal stakeholders would be unlikely to have the skills (or access) to do so.

Next is the breadth of values. In an MLOps system model, feature vectors and outcomes are logged, along with many details about the individual components that come together to allow the productive use of the modeling system. From build logs to systems logs, records of health checks, and various other quantities. All quantities previously mentioned are essential to capture - but they do not begin to meet the documentation requirements of a robust AI governance program from both presentation and completeness perspectives.

Logging, from the perspective of recording critical information about a model, begins quite a bit before the feature vector of a prediction is captured in a well-governed system. The gathering of information begins at project inception and flows through model decommission - and all elements must be legible and accessible to our non-technical stakeholders.

Let’s look at a concrete workflow as an example. A regulatory body has inquired about an underwriting decision made in July of 2023. It is December of 2026. Your company’s legal, audit, compliance, and risk teams have been working together to prepare the documents required to respond to the inquiry. They need precise information on the model.

As a leader of the data science organization, you and your people are prepared to answer questions about your model’s and system’s performance, and you are confident your data scientists made the right decisions during model build and training. You ask the line manager in underwriting data science to pull the build logs and relevant model documentation for review before the official ask comes through.

The first issue is that the former manager in Underwriting DS has taken a new role in a new division and has a major project due.

Beyond the first people and priorities challenge, some requested items are challenging to obtain when the ask comes through.

A few examples:

Which version of the model was live across a date range, and if more than one, how were they different?
What data source was used to build the training data set, and how was it permissioned at the time of training?
Where is the original training data set, and how did the feature engineering process address the issue of potential proxy bias?
How many issues were flagged with the model while this version was live? Where is the record of issues resolution, how many were high priority, and who resolved them?
How many decisions were processed by the model across this period? What was the outcome distribution? How were outcomes distributed across feature vectors similar to the decision in question? And if the current running model is different, how are the distributions of the subsequent models the same or different?
How often have the underlying data sources and feature transformation pipelines changed from the model in question to the currently running model?
What is the version history of all production data pipelines that handle pre and post-processing logic across the model history?
Please stand the model in question back up and run the historic feature vector against the model to audit the performance against our logged records. What is the output?
Can you explain the model selection process? How was model complexity weighed, eg, if the model is a GBM, why was that selected over a linear model?

Unfortunately, the data scientist that initially built the model has left the company. This leaves the Underwriting DS manager digging through old emails, slide decks, and system logs for clues.

The result is that many unplanned hours are spent in document retrieval and context building around the raw answers (in systems logs, monitoring systems, etc). The VP in the manager’s new area is neither happy nor amused with the delay and deprioritization of their recent work, and the manager is burnt out and overwhelmed.

Now the fun starts. The auditor (or regulator) makes the following statement:

“We’d like to see evidence of your data selection criteria”

Can that be done, given staffing changes and the passing of time? Where was it captured, and how hasthe present practice prepared your organization for the inquiry?

In a mature AI governance environment, the answers posed above are self-serviceable by business partners. Given an exceptionally robust system, a business user can also spin up an arbitrarily old model and re-run the transaction against it - to verify that the decision is as recorded. The business user can also spin up other models that have been since commissioned/decommissioned to run the historic feature vector against - what prediction comes from model v0.3 vs. v1.2 vs. v2.3?

If an old model was found problematic, do our models since (and present model) have the same issues? Can we demonstrate that to a regulator?

Monitoring

The monitoring components of MLOps and AI governance systems are the most similar elements we’ll discuss, though even here, strong MLOps practices do not obviate good governance. For each, there is concern with model performance at present and over time, training/serving skew, and catching anomalies that could flow through the system to produce undesirable outcomes. Depending on the maturity of the individual MLOps practice, monitoring may be done ad-hoc in a notebook without third-party evaluation or propagated into a dashboard with various visualizations.

Table stakes for these platforms (as there are many) are feature and outcome distributions, the ability to run statistical tests against the collected data, and the ability to view model performance against sub-segments of the feature vectors.

The consumer of these dashboards is typically a data scientist tasked with performance monitoring the models they built or a team member in an allied role - perhaps a data engineer or analyst. The notion of a non-technical user needing (or wanting) to use these tools is uncontemplated. Model outcome and performance questions asked by business leaders are filtered through the technical folks, usually via IT tickets, who must synthesize an answer from the data and create a story legible to a business user.

From an AI governance perspective, all relevant data about a model running in production should be visible and understandable to a business user. This shift in the end consumer drives a re-thinking of output abstractions and interpretations. MLOps approaches to monitoring and explainability can sometimes confuse non-technical partners due to a lack of context. Delivering p values becomes counterproductive for this reason. We must carefully balance depth of insight and clarity across the presented data and visualizations.

That means identifying and visualizing drift events in ways self-interpretable to your VP of risk and chief counsel. Presenting bias as a simple interpretable value, capturing environmental changes (pre and post-processing logic, for example) in a way that is clear if business action is required, and providing a contextual understanding of what a model change means are equally important. A change in any of these items can change a model's risk calculus and performance and should be treated accordingly.

Alerting

Considerations in alerting follow similarly to monitoring. A different end consumer requires a different delivery means and context. Build failure alerts flowing into a slack channel, while sometimes preferred by technical folks, are not appropriate to alert a risk partner as to model performance issues.

Focus on different components: MLOps alerting focuses on the technical aspects of the machine learning pipeline, such as model accuracy and performance. AI governance, while sometimes triggering in the same way, additionally alerts on a model’s risk, ethical and legal implications.
Different audiences: MLOps alerting is aimed at technical personnel, such as data scientists and engineers, while AI governance alerting aims at a broader range of stakeholders, including legal teams and executives.
Timeframe: MLOps alerts trigger in real-time or near real-time. AI governance alerts may trigger less frequently and require more analysis and investigation. A build failure is not the same as a failing proxy bias test. And triaging, investigating, and remedying the alert will carry different processes and needs.
Response mechanisms: MLOps alerts may trigger automated responses, such as retraining a model or scaling up resources. AI governance alerting frequently requires manual intervention and decision-making from a broader range of stakeholders.

Conclusion

I’ve aimed to demonstrate that MLOps and AI governance are two distinct disciplines that are essential to the productive and responsible development and deployment of AI models. The above examples are meant to be illustrative rather than complete. There is a similar interplay across other facets and components of the two disciplines - but neither can replace the other. They must exist side by side, tight integration of MLOps and AI governance enables organizations to move models quickly and reproducibly into production and to identify and mitigate associated risks.

And I draw from experience. I’ve been the data scientist that couldn’t move a high-value model into production because something was lost in communication, and stakeholder sign-off never materialized. An unidentified risk that resulted in months of wasted time and, ultimately, my leaving that role.

Pre-negotiation of expectations and deep in-process communication across the model development lifecycle is not built into MLOps - it is neither the goal nor intent. A well-managed AI governance program facilitates collaboration between data scientists, engineers, and other stakeholders via a shared understanding of processes, goals, and outcomes. This, in turn, allows technical teams to do what we do best. Building our collective companies' futures with the confidence that our good work will see production.

A key goal in my work at Monitaur is the complementary integration of MLOps and AI governance. We all agree that driving more models into production, more business value, and less reporting overhead for our technical people are common goals. Naming the why is less important than outlining and delivering the how.

Case Study: Revolutionizing insurtech with simplified model governance

"Model governance is not an option—but a requirement—for our engineering organization - we're building high-consequence applications in a regulated environment. By streamlining and standardizing the governance work at CAPE, Monitaur reduces the amount of effort and mental cycles my team has to spend on governance tasks. The result is more time and focus that our data scientists can spend on data analysis, improving model quality, and creating value for our clients."

- Fabian Richter, VP of Engineering, Machine Learning at CAPE Analytics