Technology
“Bad Data Can Kill Good AI”
A good painter will tell you that surface preparation means everything. A great paint job on a poorly prepared surface will look shabby and will not last. Likewise, a good data scientist will tell you that data preparation is critical to any AI system’s success. Even the best, most sophisticated analytics technique applied to low quality, poorly integrated, sloppily engineered, or largely irrelevant data will be, at best, unreliable.
Much ink has been spilled describing AI and machine learning and it’s uses in banking. But less has been written on the essential foundation of AI: robust data. In this article I will propose five rules that, if followed, will help ensure that your AI efforts are not foiled by problems with your data.
Rule #1. Less is more.
Because it is hard to know which data is important for your purpose, it is sometimes tempting to cast a wide net. Build your AI using as much data as you can get and you will find the signal wherever it may lay hidden, right? Maybe. But, put AI into use with a lot of extraneous data and your AI is ripe for failure. According to one famous paper (“Hidden Technical Debt in Machine Learning Systems”) underutilized data, data that provides little or no incremental benefit, makes AI vulnerable to change, “sometimes catastrophically so”. Weeding out extraneous data makes AI easier to test, run, scale, and maintain.
The question, of course, is – – which subset of the data is the parsimonious dataset with all or substantially all of the signal you need? This may not be easy to determine. There are data science techniques for determining how important a particular data feature is, but these don’t necessarily tell the whole story. Data may be very important, but only in certain situations. Or data might be important only in combination with other data. Figuring out which data to use and which data to eliminate is a tricky but critical step that in the rush to get AI implemented is too often skipped.
Rule #2. Source your data carefully.
When prototyping an AI solution or creating an experiment, data is often cobbled together from different places. Data integration and preparation may be ad hoc. Nothing wrong with this, as long as when your AI is ready to implement you then make that ad hoc process industrial strength with all the automation and controls appropriate for your purpose.
In some cases, AI developers grab data from the most expedient source, not necessarily the best source. In one case, I discovered a model sourced from unreconciled loss data which turned out to be entirely inconsistent with published financials and thus wrong by a mile. For proofs of concept, this might be acceptable. But before implementation data should be sourced from well supported platforms, where data controls are strong, and reliability and availability are high.
Often data must be pulled together from disparate sources, and then integrated and synthesized for AI purposes. Think about assembling client data, product data, and transactional data from different systems. Putting these together for modeling is a nontrivial step. This may be done in “quick and dirty” fashion for experimentation, but before implementation the process needs to be properly designed, engineered, and tested.
Unfortunately the temptation to allow the business to use experimental AI before it has been well engineered is strong and must be resisted. Of course, AI intended for use only periodically (for a monthly report, for example) may require less data engineering than AI to be used continuously in business operations. And higher risk uses of AI (e.g. loss forecasting, credit approval, fraud detection) require more rigor than lower risk uses (e.g. marketing campaigns, client segmentation, lead prioritization).
Rule #3. Decide how good is good enough.
Perfect data in a business setting is rare. Inevitably there are holes in the data that need to be filled, errors in the data that need to be cleaned up, or inconsistencies that need to be mitigated prior to use.
The cost of 100% accuracy can be high, sometimes unattainably so. In some cases, complete accuracy is a requirement. If you are using AI to forecast loan losses, for example, the loss history used to build your AI had better be 100% complete. But for fraud detection models, perhaps 98% accuracy is good enough if that extra 2% would hold up implementation of models that will save you a bundle in fraud losses. And for marketing purposes, maybe 85% accuracy is good enough.
You need to decide how accurate the data needs to be for your purpose. But keep in mind that
this knife cuts both ways. I’ve seen AI built on poor quality data that ended up being unreliable in use, but I’ve also seen model risk management hold up models from implementation over data quality concerns that are insufficiently worrisome in comparison to the value of the AI in question.
Each use case is different, there must be a step in the process where target data quality levels are explicitly declared. And controls need to be put in place, such as data quality metrics, that either prove that threshold has been attained or warn you when some minimum quality standard has been breached.
Rule #4. Manage change.
I’ve seen more models fail because of poor data change management than for any other reason. Something unexpected changes in the data or the data preparation breaks somewhere and your AI produces unreliable results, sometimes wildly so.
Most organizations have a data change management process in place. This process is designed to communicate and analyze the impact of data changes. But AI makes this more difficult. AI and data management are intertwined; data can impact AI, and AI can impact data. Often data change management sits under the Chief Data Officer, and AI change management sits under the Chief Analytics Officer, or in Model Risk Management under the Chief Risk Officer. In too many cases, these organizations do not communicate sufficiently or effectively.
Data change management and model change management must be strongly linked together. Data changes must be analyzed for their potential impact to AI, and AI changes must be analyzed for potential impact to downstream consumers of their data outputs. Neither can be managed in isolation.
There are two ways to handle this. One option is to merge data change management and model change management into one organization. But I suspect this is too heavy handed for most organizations. The second option is to create the right policies and procedures so that, for example, any data change is analyzed for potential impact to AI prior to implementation.
For this to work, there must be a complete registry of all your AI, including information (model metadata) on which data the AI consumes. Without that, determining which AI is impacted by a data change is a near impossibility. Again model risk should be a factor, higher risk models getting more rigorous analysis of potential impact than data changes that only affect lower risk models.
Rule #5. Build in alarm bells.
Data problems may not be obvious and data flows may by all appearances seem to be functioning normally. Without well designed alarm systems, data problems can go unnoticed. At one bank credit models ran for weeks with a key piece of data missing before anyone realized.
Even subtle, but unexpected, changes in data can lead to degradation in model performance. More significant problems, like suddenly missing data due to a process failure, are all too common.
Like plumbing problems, data problems that go unnoticed cause more damage than those found quickly. A securities pricing error for example can propagate through multiple systems and become harder and more costly to fix the longer it persists. Building in the right alerting systems so that data problems are noticed and acted on mitigate the risk of AI failure due to data processing breaks.
This requires the right instrumentation. Existing application health monitoring systems are generally not adequate for monitoring data flows. They may even provide a false sense of security, indicating that everything is working fine when in fact there is a major hole, or something highly anomalous, in your data. Start with simple data quality metrics like completeness checks (did I get all the data?) and consistency checks (does it match systems of record like general ledger or underlying loan or deposit systems?)
When something doesn’t look right, have a failover plan. Do you, for example, take your AI offline temporarily while the error is researched and corrected?
Summary
Data is the unglamorous but essential foundation of good AI. Data scientists are not necessarily skilled nor inclined to take on all of the above by themselves. They need the right support from data governance, data engineering, and IT. Building in the right controls will help banks avoid nasty surprises when their AI goes horribly wrong because of unforeseen or undiscovered data problems.
About Author:
H.P. Bunaes is founder of AI Powered Banking, a consultancy helping banks and fintechs with their data and analytics. H.P. has 30 years experience in banking, data, and analytics, holding senior leadership positions at top 10 US banks. H.P. is a graduate of M.I.T. where he earned an M.S. in Information Technology. More information can be found at https://aipoweredbanking.net
This is a Sponsored Feature
-
Finance1 day ago
FTSE 100 clocks weekly decline; personal goods shares biggest drag
-
Research Reports1 day ago
Terephthalaldehyde Market Set for Significant Growth from 2024 to 2031, Driven by Rising Demand in Key Sectors
-
Technology1 day ago
Factbox-Carmakers adjust electrification plans as EV demand slows
-
Research Reports1 day ago
Scandium Alloys Market Set for Strong Growth from 2024 to 2031, Driven by Aerospace, Automotive, and Energy Sectors