By Alvin Tan, principal consultant at Capco
Data sourcing and cleansing are frequently cited as being among the most critical, yet most time-consuming, aspects of data science. Enhanced data management not only reduces the burden of data sourcing and preparation, but also improves data quality and serves to foster greater trust in the insights that are delivered via data science.
Robust data management capabilities ensure that less time is spent wrangling data into an analytics model and more devoted to the actual modelling and identification of actionable business insights. Organisations that build analytics data pipelines upon solid data management foundations can extract greater business value from data science.
This delivers not only competitive advantage through newly-identified insights, but also a comparative advantage via a virtuous circle of data culture improvements.
Lies, damn lies and statistics
Data science is only effective if it ultimately delivers positive value: the analytics must serve a clear business purpose. ‘Garbage out’ – incorrect, misleading, meaning lessor otherwise unusable data science output – leads to sub-par strategies and misguided business decisions at best, and financial and reputational damage at worst.
This diminishes the business value of the results at hand and moreover erodes the faith that decision makers might place in future results. Trust is key. No matter how powerful, accurate or statistically sound the results, a data science capability must itself be trusted if decision makers are to transform those results into business strategies. Establishing, retaining and nurturing such trust requires business outcomes to be consistently aligned with expectations.
Trust not only requires the adoption of sound scientific methodologies, but also a cost-effective mechanism of ensuring data issues are flagged, managed and resolved. This can be boiled down into two key data management requirements for analytics processes: understanding and obtaining the right data; and fixing the data that is generated.
- Understanding and obtaining the right data
In the case of model-led analytics (for example, machine learning) data is input into an existing analytical model to ascertain its accuracy and viability. In this paradigm, the data scientist must first understand the semantics of the data to be sourced, so that the conceptual and contextual specifics of the required data can be identified. The data scientist must then determine from where to source the specified data. That requires an understanding of data provenance if the data is to be sourced appropriately.
Comparably, an understanding of what data to source, and where to source it from, is also necessary to ensure outcomes of data-led analytics (such as data mining) have strong foundations. Therefore, an understanding of data semantics and data provenance are critical to ensure that any analytics draw upon the right data:
• The data required must be properly and unambiguously defined. This involves identifying and establishing a shared understanding with potential data providers as to what is required. If the data scientist wants ‘customer name’, for example, then an agreement must be made with the provider as to whether ‘name of account holder’ means the same thing semantically. In this example, there are many hidden nuances: does customer name include prospective or former customers? Does name of account holder cover mortgages, or current accounts, or both? Arriving at a mutual understanding is no simple task without a commonly agreed understanding of the definition, taxonomy, and ontology of the data.
• The data that is obtained must be representative of the population. An unrepresentative sample, for example where data obtained only represents specific subsets of the required population biases analytics outputs, should be avoided. As an example, if retail banking customer names are required, then it is important to ensure that the data is sourced from a provider that aggregates customers for all retail banking products, and not just, say, mortgages.Satisfactorily resolving this sourcing challenge requires not only an accurate semantic articulation of the required data, but also an understanding of where this data can be reliably obtained.
- Fixing the obtained data
Once sourced, data may still contain data quality issues that must be properly understood and resolved prior to any analytics. Resolving and correcting for data quality issues is a data cleansing process that constitutes a key element of analytics preparation.
Poor quality data inputs can manifest in a variety of ways:
- Data may contain gaps, which if not corrected at source, accurately input, or omitted entirely, will result in abiased output;
- Similarly, data may contain duplicate elements, which if not omitted will also lead to biases;
- Data may not conform to an expected format, which if not corrected may break the analytics model;
- Data may contain errors, which reduce the accuracy of the results;
- Data may be out of date, hence the relationships inferred may no longer be applicable;
- Data may not be sufficiently granular, or sample size may likewise be insufficient; both scenarios weaken explanatory power and the significance of outcomes.
‘Garbage in’ – incorrectly defined, inaccurate, incomplete or otherwise poor quality data entered into an analytics process – is a primary limiting factor on the usefulness and reliability of analytics results.
Managing the inputs
To avoid misleading data science outcomes that might drive bad business decisions, while also minimising the marginal cost of implementing such remedies, organisations must implement an effective data management capability, one that delivers the scale economies required to ensure additional data science projects are cost-effective.
A data management capability provides a set of centralised, scalable services for describing what data means; for understanding and recording where the data comes from; for maintaining good quality data; and for ensuring the roles and responsibilities for data management are effectively discharged.
- Semantics: data is given commonly agreed and understood definitions, is placed in a commonly known taxonomy and ontology so it can be categorised accordingly, and semantic relationships between data are clear;
- Provenance: the sources of data, and the paths to where it is consumed, are identified and documented;
- Quality: various quality dimensions such as completeness, conformity, consistency, validity, accuracy and timeliness of data are measured and published/reported on a regular basis;
- Governance: the decision-making bodies, policies, processes, accountabilities and responsibilities by which effective data management is defined, monitored, and enforced.
Without a vision for streamlining how these requirements are met, an organisation’s data science efforts can all too swiftly devolve into a web of hit-and-miss, fact-finding engagements between analytics projects and potential providers -with each project independently trying to surface the right data from the right sources. Analytics projects may even start ‘sourcing’ data from other analytics projects which becomes ingrained into ways of working, reinforcing bad habits in cultural norms that prevent development of a mature data-driven organisation.
Conversely, a centralised data management capability will provide a hub of data services and expertise that allows all processes –whether analytics or not – to outsource their data management requirements effectively and help foster a strong data culture.
There are several benefits here. Firstly, the data semantic (definition, taxonomy, ontology, and modeling) and data provenance (lineage and trusted sources) services offered not only frees up valuable time and effort, allowing data scientists to focus on the actual analytics, but also ensures more reliable and explicable analytics results.
Secondly, the hub serves as a governing body for all data management within the organisation, ensuring that the outcomes are available across all processes. This allows for incremental gains, with the knowledge (semantics, provenance, and quality) built for one project adding to the existing organisational body of knowledge in an accretive fashion.
Thirdly, a centralised data management capability allows analytics processes and models to be defined within a globally accepted semantic model. This allows analytics results to be defined and communicated in a common business language, which in turn enables better interpretation and understanding of results across different decision makers.
Improving data culture
Regardless of the trustworthiness of analytics results, decision makers do not habitually act on these insights. This is particularly the case with data mining insights, which are often produced in financial services with little business sponsorship and poorly defined or planned business implementation.
What is often missing is not just trust, but also a willingness among decision makers to put insights into practice. This reflects an inherent preference to stick with subjective, opinion-based decision making. This risk aversity towards relying on data science outcomes can be countered by having decision makers actively drive the data science process,and so be invested and interested in the outcomes.
Data governance is a key service that ensures the effective discharge of roles and responsibilities in relation to the management of data. Crucially, data owners and stewards must be identified and also engaged in the governance and management of data. These data owners are typically the same decision makers to whom analytics projects provide insights. In this way, the effective implementation of a data management service, helps to drive cultural improvements by ensuring decision makers actively participate in the governance of the organisation’s data.
Effective data management
A data management capability helps foster a data culture that places decision makers, rather than data scientists and data managers, at the forefront of data-driven decision making.It requires data owners to be involved in the governance of the data from which they draw their insights.
Trust is built by ensuring that business outcomes are consistently in line with expectations. This requires expectations to be properly set, which in turn requires the semantics, provenance, and quality of data inputs to the analytics be defined and known – ‘good’ inputs.
While very time-consuming and resource intensive when applied to each data project in a silo, outsourcing these data management requirements to a centralised data management function means economies of scale are achievable. Successful data management is the foundational layer for good data science and data-driven decision making.