While the vision for data lakes has always been focused completely on making data more quickly available, few companies have managed to meet the challenge of satisfying the needs and of business end users as a central focus.
The reality for many banks is that the data contained within most data lakes is not accessible to the average business user, who are relying on central engineering teams to construct queries to extract the data sets.
There are two stark realities when it comes to serving business users (or not, as is often the case):
This bank had built its data lake, including many ingested data sources over a number of years.
However, instead of being focused on building analytics and improving the quality of the data lake, the central engineering team spent their entire time dealing with requests to ingest new data sources, doing little data improvement and only focusing on building the first few layers of the transformation.
The bank’s analysts and data scientists had become frustrated because the central engineering team had become a bottleneck. To get any new data source in, they had to go through the central data lake engineering team to go through an approvals process. This involved so much manual work in ingesting new data sources that it slowed down the entire analytics journey.
In dealing with its custom engineered data lakes, the bank’s senior IT executives dedicated extensive budget to building a huge team with excellent capacity. The team then spent months bringing in data sources without really consulting the businesses with what the use cases should be.
Ultimately, IT didn’t get the buy in because the business didn’t understand how to get value from the data. Because it took so long to ingest a new data sources, productionise the new uses cases and manage the data quality and governance, the business got bored and went away to find another tool.
The end result was a userless data lake with no subscribers and no stakeholders involved.
If data lakes are standing in the way of business users easily accessing data or use cases articulating a clear path to decision making that will lead to ROI, banks know they’ve got a problem. Here, we discuss three changes businesses are making to data lakes to ensure that relevant use cases see the light of day and that business users can put them to the test.
Rising to the Self-Service Challenge
From the outset, the main purpose of data lakes has been to give business users immediate access to all data, freeing them from relying on data warehousing teams to model that data, or simply to give them access.
The point is that nothing was meant to stand in the way between business users and data, but the reality is much more complicated than this. Business users often struggle with self-service using data lakes and ultimately end up relying on engineers to construct complex queries to extract that data, which slows the release cycles. This is simply because open source tools do not feature any sort of self-service capability and so this has to be built.
In order for use cases to be timely, relevant and useful, business users need to be able to get to and query their data. Data discovery needs to be made simple, allowing user to build queries to access the data to build data products that support analysis.
One of the important differentiators in the next generation of data lake platforms is that they feature self-service capabilities for non-technical users. Many feature Google-like search functionality against both data and metadata, allowing users to quickly scan the schema catalog for relevant resources.
Fixing Data Quality Control Issues
Simply put, business users must be able to rely on the quality of data in a data lake, which is something many companies cannot guarantee.
Data in data lakes that have been custom engineered tends to devolve over time. Without the proper approvals process over the data quality tools in place, people often continue to engineer in new data sources and integrate these into the data lake as one-offs. This ends up being a quickly forgotten process that does not focus on the quality of data.
So what can companies do to ensure data quality is monitored and maintained? Automation must be put in place to ensure data is refreshed regularly, and to monitor the quality of data. Without this automation, over time, data quality begins to degrade and becomes useless in analysis.
Get Governance in Place
Companies often don’t have the experience, capability or skills or fully enable the governance or security they need to safely and productively maintain a data lake. While the flexibility of data lakes is one of their top selling points, without data lake governance, a data lake can quickly become a data swamp.
Metadata management, as well as data cataloging and indexing are essential if user are to be able to query and use data in data lakes.
In order to be able to be able to build the features that excite the business and solve real problems, banks need to put the next generation of data lake platforms to the test. Solving the path to self service, data quality and governance are great steps in the right direction to making data lakes user-centric and straightforward, and to solving banks biggest analytics dilemmas.
Maurizio Colleluori is Principal Data Engineer for Kylo™, Think Big Analytics