For over 15 years, I have been working with small to very large-scale data warehouses and have become quite comfortable with their vernacular and approaches for design, development, and implementation. Over the last few years there has been a shift in the marketplace around the traditional concept of data warehousing. Many organizations’ data warehouses have served them for years, but with general maintenance and the inevitable changes in business, companies are now ready to replace their current technology with an architecture that will meet both current and future data needs. Many of the initial conversations around developing a new, upgraded data warehouse are scoffed at and immediately dismissed. The term “data warehouse” has been somewhat relegated to the “old school” bucket of terms next to “client server”, “host computing” and “structured programming”. Organizations are looking for new, cutting edge data platforms.
I believe the shift began with the advent and mainstreaming of “Big Data”. Big Data was originally the domain of data scientists who developed predictive and prescriptive models based on previously unusable or inaccessible mountains of data. After companies like Google, Yahoo, Facebook, and Amazon, became poster children for unlocking value from Big Data, suddenly every organization considered themselves Google-like in need and sprinted down the Big Data path. After all, if Google can leverage their Big Data for business value, then ‘my’ organization should be able to do the same - forget that Google is primarily a data company, and ‘my’ organization is in manufacturing, insurance, or retail, etc. This trend whetted companies’ appetites to gather all possible data in one place, often without knowing why; it was just the thing to do.
Some pundits even promoted the idea that since Big Data brings so much value to the business, organizations should put their nice, clean, structured data into a file system, i.e. Hadoop, so it could all be used together - what company wouldn’t love to have all that business value at their fingertips? The thought was well-intended but has not worked out as planned.
In my Big Data or data lake conversations with consultants and employees across different organizations, there is a consistent focus on how much data is in their data lake. “We have 27PB of data in our healthcare data lake!” “We have 4PB of data in our manufacturing data lake!” During these discussions I have never heard mention of business value or how the business actually uses their data. Some organizations lost sight of the real purpose of a data lake and instead focus solely on loading as much data as possible. It’s as if the goal is to have the biggest data lake on the block.
I think one of the current challenges to realizing data lake value is the need to understand what is in the data lake. It’s difficult to use something so vast without knowing its contents or structure - you’re staring down the abyss and wondering what’s there. Few people are willing to rappel down to find out, and are content to wait for the movie to come out.
In the case of generally uncontrolled data lake expansion, the challenge of documenting what’s in the data lake is now a focus area. Fortunately, as companies realize the importance of capturing data lake metadata, they can now use data cataloging as a tool to discover and communicate data content and structure. In the past, the “new shiny object” (i.e. data lake) got the attention, while common sense and data management were ignored or put on the shelf for later. Of course, it is better late than never.
The introduction of new terms and concepts and morphing definitions of long-standing terms have further blurred the data landscape. Use the term “data lake” and people nod in agreement. However, ask for a definition and you will likely get a different response from each person. Sometimes the definition differences are extreme. Are we referring to a single repository of data, a collection of data repositories, raw and cleansed data? Similarly, the term “analytics” has somewhat morphed from traditional business intelligence and trend analysis to also include data mining, ad-hoc analysis, and data transformation processes. Speaking of data transformations, there’s a new buzzword in this area: “data wrangling”. Can someone tell me the real difference between data wrangling and data transformation? My thoughts are the following:
- Data wrangling: a generally difficult-to-share set of data transformation steps, created by and for an individual user
- Data transformation: a planned for, designed, documented, and repeatable process that operates on data once, producing easily shared results to the masses
Why the new term? Data transformation is data transformation regardless of who or what does the transforming. With the advent of the data lake we see more reference to ELT (extract, load, transform) than traditional ETL (extract, transform, load) processing. The term “ETL” may eventually join data warehouse in the “old school” terminology.
Modern Data Platform
For at least the near future, the data lake is here to stay. As Analytics grows in scope, some traditional terms are viewed as increasingly archaic. A modern data platform must account for these and inevitable future concept shifts. In spite of the evolving changes, the goal remains consistent - Provide users with timely, high-quality, trusted, and consistent business data. Users don’t want to have to reformat, restructure, cleanse, standardize, or develop their own processes for combining and deriving meaning from data.
The scope of the modern data platform includes a data landing zone, data lake, curated data, governed “sandbox”, data transformation processes, and data wrangling functions. Data visualization tools may or may not be included. The modern data platform allows data scientists to perform discovery using raw data or some degree of transformed data in a governed sandbox environment. They can continue to live in this world, defining new and better models or otherwise identifying data patterns.
Business users want to access cleansed, standardized, curated data integrated from business applications and the data lake – it’s impractical to expect the majority of data users to directly access the data lake and perform the necessary activities to make the data usable. To provide user value, there must be data transformation processes in place.
Most IT departments are divided into groups by function (network, cloud, data integration, data architecture, data governance, etc.). Because the lines between these IT groups are increasingly blurred, their members must be more vigilant in coordinating their activities when developing a modern data platform – there is no rationale for duplicate efforts or integration mismatches.
The success and value of a modern data platform should not be contingent on the size of the data lake, but rather how much business value the data lake (and its associated, curated data) provides. A focus on business value and an IT team ready to deliver on this vision are the key to a successful modern data platform.