Why do relatively few companies have success with big data?
Ultimately, it’s because they prioritize technology, not business value, in their big data strategies. To understand why this is the case, it’s helpful to distinguish between big data in the abstract – e.g., the concept or definition -- and big data as something that’s alive for us, something that’s both a challenge and an opportunity for us, in our own environments. The former is an umbrella term for a class of problems that (broadly) apply to storing, managing, and analyzing large volumes of data at unpredictable intervals. The latter is less a definition than a specific – or, to be precise, a personal – understanding of what “big data” means for us.
The formal definition views big data as, first and foremost, a technological problem. The personal understanding, by contrast, is pragmatic: it views big data through the lens of the business use cases that it alone makes possible, the business requirements that it alone addresses. Unfortunately, most of us tend to fixate on the technology-centric definition. At best, we have a vague sense that big data has the potential to change everything – once we’ve got the right technology bits in place. Chances are, most of us have fallen into this trap.
Try asking yourself a simple question: “When I think about ‘big data’, what do I actually have in mind?” What does your gut say? Do you mean data that’s “big” in relation to normative metrics or standards, e.g., its size in terabytes, petabytes, etc.? Or do you mean data that’s big for your organization, with its known resources, competencies, and limitations? (Bear in mind that depending on a company’s size or industry, “big” data volumes could total less than 100 gigabytes – uncompressed.) When you talk about the “bigness” of data, do you really mean something about its technological complexity – e.g., the challenges you’ll face in ingesting, persisting, managing, and, above all, retrieving and using it? Finally, when you think of “big” data, do you think of it as primarily a technological issue? Is technology, not business, your go-to frame for big data? Not surprisingly, this last is the most important question of all.
It’s the most neglected question of all, too. That’s because of the technological bias that’s basically baked into the way we think about big data and all things related to it.
The upshot is that we rarely think of big data as a prerequisite for one or more business-specific use cases, or as a business-oriented innovation that enables new concepts, tools, and techniques for modeling, analyzing, synthesizing, and interpreting information. It has the potential to change the way we run our businesses, the products and services we develop, the markets in which we compete.
Slouching Towards Big Data
This lack of business-centric focus has consequences. For instance, over the last few years, I’ve worked with several client organizations that decided to move to “big data” platforms so they could reap the (usually vendor-hyped) benefits they’d heard so much about.
This is a textbook example of what I like to call the “if-you-build-it-they-will-come” fallacy. (“They” being users.) It’s what happens when an organization tries to rationalize the decision to adopt a new technology by linking it to a vague estimate of potential business value: e.g., “Big data will increase profits/improve marketing spend/accelerate new product development, etc.” In painful point of fact, the organization makes little or no effort to identify concrete business use cases – an improved sales forecasting app, or new apps or services that make it possible to improve manufacturing yields – for which the new technology is a prerequisite.
If-you-build-it-they-will-come worked in the film classic Field of Dreams; in business, however, it is all too frequently a prelude to frustration, fecklessness, and, inevitably, failure.
In practice, a lack of business-centric focus can also have surreal – to outsiders, at least – consequences. One client I worked with had very unrealistic expectations about the big-ness of their data storage and processing requirements. When I took the time to estimate their average (daily) data volume, my calculation yielded a shockingly low figure: less than 10 GB of relational data! Now, to me, 10 GB is not big-data scale.
Still, maybe 10 GB really is big-data scale for this specific client. That’s a legitimate point, although it proved to be irrelevant in this case. One problem was that the client had made a technology decision without having a clear idea of their data storage and processing requirements. Their driving motivation was to do big data. That’s not only a business no-no, but a technology no-no too. Speaking of potential business applications, these were, in effect, mere afterthoughts: the client had not identified a single concrete business use case and had no plan or vision for what they would actually do with their big data platform once it was up and running. In the second case, however, the client had made a particularly poor technology decision. Providing some perspective, consider that although my laptop is not a “beefy” laptop, it can nearly hold 10 GB of data in memory. The same amount can be easily stored in my off-the-shelf mobile phone. From a pure technology perspective, the client would have been better served by moving to a traditional SMP relational database or maybe even a massively parallel processing (MPP) database instead of a self-styled “big data” platform.
These anecdotes aren’t anomalies. They’re part of a pattern. I frequently encounter clients that say they’re doing “big data” when all they’ve really done is load a bunch of their data into a platform such as Hadoop. Ask them what they’re actually doing with this data – or how they’re using it – and many will sheepishly admit that they aren’t doing much. Maybe they’re having trouble accessing their data in a timely, efficient manner. Maybe they haven’t even gotten this far. Maybe they aren’t able to use their data; to be sure, their big data platforms enable them to ingest data at massive scale – to the tune of tens, hundreds, even thousands of terabytes – but they’re unable to get the same data, the right data, back out again. Maybe – you’d be surprised how often this turns out to be the case – they aren’t actually exploiting any of the big data-specific features or capabilities of their new big data platform.
That’s because they’re still struggling with the basics.
The situation is analogous to the Dead Sea, which is one of the most inhospitable marine environments in the world. Just how inhospitable? With an average salinity of 34.2 percent, the Dead Sea is 9.6 times saltier than ocean water, which makes it ill-suited for most forms of flora and fauna. One of the reasons the Dead Sea is so salty is that it’s a “closed basin.” It’s replenished via a fresh-water tributary – the Jordan River – but no water flows back out of it again. That said, water does escape from the Dead Sea, chiefly via evaporation. This is why (in spite of a continuous flow of fresh water from the Jordan) its salinity levels remain unusually high. They’re likely going to get even higher, in fact, because the Dead Sea is … dying.
The fate of the data that’s moved to these platforms is analogous, in a way, to that of life in and around the Dead Sea. When it’s cut off from any means of egest, the data in a big data platform remains inhospitable to – because it’s unavailable for – use. What you’re left with is something that is neither living nor dynamic nor life-giving. You’re left with something inert.
Succeeding with Big Data
Why, then, are so many organizations moving to big data platforms? One problem is the “Google-does-it” phenomenon: the idea that Google’s or Facebook’s or Amazon’s success with big data, predictive analytics, machine learning, and more advanced types of analytics (this article hasn’t even touched on deep learning or artificial intelligence, for example) is a model for organizations of all kinds. After all, these companies developed many of the core big data technologies – not just as potential proofs of concept, but as production-ready systems.
The thing to keep in mind is that your organization is not Google, Facebook, Twitter, or Instagram. Their business is data. Yours is … something else, irrespective of how important data and analytics are to it. In all likelihood, in fact, your organization manufactures, sells, and/or provides goods and services to consumers or other businesses. Think of big data as a business-oriented innovation. It’s a resource for visualizing your business – a means of obtaining a rich understanding of what your business is, of how it operates, of its strengths and weaknesses, and of the other “inhabitants” – customers, suppliers, partners, competitors – in its world. You can’t afford to take an if-you-build-it-they-will-come approach to big data and business value: e.g., amassing, managing, and analyzing data without concrete purpose or intent. You should focus on identifying business-specific requirements, applications, use cases, etc. that are enabled by or will benefit from what big data technologies do best. (These include: faster processing of more and different types of data; more rapid delivery of more and different types of data; high-value predictive insights that require more and different types of data; or some combination of all three.) A variation on a catchphrase from another movie – “Show me the money!” – seems appropriate here: “Show me the business value! Show me the use cases!” This isn’t a new philosophy. It’s the way we should’ve been doing it all along.
The term “big data” is a source of ongoing confusion and frustration for many clients.
With this in mind, I’ve tried to break it down into categories that make it easier to conceptualize it, as well as to situate it in understandable contexts. The most common way of categorizing big data is to see it as a function of the so-called “Three Vs” – volume, variety, and velocity – but I don’t think this approach is particularly helpful, especially from a business perspective. It’s much more helpful to view big data through the lens of use – i.e., the purposes for which it is actually used. Yes, the Three Vs – as well as, notionally, dozens of other Vs (veracity, viability, volatility, etc.) – are implicit in all of these, but the purposes themselves are prior and fundamental. Based on my experience, big data usage can generally be categorized into one or more of the following purposes:
- Structured Data Analysis – The sheer volume of our existing data and/or the frequency at which we need to ingest this data means that even traditional structured data tasks can potentially become “big data” workloads. This begs an obvious question: “At what volume and/or frequency does structured data become ‘big data?’” There is no hard-and-fast rule. In the first place, if or when data volumes (or data delivery frequencies) far outstrip what is normal for you, they’re by definition “big,” quite aside from how they compare with the canonical definition. In the second place, there is no canonical definition. The term “big data” is non-specific enough to encompass anything from a few terabytes to petabytes, exabytes, zettabytes, and beyond. (That would be yottabytes, for the record.) As for the frequency or velocity at which structured data analysis starts to become a big-data scale problem, there’s no hard-and-fast rule here, either. Frequency typically only becomes a problem if or when data volumes also increase: real-time delivery and ingest, by itself, isn’t a big-data scale problem. Real-time delivery and ingest of gigabytes or terabytes of data, on the other hand, poses significant challenges, even for an MPP database platform. Suffice it to say, if you’re ingesting large volumes of data at frequent intervals, you’re “doing” big data.
Another factor to consider is data sparsity, a term that’s used to describe very wide records that contain very little actual data. Imagine, for example, that you have 500 data elements in a record but that only 10 percent of each record’s elements are ever populated. (Well-known examples include manufacturing test data as well as some kinds of telemetry data.) If you find that you’re ingesting more and more sparse data, it might make sense to consider a NoSQL platform. Does use of a NoSQL platform mean that you’re working in the data big leagues? Not necessarily - again, data’s big-ness is always a function of scale and context.
- Unstructured/Semi-Structured Data Analysis – When big data first became a thing, a number of vendors, analysts, and thought-leaders talked about it primarily in connection with what they called “unstructured” data. After some time had passed, a new crop of vendors, analysts, and thought-leaders – along with a few first-wave champions – noted that much of what we were calling “unstructured” data actually had some kind of structure to it. For example, web logs will often include conventional data elements (such as log-record types) along with seemingly “unstructured” – or “multi-structured,” to use the current term – data that (depending on what’s happening) is also captured. At some point, however, it is necessary to impose some kind of structure to data of any type. Absent structure, data isn’t intelligible to human beings, which means it can’t be explored, profiled, interpreted, and – if necessary – integrated. (This is all preparatory to analysis itself, in which data – joined, blended, or combined with other data – is explored, analyzed, and interpreted.) Just when to derive structure is a fraught question, however. Speaking for myself, the most successful big data implementation for multi-structured data I’ve seen adhered to the mantra: “Impose structure on unstructured data as soon as possible.”
Most RDBMS platforms can accommodate semi-structured (e.g., text) data, at least to an extent. They’re less adept at storing and managing – or analyzing – multi-structured data. A NoSQL platform can provide a good deal of flexibility for this use case, serving as both a repository-of-record for multi-structured information and as a context in which to prepare (impose structure on) multi-structured data prior to extracting, moving, and loading it into a destination platform for analysis. In this way, you can rapidly make multi-structured data available for analysis and retain it, unchanged, in a repository-of-record.
- Machine Learning – This involves using machine learning technologies to identify likely correlations between disparate (and often disconnected) phenomena – events that are distributed in/across both space and time. (A predictive analytics solution for security penetration testing is required to connect different events on different systems that occurred at different, sometimes widely spaced, points in time.) Correlations are used as a basis for a prediction that says “if this then that.” By applying algorithms and functions to specific, pre-determined data elements, it’s possible to identify non-obvious patterns and to predict outcomes based on probability. Machine learning makes use of structured, semi-structured, and multi-structured data. It’s an ideal candidate for big data: generally speaking, the more data you feed a predictive model, the better – or more accurate – the resulting predictions.
- Data Science/Discovery – At last: the “Holy Grail” of big data-specific use cases – or, more precisely, the thing everyone’s talking about. But what does a data scientist actually do? Remember back in middle-school when you learned about the scientific method? You were taught how to formulate a hypothesis, how to devise an experiment to test this hypothesis, how to conduct the experiment, and – finally! – how to analyze and interpret your results. A data scientist does basically the same thing. Scientists always work in specific contexts, such as chemistry, biology, physics, and – yes – business. Think of each of these contexts as its own little world: a scientist devises her experiment in order to generate new data about this world. When she analyzes and interprets this data, she hopes she’ll discover something new about her world. As with most traditional scientific experiments, the vast majority of data scientific-experiments will fail. “Failure” is a misleading term, however, because even in failure, the data scientist excludes certain possibilities – hypotheses about connections between events, or about the probable cause of certain anomalies – and generates knowledge about her world. Successful data science generally requires patience, creativity, and tenacity. It likewise requires large volumes of data of different types. Some data science experiments will combine data of different frequency characteristics: streaming data, for example, that is correlated (in time, space, etc.) with transactional data, or web log data. In data science, as in machine learning, the more data, the better.
Big Data is a very large world. A shotgun approach will likely produce little or no value. Likewise, sifting through mountains of data without a specific purpose is like searching for treasure on the beach while the tide comes in and out – you just hope to uncover something interesting or valuable. There is a case to be made for storing lots of data without a specific purpose in order to unlock the value at a later time. However, if you need to unlock the value sooner, some initial steps can help guide the efforts in the right direction. For a new big data initiative or when resurrecting a stalled big data effort, identify the business objectives to be achieved, understand the types and variety of data that will be sourced, and categorize the purpose for which the data will be used. Aligning big data efforts with business desires and understanding the purpose category is more likely to produce the value big data is purported to provide.