When is big data useful, and when is it irrelevant?

If there’s a buzzword concept that intersects the worlds of business and computer science, its “big data”. Concurrently, there’s excitement around machine learning as well. The hype has gotten so extreme that around June, someone chatting with us at Beagle asked whether or not we use deep learning as part of our AI on contracts. We told him we currently don’t – and our dear inquirer expressed a sudden disinterest in us with no further questions. That example illustrates why I think that a lot of this hype is misplaced, and as is the case with all buzzwords, people should exercise restraint in using them.

Why are big data and machine learning so big right now? There are two reasons that I see being highly attractive to managers:

Big data promises managers the opportunity to conduct informed, data-driven decision making
Machine learning yields opportunities to deliver unique, tailored customer experiences

While the above is certainly true, small data works too! In fact, pursuing the goal of being a big data company has significant cost, which may never be recouped (HBR has more on this).

Big data is hard and costly to work with. Wikipedia provides a good definition of big data: “… data sets so large or complex that traditional data processing applications are inadequate.” Indeed, in order to work with what is truly “big data”, an organization would need specialized data processing applications simply to work with it. The hardware that big data processing happens on are complicated configurations: clusters consisting of several individual computers that distribute workload between themselves and process data in parallel. Examples of some open-source (free) software that does the actual data processing includes Hadoop, Spark, MapReduce, Pig, Hive, and more. Much of this software is orthogonal, and a complex architecture to orchestrate and manage these hardware clusters consist of even more software like Samza, Kafka, Celery, Redis (one example of NoSQL), and more to facilitate the whole thing.

But the complexity of such a setup is just the tip of the iceberg – any manager knows that the actual implementation of any IT project is riddled with opportunities for complications, obstacles, budget and schedule overruns, and vulnerabilities or errors due to fatal bugs. In this sense, a big data orientation can introduce real business risk.

So, “big data” is hard, risky, and in some cases infeasible. The good news is that small data can also provide valuable business analytics. The article “You May Not Need Big Data After All”, published in the Harvard Business Review, goes into detail about how small data is best leveraged to enable the data-driven decision making that managers dream about. Data that is available to most businesses (sales, operations or supply-chain, internal accounting, HR) can be rich and full of potential insight. Truly, this is not just building sales forecasts – data scientists have the tools to study small data in order to find a huge variety of meaningful correlations and intelligence.

In 2012, one year prior to the above article which is skeptic on big data, Harvard Business Review published Big Data: The Management Revolution. Leading off with a provocative quote: “You can’t manage what you can’t measure”, it is not hard to see where big data accumulated so much hype. This article is too general. It comes off with a tone that speaks to all types of businesses, and it de-emphasizes the caveats of big data. However, this article rightly touches on some important capabilities that are unique to big data. Now that I’ve done my duty to the reasonable skeptic engineers who are reading, I’ll talk about the promise of big data – because there are innumerable benefits to those outstanding institutions who are equipped to leverage it.

Andrew Ng, Coursera co-founder and machine learning researcher previously at Stanford, joined the Chinese search giant Baidu. Yann LeCun, a pioneering deep-learning researcher at New York University, is heading Facebook’s artificial intelligence research labs. Geoffrey Hinton, yet another pioneer of machine learning at the University of Toronto, joined Google to work on machine learning.

Apart from these three prominent examples, these and other big companies (both tech companies and tech-aspiring companies) are picking AI researchers straight out of PhD programs in prominent schools. I am hard-pressed to conceive other research areas which have experienced so much talent leave academia at such a rate. They’re migrating to industry because they want to get their hands on true big data.

What really constitutes true big data? Millions of users commenting, messaging, liking, clicking every second. Millions of point-of-purchase sales every minute, in thousands of locations around the world. Millions of vendors, suppliers, distributors, raw material manufacturers in an supply-chain that still manages to coordinate just-in-time inventory.

The effective leverage of big data is a key distinction that separates countless examples of the disruptor vs. disrupted, the innovator vs. laggard. See: Netflix vs. Blockbuster, Uber vs. taxis – who knows what will come next? Who remembers those that have failed already?

Meanwhile, the researchers I mentioned before have propelled several innovations and even published their research to the public. See for example, this paper just recently authored by the Andrew Ng and the Baidu research team about computer speech recognition for English and Mandarin. See also a list of publications by Facebook’s AI research labs, or Google’s.

These publications are the example of how big data can be useful. To build a product that uses computers to do something complex and challenging (like something that involves high quality recognition of spoken Mandarin), small data will simply not cut it – there just isn’t enough data to find the relationships that unlock the key to understanding how this complex functionality works.

How can these mega-corporations afford to publish their intellectual property for anyone to steal? To be sure, these algorithms are potentially effective for myriad AI problems. However, the key to actually extract value out of these algorithms is to be equipped with the big data that these algorithms run on. These algorithms are like the schematics to build a jet engine. To build the rest of the spaceship, powerful software and hardware is the body, and big data is the rocket fuel that powers it.