Today, big data is a hot topic within almost every industry. May saw the biggest ever European technologists conference on big data, Berlin Buzzwords, while the likes of O’Reilly’s Strata conference pull in huge numbers of attendees keen to learn how to adapt to this new world.
Despite all the interest, a great deal of confusion remains around big data. Not only are there never-ending debates about what big data is, there’s a huge range of possible big data solutions to choose—only a few of which will be appropriate for any given situation or problem. If you speak to some technologists from big Silicon Valley firms, they’ll say you don’t have big data until you have entire data centers on three continents. Many of the recent big venture capital (VC)-backed big data releases have targeted the “few racks” sized problem space. Hang out at the right tech events, and you’ll see various groups demoing their big data solutions on collections of machines that’ll fit comfortably in a shoebox.
On the other hand, there’s a small backlash against the big data movement, with some explicitly saying they have a small data problem. Many of these stress the importance of being able to process everything on one machine, of ensuring that processing is available to all, and not just those with large budgets. That can be countered though, through the use of on-demand cloud systems from the likes of Amazon, which allow anyone with a few dollars to spare on their credit card to spin up their own temporary big data system for an hour to do their processing.
Where does this leave those starting out on their use of big data? When we look at potential solutions, systems, frameworks, how do we know if they are right for us? When the suave salesperson from a big data company phones, how do we know that what they’re pitching is of the right scale for our needs? After all, something that scales to a handful of machines won’t work when holding medical information for a whole country, while another that works best with data centers on three continents will be an expensive overkill for those in the low-tens of terrabytes of data to process. Both problems are big data, but what’s right for one won’t suit the other.
Considering such ranges of situations, we have to ask ourselves. Despite the hype, VC funding and marketing buzz, is the use of a single label to cover the whole space becoming a problem? Is the term “big data” as a catchall still useful?
At this point, let us allow ourselves a brief diversion. Where else have we come across multiple different words for “big”? For those living in places well served by a certain international coffee chain, the answer is every morning. For those either off the beaten track, or in a town with a strong independent coffee scene, the answer may be more elusive. Either way, what can we learn from the use of “tall”, “grande” or “ venti” as different measures of big?
Well, sticking with coffee, in many places no one likes to order a “small.” Just a few wish to stand up in a senior management meeting and say “actually, we don’t have big data after all,” there’s a certain reluctance to ordering a “small” coffee. People still prefer different sizes, but naming is important.
On the academic side, Google have released a number of seminal papers on big data. Whether we’re considering their paper on MapReduce, which led Doug Cutting and friends to re-architect Nutch along those lines (which eventually grew into Hadoop), or their more recent Spanner Works, which rely on known error-bars to allow distributed provably-correct distributed handling of “what happened first,” we see great leaps forward. The computer scientist in me is excited by the prospect of what can be done, and the elegance of what’s possible. The pragmatist in me wants to know how we can solve last Tuesday’s customer issue without committing to another rack in the data center. While some solve globally distributed problems, many of us face short-term, multi-machine problems. Many of us foresee larger challenges, but not that many orders of magnitude more. Find a VC over a beer, or certain researchers, and you’ll hear of the huge big data challenges that exist, and the innovative giant projects that help solve them. Compared to what many of us face, it seems a different world. However, all of us are within the “big data” space. Faced with these divergent needs, can we really all say we are all “big data”?
Given this range, how come one term has tended to stick? How much can be explained by the desire not to have a “small” problem, and how much can be pinned onto the desire of people to follow the buzzword and marketing effects of “big data”? Another challenge we face is fluidity, as new systems and products are developed. If you look at many talks from big data events from two to three years ago, it’s striking how many talked of bespoke functionality and hard coding at the time, which are now available as standard in the latest tools. In some areas, what’s hard or big today won’t be next year, while other challenges remain. A new release might make enforcing security permissions easier, or allow new statistics to be run as standard, but the speed of light remains constant.
Despite this, a problem remains—how can someone new to this field work out which kinds of big data problems they have, and identify the right kinds of solutions? Plenty of companies—large and small—claim they have what you need, but how can you check before handing over large sums or spending lots of time? The boring and un-sexy answer is in part the need for requirements, a clear identification of what your problem is, and what is needed. If anything, the growth of big data has made the up-front gathering of requirements more important, not less. Users need to think about where their source data is, and what form it’s in. They must consider how spread out it is and how easy it is to run the data processing/analysis near to its location. Users must think about if they can work directly on the source data, or if it needs pre-processing. They must think about how fast the data is growing and how quickly they need to include new data in the results. They must work out the complexity of their calculations, where the outputs will go, and for what use they’ll be put. They must decide if 10 machines for 10 mins with complexity are better or worse than two machines and simplicity for an hour. In conclusion, users must identify the problem, then pick the solution—not the other way around.
Once the requirements are gathered, users can group themselves into a “kind” of big data, to help search for the right solution.