When you hear the expression Big Data, the next word you will often hear is 'Hadoop'. That's because the underlying technology that has made massive amounts of data accessible is based on the open source Apache Hadoop project.
From the outside looking in, you would rightly assume then that Hadoop is Big Data and vice versa; that without one the other cannot be. But there is a Hadoop competitor that in many ways is more mature and enterprise-ready: High Performance Computing Cluster.
HPCC Systems is a spinoff from data services company LexisNexis that has been powering that company's massive $1.5 billion data-as-a-service (DaaS) business since the early 2000s.
Like Hadoop, HPCC is open-sourced under the Apache 2.0 licence and is free to use. Both likewise leverage commodity hardware and local storage interconnected through IP networks, allowing for parallel data processing and/or querying across the architectures. This is where most of the similarities end, says Flavio Villanustre, vice president of information security and the lead for the HPCC Systems initiative within LexisNexis.
HPCC older and wiser than Hadoop?
HPCC has been in production use for more than 12 years, though the HPCC open source version has been available for only a little more than a year. Hadoop, on the other hand, was originally part of the Nutch project that Google put together to parse and analyse log files and wasn't even its own Apache project until 2006. Since that time, though, it has become the de facto standard for Big Data projects, far outpacing HPCC's 60 or so enterprise users. Hadoop is also supported by an open source community in the millions and an entire ecosystem of start-ups springing up to take advantage of this leadership position.
That said, HPCC is a more mature enterprise-ready package that uses a higher-level programming language called enterprise control language (ECL) based on C++, as opposed to Hadoop's Java. This, says Villanustre, gives HPCC advantages in terms of ease of use as well as backup and recovery of production. Speed is enhanced in HPCC because C++ runs natively on top of the operating system, while Java requires a Java virtual machine (JVM) to execute.
HPCC also possesses more mission-critical functionality, says Boris Evelson, vice president and principal analyst for Application Development and Delivery at Forrester Research. Because it's been in use for much longer, HPCC has layers-security, recovery, audit and compliance, for example-that Hadoop lacks. Lose data during a search and it's not gone forever, Evelson says. It can be recovered like a traditional data warehouse such as Teradata.
Rags Srinivasan, senior manager for Big Data products at Symantec, wrote about this shortcoming in a May 2012 blog post on issues with enterprise Hadoop: "No reliable backup solution for Hadoop cluster exists. Hadoops way of storing three copies of data is not the same as backup. It does not provide archiving or point in time recovery."
Although Hadoop is less mature in these areas, it's not intended to be used in a production environment, so these distinctions may not be that important at the moment, says Jeff Kelly, Big Data analyst at Wikibon. What it's being used for is analysing massive amounts of data to find correlations between heretofore hard-to-connect data points. Once these points are uncovered, the data is often moved to a more traditional business intelligence solution and data warehouse for further analysis.
"Currently, the most common use case for Hadoop is as a large-scale staging area," Kelly says. "Essentially it is a platform for adding structure to large volumes of multi-unstructured data so that it can then be analysed by relational-style database technology."
ECL - A high-level query language with a drag-and-drop interface
Another key benefit of ECL, Villanustre says, is that it's very much like high-level query languages such as SQL. If you're a Microsoft Excel maven, then, you should have no trouble picking up ECL.
Developing queries is further simplified by the work HPCC has done with analytics provider Pentaho and its open source Kettle project, which lets users create ECL queries in a drag and drop interface. This isn't possible with Hadoop's Pig or Hive query languages yet.
HPCC is also designed to answer real-world questions. Hadoop requires users to put together separate queries for each variable they seek; HPCC does not.
"ECL is a little bit like SQL in that it is declarative, so you tell the computer what you want rather than how to do it," Villanustre says. Pig and Hive, on the other hand, are quite primitive. "They are hard to program, they are hard to maintain and they are hard to extend and reuse the code - which are the key elements for any computer language to be successful."
Hadoop advantages - Scalable, flexible and inexpensive
Charles Zedlewski, vice president of products at Cloudera, disagrees with this perspective. Cloudera, after all, is among the best-known and most successful Hadoop start-ups, providing turnkey Hadoop implementations to companies as diverse as eBay, Chevron and Nokia.
"In fact, today Hadoop probably has the ability to cater to a wider range of end users than the data management systems that have come before, and that has always been the strength of Hadoop," Zedlewski says. "The three things that Hadoop does really well is it's very scalable, it's very flexible and very inexpensive."
As well as being flexible and robust, it's this last point that has so many people interested in Hadoop. However, while Hadoop runs on commodity hardware, you either have to hire someone to put everything together or find a third-party provider such as Cloudera to do it for you. With HPCC, much of the functionality you need is available out of the box-and it runs on commodity boxes as well.
In the final analysis, on the one hand, if you're looking for a more robust solution that provides enterprise-grade functionality, then HPCC may be the way to go. On the other hand, if you are just wanting to get a feel for what Big Data is all about, then Hadoop may be the better alternative, since it has a massive open-source ecosystem of developers working on it daily and a host of third-party companies springing up to take advantage of the opportunity Big Data represents.
"The macro trend that is driving all this is the explosion of data," Zedlewski says. "Data is growing faster than Moore's Law, which is requiring this different architecture and different way of working with data. And the reason it's growing faster than Moore's Law is because more and more things are getting hooked up to computers, whether it be your house, your TV, your cell phone, the flight you took. When that happens, they all wind up generating data at prodigious rates."