Experience of a Lazy Coder: HPCC taking on Hadoop: Yet another Hadoop Killer?

Thursday, November 17, 2011

HPCC taking on Hadoop: Yet another Hadoop Killer?

After almost 5-6 years of dominance in big data world, Hadoop is finally come under fire from the upcoming new architectures. Recently Hadoop is facing lots of competition from lots of so called “Hadoop Killers”. Yes, I am a Hadoop lover but there have been some areas from starting where it is lagging behind like real time calculations, processing on small data set, CPU intensive problems, continuous computation, incremental updates etc.

Recently, LexisNexis open sourced its Hadoop killer claiming to outperform it in both data volumes and response time. It looks to me as a definite killer so I decided to have a look into it and compare it with Hadoop. Before taking a closer look into head to head comparison here is a brief introduction to HPCC for too lazy developers to visit HPCC site.

HPCC (High Performance Computing Cluster) is a massive parallel-processing computing platform that solves Big Data problems. The platform is now Open Source!

HPCC systems directly take on Hadoop, which is not wrong as far as functionality is concerned. But architecturally they are not comparable. Hadoop is based on Map reduce paradigm where as HPCC is not map reduce its something else. This is what I am able to figure out from their public documents.

Take a look of architectural diagram of HPCC; I am not going into detail of everything here.

However there are few places where HPCC have an edge over Hadoop:

Developers IDE
Roxie component

Places where both are on same track:

Thor ETL component: Compares directly to map reduce
ECL: High level programming language just like Pig

Now let’s talk in actual numbers, I have performance tested both frameworks on few standard use cases like N-grams finder and Word count example. Interestingly found mixed bench-marking numbers for both tests. HPCC performed well for N-gram finder whereas Hadoop did better in Word count use case. Here are the exact numbers of our tests:

Round 1: N-Gram Finder

Score: HPCC: 1, Hadoop: 0

Round 2: Word Counter

Score: HPCC: 1, Hadoop: 1

Yeah, with this mixed nature of test result we cannot conclude anything. So, let’s compare them on some critical and curial non-functional parameters.

Round 3: Licensing

The biggest issue HPCC is going to face is its licensing model. Open source world has never honored AGPL very well. If HPCC wants to compete and grow big like Hadoop has done is last four five years then they must have rethink about licensing model.

Whereas, on the contrary Hadoop with its Apache 2.0 licensing model results in many enterprises as well as individual to contribute to this project and help it to grow bigger and better.

Score: HPCC: 1, Hadoop: 2

Round 4: Adoption Rate

This is not a very big factor as compared to others but taken the last point HPCC is bound to face adaptation issues for developers and companies of all shapes and sizes.

For developers hurdle would to adapt ECL programming language and developers never feel comfortable to deal with a new language. On the other hand Hadoop’s development environment is higly flexible to any kind of developers. MR jobs can be written Java, C++, and Python etc. and for scripting background Pig is the easiest way to go. Thanks to hive which allows DBA to use its SQL-ish syntax.

Score: HPCC: 1, Hadoop: 3

Round 5: Ecosystem and related tools

Even if we consider HPCC to be more efficient than Hadoop but I feel HPCC guys are a bit late to enter into Open source world. Hadoop has put on lots of weight!, in last 5 years it has travelled a long way from a Map Reduce - parallel distributed programming framework to a de facto for big data processing or data warehouse solution. Contrary, am afraid HPCC is still more or less a big data framework only.

So, HPCC is not competing with another framework but an ecosystem which has been developed in recent years around Hadoop. It’s actually competing with Hadoop and its army, that’s where real strength exists.

Take a look how Hadoop has evolved in recent years for clearer picture.

Score: HPCC: 1, Hadoop: 4

Final Score: Clear Hadoop Dominance

Let’s talk practical, Hadoop is not the most efficient big data architecture. In fact, it is said that it’s anything from 10x-10000x too slow for what's required. There are tools and technologies in place which performs way better than Hadoop in their own domain. Like for database joins any SQL is better, for OLTP environment we have VoltDB, for real time analytics Cloudscale performs much better, for supercomputing/modeling/simulations MPI and BSP are the clear winners, Pregel is there for graph computing, Dremel for interactive big data analytics, Caffine or Percolator can’t be compared with Hadoop of incremental computing of big data.

Strength of Hadoop exists in its huge ecosystem and that’s where its USP is. It offers Hive for SQL-ish requirements, HBase for key value store, Mahout for learning machines, Oozie for workflow management, Flume for log processing , Blur for search, Sqoop for interaction with rdbms etc etc. and lastly vast developers network. So, when we talk about totality when creating a data warehouse or BI tool Hadoop would remain the obvious choice presently.

3 comments:

Flavio VillanustreNovember 18, 2011 at 1:20 PM
This comment has been removed by the author.
ReplyDelete
Replies
Flavio VillanustreNovember 18, 2011 at 3:22 PM
Mayur,

Nice article!

There are some aspects that I would like to dig into for the fairness of this comparison, though.

First of all, you are right when you say that HPCC is not based on a strict MapReduce paradigm; as a matter of fact, the design and implementation of the HPCC platform predates the paper on MapReduce published by Google in 2004, by several years.

HPCC was designed as a general purpose distributed shared nothing Big Data workflow platform and, as such, accommodated the most natural way to refer to data operations, implementing a declarative data oriented language (ECL) which has high level primitives (JOIN, SORT, DISTRIBUTE, TRANSFORM, etc.).

If you have worked with MapReduce and key/value stores (Hadoop and others) you probably felt the entire MapReduce paradigm unnatural: if I just want to join two datasets, why do I need to give the system details of the underlying implementation (for example, why do I need to perform a sort-merge-join)? And this gets much worse as you try to express more sophisticated data operations.

In ECL, joining two datasets is just a single operation: JOIN(dataset1,dataset2,condition). ECL also presents other advantages including a powerful optimizer (in the example above, the specific method of join can take advantage of, for example, a dataset that fits completely in RAM in each node, etc.), lazy execution (ECL code with no associated action doesn’t need to be compiled and executed at all), implicit parallelism (the exact same ECL code will run on one or 1000 nodes without changes), etc.

While PIG could resemble ECL from afar, both obey to different programming paradigms (PIG is mostly imperative, while ECL is declarative, for example), have different degrees of maturity (PIG forces the programmer to resort to User Defined Functions frequently, while ECL offers a complete set of high level primitives) and ECL is quite superior when it comes to code and data reuse (code/data encapsulation, abstracting the underlying implementation from the public interface, for example).Hive, the other query language of the Hadoop world, was not designed to handle extensible programming either, so it will do little more than just offer some query capabilities to Hadoop. At this point, I would say that ECL scores significantly higher than the other two options (I vote for +3 :)).

Regarding the word count code, your results are quite different from mine (and it should be because our codebases probably differ significantly!). This is an example of a word count program in ECL, assuming that your input file has one word per line (it can be easily changed to support any number of words per line):

// Import the standard ECL string functions library
IMPORT std.str;
// Define the record layout for my input dataset
op_record := record
String25 my_word;
end;
// Load my input dataset into op_record
OP := Dataset('~tutorial::fv::lotsofwords', op_record,thor);
// Define the record layout for my output dataset
R := RECORD
str.ToUpperCase(OP.my_word);
UNSIGNED C := COUNT(GROUP);
END;
// Perform the actual count
// I use the TABLE data structure
// The OUTPUT action is not required but including it for the sake of completeness
OUTPUT(TABLE(OP,R,STR.ToUpperCase(OP.my_word)));
ReplyDelete
Replies
Flavio VillanustreNovember 18, 2011 at 3:23 PM
On a more general comment, Roxie is a data delivery platform, designed to handle thousands of simultaneous data queries per second, and usually leverages index based data queries (where distributed simple and compound keys, dynamic indices, fuzzy matching, Boolean access and hash based data distributions can be used as necessary). Thor processing model, on the other hand, is based on parallel sequential access to datasets. Based on this premise, Roxie wouldn’t be the platform of choice to implement a word count algorithm (it would be the platform to use if you wanted to serve the results to thousands of simultaneous clients, though, but you would probably pre-calculate those counts and aggregates in Thor and publish an appropriate query in Roxie).

Since Hadoop doesn’t have a platform equivalent to Roxie (Hbase is probably as close as it gets to a data retrieval system in the Hadoop ecosystem but its features are not comparable to those of Roxie), I think Roxie should count as a positive score to HPCC :)

Regarding licensing, while you may be correct about the fact that the AGPL license hasn’t created significant traction, when compared to other, more permissive, licenses such as Apache and BSD, GPL is my personal favorite and I’ll try to articulate why.

In my opinion, Apache and BSD licenses don’t protect the freedom of the consumer of the code. As a user, I feel entitled to seeing how things are done behind the curtain, and accessing the source code for the application that I’m using; and the GPL family of licenses ensure exactly this.

While, at first glimpse, BSD and Apache licenses could seem to offer more “freedom”, since they don’t prevent companies from close sourcing the final product including their modifications, they end up restricting the freedom of the end user (I guess it all depends on whose side you are: the end users or the businesses).

On the topic of adoption rate (you have a typo and title it “adaption rate”) and the support for other programming languages, HPCC allows developers to embed C++ code in any ECL program, and also to pipe in and out of any ECL program. Not that I advocate for this, but programs written in Java, Phython, C++, Lisp, etc. (including binary executables) can be leveraged in an HPCC environment, from within ECL.

On the Ecosystem and related tools, HPCC offers a complete data-intensive computing paradigm, where many of the third party tools required by Hadoop are already part of the functionality the HPCC platform offers out of the box. In addition to this, standards based interfaces (SOAP, RESTful, JSON, etc.) provide for seamless integration with a myriad of other systems and solutions.

I would be more than glad to help you run a set of benchmarks across both platforms (for example PigMix), and explore some of the other HPCC features that were not mentioned here such as Machine Learning (I mean, really distributing the ML algorithms and matrix operations across the cluster!), data integration with Pentaho/Kettle/Spoon (our integration with Kettle allows users to define a data workflow in HPCC using Spoon, without writing a single line of ECL), management and operations tools, HDFS integration, etc., if you’d like.

Thanks,

Flavio
ReplyDelete
Replies

Add comment