Experience of a Lazy Coder: April 2011

Tuesday, April 19, 2011

Install Yahoo Oozie 3.0.0 on Apache Hadoop 0.20.2

Oozie is a workflow based service to manage data processing for Hadoop and different projects of Hadoop ecosystem like Pig, Hive, Sqoop, etc.

Oozie workflows are actions arranged in a control dependency DAG (Direct Acyclic Graph).An Oozie workflow may contain the following types of actions nodes: map-reduce, map-reduce streaming, map-reduce pipes, pig, file-system, sub-workflows etc.

Typical Ooize Design:

This post is about installing Yahoo Oozie 3.0.0 on Apache Hadoop 0.20.0. A little configuration trick is required to get Oozie working with Apache Hadoop.

Cloudera and Yahoo manage their own version of Oozie and hence pretty natural that they worked with their Hadoop without any pain.

Before diving into the flow of Oozie installation here are few softwares which are required

Warm up:

Apache Hadoop 0.20.0 Up and Running
Download and Untar Ooize 3.0.0 distro
Download ExtJS-2.2 library for enabling Oozie web console
Create ‘oozie’ named user and group on your system

Workflow for setting up Oozie:

Go through this diagram execute commands and put snippet in corresponding xml files and you are done.

Is everything Ok?

That’s it, Installation is over. To check if everything configured well

open your browser and hit http://localhost:11000 (by default bundled tomcat of Oozie-3.0.0 listen at port 11000).

Run Oozie examples bundled with Oozie and keep your figures crossed to get a long job id in response.

shell>hadoop fs -put examples examples
shell>oozie job -oozie http://localhost:11000/oozie -config examples/apps/map-reduce/job.properties -run

That's it guys.

Sunday, April 17, 2011

My Love Hate Relationship with Erlang

Yeah it’s true, lately I had a relationship with Erlang!! But unfortunately it turned out to be just an infatuation. Erlang is like most of attractive girls - looks promising from far but a real turn off when you interact with them. No, in-fact a lady not even a girl, it’s been there from two decades :-).

I have been interacting with Erlang from a year or so. No doubt it’s a powerful language best suited for mission critical system, effortlessly distributed in nature, no one can match its super scalability, inherent support for fault tolerance and its real hot - hot code swapping. Erlang is a Special purpose language.

With the evolution of multi-core culture from last few years it is getting lots of hype for wrong reasons. If you are among those who are attracted towards Erlang with overrated hype that it’s the next big thing and it will replace languages like Java or Ruby in near future, I beg to differ though. It has its own target audience but NOT general purpose programming language in true manner.

In this blog I will talk about “hate” of my love-hate relationship with Erlang. There are hell lots of reasons to dislike but here few of them:

Ugly Syntax - No dressing sense

Biggest turn-off for a developer interested in Erlang is its very very very ugly syntax. Today most of the developers are familiar with Java/Ruby/C like syntax whereas Erlang syntax is inspired from Prolog. It would take ages to get comfortable with its syntax and even after a year or so I struggle with it.

Bunch of confusing terminators

To end an expression you have as many as four terminators (explored so far).

Comma(,)
Period(.)
Semicolon(;)
No character()

It looks funny to start with, a bit confusing and irritates you in the time.

I will show you with this example:

is_valid_user(true) ->
take_action_1(),
take_action_2();

is_valid_user(false) ->
take_action_3().

And if I want to reorder the flows, you can’t just cut and paste code from here to there:

is_valid_user(false) ->
take_action_3();

is_valid_user(true) ->
take_action_1(),
take_action_2().

Or swapping the calling order of two actions, here again you have to take care of positional terminators:

is_valid_user(true) ->
take_action_2(),
take_action_1();

is_valid_user(false) ->
take_action_3().

As you go along with the language if clauses, switch statements etc. are even more difficult to get syntax and terminators right. I can’t imagine how a developer can re-factorize code without having bunch of stupid syntactical errors. In short, you either has to be mental to get it right in one go or Erlang syntax with make you metal with series of errors.

Erlang syntax is a huge productivity killer!!

Big Bulky Strings

Erlang is a BIG NO NO as a language choice if your application or business logic involves fair bit of String and String related operations. It is the most in-efficient String implementation I have seen ever.
It’s hard to digest but in Erlang each character in a String eats 8 BYTES of memory. Yeah 8 BYTES/character!! Because Strings are internally implemented as list of characters are having 4 Bytes for each character and another 4 Bytes having pointer to next character in the string.

Let’s add insult to injury, given the fact that Erlang is functional in nature which means all variables are immutable in nature. So, I you want to modify a string you have to create a new string out of modified elements.
More insult, imagine message passing this string to say 10 processes and remember copying occurs during message passing in Erlang.

Final thought, don’t even think of Erlang once for performance oriented text processing applications.

Terribly Slow I/O

Yes it’s a known fact, Erlang’s io:get_line is TERRIBLY slow ! Problem is it reads the file content character by character. Actually it’s not an inherent problem with language rather more of an implementation shortcoming of virtual machine.
However, inefficient line-reading mechanism is not an excuse for any programming language it’s a big shame! Languages like C, Java, Ruby and others deal with line-based I/O quite efficiently.

Lastly just club the last two points; you can imagine slow sluggish performance of application involving text based processing of log files.

Closing my post here, I will cover remaining points why hate Erlang in my upcoming post. Remember I equally Love this language as much as I hate, so will cover most incredible features of this language in my next post.

Thanks.

Friday, April 1, 2011

Benchmarking Hive Hbase performance

Lately, we have been working on a great deal of Hadoop ecosystem products like Hadoop, Hbase, Hive, and Oozie. Everyone is master in their own way, their way of working and their targeted audience is completely different. Still sometimes you found yourself confused which one suits best in your problem statement.

In this post in particular I will talk about Hbase, Hive and combo of Hive over Hbase only. Our problem statement deals with fairly large amount of server generated log data files, that too hundreds of GiBs/day and one aggregated log file every 5mins. Analytics reports in both near real time on small data set and scheduled ones for large set of data. We performance bench-marked few combos of Hbase and Hive.

Hbase:

Near real time analytics
Fast Incremental load
Custom map-reduce

Hive:

Bulk processing/ real time(if possible)
SQL like interface
Built in optimized map-reduce
Partitioning of large data

Hive Hbase Integration:

Best of both worlds
low-latency incremental data refresh to Hive
SQL query capabilities to Hbase

For simplicity we kept schema as simple as possible.

Hbase Schema:

Hive Schema:

Expected Output:

To achieve this we have coded custom map-reduce classes for pure Hbase scenario and used “select count(*), pagehit from weblogs” for hive and Hbase-Hive approach.

Below table shows the performance benchmarking results on 2 machine Hadoop cluster:

Is quite clearly that pure hive has performed exceedingly well as compared to other two.

Custom map reduce on Hbase and Hive over Hbase performed almost same.

Hive over Hbase is 5 times slower than Hive, but we find it far slower than factor of 5.

With increase in number of column families (which will in real world use cases) the performance of hive over Hbase would degrade further.

Hive over Hbase is in experiment phase currently, not production ready. Hope Facebook will fix this in upcoming release of Hive.

Next i am gona tune Hive with data compression and other stuffs. In my next post I will share performance results on this.

That’s it for now. Thanks.