Friday, April 1, 2011

Benchmarking Hive Hbase performance

Lately, we have been working on a great deal of Hadoop ecosystem products like Hadoop, Hbase, Hive, and Oozie. Everyone is master in their own way, their way of working and their targeted audience is completely different. Still sometimes you found yourself confused which one suits best in your problem statement.

In this post in particular I will talk about Hbase, Hive and combo of Hive over Hbase only. Our problem statement deals with fairly large amount of server generated log data files, that too hundreds of GiBs/day and one aggregated log file every 5mins. Analytics reports in both near real time on small data set and scheduled ones for large set of data. We performance bench-marked few combos of Hbase and Hive.

Hbase:

  • Near real time analytics
  • Fast Incremental load
  • Custom map-reduce

Hive:

  • Bulk processing/ real time(if possible)
  • SQL like interface
  • Built in optimized map-reduce
  • Partitioning of large data

Hive Hbase Integration:

  • Best of both worlds
  • low-latency incremental data refresh to Hive
  • SQL query capabilities to Hbase

For simplicity we kept schema as simple as possible.

Hbase Schema:

Hive Schema:

Expected Output:

To achieve this we have coded custom map-reduce classes for pure Hbase scenario and used “select count(*), pagehit from weblogs” for hive and Hbase-Hive approach.

Below table shows the performance benchmarking results on 2 machine Hadoop cluster:

  • Is quite clearly that pure hive has performed exceedingly well as compared to other two.
  • Custom map reduce on Hbase and Hive over Hbase performed almost same.
  • Hive over Hbase is 5 times slower than Hive, but we find it far slower than factor of 5.
  • With increase in number of column families (which will in real world use cases) the performance of hive over Hbase would degrade further.
  • Hive over Hbase is in experiment phase currently, not production ready. Hope Facebook will fix this in upcoming release of Hive.

Next i am gona tune Hive with data compression and other stuffs. In my next post I will share performance results on this.

That’s it for now. Thanks.

1 comment: