In this post in particular I will talk about Hbase, Hive and combo of Hive over Hbase only. Our problem statement deals with fairly large amount of server generated log data files, that too hundreds of GiBs/day and one aggregated log file every 5mins. Analytics reports in both near real time on small data set and scheduled ones for large set of data. We performance bench-marked few combos of Hbase and Hive.
Hbase:
- Near real time analytics
- Fast Incremental load
- Custom map-reduce
Hive:
- Bulk processing/ real time(if possible)
- SQL like interface
- Built in optimized map-reduce
- Partitioning of large data
Hive Hbase Integration:
- Best of both worlds
- low-latency incremental data refresh to Hive
- SQL query capabilities to Hbase
For simplicity we kept schema as simple as possible.
Hbase Schema:
Hive Schema:
Expected Output:
To achieve this we have coded custom map-reduce classes for pure Hbase scenario and used “select count(*), pagehit from weblogs” for hive and Hbase-Hive approach.
Below table shows the performance benchmarking results on 2 machine Hadoop cluster:
- Is quite clearly that pure hive has performed exceedingly well as compared to other two.
- Custom map reduce on Hbase and Hive over Hbase performed almost same.
- Hive over Hbase is 5 times slower than Hive, but we find it far slower than factor of 5.
- With increase in number of column families (which will in real world use cases) the performance of hive over Hbase would degrade further.
- Hive over Hbase is in experiment phase currently, not production ready. Hope Facebook will fix this in upcoming release of Hive.
Next i am gona tune Hive with data compression and other stuffs. In my next post I will share performance results on this.
That’s it for now. Thanks.
hi there,
ReplyDeleteany further results about Hive over HBase?