Wednesday, March 14, 2012

2 mins guide to Integrate Apache Solr with Apache Mahout Classifier


My previous post was about integrating Mahout clustering with Solr. In this post I will guide you how to integrate Mahout’s second C (Classification). There are generally two major approaches to classify or categorize Solr data – Before Indexing and After Indexing. Every approach have its caveats, this post would talk about Classification after indexing approach.

Must haves for this post:
  • Understanding and hands on Mahout Classification
  • Basic understanding of Solr and Solr configurations

Procedure would be to hook into Solr update mechanism and invoke Mahout Classifier for every document indexing in Solr. Later Mahout Classifier would identify category and update the corresponding field in the Solr data structure.

Step #1: Custom Code for Document Categorization

Code is quite self-explanatory for those who are familiar with Mahout Classification and Apache Solr Update Mechanism.


package org.apache.solr.update.processor.ext;
// required imports
public class MahoutDocumentClassifier extends UpdateRequestProcessorFactory {
SolrParams parameters = null;
ClassifierContext classifierContext = null;
public void init(NamedList args) {
BayesParameters params = new BayesParameters();
parameters = SolrParams.toSolrParams((NamedList) args);
String modelPath = parameters.get("model");
params.setBasePath(modelPath);
InMemoryBayesDatastore datasource = new InMemoryBayesDatastore(params);
Algorithm algorithm = new BayesAlgorithm();

ClassifierContext ctx = new ClassifierContext(algorithm, datasource);
try {
ctx.initialize();
} catch (Exception e1) {
}
}

@Override
public UpdateRequestProcessor getInstance(SolrQueryRequest req,
SolrQueryResponse rsp, UpdateRequestProcessor next) {
return new DocumentClassifier(next);
}

public class DocumentClassifier extends UpdateRequestProcessor {
public DocumentClassifier(UpdateRequestProcessor next) {
super(next);
}

@Override
public void processAdd(AddUpdateCommand cmd) throws IOException {
try {
SolrInputDocument document = cmd.getSolrInputDocument();
String inputField = parameters.get("inputField");
String outputField = parameters.get("outputField");
String input = (String) document.getFieldValue(inputField);
ArrayList tokenList = new ArrayList(256);
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);

TokenStream tokenStream = analyzer.tokenStream(inputField, new StringReader(input));
while (tokenStream.incrementToken()) {
tokenList.add(tokenStream.getAttribute(TermAttribute.class).toString());
}
String[] tokens = tokenList.toArray(new String[tokenList.size()]);
ClassifierResult category = classifierContext.classifyDocument(tokens, "defaultCategory");
if (category != null && category.getLabel() != "") {
document.addField(outputField, category.getLabel());
}
} catch (Exception e) {
e.printStackTrace();
}
super.processAdd(cmd);
}
}
}



Step #2: Compile above Class as a Jar and Copy it into Solr/Lib Directory

Step #3: Configure Solr to Hook Custom Classifier

Add following snippet in SolrConfig.xml, this would hook above Classifier code into Solr update procedure.


<updateRequestProcessorChain name="mahoutclassifier" default="true">
<processor class="org.apache.solr.update.processor.ext.MahoutDocumentClassifier">
<str name="inputField">text</str>
<str name="outputField">docCategory</str>
<str name="defaultCategory">default</str>
<str name="model">/home/user/classifier- model</str>
</processor>
<processor class="solr.RunUpdateProcessorFactory"/>
<processor class="solr.LogUpdateProcessorFactory"/>
</updateRequestProcessorChain>

<requestHandler name="/update" class="solr.XmlUpdateRequestHandler">
<lst name="defaults">
<str name="update.processor"> mahoutclassifier </str>
</lst>
</requestHandler


Step #4: Start Solr with java -jar start.jar

Solr would load Mahout Classifier model into memory and would start classifying documents on the fly. For large volume of data this approach may not perform well because an Update Operation in Solr is internally equivalent to one Delete Operation and one Add operation. So, if you have say 300 GB of data then you are first putting 300GB into Solr then deleting it and again adding 300GB into system.

That’s it in this post. Thanks :)

Tuesday, March 13, 2012

Cluster Apache Solr data using Apache Mahout

Lately, I was working on Integration of Apache Mahout algorithms with Apache Solr. I am able to integrate Solr with Mahout Classification and Clustering algorithms. I will post a series of blogs on this integration. This post would guide you to Cluster your Solr data using K-Means Clustering algorithm of Mahout.

Minimum Requirement:

  • Basic understanding of Apache Solr and Apache Mahout

  • Understanding of K-Means clustering

  • Up and Running Apache Solr and Apache Mahout on your system

Step 1 – Configure Solr & Index Data:

Before indexing some sample data into Solr make sure to configure fields in SolrConfig.xml.

<field name=”field_name” type=”text” indexed=”true” stored=”true” termVector=”true” />
  • Add termVector=”true” for the fields which can be clustered

  • Indexing some sample documents into Solr

Step 2 – Convert Lucene Index to Mahout Vectors


mahout lucene.vector <PATH OF INDEXES> --output <OUTPUT VECTOR PATH> --field <field_name> --idField id –dicOut <OUTPUT DICTIONARY PATH> --norm 2


Step 3 – Run K-Means Clustering

mahout kmeans -i <OUTPUT VECTOR PATH> -c <PATH TO CLUSTER CENTROIDS> -o <PATH TO OUTPUT CLUSTERS> -dm org.apache.mahout.common.distance.CosineDistanceMeasure –x 10 –k 20 –ow –clustering

Here:
  • k: number of clusters/value of K in K-Means clustering

  • x: maximum iterations

  • o: path to output clusters

  • ow: overwrite output directory

  • dm: classname of Distance Measure

Step 4 – Analyze Cluster Output


mahout clusterdump -s <PATH TO OUTPUT CLUSTERS> -d <OUTPUT DICTIONARY PATH> -dt text -n 20 -dm org.apache.mahout.common.distance.CosineDistnanceMeasure --pointsDir <PATH OF OUTPUT CLUSTERED POINTS> --output <PATH OF OUTPUT DIR>

Here:
  • s: Directory containing clusters

  • d:Path of dictionary from step #2

  • dt: Format of dictionary file

  • n: number of top terms

  • output: Path of generated clusters

That was all for clustering; in my next posting I’ll showcase how to run Mahout Classification on Apache Solr data. Hope it helps, let me know your feedbacks.

Thursday, November 17, 2011

HPCC taking on Hadoop: Yet another Hadoop Killer?

After almost 5-6 years of dominance in big data world, Hadoop is finally come under fire from the upcoming new architectures. Recently Hadoop is facing lots of competition from lots of so called “Hadoop Killers”. Yes, I am a Hadoop lover but there have been some areas from starting where it is lagging behind like real time calculations, processing on small data set, CPU intensive problems, continuous computation, incremental updates etc.

Recently, LexisNexis open sourced its Hadoop killer claiming to outperform it in both data volumes and response time. It looks to me as a definite killer so I decided to have a look into it and compare it with Hadoop. Before taking a closer look into head to head comparison here is a brief introduction to HPCC for too lazy developers to visit HPCC site.

HPCC (High Performance Computing Cluster) is a massive parallel-processing computing platform that solves Big Data problems. The platform is now Open Source!

HPCC systems directly take on Hadoop, which is not wrong as far as functionality is concerned. But architecturally they are not comparable. Hadoop is based on Map reduce paradigm where as HPCC is not map reduce its something else. This is what I am able to figure out from their public documents.

Take a look of architectural diagram of HPCC; I am not going into detail of everything here.




However there are few places where HPCC have an edge over Hadoop:

  1. Developers IDE
  2. Roxie component

Places where both are on same track:

  1. Thor ETL component: Compares directly to map reduce
  2. ECL: High level programming language just like Pig

Now let’s talk in actual numbers, I have performance tested both frameworks on few standard use cases like N-grams finder and Word count example. Interestingly found mixed bench-marking numbers for both tests. HPCC performed well for N-gram finder whereas Hadoop did better in Word count use case. Here are the exact numbers of our tests:

Round 1: N-Gram Finder



Score: HPCC: 1, Hadoop: 0


Round 2: Word Counter



Score: HPCC: 1, Hadoop: 1

Yeah, with this mixed nature of test result we cannot conclude anything. So, let’s compare them on some critical and curial non-functional parameters.


Round 3: Licensing

The biggest issue HPCC is going to face is its licensing model. Open source world has never honored AGPL very well. If HPCC wants to compete and grow big like Hadoop has done is last four five years then they must have rethink about licensing model.

Whereas, on the contrary Hadoop with its Apache 2.0 licensing model results in many enterprises as well as individual to contribute to this project and help it to grow bigger and better.

Score: HPCC: 1, Hadoop: 2


Round 4: Adoption Rate

This is not a very big factor as compared to others but taken the last point HPCC is bound to face adaptation issues for developers and companies of all shapes and sizes.

For developers hurdle would to adapt ECL programming language and developers never feel comfortable to deal with a new language. On the other hand Hadoop’s development environment is higly flexible to any kind of developers. MR jobs can be written Java, C++, and Python etc. and for scripting background Pig is the easiest way to go. Thanks to hive which allows DBA to use its SQL-ish syntax.

Score: HPCC: 1, Hadoop: 3


Round 5: Ecosystem and related tools

Even if we consider HPCC to be more efficient than Hadoop but I feel HPCC guys are a bit late to enter into Open source world. Hadoop has put on lots of weight!, in last 5 years it has travelled a long way from a Map Reduce - parallel distributed programming framework to a de facto for big data processing or data warehouse solution. Contrary, am afraid HPCC is still more or less a big data framework only.

So, HPCC is not competing with another framework but an ecosystem which has been developed in recent years around Hadoop. It’s actually competing with Hadoop and its army, that’s where real strength exists.

Take a look how Hadoop has evolved in recent years for clearer picture.



Score: HPCC: 1, Hadoop: 4

Final Score: Clear Hadoop Dominance

Let’s talk practical, Hadoop is not the most efficient big data architecture. In fact, it is said that it’s anything from 10x-10000x too slow for what's required. There are tools and technologies in place which performs way better than Hadoop in their own domain. Like for database joins any SQL is better, for OLTP environment we have VoltDB, for real time analytics Cloudscale performs much better, for supercomputing/modeling/simulations MPI and BSP are the clear winners, Pregel is there for graph computing, Dremel for interactive big data analytics, Caffine or Percolator can’t be compared with Hadoop of incremental computing of big data.

Strength of Hadoop exists in its huge ecosystem and that’s where its USP is. It offers Hive for SQL-ish requirements, HBase for key value store, Mahout for learning machines, Oozie for workflow management, Flume for log processing , Blur for search, Sqoop for interaction with rdbms etc etc. and lastly vast developers network. So, when we talk about totality when creating a data warehouse or BI tool Hadoop would remain the obvious choice presently.

Tuesday, April 19, 2011

Install Yahoo Oozie 3.0.0 on Apache Hadoop 0.20.2

Oozie is a workflow based service to manage data processing for Hadoop and different projects of Hadoop ecosystem like Pig, Hive, Sqoop, etc.

Oozie workflows are actions arranged in a control dependency DAG (Direct Acyclic Graph).An Oozie workflow may contain the following types of actions nodes: map-reduce, map-reduce streaming, map-reduce pipes, pig, file-system, sub-workflows etc.

Typical Ooize Design:



This post is about installing Yahoo Oozie 3.0.0 on Apache Hadoop 0.20.0. A little configuration trick is required to get Oozie working with Apache Hadoop.
Cloudera and Yahoo manage their own version of Oozie and hence pretty natural that they worked with their Hadoop without any pain.

Before diving into the flow of Oozie installation here are few softwares which are required

Warm up:
  • Apache Hadoop 0.20.0 Up and Running
  • Download and Untar Ooize 3.0.0 distro
  • Download ExtJS-2.2 library for enabling Oozie web console
  • Create ‘oozie’ named user and group on your system


Workflow for setting up Oozie:

Go through this diagram execute commands and put snippet in corresponding xml files and you are done.




Is everything Ok?


That’s it, Installation is over. To check if everything configured well
open your browser and hit http://localhost:11000 (by default bundled tomcat of Oozie-3.0.0 listen at port 11000).

Run Oozie examples bundled with Oozie and keep your figures crossed to get a long job id in response.

shell>hadoop fs -put examples examples
shell>oozie job -oozie http://localhost:11000/oozie -config examples/apps/map-reduce/job.properties -run

That's it guys.

Sunday, April 17, 2011

My Love Hate Relationship with Erlang

Yeah it’s true, lately I had a relationship with Erlang!! But unfortunately it turned out to be just an infatuation. Erlang is like most of attractive girls - looks promising from far but a real turn off when you interact with them. No, in-fact a lady not even a girl, it’s been there from two decades :-).

I have been interacting with Erlang from a year or so. No doubt it’s a powerful language best suited for mission critical system, effortlessly distributed in nature, no one can match its super scalability, inherent support for fault tolerance and its real hot - hot code swapping. Erlang is a Special purpose language.

With the evolution of multi-core culture from last few years it is getting lots of hype for wrong reasons. If you are among those who are attracted towards Erlang with overrated hype that it’s the next big thing and it will replace languages like Java or Ruby in near future, I beg to differ though. It has its own target audience but NOT general purpose programming language in true manner.


In this blog I will talk about “hate” of my love-hate relationship with Erlang. There are hell lots of reasons to dislike but here few of them:

Ugly Syntax - No dressing sense

Biggest turn-off for a developer interested in Erlang is its very very very ugly syntax. Today most of the developers are familiar with Java/Ruby/C like syntax whereas Erlang syntax is inspired from Prolog. It would take ages to get comfortable with its syntax and even after a year or so I struggle with it.

Bunch of confusing terminators

To end an expression you have as many as four terminators (explored so far).
  1. Comma(,)
  2. Period(.)
  3. Semicolon(;)
  4. No character()
It looks funny to start with, a bit confusing and irritates you in the time.

I will show you with this example:

is_valid_user(true) ->
take_action_1(),
take_action_2();

is_valid_user(false) ->
take_action_3().


And if I want to reorder the flows, you can’t just cut and paste code from here to there:

is_valid_user(false) ->
take_action_3();

is_valid_user(true) ->
take_action_1(),
take_action_2().


Or swapping the calling order of two actions, here again you have to take care of positional terminators:

is_valid_user(true) ->
take_action_2(),
take_action_1();

is_valid_user(false) ->
take_action_3().


As you go along with the language if clauses, switch statements etc. are even more difficult to get syntax and terminators right. I can’t imagine how a developer can re-factorize code without having bunch of stupid syntactical errors. In short, you either has to be mental to get it right in one go or Erlang syntax with make you metal with series of errors.

Erlang syntax is a huge productivity killer!!

Big Bulky Strings

Erlang is a BIG NO NO as a language choice if your application or business logic involves fair bit of String and String related operations. It is the most in-efficient String implementation I have seen ever.
It’s hard to digest but in Erlang each character in a String eats 8 BYTES of memory. Yeah 8 BYTES/character!! Because Strings are internally implemented as list of characters are having 4 Bytes for each character and another 4 Bytes having pointer to next character in the string.

Let’s add insult to injury, given the fact that Erlang is functional in nature which means all variables are immutable in nature. So, I you want to modify a string you have to create a new string out of modified elements.
More insult, imagine message passing this string to say 10 processes and remember copying occurs during message passing in Erlang.

Final thought, don’t even think of Erlang once for performance oriented text processing applications.

Terribly Slow I/O

Yes it’s a known fact, Erlang’s io:get_line is TERRIBLY slow ! Problem is it reads the file content character by character. Actually it’s not an inherent problem with language rather more of an implementation shortcoming of virtual machine.
However, inefficient line-reading mechanism is not an excuse for any programming language it’s a big shame! Languages like C, Java, Ruby and others deal with line-based I/O quite efficiently.

Lastly just club the last two points; you can imagine slow sluggish performance of application involving text based processing of log files.

Closing my post here, I will cover remaining points why hate Erlang in my upcoming post. Remember I equally Love this language as much as I hate, so will cover most incredible features of this language in my next post.

Thanks.

Friday, April 1, 2011

Benchmarking Hive Hbase performance

Lately, we have been working on a great deal of Hadoop ecosystem products like Hadoop, Hbase, Hive, and Oozie. Everyone is master in their own way, their way of working and their targeted audience is completely different. Still sometimes you found yourself confused which one suits best in your problem statement.

In this post in particular I will talk about Hbase, Hive and combo of Hive over Hbase only. Our problem statement deals with fairly large amount of server generated log data files, that too hundreds of GiBs/day and one aggregated log file every 5mins. Analytics reports in both near real time on small data set and scheduled ones for large set of data. We performance bench-marked few combos of Hbase and Hive.

Hbase:

  • Near real time analytics
  • Fast Incremental load
  • Custom map-reduce

Hive:

  • Bulk processing/ real time(if possible)
  • SQL like interface
  • Built in optimized map-reduce
  • Partitioning of large data

Hive Hbase Integration:

  • Best of both worlds
  • low-latency incremental data refresh to Hive
  • SQL query capabilities to Hbase

For simplicity we kept schema as simple as possible.

Hbase Schema:

Hive Schema:

Expected Output:

To achieve this we have coded custom map-reduce classes for pure Hbase scenario and used “select count(*), pagehit from weblogs” for hive and Hbase-Hive approach.

Below table shows the performance benchmarking results on 2 machine Hadoop cluster:

  • Is quite clearly that pure hive has performed exceedingly well as compared to other two.
  • Custom map reduce on Hbase and Hive over Hbase performed almost same.
  • Hive over Hbase is 5 times slower than Hive, but we find it far slower than factor of 5.
  • With increase in number of column families (which will in real world use cases) the performance of hive over Hbase would degrade further.
  • Hive over Hbase is in experiment phase currently, not production ready. Hope Facebook will fix this in upcoming release of Hive.

Next i am gona tune Hive with data compression and other stuffs. In my next post I will share performance results on this.

That’s it for now. Thanks.

Wednesday, January 5, 2011

Baby Steps into Datanucleus HBase JPA

I was playing around with Datanucleus JPA with HBase from last few days. Datanucleus also have support for JDO but my choice to go for JPA is due to Sun standardization on it. There are couple of good tutorials already available but my personal favorite is Matzew one. Apart from few issues of jar dependencies its quite straight forward Maven script which runs a jetty server and a web app.

But if you are crazy about IDEs like Eclipse then it’s a bit tricky to get things in place and working. So, here is the tutorial which guides you to run the same application in Eclipse and Tomcat.

Prerequisites:

  • Hadoop and HBase up and running.
  • Basic knowledge of HBase concepts (Schema, Columns, Column Families etc).
  • Datanucleus Eclipse plug-in installed on your Eclipse.
  • High level understanding of Datanucleus and JPA.

For Whom:
If you are eying any one of following issues:

  • Integration of HBase using JPA in your application.
  • Port Matzew’s Maven script based example in Eclipse Tomcat environment.
  • Suffering from “javax.persistence.* class not found” exception.
  • “No persistence providers available for storename” exception is frustrating you in-spite to few try/solutions suggested on forums.
  • Confused where to put persistence.xml in your web application.
  • Need some way to control column families name in your HBase schema.
  • @Column(name=”familyname:columnname”) is making no effect on your HBase column families.
  • Didn’t find Datanucleus menu item after successfully installing Datanucleus eclipse plug-in.


Let’s Start:

Here I am using the same code committed by Matzew on github and would show you steps to create a simple web application.

Step 1: Create a simple Servlet.
I have created a simple servlet which persist Contact Entity class into database.

import java.io.IOException;
import javax.servlet.ServletException;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
import javax.persistence.EntityManager;
import javax.persistence.EntityManagerFactory;
import javax.persistence.Persistence;

import net.wessendorf.addressbook.Contact;
import net.wessendorf.addressbook.dao.HBaseJPAImpl;

public class Index extends javax.servlet.http.HttpServlet implements
javax.servlet.Servlet {
protected void doGet(HttpServletRequest request,
HttpServletResponse response) throws ServletException, IOException {

EntityManagerFactory emf = Persistence
.createEntityManagerFactory("hbase-addressbook");
EntityManager em = emf.createEntityManager();

Contact contact = new Contact();
contact.setId("id");
contact.setFirstname("name");
contact.setSecondname("second name");

HBaseJPAImpl hbase = new HBaseJPAImpl(em);
hbase.save(contact);
}
}

Step 2: Web application Structure
Your Eclipse dynamic web project should look like this.

Make sure to add all required jars (pretty obvious) and remember few critical points:

  • Make sure you have “persistence-api-XXX.jar” in web-inf\lib (Because Tomcat doesn’t ships with its own version of Persistence.jar)
  • Add META-INF\persistence.xml source (SRC) folder.
  • Place orm.xml in parallel with the Entity classes.


Step 3: Important Configuration XMLs

The persistence.xml file

When Datanucleus starts persisting entities to the database, it needs to know how to connect to that database, where the database is, and which components are its responsibility for managing. All of that information goes in the persistence.xml file.

<persistence xmlns="http://java.sun.com/xml/ns/persistence"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://java.sun.com/xml/ns/persistence http://java.sun.com/xml/ns/persistence/persistence_1_0.xsd" version="1.0">

<persistence-unit name="hbase-addressbook" transaction-type="RESOURCE_LOCAL">
<provider>org.datanucleus.jpa.PersistenceProviderImpl</provider>
<class>net.wessendorf.addressbook.Contact</class>
<mapping-file>net/wessendorf/addressbook/orm.xml
</mapping-file>
<properties>
<property name="datanucleus.ConnectionURL" value="hbase"/>
<property name="datanucleus.ConnectionUserName" value=""/>
<property name="datanucleus.ConnectionPassword" value=""/>


<property name="datanucleus.autoCreateSchema" value="true"/>
<property name="datanucleus.validateTables" value="false"/>
<property name="datanucleus.Optimistic" value="false"/>
<property name="datanucleus.validateConstraints" value="false"/>
</properties>
</persistence-unit>
</persistence>

Fine touch by orm.xml file

If you need fine control over database schema like column family name and constraints orm.xml is the one for you.

The idea of this xml is to map your Entity class with corresponding table in the database. You can also map member variable of Entity class with field/column of the table and can declare constraints on them as well.


<?xml version="1.0" encoding="UTF-8"?>
<entity-mappings xmlns="http://java.sun.com/xml/ns/persistence/orm" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://java.sun.com/xml/ns/persistence/orm http://java.sun.com/xml/ns/persistence/orm_1_0.xsd" version="1.0">
<entity class="fully qualified entity class name here" name="Login" >
<table name="Login" />
<attributes>
<id name="userId">
<column name="Login_data:userId" />
</id>
<basic name="pwd">
<column name="Login_data:pwd" />
</basic>
</attributes>
</entity>….
</entity-mapping>

Structure of this xml is quite self-explanatory, still few points to remember:
  • <id> …</id> tag is used to declare and map row key of the HBase table.
  • <basic>...</basic> maps field with the column name of table.
  • Most Important: Note here that column name is given in “family-name:column-name” format. If you don’t specify column name in this format, Datanucleus would take class name as the column family name of the HBase table.


Step 4: Enhance your Entity classes

Make sure to enhance your classes with Datanucleus enhancer. This can be found in right click Datanucleus menu option. Please Note Datanucleus menu would be visible only in Java perspective. If you are in J2EE perspective make sure to change your perspective.


Step 5: Deploy and Run
Go head. Deploy your first JPA app on server and run.

If everything goes well you would have Contact table created in your HBase with one row entry.

I have uploaded the zip of this Eclipse Project here.
That’s It. Enjoy!!

Additional Resources: