Wednesday, March 14, 2012

2 mins guide to Integrate Apache Solr with Apache Mahout Classifier


My previous post was about integrating Mahout clustering with Solr. In this post I will guide you how to integrate Mahout’s second C (Classification). There are generally two major approaches to classify or categorize Solr data – Before Indexing and After Indexing. Every approach have its caveats, this post would talk about Classification after indexing approach.

Must haves for this post:
  • Understanding and hands on Mahout Classification
  • Basic understanding of Solr and Solr configurations

Procedure would be to hook into Solr update mechanism and invoke Mahout Classifier for every document indexing in Solr. Later Mahout Classifier would identify category and update the corresponding field in the Solr data structure.

Step #1: Custom Code for Document Categorization

Code is quite self-explanatory for those who are familiar with Mahout Classification and Apache Solr Update Mechanism.


package org.apache.solr.update.processor.ext;
// required imports
public class MahoutDocumentClassifier extends UpdateRequestProcessorFactory {
SolrParams parameters = null;
ClassifierContext classifierContext = null;
public void init(NamedList args) {
BayesParameters params = new BayesParameters();
parameters = SolrParams.toSolrParams((NamedList) args);
String modelPath = parameters.get("model");
params.setBasePath(modelPath);
InMemoryBayesDatastore datasource = new InMemoryBayesDatastore(params);
Algorithm algorithm = new BayesAlgorithm();

ClassifierContext ctx = new ClassifierContext(algorithm, datasource);
try {
ctx.initialize();
} catch (Exception e1) {
}
}

@Override
public UpdateRequestProcessor getInstance(SolrQueryRequest req,
SolrQueryResponse rsp, UpdateRequestProcessor next) {
return new DocumentClassifier(next);
}

public class DocumentClassifier extends UpdateRequestProcessor {
public DocumentClassifier(UpdateRequestProcessor next) {
super(next);
}

@Override
public void processAdd(AddUpdateCommand cmd) throws IOException {
try {
SolrInputDocument document = cmd.getSolrInputDocument();
String inputField = parameters.get("inputField");
String outputField = parameters.get("outputField");
String input = (String) document.getFieldValue(inputField);
ArrayList tokenList = new ArrayList(256);
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);

TokenStream tokenStream = analyzer.tokenStream(inputField, new StringReader(input));
while (tokenStream.incrementToken()) {
tokenList.add(tokenStream.getAttribute(TermAttribute.class).toString());
}
String[] tokens = tokenList.toArray(new String[tokenList.size()]);
ClassifierResult category = classifierContext.classifyDocument(tokens, "defaultCategory");
if (category != null && category.getLabel() != "") {
document.addField(outputField, category.getLabel());
}
} catch (Exception e) {
e.printStackTrace();
}
super.processAdd(cmd);
}
}
}



Step #2: Compile above Class as a Jar and Copy it into Solr/Lib Directory

Step #3: Configure Solr to Hook Custom Classifier

Add following snippet in SolrConfig.xml, this would hook above Classifier code into Solr update procedure.


<updateRequestProcessorChain name="mahoutclassifier" default="true">
<processor class="org.apache.solr.update.processor.ext.MahoutDocumentClassifier">
<str name="inputField">text</str>
<str name="outputField">docCategory</str>
<str name="defaultCategory">default</str>
<str name="model">/home/user/classifier- model</str>
</processor>
<processor class="solr.RunUpdateProcessorFactory"/>
<processor class="solr.LogUpdateProcessorFactory"/>
</updateRequestProcessorChain>

<requestHandler name="/update" class="solr.XmlUpdateRequestHandler">
<lst name="defaults">
<str name="update.processor"> mahoutclassifier </str>
</lst>
</requestHandler


Step #4: Start Solr with java -jar start.jar

Solr would load Mahout Classifier model into memory and would start classifying documents on the fly. For large volume of data this approach may not perform well because an Update Operation in Solr is internally equivalent to one Delete Operation and one Add operation. So, if you have say 300 GB of data then you are first putting 300GB into Solr then deleting it and again adding 300GB into system.

That’s it in this post. Thanks :)

2 comments:

  1. Hi Mayur,

    Do you have any links or pointers as to how I can do this given an existing solr index?

    How do I read the vector file and dictionary file generated by lucene.vectors and generate a model from that?

    ReplyDelete
  2. Hi Mayur,

    Need some help on integrating apache mahout classifier with apache solr. I am not able to find the files in my ubuntu system. Can you please help.

    ReplyDelete