Wednesday, March 14, 2012

2 mins guide to Integrate Apache Solr with Apache Mahout Classifier


My previous post was about integrating Mahout clustering with Solr. In this post I will guide you how to integrate Mahout’s second C (Classification). There are generally two major approaches to classify or categorize Solr data – Before Indexing and After Indexing. Every approach have its caveats, this post would talk about Classification after indexing approach.

Must haves for this post:
  • Understanding and hands on Mahout Classification
  • Basic understanding of Solr and Solr configurations

Procedure would be to hook into Solr update mechanism and invoke Mahout Classifier for every document indexing in Solr. Later Mahout Classifier would identify category and update the corresponding field in the Solr data structure.

Step #1: Custom Code for Document Categorization

Code is quite self-explanatory for those who are familiar with Mahout Classification and Apache Solr Update Mechanism.


package org.apache.solr.update.processor.ext;
// required imports
public class MahoutDocumentClassifier extends UpdateRequestProcessorFactory {
SolrParams parameters = null;
ClassifierContext classifierContext = null;
public void init(NamedList args) {
BayesParameters params = new BayesParameters();
parameters = SolrParams.toSolrParams((NamedList) args);
String modelPath = parameters.get("model");
params.setBasePath(modelPath);
InMemoryBayesDatastore datasource = new InMemoryBayesDatastore(params);
Algorithm algorithm = new BayesAlgorithm();

ClassifierContext ctx = new ClassifierContext(algorithm, datasource);
try {
ctx.initialize();
} catch (Exception e1) {
}
}

@Override
public UpdateRequestProcessor getInstance(SolrQueryRequest req,
SolrQueryResponse rsp, UpdateRequestProcessor next) {
return new DocumentClassifier(next);
}

public class DocumentClassifier extends UpdateRequestProcessor {
public DocumentClassifier(UpdateRequestProcessor next) {
super(next);
}

@Override
public void processAdd(AddUpdateCommand cmd) throws IOException {
try {
SolrInputDocument document = cmd.getSolrInputDocument();
String inputField = parameters.get("inputField");
String outputField = parameters.get("outputField");
String input = (String) document.getFieldValue(inputField);
ArrayList tokenList = new ArrayList(256);
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);

TokenStream tokenStream = analyzer.tokenStream(inputField, new StringReader(input));
while (tokenStream.incrementToken()) {
tokenList.add(tokenStream.getAttribute(TermAttribute.class).toString());
}
String[] tokens = tokenList.toArray(new String[tokenList.size()]);
ClassifierResult category = classifierContext.classifyDocument(tokens, "defaultCategory");
if (category != null && category.getLabel() != "") {
document.addField(outputField, category.getLabel());
}
} catch (Exception e) {
e.printStackTrace();
}
super.processAdd(cmd);
}
}
}



Step #2: Compile above Class as a Jar and Copy it into Solr/Lib Directory

Step #3: Configure Solr to Hook Custom Classifier

Add following snippet in SolrConfig.xml, this would hook above Classifier code into Solr update procedure.


<updateRequestProcessorChain name="mahoutclassifier" default="true">
<processor class="org.apache.solr.update.processor.ext.MahoutDocumentClassifier">
<str name="inputField">text</str>
<str name="outputField">docCategory</str>
<str name="defaultCategory">default</str>
<str name="model">/home/user/classifier- model</str>
</processor>
<processor class="solr.RunUpdateProcessorFactory"/>
<processor class="solr.LogUpdateProcessorFactory"/>
</updateRequestProcessorChain>

<requestHandler name="/update" class="solr.XmlUpdateRequestHandler">
<lst name="defaults">
<str name="update.processor"> mahoutclassifier </str>
</lst>
</requestHandler


Step #4: Start Solr with java -jar start.jar

Solr would load Mahout Classifier model into memory and would start classifying documents on the fly. For large volume of data this approach may not perform well because an Update Operation in Solr is internally equivalent to one Delete Operation and one Add operation. So, if you have say 300 GB of data then you are first putting 300GB into Solr then deleting it and again adding 300GB into system.

That’s it in this post. Thanks :)

Tuesday, March 13, 2012

Cluster Apache Solr data using Apache Mahout

Lately, I was working on Integration of Apache Mahout algorithms with Apache Solr. I am able to integrate Solr with Mahout Classification and Clustering algorithms. I will post a series of blogs on this integration. This post would guide you to Cluster your Solr data using K-Means Clustering algorithm of Mahout.

Minimum Requirement:

  • Basic understanding of Apache Solr and Apache Mahout

  • Understanding of K-Means clustering

  • Up and Running Apache Solr and Apache Mahout on your system

Step 1 – Configure Solr & Index Data:

Before indexing some sample data into Solr make sure to configure fields in SolrConfig.xml.

<field name=”field_name” type=”text” indexed=”true” stored=”true” termVector=”true” />
  • Add termVector=”true” for the fields which can be clustered

  • Indexing some sample documents into Solr

Step 2 – Convert Lucene Index to Mahout Vectors


mahout lucene.vector <PATH OF INDEXES> --output <OUTPUT VECTOR PATH> --field <field_name> --idField id –dicOut <OUTPUT DICTIONARY PATH> --norm 2


Step 3 – Run K-Means Clustering

mahout kmeans -i <OUTPUT VECTOR PATH> -c <PATH TO CLUSTER CENTROIDS> -o <PATH TO OUTPUT CLUSTERS> -dm org.apache.mahout.common.distance.CosineDistanceMeasure –x 10 –k 20 –ow –clustering

Here:
  • k: number of clusters/value of K in K-Means clustering

  • x: maximum iterations

  • o: path to output clusters

  • ow: overwrite output directory

  • dm: classname of Distance Measure

Step 4 – Analyze Cluster Output


mahout clusterdump -s <PATH TO OUTPUT CLUSTERS> -d <OUTPUT DICTIONARY PATH> -dt text -n 20 -dm org.apache.mahout.common.distance.CosineDistnanceMeasure --pointsDir <PATH OF OUTPUT CLUSTERED POINTS> --output <PATH OF OUTPUT DIR>

Here:
  • s: Directory containing clusters

  • d:Path of dictionary from step #2

  • dt: Format of dictionary file

  • n: number of top terms

  • output: Path of generated clusters

That was all for clustering; in my next posting I’ll showcase how to run Mahout Classification on Apache Solr data. Hope it helps, let me know your feedbacks.