Tuesday, March 13, 2012

Cluster Apache Solr data using Apache Mahout

Lately, I was working on Integration of Apache Mahout algorithms with Apache Solr. I am able to integrate Solr with Mahout Classification and Clustering algorithms. I will post a series of blogs on this integration. This post would guide you to Cluster your Solr data using K-Means Clustering algorithm of Mahout.

Minimum Requirement:

  • Basic understanding of Apache Solr and Apache Mahout

  • Understanding of K-Means clustering

  • Up and Running Apache Solr and Apache Mahout on your system

Step 1 – Configure Solr & Index Data:

Before indexing some sample data into Solr make sure to configure fields in SolrConfig.xml.

<field name=”field_name” type=”text” indexed=”true” stored=”true” termVector=”true” />
  • Add termVector=”true” for the fields which can be clustered

  • Indexing some sample documents into Solr

Step 2 – Convert Lucene Index to Mahout Vectors


mahout lucene.vector <PATH OF INDEXES> --output <OUTPUT VECTOR PATH> --field <field_name> --idField id –dicOut <OUTPUT DICTIONARY PATH> --norm 2


Step 3 – Run K-Means Clustering

mahout kmeans -i <OUTPUT VECTOR PATH> -c <PATH TO CLUSTER CENTROIDS> -o <PATH TO OUTPUT CLUSTERS> -dm org.apache.mahout.common.distance.CosineDistanceMeasure –x 10 –k 20 –ow –clustering

Here:
  • k: number of clusters/value of K in K-Means clustering

  • x: maximum iterations

  • o: path to output clusters

  • ow: overwrite output directory

  • dm: classname of Distance Measure

Step 4 – Analyze Cluster Output


mahout clusterdump -s <PATH TO OUTPUT CLUSTERS> -d <OUTPUT DICTIONARY PATH> -dt text -n 20 -dm org.apache.mahout.common.distance.CosineDistnanceMeasure --pointsDir <PATH OF OUTPUT CLUSTERED POINTS> --output <PATH OF OUTPUT DIR>

Here:
  • s: Directory containing clusters

  • d:Path of dictionary from step #2

  • dt: Format of dictionary file

  • n: number of top terms

  • output: Path of generated clusters

That was all for clustering; in my next posting I’ll showcase how to run Mahout Classification on Apache Solr data. Hope it helps, let me know your feedbacks.

5 comments:

  1. Hello Mayur, thank you for your useful post.
    I have a question, I'm trying to classify data that are text and mostly numbers and amounts. Can this be possible using Mahout?

    ReplyDelete
  2. Sloppy work. Please check your commands before blogging . For example: CosineDist>>n<<anceMeasure . Really ?

    ReplyDelete
  3. This comment has been removed by the author.

    ReplyDelete
  4. This comment has been removed by the author.

    ReplyDelete
  5. Harrah's Lake Tahoe - Mapyro
    Harrah's Lake 제주도 출장샵 Tahoe is 상주 출장안마 a hotel and casino located in Stateline, Nevada. The 여수 출장마사지 property is 청주 출장샵 owned 동두천 출장안마 and operated by Caesars Entertainment,

    ReplyDelete