Lately, I was working on Integration of Apache Mahout algorithms with Apache Solr. I am able to integrate Solr with Mahout Classification and Clustering algorithms. I will post a series of blogs on this integration. This post would guide you to Cluster your Solr data using K-Means Clustering algorithm of Mahout.
Minimum Requirement:
Before indexing some sample data into Solr make sure to configure fields in SolrConfig.xml.
Step 2 – Convert Lucene Index to Mahout Vectors
Minimum Requirement:
- Basic understanding of Apache Solr and Apache Mahout
- Understanding of K-Means clustering
- Up and Running Apache Solr and Apache Mahout on your system
Before indexing some sample data into Solr make sure to configure fields in SolrConfig.xml.
<field name=”field_name” type=”text” indexed=”true” stored=”true” termVector=”true” /> |
- Add termVector=”true” for the fields which can be clustered
- Indexing some sample documents into Solr
mahout lucene.vector <PATH OF INDEXES> --output <OUTPUT VECTOR PATH> --field <field_name> --idField id –dicOut <OUTPUT DICTIONARY PATH> --norm 2 |
Step 3 – Run K-Means Clustering
Here:
Step 4 – Analyze Cluster Output
Here:
That was all for clustering; in my next posting I’ll showcase how to run Mahout Classification on Apache Solr data. Hope it helps, let me know your feedbacks.
mahout kmeans -i <OUTPUT VECTOR PATH> -c <PATH TO CLUSTER CENTROIDS> -o <PATH TO OUTPUT CLUSTERS> -dm org.apache.mahout.common.distance.CosineDistanceMeasure –x 10 –k 20 –ow –clustering |
Here:
- k: number of clusters/value of K in K-Means clustering
- x: maximum iterations
- o: path to output clusters
- ow: overwrite output directory
- dm: classname of Distance Measure
mahout clusterdump -s <PATH TO OUTPUT CLUSTERS> -d <OUTPUT DICTIONARY PATH> -dt text -n 20 -dm org.apache.mahout.common.distance.CosineDistnanceMeasure --pointsDir <PATH OF OUTPUT CLUSTERED POINTS> --output <PATH OF OUTPUT DIR> |
Here:
- s: Directory containing clusters
- d:Path of dictionary from step #2
- dt: Format of dictionary file
- n: number of top terms
- output: Path of generated clusters
Hello Mayur, thank you for your useful post.
ReplyDeleteI have a question, I'm trying to classify data that are text and mostly numbers and amounts. Can this be possible using Mahout?
Sloppy work. Please check your commands before blogging . For example: CosineDist>>n<<anceMeasure . Really ?
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteHarrah's Lake Tahoe - Mapyro
ReplyDeleteHarrah's Lake 제주도 출장샵 Tahoe is 상주 출장안마 a hotel and casino located in Stateline, Nevada. The 여수 출장마사지 property is 청주 출장샵 owned 동두천 출장안마 and operated by Caesars Entertainment,