Tuesday, 15 September 2015

Mahout: TanimotoCoefficientSimilarity : Compute item similarity

TanimotoCoefficientSimilarity is based on Tanimoto coefficient, or extended Jaccard coefficient. Go through following article to know about Tanimoto coefficient.


This is used when user don’t provide preference values.

Let’s say I had following input data.


customer.csv
1,1
1,2
1,3
1,7
1,8
2,1
2,2
2,3
2,4
2,5
2,7
3,1
3,2
3,3
3,5
3,6
3,7
4,1
4,3
4,4
4,5
4,7
4,9
4,10
5,1
5,2
5,3
5,4
5,9


1,2 means customer 1 like item 1.
import java.io.File;
import java.io.IOException;

import org.apache.mahout.cf.taste.common.TasteException;
import org.apache.mahout.cf.taste.impl.model.file.FileDataModel;
import org.apache.mahout.cf.taste.impl.similarity.TanimotoCoefficientSimilarity;
import org.apache.mahout.cf.taste.model.DataModel;

public class TanimotoCoefficientSimilarityEx {
 public static String dataFile = "/Users/harikrishna_gurram/customer.csv";

 public static void main(String args[]) throws IOException, TasteException {

  DataModel model = new FileDataModel(new File(dataFile));

  TanimotoCoefficientSimilarity similarity = new TanimotoCoefficientSimilarity(
    model);

  long itemIds[] = { 3, 4, 6, 7, 8, 9, 10 };

  double distance[] = similarity.itemSimilarities(4, itemIds);

  for (int i = 0; i < itemIds.length; i++) {
   System.out.println("distance between item 4 and " + itemIds[i]
     + " is " + distance[i]);
  }

 }
}


Output
distance between item 4 and 3 is 0.6
distance between item 4 and 4 is 1.0
distance between item 4 and 6 is NaN
distance between item 4 and 7 is 0.4
distance between item 4 and 8 is NaN
distance between item 4 and 9 is 0.6666666666666666
distance between item 4 and 10 is 0.3333333333333333

TanimotoCoefficientSimilarity returns NaN, if similarity is unknown.    



Prevoius                                                 Next                                                 Home

No comments:

Post a Comment