Thursday, 3 September 2015

mahout: CityBlockSimilarity : Compute User similarity

CityBlockSimilarity is implementation of City Block distance (Also known as Manhattan distance).

if x=(a,b) and y=(c,d), the Manhatten distance between x and y is |a−c|+|b−d|.

x = (5, 3)
y = (6, 2)

Manhatten distance = |5-6| + |3-2| = 1 + 1 = 2.

x1:  1.0, 3.2, 4.8, 0.1, 3.2, 0.6, 2.2, 1.1
x2:  0.1, 5.2, 1.9, 4.2, 1.9, 0.1, 0.1, 6.0

Manhatten distance(MD) between x1 and x2 is

|1−0.1|+|3.2−5.2|+|4.8−1.9|+|0.1−4.2|+|3.2−1.9|+|0.6−0.1|+|2.2−0.1|+|1.1−6.0|

0.9+2+2.9+4.1+1.3+0.5+2.1+4.9

18.7

Following example explains how to compute UserSimilarity between two users using  CityBlockSimilarity.

Let’s say I had following input data.


customer.csv

1,4,3
1,7,2
1,8,2
1,10,1
2,3,2
2,4,3
2,6,3
2,7,1
2,9,1
3,0,3
3,3,2
3,4,1
3,8,3
3,9,1
4,2,5
4,3,4
4,7,3
4,9,2
5,4,5
5,6,4
5,7,1
5,8,3


1,4,3 means customer 1 like item 4 and rated it 3

import java.io.File;
import java.io.IOException;

import org.apache.mahout.cf.taste.common.TasteException;
import org.apache.mahout.cf.taste.impl.model.file.FileDataModel;
import org.apache.mahout.cf.taste.impl.similarity.CityBlockSimilarity;
import org.apache.mahout.cf.taste.model.DataModel;

public class CityBlockSimilarityEx {
 public static String dataFile = "/Users/harikrishna_gurram/customer.csv";

 public static void main(String args[]) throws IOException, TasteException {

  DataModel model = new FileDataModel(new File(dataFile));

  CityBlockSimilarity similarity = new CityBlockSimilarity(model);

  System.out.println("Similarity between user1 and user2 is "
    + similarity.userSimilarity(1, 2));

 }
}


Output

Similarity between user1 and user2 is 0.16666666666666666


When you go through the source code, it is implemented like following.

 public double userSimilarity(long userID1, long userID2) throws TasteException {
    DataModel dataModel = getDataModel();
    FastIDSet prefs1 = dataModel.getItemIDsFromUser(userID1);
    FastIDSet prefs2 = dataModel.getItemIDsFromUser(userID2);
    int prefs1Size = prefs1.size();
    int prefs2Size = prefs2.size();
    int intersectionSize = prefs1Size < prefs2Size ? prefs2.intersectionSize(prefs1) : prefs1.intersectionSize(prefs2);
    return doSimilarity(prefs1Size, prefs2Size, intersectionSize);
  }


  private static double doSimilarity(int pref1, int pref2, int intersection) {
    int distance = pref1 + pref2 - 2 * intersection;
    return 1.0 / (1.0 + distance);
  }


It computes like following.

Step 1: finds out all the item ids for given user.
Item ids for user 1 are: 4, 7, 8, 10
Item ids for user 2 are: 3, 4, 6, 7, 9

Total item ids for user 1 prefs1Size  = 4
Total item ids for user 2 prefs2Size  = 5

Common item ids for user1 and user2 are (4, 7)
Total common item ids are intersectionSize =2

Compute similarity distance using following formula.

distance = prefs1Size  + prefs2Size  - 2 * intersectionSize;
              = 4 + 5 – (2 * 2)
              = 9 – 4
              = 5

To map distance between 0 and 1, following logic is used.
Manhattan_distance = 1.0 / (1.0 + distance);
                                = 1.0 / (1.0 + 5)
                                = 1.0 / 6.0
                                = 0.16666666666666666



Prevoius                                                 Next                                                 Home

No comments:

Post a Comment