Tuesday, October 2, 2012

Item based similarity with GraphChi

Item based collaborative filtering is one of the most popular collaborative filtering methods used in more than 70% of the companies I am talking to.  Following my mega collaborator Justin Yan's advice, I have started to implement some item based similarity methods in GraphChi.

Item based methods compare all pairs of items together, for computing similarity metric between each pair of items. This task becomes very quickly computation intensive. For example, Netflix data has around 17K movies. If we want to compute all pairs of movies to find the most similar ones, we need to compare around 290M pairs!

If we use a symmetric similarity measure, the distance between movie A and B, is similar to the distance between movie B and A. Thus for the Netflix example we have around 145M pairs to compute. To reduce the work furthermore, we only compare movies which where watched together by at least X users, for example X=5. Otherwise, those movies are not considered similar.

When the dataset is big, it is not possible to load it fully into memory at a single machine. That is where GraphChi comes in. My preliminary implementation of the item similarity computes similarity between all pairs without fully reading the dataset into memory. The idea is to load a chunk of the items (called pivots) into memory, and then stream over the rest of the items by comparing the pivots to the rest of the items.

The simplest distance metric I have implemented is Jaccard distance. The distance of items i and j is computed as follows:
         wi = number of users who watched movie i
         wj = number of users who watched movie j
         wij = number of users who watched both movie i and movie j
         Dij =  wij / ( wi + wj - wij )

It is clear that Dij is a number between zero and one.
Additional distance functions are found here.

As always, I am looking for brave beta testers who want to try it out!
For full Netflix data, it takes around 1200 seconds to compute distances of around 130M item pairs. (Around 1000 item pairs in a second, using a 24 core machine, with only 800MB memory used).

Ping me if you are interested in other distance metric so I could add them as well.

How to try it out:

1) Follow GraphChi installation instructions steps 1-4.
2) Download smallnetflix_mm
3) Run itemcf on smallnetflix data using:

bickson@thrust:~/graphchi$ > ./toolkits/collaborative_filtering/itemcf --training=smallnetflix_mm --nshards=1 --clean_cache=1 --K=10 --quiet=1
WARNING:  common.hpp(print_copyright:183): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com
[training] => [smallnetflix_mm]
[nshards] => [1]
[clean_cache] => [1]
[K] => [10]
[quiet] => [1]

 === REPORT FOR sharder() ===
edata_flush: 0.005163s (count: 13, min: 0.000335s, max: 0.000534, avg: 0.000397154s)
execute_sharding: 0.273818 s
finish_shard.sort: 0.065508 s
preprocessing: 1.05835 s
shard_final: 0.15078 s
app: sharder
Total item pairs compared: 6325948 total written to file: 35551 pairs with zero distance: 124609
Sorting output file smallnetflix_mm.out0
Sorting output file smallnetflix_mm.out1
Sorting output file smallnetflix_mm.out2
Sorting output file smallnetflix_mm.out3
Sorting output file smallnetflix_mm.out4
Sorting output file smallnetflix_mm.out5
Sorting output file smallnetflix_mm.out6
Sorting output file smallnetflix_mm.out7
Merging sorted files:
File written: smallnetflix_mm-topk
Total lines: 35551

Now let's examine the output file

bickson@thrust:~/graphchi$ 3561 3076 0.23296110332
3561 1990 0.270156830549
3561 1794 0.218252897263
3561 1788 0.227363973856
3561 1598 0.211793571711
3561 1536 0.233875498176
3561 1507 0.224534928799
3561 557 0.237991556525
3561 411 0.22253626585
3561 338 0.237756416202

For each item we get exactly K=10 similar items. (The item are sorted in reverse order).
The format is rather simple. In each row we have:
<item A> <item B> <similarity>

Command line arguments
--training - input file name in sparse matrix market format
--min_allowed_intersection=XX - filter out item pairs with less than XX users who rated them  jointly.
--quiet=1 run with less verbose traces
--nshards=1 ## mandatory argument
FOR itemcf:  --distance=XX, 0 = Jaccard index, 1=AA, 2=RA, 3=Aiolli

Useful GraphChi command line arguments:
execthreads XX - set the number of execution threads
membudget_mb XX - fix the size of used memory to XX mb

Additional metrics:

Adamic/Adar (AA) (itemcf):

Just got an email from Timmy Wilson, our man in Ohio: 
"There are a lot of similarity functions out there -- Adamic / Adar is another good one "

I looked it up an indeed it is a very simple cost function: 
Distance_item_i_j = sum over users which rated both items i j (1 / log (user_rating) )
see equation (2) in the paper: http://arxiv.org/abs/0907.1728 "Role of Weak Ties in Link Prediction of Complex Networks". The equation gives larger weight to users which rated both items but have relatively small number of rated items.

I have also added the RA cost function, which is same like AA but without the log.

Pearson Correlation (itemcf)

I got a request from additional reader "silly snail" to implement Pearson Correlation as well. Let m be a vector of size M (users) which hold the user rating mean.
Let a be a sparse vector holding user rating for item a, and the same for item b.

Pearson_correlation_i_j = ((a - mean)' * (b-mean)) / (stddev(a) * stddev(b))

And stddev(a) = sum((a-mean).^2) / (n - 1) 
where n = length(a).

To run pearson correlation, use the program ./toolkits/collaborative_filtering/preason

A toy example:

Prepare the file "stam" with the content:
%%MatrixMarket matrix coordinate real general
3 4 9
1 1 3
1 2 4
2 3 1
2 4 1
2 1 2
3 1 1
3 2 2
3 3 3
3 4 4

Namely there are 3 users, 4 items, and 9 ratings. 
Now run pearson correlation:
./toolkits/collaborative_filtering/pearson --training=stam execthreads 1 

Lets examine the output file:
head stam.out0
2 1 0.337405

Namely, the pearson correlation between items 1 and 2 is 0.337405.
Now let's do the same computation in matlab:

>> a=full(mmread('stam')); % load the matrix into memory
a =

     3     4     0     0
     2     0     1     1
     1     2     3     4
>> means=mean(a') % compute the item vectors mean

means =

    1.7500    1.0000    2.5000

>> A=a(:,1)-means'  % compute item 1 vector minus mean

A =


>> B=a(:,2)-means'  % compute item 2 vector minus mean

B =

>> dist = A'*B  % multiply the result

ans =

>> stddeva=sum(A.^2)/2 % compute stddev item 1

stddeva =


>> stddevb=sum(B.^2/2)% compute stddev item 2

stddevb =

>> dist/(stddeva*stddevb) % compute pearson correlation

ans =


Luckily we got the same result! namely pearson correlation between the two first items is 0.3374.

Cosine Distance (itemcf)

Manhattan Distance (itemcf2)

See http://en.wikipedia.org/wiki/Taxicab_geometry

Log Similarity Distance (itemcf2)

Chebychev Distance (itemcf2)


Tanimoto Distance (itemcf2)

See http://en.wikipedia.org/wiki/Jaccard_index

Aiolli's method (itemcf)

See the paper: F. Aiolli, A Preliminary Study on a Recommender System for the Million Songs Dataset Challenge Preference Learning: Problems and Applications in AI (PL-12), ECAI-12 Workshop, Montpellier. pdf

Slope One Recommender (itemcf2)

Is described in A Programmer's Guide to Data Mining , page 18.

How to compute recommendations out of similarities?

For computing top K recommendations for each user, you can use the itemsim2ratings utility
as explained here. The input file for itemsim2rating is the similarity matrix computed by itemcf.

Additional reading:

Matrix factorization based collaborative filtering with GraphChi.
Case study: million songs dataset



  1. Danny, Do you mean number of users who _watched_ both movie i and j in the third line?

    wi = number of users who watched movie i
    wj = number of users who watched movie j
    wij = number of users who wanted both movie i and movie j
    Dij = wij / ( wi + wj - wij )

    Also at some point you mentioned Python interfaces to GraphChi, if that becomes available I would be happy to jump in and beta test, become early user etc. Do you have any plans for that. I think once you have a system where writing a new algorithm becomes kind of like writing a plugin for firefox or chrome, that is the ideal situation since we often run into situations which need slight modifications, workarounds based on ground realities.

    1. Hi Senthil ! Thanks for your note I have fixed the typo (watched).

      Regarding python we do not have a wrapper yet, but there is a graphchi version in Java. We will add your request to our wish list...

  2. Hi Danny, nice work!
    Can it calculate item similarity with Pearson's correlation?
    I'm new to GraphChi and I'm wondering if it is possible that I could implement different similarity metrics myself.

    1. Hi SillySnail (nice nickname by the way!)
      I can implement pearson correlation rather quickly for you (maybe even today), if you like to be my beta tester and try it out.


    2. HI again,
      Pearson correlation is now implemented in graphchi. Please try it out by checking out from mercurial (hg pull; hg update) and then recompile (make clean; make cf). I have documented the usage here: http://bickson.blogspot.co.il/2012/09/item-based-similarity-with-graphchi.html#metrics

  3. Hi Danny.

    Thanks for the work. I was wondering. Does someone have to go through each install step each time you publish a new algorithm/measure or can one just copy the new file to the toolkit folder?


    1. Hi Paul,
      My advice is to use mercurial, and thus you do not need to install, just do "hg pull; hg update" and then "make clean; make cf". The pull commands gets the latest files from the repository. You can also download them manually and place them in the toolkits/collaborative_filtering folder, but this is less recommended since we sometime fix things in the graphchi engine itself or in other parts of the code.


  4. Hy Danny,

    In the last RecSys Conference, Anmol Bhasin from LinkedIn explained that they use LSH (http://en.wikipedia.org/wiki/Locality-sensitive_hashing) for item-similarity and clustering in high-dimensional data (makes sense, in LinkedIn they have many features per user). Do you think is a good idea implementing this in GraphLab?

    On the other side, although you have implemented many recommender algorithms, many researchers require several measures to evaluate them (RMSE, Precision@n, nDCG, recall, MRR, AUC) and MyMediaLite has given an important step in this area


    is there any plan to incorporate these measures in the Recommender Algorithms of GraphLab and/or GraphChi?


  5. Hi Denis,
    Thanks for your interesting feedback.

    Regarding locality sensitive hashing - I am not sure wish variant do you mean. It seems that on the web lecture I found: http://www.slideshare.net/anmolbhasin/beyond-ratings-andfollowers-recsys-2012 they refer to cosine similarity. It is rather straightforward to add it to graphchi.

    Regarding additional accuracy measures - it is possible to add support to additional measures. If you are interested in helping me beta test them I can add some of the measures you request.



  6. Danny,

    You are right, looks like the variant than Anmol refers in his presentation is cosine similarity, but not sure how he sets the weights of the vectors or if this is a special implementation. I think it good be a good contribution adding this to GraphChi.

    On the other side, I would be glad to help you testing evaluation metrics. We have 4 people working on RecSys at iSchool at Pitt and at least 2 of us (Sherry Sahebi and me) could collaborate with you on this task.

    Thanks again.

  7. Danny,

    I do hope you'll implement cosine similarity and the other sim. functions.

    1. Already implemented... Let me know if you want to test it out.



  8. Hi,Danny!
    I am a little confused about your code when do some test about computing the similarity(Cosine similarity and Pearson coefficient) of an item set by "itemcf2".
    It seems when computing Cosine Similarity, the output is the "item1 item2 Distance", for in the distance.cpp, function "calc_cosine_distance" returns "1-dotprod / denominator". But when Pearson Coefficient, the output is "item1 item2 Similarity". I think maybe these outputs are confusing...

    1. Hi Martin!
      Thanks for your feedback. Based on your comments I have change the terminology from Cosine similarity to Cosine distance and gave a different explaining URL. Please take a loook and let me know if this is now clearer.



  9. Good one its works Great .... Sir !!

  10. Dear Danny,

    I have a below mentioned Market Matrix, when I run itemcf command, it throws an exception...any idea what could be the issue?

    %%MatrixMarket matrix coordinate real general
    % Generated 28-Jan-2013
    4 4 7
    1 1 5
    1 2 5
    1 3 3
    1 4 5
    2 1 5
    3 1 2
    4 1 4

    mburhan@mburhan-Vostro-3460:~/graphchi/toolkits/collaborative_filtering$ ./itemcf --training=/home/mburhan/workspace/MarketMatrix/datasource/test_mm
    WARNING: common.hpp(print_copyright:144): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any comments or bug reports to danny.bickson@gmail.com
    [training] => [/home/mburhan/workspace/MarketMatrix/datasource/test_mm]
    INFO: chifilenames.hpp(find_shards:251): Detected number of shards: 2
    INFO: chifilenames.hpp(find_shards:252): To specify a different number of shards, use command-line parameter 'nshards'
    INFO: io.hpp(convert_matrixmarket:419): File /home/mburhan/workspace/MarketMatrix/datasource/test_mm was already preprocessed, won't do it again.
    INFO: io.hpp(read_global_mean:112): Opened matrix size: 4 x 4 edges: 7 Global mean is: 4.14286 time bins: 0 Now creating shards.
    FATAL: itemcf.cpp(main:452): This application currently supports only 1 shard
    terminate called after throwing an instance of 'char const*'
    Aborted (core dumped)

  11. Hi Danny,

    I exectued the algorithms (Jaccard index, AA, RA, Aiollo) with my dataset and have got great results.
    I am currently trying to use cosine distance using the following command:
    ./toolkits/collaborative_filtering/itemcf2 --training=input_notP_5 --distance=4

    This results in the following output on the terminal.

    [training] => [input_notP_5]
    [distance] => [4]
    INFO: chifilenames.hpp(find_shards:258): Detected number of shards: 2
    INFO: chifilenames.hpp(find_shards:259): To specify a different number of shards, use command-line parameter 'nshards'
    INFO: io.hpp(convert_matrixmarket:425): File input_notP_5 was already preprocessed, won't do it again.
    INFO: io.hpp(read_global_mean:112): Opened matrix size: 511288 x 274815 edges: 1044265 Global mean is: 4 time bins: 0 Now creating shards.
    DEBUG: stripedio.hpp(stripedio:201): Start io-manager with 2 threads.
    INFO: graphchi_engine.hpp(graphchi_engine:150): Initializing graphchi_engine. This engine expects 4-byte edge data.
    INFO: chifilenames.hpp(load_vertex_intervals:378): shard: 0 - 600855
    INFO: chifilenames.hpp(load_vertex_intervals:378): shard: 600856 - 786102
    INFO: graphchi_engine.hpp(run:673): GraphChi starting
    INFO: graphchi_engine.hpp(run:674): Licensed under the Apache License 2.0
    INFO: graphchi_engine.hpp(run:675): Copyright Aapo Kyrola et al., Carnegie Mellon University (2012)
    DEBUG: slidingshard.hpp(sliding_shard:193): Total edge data size: 2088552, input_notP_5.edata.e4B.0_2sizeof(ET): 4
    DEBUG: slidingshard.hpp(sliding_shard:193): Total edge data size: 2088508, input_notP_5.edata.e4B.1_2sizeof(ET): 4
    INFO: graphchi_engine.hpp(print_config:125): Engine configuration:
    INFO: graphchi_engine.hpp(print_config:126): exec_threads = 1
    INFO: graphchi_engine.hpp(print_config:127): load_threads = 4
    INFO: graphchi_engine.hpp(print_config:128): membudget_mb = 800
    INFO: graphchi_engine.hpp(print_config:129): blocksize = 4194304
    INFO: graphchi_engine.hpp(print_config:130): scheduler = 1
    INFO: graphchi_engine.hpp(run:706): Start iteration: 0
    INFO: graphchi_engine.hpp(run:760): 0.029119s: Starting: 0 -- 600855
    INFO: graphchi_engine.hpp(run:773): Iteration 0/5, subinterval: 0 - 600855
    DEBUG: memoryshard.hpp(load_edata:249): Compressed/full size: 0.00485887 number of blocks: 1
    INFO: graphchi_engine.hpp(run:799): Start updates
    INFO: graphchi_engine.hpp(run:809): Finished updates
    INFO: graphchi_engine.hpp(run:827): Commit memshard
    INFO: graphchi_engine.hpp(run:760): 0.207832s: Starting: 600856 -- 786102
    INFO: graphchi_engine.hpp(run:773): Iteration 0/5, subinterval: 600856 - 786102
    DEBUG: memoryshard.hpp(load_edata:249): Compressed/full size: 0.00485801 number of blocks: 1
    INFO: graphchi_engine.hpp(run:799): Start updates
    INFO: graphchi_engine.hpp(run:809): Finished updates
    INFO: graphchi_engine.hpp(run:827): Commit memshard
    INFO: graphchi_engine.hpp(run:706): Start iteration: 1
    INFO: graphchi_engine.hpp(run:760): 0.322075s: Starting: 0 -- 600855
    INFO: graphchi_engine.hpp(run:773): Iteration 1/5, subinterval: 0 - 600855
    DEBUG: memoryshard.hpp(load_edata:249): Compressed/full size: 0.00485887 number of blocks: 1
    INFO: graphchi_engine.hpp(run:799): Start updates

    The execution does not proceeds further and stays at "Start Updates" step. Could you please let me know what I have missed.


    1. HI Manu,
      This is strange. If you like send me a sample dataset where this problem happens and I will take a look at it. Which OS are you running on?

    2. Hi Danny,

      It worked. I am using Ubuntu.. I used the following command:

      ./toolkits/collaborative_filtering/itemcf2 --training=input_notP_7 --distance=6 --K=3 --min_allowed_intersection=3

      The execution at the following took 10 minutes:
      INFO: graphchi_engine.hpp(run:799): Start updates

      And then it proceeded with the comparison of pairs.


    3. I am glad to hear it worked! Comparing all pairs may be a low task - depends on how many items you got and also on the sparsity pattern of the user item ratings.

  12. Hi Danny,

    I am a little bit confused about the cosine similarity choice.

    According to the command line argument, "FOR itemcf2: --distance=XX, 3 = PEARSON, 4=COSINE, " itemcf2: --distance=4 means cosine similarity, but in the reference page, cosine distance are in itemcf, not in itemcf2.

    I tried with itemcf2, --distance=4 command, which give me similarity value greater than 1, I assume the cosine similarity value should between -1 and 1, right?

    Could you help me figure out which --distance index should I choose?


    1. If your edges are not binary (0 or 1) you may have values outside the range.