Monday, May 23, 2011

SVD++ Koren's collaborative filtering algorithm implemented in GraphLab

Due to many user requests, I have implemented Yehuda Koren's SVD++ collaborative filtering algorithm in GraphLab. Thanks to Nicholas Ampazis, and to Yehuda Koren who supplied his C++ version.
Implementation is based on the paper: Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model by Yehuda Koren.

Note that unlike the original paper, our implementation is parallel, thus exploiting multiple cores whenever they are available.

Here are some timing results of the multicore implemenation:
(I used 8 core machine, 5 SVD++ iterations using Netflix partial dataset with 3M ratings)

Here are some accuracy results:
It seems that additional cores improve accuracy.


The way to run SVD++ is to
0) Istall GraphLab based on the instructions: http://graphlab.org/download.html
1) Run with run mode = 5 
Example:
./pmf netflix 5 --ncpus=XX --scheduler="round_robin(max_iterations=10)"

Two other options are --minval=XX and --maxval=XX
for kddcup, it should be --minval=0 and --maxval=100
(if file name is kddcup it will automatically set those values).
For Netflix data, it should be --minval=1 and --maxval=5

Example run on the full Netflix dataset (using 8 cores:)

<55|0>bickson@biggerbro:~/newgraphlab/graphlabapi/release/demoapps/pmf$ ./pmf netflix-r 5 --ncpus=8 --scheduler=round_robin

INFO:     pmf.cpp(main:1081): PMF/ALS Code written By Danny Bickson, CMU
Send bug reports and comments to danny.bickson@gmail.com
WARNING:  pmf.cpp(main:1083): Code compiled with GL_NO_MULT_EDGES flag - this mode does not support multiple edges between user and movie in different times
WARNING:  pmf.cpp(main:1086): Code compiled with GL_NO_MCMC flag - this mode does not support MCMC methods.
WARNING:  pmf.cpp(main:1089): Code compiled with GL_SVD_PP flag - this mode only supports SVD++ run.
Setting run mode SVD_PLUS_PLUS
INFO:     pmf.cpp(main:1126): SVD_PLUS_PLUS starting

loading data file netflix-r
Loading netflix-r TRAINING
Matrix size is: USERS 480189 MOVIES 17770 TIME BINS 27
Creating 99072112 edges (observed ratings)...
................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................loading data file netflix-re
Loading netflix-re VALIDATION
Matrix size is: USERS 480189 MOVIES 17770 TIME BINS 27
Creating 1408395 edges (observed ratings)...
........loading data file netflix-rt
Loading netflix-rt TEST
skipping file
setting regularization weight to 1
PTF_ALS for matrix (480189, 17770, 27):99072112.  D=20
pU=1, pV=1, pT=1, muT=1, D=20
nuAlpha=1, Walpha=1, mu=0, muT=1, nu=20, beta=1, W=1, WT=1 BURN_IN=10
SVD++ 20 factors (rate=8.00e-03, reg=1.50e-02)
complete. Obj=6.8368e+08, TRAIN RMSE=3.7150 VALIDATION RMSE=3.7946.
max iterations = 0
step = 1
max_iterations = 0
INFO:     asynchronous_engine.hpp(run:94): Worker 0 started.

INFO:     asynchronous_engine.hpp(run:94): Worker 2 started.

INFO:     asynchronous_engine.hpp(run:94): Worker 1 started.

INFO:     asynchronous_engine.hpp(run:94): Worker 3 started.

INFO:     asynchronous_engine.hpp(run:94): Worker 4 started.

INFO:     asynchronous_engine.hpp(run:94): Worker 5 started.

INFO:     asynchronous_engine.hpp(run:94): Worker 6 started.

INFO:     asynchronous_engine.hpp(run:94): Worker 7 started.

Entering last iter with 1
92.7115) Iter SVD 1, TRAIN RMSE=1.0587 VALIDATION RMSE=0.9892.
Entering last iter with 2
174.441) Iter SVD 2, TRAIN RMSE=0.9096 VALIDATION RMSE=0.9536.
Entering last iter with 3
260.442) Iter SVD 3, TRAIN RMSE=0.8678 VALIDATION RMSE=0.9805.
Entering last iter with 4
321.652) Iter SVD 4, TRAIN RMSE=0.8480 VALIDATION RMSE=0.9603.
Entering last iter with 5
388.735) Iter SVD 5, TRAIN RMSE=0.8291 VALIDATION RMSE=0.9312.
Entering last iter with 6
470.291) Iter SVD 6, TRAIN RMSE=0.8106 VALIDATION RMSE=0.9264.
Entering last iter with 7
558.886) Iter SVD 7, TRAIN RMSE=0.8046 VALIDATION RMSE=0.9270.
Entering last iter with 8
628.846) Iter SVD 8, TRAIN RMSE=0.8007 VALIDATION RMSE=0.9242.
Entering last iter with 9
687.212) Iter SVD 9, TRAIN RMSE=0.7969 VALIDATION RMSE=0.9221.
Entering last iter with 10
775.021) Iter SVD 10, TRAIN RMSE=0.7926 VALIDATION RMSE=0.9215.
Entering last iter with 11
836.143) Iter SVD 11, TRAIN RMSE=0.7907 VALIDATION RMSE=0.9203.
Entering last iter with 12
919.416) Iter SVD 12, TRAIN RMSE=0.7874 VALIDATION RMSE=0.9195.
Entering last iter with 13
1000.87) Iter SVD 13, TRAIN RMSE=0.7852 VALIDATION RMSE=0.9191.
Entering last iter with 14
1081.9) Iter SVD 14, TRAIN RMSE=0.7834 VALIDATION RMSE=0.9186.
Entering last iter with 15
1169.46) Iter SVD 15, TRAIN RMSE=0.7817 VALIDATION RMSE=0.9182.
Entering last iter with 16
1236.61) Iter SVD 16, TRAIN RMSE=0.7808 VALIDATION RMSE=0.9179.
Entering last iter with 17
1304.72) Iter SVD 17, TRAIN RMSE=0.7795 VALIDATION RMSE=0.9176.
Entering last iter with 18
1366.15) Iter SVD 18, TRAIN RMSE=0.7783 VALIDATION RMSE=0.9173.
Entering last iter with 19
1453.8) Iter SVD 19, TRAIN RMSE=0.7768 VALIDATION RMSE=0.9172.
Entering last iter with 20
1521.15) Iter SVD 20, TRAIN RMSE=0.7763 VALIDATION RMSE=0.9171.
Entering last iter with 21
1588.85) Iter SVD 21, TRAIN RMSE=0.7754 VALIDATION RMSE=0.9175.
Entering last iter with 22
1654.52) Iter SVD 22, TRAIN RMSE=0.7757 VALIDATION RMSE=0.9170.
Entering last iter with 23
1722.88) Iter SVD 23, TRAIN RMSE=0.7740 VALIDATION RMSE=0.9171.
Entering last iter with 24
1783.94) Iter SVD 24, TRAIN RMSE=0.7739 VALIDATION RMSE=0.9163.

4 comments:

  1. Hi Bickson,

    The original svd++ (Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model) does not take time into consideration. Does your implementation also include this paper: Collaborative Filtering with Temporal Dynamics? I want to use this software for track2, which does not have time information.

    Thanks,
    Carlos

    ReplyDelete
  2. Hi Danny,

    After running for a while and ending with a TRAIN RMSE=0.7235, over 99% of new predictions using the sum(User .* Movie) formula are negative. (minvalue and maxvalue were set to 1 and 5)

    Is this not the way to calculate predictions for SVDpp?

    ReplyDelete
  3. Can you send me your input file and I will look at it?

    ReplyDelete
  4. p.s.
    It is not the way to compute predictions for SVD++... I will send you more details instructions on how to do it.

    ReplyDelete