Tuesday, April 30, 2013

Recsys 2013: Yelp! Business Prediction Contest

I got an interesting email from Prof. Nicholas Ampazis from University of Aegean, Greece. Nicholas is trying out GraphChi gensgd for Kaggle's Yelp! business prediction contest which is part of Recsys 2013.

First he sent me some interesting observations about the dataset:


- There are 2108 training users in the ratings (review) matrix that do appear in the training users file. The reverse is not true (i.e. all users in the training user file have ratings).
- All business_ids in review appear in the business file
- There are 5315 users for which we wish to make predictions that do not appear in the ratings matrix.
- There are 1205 businness_ids  for which we wish to make predictions that do not appear in the ratings matrix (those always come in pairs with the unknown users above).
- The union of (distinct) users in the ratings matrix, training user file and test user file is 51082
- The union of (distinct) business_ids in the ratings matrix, training business file and test business file is 12742


Nicholas has kindly agreed to share with us some of the scripts he is using, to convert the Yelp! data to GraphChi: (written by his colleague Vaggelis Tripolitakis - thanks!!).

Disclaimer: we did not fine tune performance of gensgd yet so prediction quality is still poor. We plan to refine execution in the next couple of days and report results here.

0) Register to the competition here and download the datasets into your root GraphChi folder.

Method A: use Vaggelis scripts (Ruby)
1) Download the conversion scripts from GitHub:
https://github.com/vtripolitakis/yelpscripts
2) Give running permission to the script:
# chmod a+rx script

3) Verify that json ruby library is present using:
# sudo gem install json

Note: if you do not have root permission on your machine, install the package using
# gem install json
and add the locally created gem folder into your path, for example:
# export PATH=$PATH:/home/bickson/.gem/ruby/1.8/bin

Method B: use Justin Yan's scripts (Python):

Preliminaries: you will have to install python pandas
This script is based on script by Paul Butler.

create a file name conv_json2csv.py with the following lines:

'''
Convert Yelp Academic Dataset from JSON to CSV

'''

import json
import pandas as pd
from glob import glob

def convert(x):
    ''' Convert a json string to a flat python dictionary
    which can be passed into Pandas. '''
    ob = json.loads(x)
    for k, v in ob.items():
        if isinstance(v, list):
            ob[k] = ','.join(v)
        elif isinstance(v, dict):
            for kk, vv in v.items():
                ob['%s_%s' % (k, kk)] = vv
            del ob[k]
    return ob

for json_filename in glob('*.json'):
    csv_filename = '%s.csv' % json_filename[:-5]
    print 'Converting %s to %s' % (json_filename, csv_filename)
    df = pd.DataFrame([convert(line.strip().replace("\\n"," ").replace("\\r"," ")) for line in file(json_filename)])
    df.to_csv(csv_filename, encoding='utf-8', index=False)

Run
# python conv_json2csv.py

4) Use the following instructions for converting the data to GraphChi format
(hint: use copy & paste!)

###################### TRAINING SET ##########################

#---REVIEW---
./script yelp_training_set/yelp_training_set_review.json user_id business_id date votes stars > yelp_training_set_review.csv

#----USER---
./script yelp_training_set/yelp_training_set_user.json user_id review_count average_stars name votes > yelp_training_set_user.csv

#----BUSINESS----
./script yelp_training_set/yelp_training_set_business.json business_id open city state review_count longitude latitude categories name neighborhoods full_address stars > yelp_training_set_business.csv

##############################################################


###################### TEST SET ##########################

#---REVIEW---
./script yelp_test_set/yelp_test_set_review.json user_id business_id > yelp_test_set_review.csv

#----USER---
./script yelp_test_set/yelp_test_set_user.json user_id review_count > yelp_test_set_user.csv

#----BUSINESS----
./script yelp_test_set/yelp_test_set_business.json business_id open city state review_count longitude latitude categories name neighborhoods full_address > yelp_test_set_business.csv

##############################################################
######### CONCATENATE USER/BUSINESS FILES FROM TRAIN AND TEST ##########################

cat yelp_training_set_user.csv yelp_test_set_user.csv > user_file.csv

cat yelp_training_set_business.csv yelp_test_set_business.csv > business_file.csv

5) Run GraphChi GENSGD
a) Prepare a file named yelp_training_set_review.csv\:info with the following 2 lines:
%%MatrixMarket matrix coordinate real general
51082 12742 229907 
And a second file named yelp_test_set_review.csv\:info with the following 2 lines:

%%MatrixMarket matrix coordinate real general
51082 12742 22956


b) First trial: run using reviews only (without user and business information)

bickson@thrust:~/graphchi$ ./toolkits/collaborative_filtering/gensgd --training=yelp_training_set_review.csv --test=yelp_test_set_review.csv --from_pos=0 --to_pos=1 --val_pos=6 --rehash=1 --gensgd_mult_dec=0.999999 --quiet=1 --file_columns=7 --minval=0 --maxval=5   --clean_cache=1 --gensgd_rate1=1e-3 --gensgd_rate2=1e-3 --gensgd_rate3=1e-3 --gensgd_reg0=1e-3 --gensgd_regw=1e-2 --gensgd_regv=1e-1 --max_iter=20 --nshards=1 
WARNING:  common.hpp(print_copyright:180): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
...
   1.22736) Iteration:   0 Training RMSE:    1.20862
   1.35923) Iteration:   1 Training RMSE:    1.19359
    1.4658) Iteration:   2 Training RMSE:    1.18219
   1.60173) Iteration:   3 Training RMSE:     1.1729
   1.70212) Iteration:   4 Training RMSE:    1.16528
   1.82876) Iteration:   5 Training RMSE:    1.15854
   1.94846) Iteration:   6 Training RMSE:    1.15233
    2.0595) Iteration:   7 Training RMSE:    1.14694
   2.19689) Iteration:   8 Training RMSE:    1.14202
    2.3059) Iteration:   9 Training RMSE:     1.1375
   2.41694) Iteration:  10 Training RMSE:    1.13332
   2.53653) Iteration:  11 Training RMSE:    1.12953
   2.65778) Iteration:  12 Training RMSE:    1.12591
   2.77716) Iteration:  13 Training RMSE:    1.12243
    2.9095) Iteration:  14 Training RMSE:    1.11911
   3.04893) Iteration:  15 Training RMSE:    1.11606
   3.18783) Iteration:  16 Training RMSE:     1.1132
   3.29783) Iteration:  17 Training RMSE:    1.11039
   3.43783) Iteration:  18 Training RMSE:    1.10765
   3.54825) Iteration:  19 Training RMSE:    1.10512
Found 16335 new test users with no information about them in training dataset!


c) second run: throw in user information

bickson@thrust:~/graphchi$ ./toolkits/collaborative_filtering/gensgd --training=yelp_training_set_review.csv --test=yelp_test_set_review.csv --from_pos=0 --to_pos=1 --val_pos=6 --rehash=1 --gensgd_mult_dec=0.999999 --quiet=1 --file_columns=7 --minval=0 --maxval=5   --clean_cache=1 --gensgd_rate1=1e-3 --gensgd_rate2=1e-3 --gensgd_rate3=1e-3 --gensgd_reg0=1e-3 --gensgd_regw=1e-2 --gensgd_regv=1e-1 --max_iter=20 --nshards=1 --user_file=user_file.csv
WARNING:  common.hpp(print_copyright:183): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
[training] => [yelp_training_set_review.csv]
[test] => [yelp_test_set_review.csv]
[from_pos] => [0]
[to_pos] => [1]
[val_pos] => [6]
[rehash] => [1]
[gensgd_mult_dec] => [0.999999]
[quiet] => [1]
[file_columns] => [7]
[minval] => [0]
[maxval] => [5]
[clean_cache] => [1]
[gensgd_rate1] => [1e-3]
[gensgd_rate2] => [1e-3]
[gensgd_rate3] => [1e-3]
[gensgd_reg0] => [1e-3]
[gensgd_regw] => [1e-2]
[gensgd_regv] => [1e-1]
[max_iter] => [20]
[nshards] => [1]
[user_file] => [user_file.csv]

 === REPORT FOR sharder() ===
[Timings]
edata_flush: 0.063296s (count: 3, min: 0.01508s, max: 0.024157, avg: 0.0210987s)
execute_sharding: 0.168887 s
preprocessing: 0.530149 s
read_shovel: 0.008681 s
shard_final: 0.130108 s
shovel_flush: 0.008117 s
shovel_read: 0.006699 s
[Other]
app: sharder
Warning: missing: 5101 from node feature file: user_file.csv out of: 48978
   1.53665) Iteration:   0 Training RMSE:    1.35278
   1.77865) Iteration:   1 Training RMSE:    1.16203
   1.94927) Iteration:   2 Training RMSE:    1.11664
   2.11927) Iteration:   3 Training RMSE:    1.09443
   2.29663) Iteration:   4 Training RMSE:       1.08
   2.45677) Iteration:   5 Training RMSE:    1.06953
   2.60268) Iteration:   6 Training RMSE:    1.06094
   2.77053) Iteration:   7 Training RMSE:    1.05397
   2.93878) Iteration:   8 Training RMSE:    1.04816
   3.10066) Iteration:   9 Training RMSE:    1.04273
   3.28596) Iteration:  10 Training RMSE:    1.03828
   3.45064) Iteration:  11 Training RMSE:    1.03407
   3.64199) Iteration:  12 Training RMSE:    1.03018
   3.80989) Iteration:  13 Training RMSE:    1.02669
   3.99993) Iteration:  14 Training RMSE:    1.02326
   4.15758) Iteration:  15 Training RMSE:    1.02018
   4.36239) Iteration:  16 Training RMSE:    1.01738
   4.58039) Iteration:  17 Training RMSE:    1.01461
   4.73803) Iteration:  18 Training RMSE:    1.01198
   4.91018) Iteration:  19 Training RMSE:     1.0095
Found 16335 new test users with no information about them in training dataset!

d) third run: throw in also business information:


bickson@thrust:~/graphchi$ ./toolkits/collaborative_filtering/gensgd --training=yelp_training_set_review.csv --test=yelp_test_set_review.csv --from_pos=0 --to_pos=1 --val_pos=6 --rehash=1 --gensgd_mult_dec=0.999999 --quiet=1 --file_columns=7 --minval=0 --maxval=5   --clean_cache=1 --gensgd_rate1=1e-3 --gensgd_rate2=1e-3 --gensgd_rate3=1e-3 --gensgd_reg0=1e-3 --gensgd_regw=1e-2 --gensgd_regv=1e-1 --max_iter=20 --nshards=1 --user_file=user_file.csv --item_file=business_file.csv  --gensgd_rate4=1e-2 --gensgd_rate5=1e-2

WARNING:  common.hpp(print_copyright:183): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
...
Warning: missing: 5101 from node feature file: user_file.csv out of: 48978
Warning: missing: 16865 from node feature file: business_file.csv out of: 28402
   2.42267) Iteration:   0 Training RMSE:    2.18202
    3.0081) Iteration:   1 Training RMSE:    2.03203
   3.47768) Iteration:   2 Training RMSE:    1.90535
   4.07795) Iteration:   3 Training RMSE:    1.77818
   4.65206) Iteration:   4 Training RMSE:    1.67263
   5.23848) Iteration:   5 Training RMSE:    1.56425
   5.78475) Iteration:   6 Training RMSE:    1.47417
   6.38091) Iteration:   7 Training RMSE:    1.40324
   6.98647) Iteration:   8 Training RMSE:    1.34509
   7.57702) Iteration:   9 Training RMSE:    1.29067
   8.19608) Iteration:  10 Training RMSE:    1.24391
   8.80638) Iteration:  11 Training RMSE:    1.20927
   9.40638) Iteration:  12 Training RMSE:    1.18695
   9.94451) Iteration:  13 Training RMSE:    1.15484
   10.5267) Iteration:  14 Training RMSE:    1.13502
   11.1167) Iteration:  15 Training RMSE:    1.11524
   11.6525) Iteration:  16 Training RMSE:    1.10079
   12.2368) Iteration:  17 Training RMSE:    1.07966
   12.8187) Iteration:  18 Training RMSE:    1.06165
   13.4304) Iteration:  19 Training RMSE:    1.05136
Found 16335 new test users with no information about them in training dataset!

Conclusions: 
1) including user and business properties significantly improves prediction performance. 
2) When adding additional features, you should be careful not to overfit.


The output of gensgd is the file yelp_test_set_review.csv.predict

   1.4793704
N/A
   3.3002301
   2.8208445
   4.0713396
N/A
   3.1468302
   3.6243955

Where N/A is a prediction for a new user. If you like to have the global mean prediction instead, you can run with --cold_start=2

Next: soon I will post some update about performance of GraphChi & how to create the submission format out of GraphChi.

6 comments:

  1. Dear Danny,

    Thanks for the great post!

    I'm working on the same problem.

    It seems like my training csv file has a different column ordering from yours. I changed the from_pos and to_pos accordingly but I'm getting an error after training:

    app: sharder
    0.846656) Iteration: 0 Training RMSE: 1.25905
    0.950504) Iteration: 1 Training RMSE: 1.23172
    1.06232) Iteration: 2 Training RMSE: 1.20799
    1.16633) Iteration: 3 Training RMSE: 1.18764
    1.26865) Iteration: 4 Training RMSE: 1.17327
    1.37181) Iteration: 5 Training RMSE: 1.15787

    FATAL: gensgd.cpp(read_one_token:302): Error reading line 0 [ user_id business_id
    ]
    terminate called after throwing an instance of 'char const*'

    I was looking for options like from_pos, to_pos for the test csv data but there doesn't seem to be any.

    Is the test data supposed to be in a specific csv format?

    Thanks,
    David

    ReplyDelete
    Replies
    1. Hi,
      Please remove the header title from the test data.

      Thanks!

      Delete
  2. Hi Danny,
    Is it possible to run other methods for training rather than GENSGD on the dataset?

    Thanks,
    Patrick

    ReplyDelete
    Replies
    1. Definitely. The are two benefits for GENSGD:
      1) you can use a richer set of parameters (not just user, business and rating)
      2) no string parsing is required. If you plan to use other method, you will first need to convert the user ids and item ids to consecutive integers, and then run any of the other methods. It is possible to use GraphChi parser toolkit for generating those consecutive ids.

      Best,

      Delete
  3. Hi Danny,
    very impressive work!
    But when I try to create the file yelp_training_set_review.csv\:info, neither windows nor linux allow me to create a file with the symbol \ and :.
    What should I do?

    Ben

    ReplyDelete
  4. I recommend trying out GraphLab Create: http://graphlab.com/products/create/overview.html it will be easier to set this contest with Graphlab Create.

    The "\" tells the linux shell to ignore the special meaning of ":" the filename should be something:info

    ReplyDelete