First dataset: Airline on time
Below I will explain how to deploy it on a different problem domain: Airline on time performance. It is a completely different dataset from a different domain, but still the gensgd software can deal without without any modification. I hope that those results that show how
flexible is the software will encourage additional data scientist to try it out!
The airline on time dataset, has information about 10 years of flights in the US. The data of each year is a csv file with the following format:
Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
The fields are rather self explanatory Each line represents a single flight, and information about the date, carrier, airport etc. is given, and the interesting fields is the varying information about flight duration.
And here are the first few lines:
2008,1,3,4,2003,1955,2211,2225,WN,335,N712SW,128,150,116,-14,8,IAD,TPA,810,4,8,0,,0,NA,NA,NA,NA,NA
2008,1,3,4,754,735,1002,1000,WN,3231,N772SW,128,145,113,2,19,IAD,TPA,810,5,10,0,,0,NA,NA,NA,NA,NA
2008,1,3,4,628,620,804,750,WN,448,N428WN,96,90,76,14,8,IND,BWI,515,3,17,0,,0,NA,NA,NA,NA,NA
First task. Can we predict the total time the flight was on the air?
Well, for a matrix factorization method, it is not clear what is the actual matrix here. That is why it is useful to have a flexible software. In my experiments I have chosen "UniqueCarrier" and "FlightNum" as the two fields which form the matrix. This is because the characterize each flight rather uniquely. Next we need to decide which field we want to predict. I have chosen the ActualElapsedTime as the prediction target. Note that those fields are chosen on the fly, so you are more than welcome to chose others and see how well is the prediction in that case.
(Additional information about each field meaning is found here).
First let's use traditional matrix factorization.
bickson@thrust:~/graphchi$ ./toolkits/collaborative_filtering/gensgd --training=2008.csv --from_pos=8 --to_pos=9 --val_pos=11 --rehash=1 --gensgd_rate3=1e-5 --gensgd_mult_dec=0.9999 --max_iter=20 --file_columns=28 --gensgd_rate1=1e-5 --gensgd_rate2=1e-5 --quiet=1 --has_header_titles=1
WARNING: common.hpp(print_copyright:104): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any comments or bug reports to danny.bickson@gmail.com
INFO: gensgd.cpp(main:1155): Total selected features: 0 :
INFO: gensgd.cpp(main:1212): Target variable 11 : ActualElapsedTime
INFO: gensgd.cpp(main:1213): From 8 : UniqueCarrier
INFO: gensgd.cpp(main:1214): To 9 : FlightNum
7.58561) Iteration: 0 Training RMSE: 67.1094
11.7177) Iteration: 1 Training RMSE: 64.6665
15.8441) Iteration: 2 Training RMSE: 63.2155
19.9971) Iteration: 3 Training RMSE: 59.0044
24.0989) Iteration: 4 Training RMSE: 53.9083
28.1962) Iteration: 5 Training RMSE: 50.2416
...
77.6041) Iteration: 17 Training RMSE: 35.6409
81.7165) Iteration: 18 Training RMSE: 35.505
85.8197) Iteration: 19 Training RMSE: 35.4046
89.9266) Iteration: 20 Training RMSE: 35.3288
We got RMSE error of 35.3 minutes error on predicted flight time taking into account the carrier and flight number. That is rather bad.. we are half an hour off track.
Next let's throw in some temporal features into the computation: Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime. How do we do that? It is very easy! Just add the command line: --features=0,1,2,3,4,5,6,7 namely the positions of the features in the input file. This is what we call temporal matrix factorization or tensor factorization. But for utilizing it in one of the traditional methods, you need to merge al the 8 fields into one integer which encodes the time. Which is of course a tedious task.
bickson@thrust:~/graphchi$ ./toolkits/collaborative_filtering/gensgd --training=2008.csv --from_pos=8 --to_pos=9 --val_pos=11 --rehash=1 --file_columns=28 --gensgd_rate3=1e-5 --gensgd_mult_dec=0.9999 --max_iter=100 --gensgd_rate1=1e-5 --gensgd_rate2=1e-5 --features=1,2,3,4,5,6,7 --quiet=1 --has_header_titles=1
WARNING: common.hpp(print_copyright:104): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any comments or bug reports to danny.bickson@gmail.com
INFO: gensgd.cpp(main:1155): Total selected features: 7 :
INFO: gensgd.cpp(main:1211): Selected feature: 1 : Month
INFO: gensgd.cpp(main:1211): Selected feature: 2 : DayofMonth
INFO: gensgd.cpp(main:1211): Selected feature: 3 : DayOfWeek
INFO: gensgd.cpp(main:1211): Selected feature: 4 : DepTime
INFO: gensgd.cpp(main:1211): Selected feature: 5 : CRSDepTime
INFO: gensgd.cpp(main:1211): Selected feature: 6 : ArrTime
INFO: gensgd.cpp(main:1211): Selected feature: 7 : CRSArrTime
INFO: gensgd.cpp(main:1212): Target variable 11 : ActualElapsedTime
INFO: gensgd.cpp(main:1213): From 8 : UniqueCarrier
INFO: gensgd.cpp(main:1214): To 9 : FlightNum
21.8356) Iteration: 0 Training RMSE: 50.3144
36.6782) Iteration: 1 Training RMSE: 40.4813
51.425) Iteration: 2 Training RMSE: 36.0579
66.4348) Iteration: 3 Training RMSE: 33.4226
...
272.188) Iteration: 17 Training RMSE: 20.0103
286.887) Iteration: 18 Training RMSE: 19.7198
301.602) Iteration: 19 Training RMSE: 19.4597
316.305) Iteration: 20 Training RMSE: 19.2147
With temporal information we now got to RMSE of 19.2 minutes. Which is again not that
good.
Now let's utilize the full power of gensgd: when the going gets tough - throw in some more features! Without even understanding what the feature means I have thrown in almost everything...
./toolkits/collaborative_filtering/gensgd --training=2008.csv --from_pos=8 --to_pos=9 --val_pos=11 --rehash=1 --features=1,2,3,4,5,6,7,12,13,14,15,16,17,18 --gensgd_rate3=1e-5 --gensgd_mult_dec=0.9999 --file_columns=28 --max_iter=20 --gensgd_rate1=1e-5 --gensgd_rate2=1e-5 --quiet=1 --has_header_titles=1
WARNING: common.hpp(print_copyright:104): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any comments or bug reports to danny.bickson@gmail.com
INFO: gensgd.cpp(main:1155): Total selected features: 14 :
INFO: gensgd.cpp(main:1211): Selected feature: 1 : Month
INFO: gensgd.cpp(main:1211): Selected feature: 2 : DayofMonth
INFO: gensgd.cpp(main:1211): Selected feature: 3 : DayOfWeek
INFO: gensgd.cpp(main:1211): Selected feature: 4 : DepTime
INFO: gensgd.cpp(main:1211): Selected feature: 5 : CRSDepTime
INFO: gensgd.cpp(main:1211): Selected feature: 6 : ArrTime
INFO: gensgd.cpp(main:1211): Selected feature: 7 : CRSArrTime
INFO: gensgd.cpp(main:1211): Selected feature: 12 : CRSElapsedTime
INFO: gensgd.cpp(main:1211): Selected feature: 13 : AirTime
INFO: gensgd.cpp(main:1211): Selected feature: 14 : ArrDelay
INFO: gensgd.cpp(main:1211): Selected feature: 15 : DepDelay
INFO: gensgd.cpp(main:1211): Selected feature: 16 : Origin
INFO: gensgd.cpp(main:1211): Selected feature: 17 : Dest
INFO: gensgd.cpp(main:1211): Selected feature: 18 : Distance
INFO: gensgd.cpp(main:1212): Target variable 11 : ActualElapsedTime
INFO: gensgd.cpp(main:1213): From 8 : UniqueCarrier
INFO: gensgd.cpp(main:1214): To 9 : FlightNum
36.2089) Iteration: 0 Training RMSE: 21.1476
61.2802) Iteration: 1 Training RMSE: 10.1963
86.3032) Iteration: 2 Training RMSE: 8.64215
111.236) Iteration: 3 Training RMSE: 7.76054
136.246) Iteration: 4 Training RMSE: 7.14308
161.221) Iteration: 5 Training RMSE: 6.6629
...
461.528) Iteration: 17 Training RMSE: 4.26991
486.61) Iteration: 18 Training RMSE: 4.17239
511.737) Iteration: 19 Training RMSE: 4.08084
536.775) Iteration: 20 Training RMSE: 3.99414
Now we got down to 4 minutes avg error. But, we can continue the computation (run more iterations) and we get down even below 2 minutes error. Isn't that neat? The average flight time is 127 minutes in 2008, so 2 minutes error prediction is not that bad.
Conclusion: traditional matrix / tensor factorization have some severe limitation when dealing with real world complex data. Additional techniques are needed to improve accuracy!
Second task: let's predict TaxiIn (time that the plane is on the ground when coming in)
This task is slightly more difficult, since as you may imagine, there is much larger variation in texiin time relative to flight time. But is predeicing it more difficult? No.. we simply change --val_pos=19 namely to point the taget into the taxiintime field.
bickson@thrust:~/graphchi$ ./toolkits/collaborative_filtering/gensgd --training=2008.csv --from_pos=8 --to_pos=9 --val_pos=19 --rehash=1 --file_columns=28 --gensgd_rate3=1e-3 --gensgd_mult_dec=0.9999 --max_iter=20 --file_columns=28 --gensgd_rate1=1e-3 --gensgd_rate2=1e-3 --features=1,2,3,4,5,6,7,10,11,12,13,14,15,16,17,18 --quiet=1 --has_header_titles=1
WARNING: common.hpp(print_copyright:104): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any comments or bug reports to danny.bickson@gmail.com
[quiet] => [1]
INFO: gensgd.cpp(main:1155): Total selected features: 16 :
INFO: gensgd.cpp(main:1158): Selected feature: 1
INFO: gensgd.cpp(main:1158): Selected feature: 2
INFO: gensgd.cpp(main:1158): Selected feature: 3
INFO: gensgd.cpp(main:1158): Selected feature: 4
INFO: gensgd.cpp(main:1158): Selected feature: 5
INFO: gensgd.cpp(main:1158): Selected feature: 6
INFO: gensgd.cpp(main:1158): Selected feature: 7
INFO: gensgd.cpp(main:1158): Selected feature: 10
INFO: gensgd.cpp(main:1158): Selected feature: 11
INFO: gensgd.cpp(main:1158): Selected feature: 12
INFO: gensgd.cpp(main:1158): Selected feature: 13
INFO: gensgd.cpp(main:1158): Selected feature: 14
INFO: gensgd.cpp(main:1158): Selected feature: 15
INFO: gensgd.cpp(main:1158): Selected feature: 16
INFO: gensgd.cpp(main:1158): Selected feature: 17
INFO: gensgd.cpp(main:1158): Selected feature: 18
1.56777) Iteration: 0 Training RMSE: 3.89207
3.01777) Iteration: 1 Training RMSE: 3.64978
4.5159) Iteration: 2 Training RMSE: 3.46472
5.8659) Iteration: 3 Training RMSE: 3.30712
7.26778) Iteration: 4 Training RMSE: 3.17225
8.7159) Iteration: 5 Training RMSE: 3.06696
...
23.6072) Iteration: 16 Training RMSE: 2.60147
24.9789) Iteration: 17 Training RMSE: 2.57697
26.3267) Iteration: 18 Training RMSE: 2.55768
27.6967) Iteration: 19 Training RMSE: 2.54186
29.0773) Iteration: 20 Training RMSE: 2.53113
Instructions:
0) Install GraphChi from mercurial using the instructions here.
1) Download the year 2008 from here.
2) Open the zip file using:
bunzip2 2008.csv.bz2
3) Create a matrix market format file, named 2008.csv:info with the following two lines:
%%MatrixMarket matrix coordinate real general
20 7130 1000000
4) Run the commands as instructed above.
Second dataset: Hearst machine learning challenge
A while ago Hearst provided data about emails campaigns and the task was to predict user reaction to emails (click/ not clicked). The data has several millions records about emails sent with around 273 user features for each email. Here is some of the available frields:CLICK_FLG,OPEN_FLG,ADDR_VER_CD,AQI,ASIAN_CD,AUTO_IN_MARKET,BIRD_QTY,BUYER_DM_BOOKS,BUYER_DM_COLLECT_SPC_FOOD,BUYER_DM_CRAFTS_HOBBI,BUYER_DM_FEMALE_ORIEN,BUYER_DM_GARDEN_FARM,BUYER_DM_GENERAL,BUYER_DM_GIFT_GADGET,BUYER_DM_MALE_ORIEN,BUYER_DM_UPSCALE,BUYER_MAG_CULINARY_INTERS,BUYER_MAG_FAMILY_GENERAL,BUYER_MAG_FEMALE_ORIENTED,BUYER_MAG_GARDEN_FARMING,BUYER_MAG_HEALTH_FITNESS,BUYER_MAG_MALE_SPORT_ORIENTED,BUYER_MAG_RELIGIOUS,CATS_QTY,CEN_2000_MATCH_LEVEL,CLUB_MEMBER_CD,COUNTRY_OF_ORIGIN,DECEASED_INDICATOR,DM_RESPONDER_HH,DM_RESPONDER_INDIV,DMR_CONTRIB_CAT_GENERAL,DMR_CONTRIB_CAT_HEALTH_INST,DMR_CONTRIB_CAT_POLITICAL,DMR_CONTRIB_CAT_RELIGIOUS,DMR_DO_IT_YOURSELFERS,DMR_MISCELLANEOUS,DMR_NEWS_FINANCIAL,DMR_ODD_ENDS,DMR_PHOTOGRAPHY,DMR_SWEEPSTAKES,DOG_QTY,DWELLING_TYPE,DWELLING_UNIT_SIZE,EST_LOAN_VALUE_RATIO,ETECH_GROUP,ETHNIC_GROUP_CODE,ETHNIC_INSIGHT_MTCH_FLG,ETHNICITY_DETAIL,EXPERIAN_INCOME_CD,EXPERIAN_INCOME_CD_V4,GNDR_OF_CHLDRN_0_3,GNDR_OF_CHLDRN_10_12,GNDR_OF_CHLDRN_13_18,GNDR_OF_CHLDRN_4_6,GNDR_OF_CHLDRN_7_9,HH_INCOME,HHLD_DM_PURC_CD,HOME_BUSINESS_IND,I1_BUSINESS_OWNER_FLG,I1_EXACT_AGE,I1_GNDR_CODE,I1_INDIV_HHLD_STATUS_CODE,INDIV_EDUCATION,INDIV_EDUCATION_CONF_LVL,INDIV_MARITAL_STATUS,INDIV_MARITAL_STATUS_CONF_LVL,INS_MATCH_TYPE,LANGUAGE,LENGTH_OF_RESIDENCE,MEDIAN_HOUSING_VALUE,MEDIAN_LEN_OF_RESIDENCE,MM_INCOME_CD,MOSAIC_HH,MULTI_BUYER_INDIV,NEW_CAR_MODEL,NUM_OF_ADULTS_IN_HHLD,NUMBER_OF_CHLDRN_18_OR_LESS,OCCUP_DETAIL,OCCUP_MIX_PCT,PCT_CHLDRN,PCT_DEROG_TRADES,PCT_HOUSEHOLDS_BLACK,PCT_OWNER_OCCUPIED,PCT_RENTER_OCCUPIED,PCT_TRADES_NOT_DEROG,PCT_WHITE,PHONE_TYPE_CD,PRES_OF_CHLDRN_0_3,PRES_OF_CHLDRN_10_12,PRES_OF_CHLDRN_13_18,PRES_OF_CHLDRN_4_6,PRES_OF_CHLDRN_7_9,PRESENCE_OF_CHLDRN,PRIM_FEM_EDUC_CD,PRIM_FEM_OCC_CD,PRIM_MALE_EDUC_CD,PRIM_MALE_OCC_CD,RECIPIENT_RELIABILITY_CD,RELIGION,SCS_MATCH_TYPE,TRW_INCOME_CD,TRW_INCOME_CD_V4,USED_CAR_CD,Y_OWNS_HOME,Y_PROBABLE_HOMEOWNER,Y_PROBABLE_RENTER,Y_RENTER,YRS_SCHOOLING_CD,Z_CREDIT_CARD
Fields meaning and code are described in detail here. You will need to register the website for getting access to the data.
And this the is the first entry:
N,N,,G,,8,0,1,0,0,0,0,1,0,0,0,0,4,0,0,1,0,0,0,B,U,0,,M,Y,0,0,0,0,0,1,1,1,0,2,0,A,C,0,J,18,Y,66,,A,U,U,U,U,U,34,,U,U,84,M,H,1,1,M,5,I,01,00,67,3,,E06,Y,7,3,0,05,0,37,78.09,30,63,36,13.27,59,,N,N,N,N,N,N,U,UU,U,07,6,J,4,,J,4,U,,Y,U,0,Y,,24,,,,,,,F,F,,,,,,,U,Y,,,,,,,17,69,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,5,5,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,NORTH LAUDERDALE,330685141,FL,190815,,,,,,1036,Third Party - Merch,"Mon, 09/20/10 01:04 PM"
For this demo, I used the file Modeling_1.csv which is the first of 5 files, with 400K entries.
We would like to predict the zeros entry (click flag). I have taken column 9 and 10 as the matrix from/to entries. The rest of the columns up to column 40 are features. (While there are more features the actual solution is so accurate so the first 40 are enough).
After about an hour of playing I got the the following formulation:
./toolkits/collaborative_filtering/gensgd --training=Modeling_1.csv --val_pos=0 --from_pos=9 --to_pos=10 --features=3,4,5,6,7,8,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40 --has_header_titles=1 --rehash=1 --file_columns=200 --rehash_value=1 --calc_error=1 --cutoff=0.5 --has_header_titles=1
WARNING: common.hpp(print_copyright:104): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any comments or bug reports to danny.bickson@gmail.com
INFO: gensgd.cpp(main:1255): Total selected features: 36 :
INFO: gensgd.cpp(main:1258): Selected feature: 3 : AQI
INFO: gensgd.cpp(main:1258): Selected feature: 4 : ASIAN_CD
INFO: gensgd.cpp(main:1258): Selected feature: 5 : AUTO_IN_MARKET
INFO: gensgd.cpp(main:1258): Selected feature: 6 : BIRD_QTY
INFO: gensgd.cpp(main:1258): Selected feature: 7 : BUYER_DM_BOOKS
INFO: gensgd.cpp(main:1258): Selected feature: 8 : BUYER_DM_COLLECT_SPC_FOOD
INFO: gensgd.cpp(main:1258): Selected feature: 11 : BUYER_DM_GARDEN_FARM
INFO: gensgd.cpp(main:1258): Selected feature: 12 : BUYER_DM_GENERAL
INFO: gensgd.cpp(main:1258): Selected feature: 13 : BUYER_DM_GIFT_GADGET
INFO: gensgd.cpp(main:1258): Selected feature: 14 : BUYER_DM_MALE_ORIEN
INFO: gensgd.cpp(main:1258): Selected feature: 15 : BUYER_DM_UPSCALE
INFO: gensgd.cpp(main:1258): Selected feature: 16 : BUYER_MAG_CULINARY_INTERS
INFO: gensgd.cpp(main:1258): Selected feature: 17 : BUYER_MAG_FAMILY_GENERAL
INFO: gensgd.cpp(main:1258): Selected feature: 18 : BUYER_MAG_FEMALE_ORIENTED
INFO: gensgd.cpp(main:1258): Selected feature: 19 : BUYER_MAG_GARDEN_FARMING
INFO: gensgd.cpp(main:1258): Selected feature: 20 : BUYER_MAG_HEALTH_FITNESS
INFO: gensgd.cpp(main:1258): Selected feature: 21 : BUYER_MAG_MALE_SPORT_ORIENTED
INFO: gensgd.cpp(main:1258): Selected feature: 22 : BUYER_MAG_RELIGIOUS
INFO: gensgd.cpp(main:1258): Selected feature: 23 : CATS_QTY
INFO: gensgd.cpp(main:1258): Selected feature: 24 : CEN_2000_MATCH_LEVEL
INFO: gensgd.cpp(main:1258): Selected feature: 25 : CLUB_MEMBER_CD
INFO: gensgd.cpp(main:1258): Selected feature: 26 : COUNTRY_OF_ORIGIN
INFO: gensgd.cpp(main:1258): Selected feature: 27 : DECEASED_INDICATOR
INFO: gensgd.cpp(main:1258): Selected feature: 28 : DM_RESPONDER_HH
INFO: gensgd.cpp(main:1258): Selected feature: 29 : DM_RESPONDER_INDIV
INFO: gensgd.cpp(main:1258): Selected feature: 30 : DMR_CONTRIB_CAT_GENERAL
INFO: gensgd.cpp(main:1258): Selected feature: 31 : DMR_CONTRIB_CAT_HEALTH_INST
INFO: gensgd.cpp(main:1258): Selected feature: 32 : DMR_CONTRIB_CAT_POLITICAL
INFO: gensgd.cpp(main:1258): Selected feature: 33 : DMR_CONTRIB_CAT_RELIGIOUS
INFO: gensgd.cpp(main:1258): Selected feature: 34 : DMR_DO_IT_YOURSELFERS
INFO: gensgd.cpp(main:1258): Selected feature: 35 : DMR_MISCELLANEOUS
INFO: gensgd.cpp(main:1258): Selected feature: 36 : DMR_NEWS_FINANCIAL
INFO: gensgd.cpp(main:1258): Selected feature: 37 : DMR_ODD_ENDS
INFO: gensgd.cpp(main:1258): Selected feature: 38 : DMR_PHOTOGRAPHY
INFO: gensgd.cpp(main:1258): Selected feature: 39 : DMR_SWEEPSTAKES
INFO: gensgd.cpp(main:1258): Selected feature: 40 : DOG_QTY
INFO: gensgd.cpp(main:1259): Target variable 0 : CLICK_FLG
INFO: gensgd.cpp(main:1260): From 9 : BUYER_DM_CRAFTS_HOBBI
INFO: gensgd.cpp(main:1261): To 10 : BUYER_DM_FEMALE_ORIEN
54.8829) Iteration: 0 Training RMSE: 0.00927502 Train err: 8e-05
99.4742) Iteration: 1 Training RMSE: 0.00120904 Train err: 0
143.852) Iteration: 2 Training RMSE: 0.000793143 Train err: 0
188.523) Iteration: 3 Training RMSE: 0.000604034 Train err: 0
233.188) Iteration: 4 Training RMSE: 0.000500067 Train err: 0
We got a very good classifier - starting from the second iteration there are no classification errors.
Some explanation about additional run time flags, not used in previous examples.
1) --rehash_value=1 - since the target value is not numeric, I used rehash_value to translate Y/N into two numeric integer bins.
2) --cutoff=0.5 - after hasing the target Y/N we get two integers: 0 and 1. So I use 0.5 as a prediction threshold to decide for Y/N.
3) --file_columns=200 - I am looking only at the first 40 columns, so there is no need in parsing all the 273 columns. (You can play with this parameter on run time).
4) --has_header_titles=1 - first line of input field includes column titles
Instructions
1) Register to the hearst website.
2) Download the first data file Modeling_1.csv and put in the in main graphchi folder.
3) Create a file named Modeling_1.csv:info and put the following two lines in it:
%%MatrixMarket matrix coordinate real general
11 13 400000
4) Run as instructed.
Hi,
ReplyDeleteI've installed graph-chi on my macbook, and ran a few of the demo scripts without error. However, it appears I cannot load data from a .csv file. When I try to run "traditional matrix factorization" I get the following error: "FATAL: gensgd.cpp(convert_matrixmarket_N:582): Bug: can not add edge from 0 to J 0 since max is: 0x0"
It appears that the conversion from .csv to matrix market is failing. What could be causing this?
Thanks,
Zach
I found the problem: the file should be named "2008.csv:info" not "csv.2008:info"
DeleteThanks for the update! I have fixed the documentation.
DeleteHello,
ReplyDeleteI've installed graphlab on a VM with unbuntu and ran the demo scripts of this page and got some errors :
dataset 2008.CSV
- traditional matrix factorization : OK
- temporal matrix factorization :
[Other]
app: sharder
gensgd: malloc.c:2451: sYSMALLOc: Assertion `(old_top == (((mbinptr) (((char *) &((av)->bins[((1) - 1) * 2])) - __builtin_offsetof (struct malloc_chunk, fd)))) && old_size == 0) || ((unsigned long) (old_size) >= (unsigned long)((((__builtin_offsetof (struct malloc_chunk, fd_nextsize))+((2 * (sizeof(size_t))) - 1)) & ~((2 * (sizeof(size_t))) - 1))) && ((old_top)->size & 0x1) && ((unsigned long)old_end & pagemask) == 0)' failed.
Aborted (core dumped)
- More features : OK
- TaxiIn : OK
dataset Modeling_1.csv
INFO: gensgd.cpp(convert_matrixmarket_N:559): Starting to read matrix-market input. Matrix dimensions: 11 x 13, non-zeros: 400000
FATAL: gensgd.cpp(read_line:333): Error reading line 0 feature 115 [ N,N,,G,,8,0,1,0,0,0,0,1,0,0,0,0,4,0,0,1,0,0,0,B,U,0,,M,Y,0,0,0,0,0,1,1,1,0,2,0,A,C,0,J,18,Y,66,,A,U,U,U,U,U,34,,U,U,84,M,H,1,1,M,5,I,01,00,67,3,,E06,Y,7,3,0,05,0,37,78.09,30,63,36,13.27,59,,N,N,N,N,N,N,U,UU,U,07,6,J,4,,J,4,U,,Y,U,0,Y,,24,,,,h ]
terminate called after throwing an instance of 'char const*'
Aborted (core dumped)
The first error is strange because it works with more features and i couldn't find what's wrong in the second file that cause a reading error (tried to change --file_columns but still doesn't work).
Thanks.
Hi,
DeleteSorry about that. Please retake from mercurial using "hg pull; hg update" and recompile using "make clean; make cf". A MAC OS contributed patch that was supposed to fix getline() missing function did a mess in the Linux version..
Let me know if it now works.
Thanks Danny.
DeleteIt works perfectly now.
Hello Danny,
ReplyDeleteAs i said in another post i'm working on a one class and i tried your new soft on my database, i have few questions.
- in your first example you set "--minval=-1 --maxval=1 --calc_error=1" but no cutoff, it automatically set the cutoff value at 0 ?
- in the sparse example you don't set --minval and --maxval but --cutoff=0.5, is there a specific reason you write the command this way in this case ?
- when you set --minval and --maxval what kind of loss function is used ?
- you use the --validation option in the sparse example but when i try to use it with gensgd it doesn't work, is it normal ?
- do you plan to implement the --test option ?
- as i'm dealing with a one class problem i tried the implicite rating option and it worked but i'm curious of what is done when features option is used, what value are put to the features associated to these additionnal ratings ?
Thanks.
Regards.
Hi Alex,
Delete1) You are right. The default cutoff is 0.
2) --minval and --maxval are optional arguments, the slightly improve performance in some cases but when the result is any in the range, there is no need to truncate.
3) --minval and --maxval are independent of the loss function used, you can use them with any loss function.
4) Please send our user mailing list (graphlab-kdd) the exact command you used and the error you got using the --validation - it should work. (Even better if you have some small dataset to show the error).
5) The --test option should work - send me a scenario where you get an error and I will debug it.
6) Adding implicit rating does not have feature information and thus I suggest not to apply it here.
Best,