Tuesday, April 30, 2013

Recsys 2013: Yelp! Business Prediction Contest

I got an interesting email from Prof. Nicholas Ampazis from University of Aegean, Greece. Nicholas is trying out GraphChi gensgd for Kaggle's Yelp! business prediction contest which is part of Recsys 2013.

First he sent me some interesting observations about the dataset:


- There are 2108 training users in the ratings (review) matrix that do appear in the training users file. The reverse is not true (i.e. all users in the training user file have ratings).
- All business_ids in review appear in the business file
- There are 5315 users for which we wish to make predictions that do not appear in the ratings matrix.
- There are 1205 businness_ids  for which we wish to make predictions that do not appear in the ratings matrix (those always come in pairs with the unknown users above).
- The union of (distinct) users in the ratings matrix, training user file and test user file is 51082
- The union of (distinct) business_ids in the ratings matrix, training business file and test business file is 12742


Nicholas has kindly agreed to share with us some of the scripts he is using, to convert the Yelp! data to GraphChi: (written by his colleague Vaggelis Tripolitakis - thanks!!).

Disclaimer: we did not fine tune performance of gensgd yet so prediction quality is still poor. We plan to refine execution in the next couple of days and report results here.

0) Register to the competition here and download the datasets into your root GraphChi folder.

Method A: use Vaggelis scripts (Ruby)
1) Download the conversion scripts from GitHub:
https://github.com/vtripolitakis/yelpscripts
2) Give running permission to the script:
# chmod a+rx script

3) Verify that json ruby library is present using:
# sudo gem install json

Note: if you do not have root permission on your machine, install the package using
# gem install json
and add the locally created gem folder into your path, for example:
# export PATH=$PATH:/home/bickson/.gem/ruby/1.8/bin

Method B: use Justin Yan's scripts (Python):

Preliminaries: you will have to install python pandas
This script is based on script by Paul Butler.

create a file name conv_json2csv.py with the following lines:

'''
Convert Yelp Academic Dataset from JSON to CSV

'''

import json
import pandas as pd
from glob import glob

def convert(x):
    ''' Convert a json string to a flat python dictionary
    which can be passed into Pandas. '''
    ob = json.loads(x)
    for k, v in ob.items():
        if isinstance(v, list):
            ob[k] = ','.join(v)
        elif isinstance(v, dict):
            for kk, vv in v.items():
                ob['%s_%s' % (k, kk)] = vv
            del ob[k]
    return ob

for json_filename in glob('*.json'):
    csv_filename = '%s.csv' % json_filename[:-5]
    print 'Converting %s to %s' % (json_filename, csv_filename)
    df = pd.DataFrame([convert(line.strip().replace("\\n"," ").replace("\\r"," ")) for line in file(json_filename)])
    df.to_csv(csv_filename, encoding='utf-8', index=False)

Run
# python conv_json2csv.py

4) Use the following instructions for converting the data to GraphChi format
(hint: use copy & paste!)

###################### TRAINING SET ##########################

#---REVIEW---
./script yelp_training_set/yelp_training_set_review.json user_id business_id date votes stars > yelp_training_set_review.csv

#----USER---
./script yelp_training_set/yelp_training_set_user.json user_id review_count average_stars name votes > yelp_training_set_user.csv

#----BUSINESS----
./script yelp_training_set/yelp_training_set_business.json business_id open city state review_count longitude latitude categories name neighborhoods full_address stars > yelp_training_set_business.csv

##############################################################


###################### TEST SET ##########################

#---REVIEW---
./script yelp_test_set/yelp_test_set_review.json user_id business_id > yelp_test_set_review.csv

#----USER---
./script yelp_test_set/yelp_test_set_user.json user_id review_count > yelp_test_set_user.csv

#----BUSINESS----
./script yelp_test_set/yelp_test_set_business.json business_id open city state review_count longitude latitude categories name neighborhoods full_address > yelp_test_set_business.csv

##############################################################
######### CONCATENATE USER/BUSINESS FILES FROM TRAIN AND TEST ##########################

cat yelp_training_set_user.csv yelp_test_set_user.csv > user_file.csv

cat yelp_training_set_business.csv yelp_test_set_business.csv > business_file.csv

5) Run GraphChi GENSGD
a) Prepare a file named yelp_training_set_review.csv\:info with the following 2 lines:
%%MatrixMarket matrix coordinate real general
51082 12742 229907 
And a second file named yelp_test_set_review.csv\:info with the following 2 lines:

%%MatrixMarket matrix coordinate real general
51082 12742 22956


b) First trial: run using reviews only (without user and business information)

bickson@thrust:~/graphchi$ ./toolkits/collaborative_filtering/gensgd --training=yelp_training_set_review.csv --test=yelp_test_set_review.csv --from_pos=0 --to_pos=1 --val_pos=6 --rehash=1 --gensgd_mult_dec=0.999999 --quiet=1 --file_columns=7 --minval=0 --maxval=5   --clean_cache=1 --gensgd_rate1=1e-3 --gensgd_rate2=1e-3 --gensgd_rate3=1e-3 --gensgd_reg0=1e-3 --gensgd_regw=1e-2 --gensgd_regv=1e-1 --max_iter=20 --nshards=1 
WARNING:  common.hpp(print_copyright:180): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
...
   1.22736) Iteration:   0 Training RMSE:    1.20862
   1.35923) Iteration:   1 Training RMSE:    1.19359
    1.4658) Iteration:   2 Training RMSE:    1.18219
   1.60173) Iteration:   3 Training RMSE:     1.1729
   1.70212) Iteration:   4 Training RMSE:    1.16528
   1.82876) Iteration:   5 Training RMSE:    1.15854
   1.94846) Iteration:   6 Training RMSE:    1.15233
    2.0595) Iteration:   7 Training RMSE:    1.14694
   2.19689) Iteration:   8 Training RMSE:    1.14202
    2.3059) Iteration:   9 Training RMSE:     1.1375
   2.41694) Iteration:  10 Training RMSE:    1.13332
   2.53653) Iteration:  11 Training RMSE:    1.12953
   2.65778) Iteration:  12 Training RMSE:    1.12591
   2.77716) Iteration:  13 Training RMSE:    1.12243
    2.9095) Iteration:  14 Training RMSE:    1.11911
   3.04893) Iteration:  15 Training RMSE:    1.11606
   3.18783) Iteration:  16 Training RMSE:     1.1132
   3.29783) Iteration:  17 Training RMSE:    1.11039
   3.43783) Iteration:  18 Training RMSE:    1.10765
   3.54825) Iteration:  19 Training RMSE:    1.10512
Found 16335 new test users with no information about them in training dataset!


c) second run: throw in user information

bickson@thrust:~/graphchi$ ./toolkits/collaborative_filtering/gensgd --training=yelp_training_set_review.csv --test=yelp_test_set_review.csv --from_pos=0 --to_pos=1 --val_pos=6 --rehash=1 --gensgd_mult_dec=0.999999 --quiet=1 --file_columns=7 --minval=0 --maxval=5   --clean_cache=1 --gensgd_rate1=1e-3 --gensgd_rate2=1e-3 --gensgd_rate3=1e-3 --gensgd_reg0=1e-3 --gensgd_regw=1e-2 --gensgd_regv=1e-1 --max_iter=20 --nshards=1 --user_file=user_file.csv
WARNING:  common.hpp(print_copyright:183): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
[training] => [yelp_training_set_review.csv]
[test] => [yelp_test_set_review.csv]
[from_pos] => [0]
[to_pos] => [1]
[val_pos] => [6]
[rehash] => [1]
[gensgd_mult_dec] => [0.999999]
[quiet] => [1]
[file_columns] => [7]
[minval] => [0]
[maxval] => [5]
[clean_cache] => [1]
[gensgd_rate1] => [1e-3]
[gensgd_rate2] => [1e-3]
[gensgd_rate3] => [1e-3]
[gensgd_reg0] => [1e-3]
[gensgd_regw] => [1e-2]
[gensgd_regv] => [1e-1]
[max_iter] => [20]
[nshards] => [1]
[user_file] => [user_file.csv]

 === REPORT FOR sharder() ===
[Timings]
edata_flush: 0.063296s (count: 3, min: 0.01508s, max: 0.024157, avg: 0.0210987s)
execute_sharding: 0.168887 s
preprocessing: 0.530149 s
read_shovel: 0.008681 s
shard_final: 0.130108 s
shovel_flush: 0.008117 s
shovel_read: 0.006699 s
[Other]
app: sharder
Warning: missing: 5101 from node feature file: user_file.csv out of: 48978
   1.53665) Iteration:   0 Training RMSE:    1.35278
   1.77865) Iteration:   1 Training RMSE:    1.16203
   1.94927) Iteration:   2 Training RMSE:    1.11664
   2.11927) Iteration:   3 Training RMSE:    1.09443
   2.29663) Iteration:   4 Training RMSE:       1.08
   2.45677) Iteration:   5 Training RMSE:    1.06953
   2.60268) Iteration:   6 Training RMSE:    1.06094
   2.77053) Iteration:   7 Training RMSE:    1.05397
   2.93878) Iteration:   8 Training RMSE:    1.04816
   3.10066) Iteration:   9 Training RMSE:    1.04273
   3.28596) Iteration:  10 Training RMSE:    1.03828
   3.45064) Iteration:  11 Training RMSE:    1.03407
   3.64199) Iteration:  12 Training RMSE:    1.03018
   3.80989) Iteration:  13 Training RMSE:    1.02669
   3.99993) Iteration:  14 Training RMSE:    1.02326
   4.15758) Iteration:  15 Training RMSE:    1.02018
   4.36239) Iteration:  16 Training RMSE:    1.01738
   4.58039) Iteration:  17 Training RMSE:    1.01461
   4.73803) Iteration:  18 Training RMSE:    1.01198
   4.91018) Iteration:  19 Training RMSE:     1.0095
Found 16335 new test users with no information about them in training dataset!

d) third run: throw in also business information:


bickson@thrust:~/graphchi$ ./toolkits/collaborative_filtering/gensgd --training=yelp_training_set_review.csv --test=yelp_test_set_review.csv --from_pos=0 --to_pos=1 --val_pos=6 --rehash=1 --gensgd_mult_dec=0.999999 --quiet=1 --file_columns=7 --minval=0 --maxval=5   --clean_cache=1 --gensgd_rate1=1e-3 --gensgd_rate2=1e-3 --gensgd_rate3=1e-3 --gensgd_reg0=1e-3 --gensgd_regw=1e-2 --gensgd_regv=1e-1 --max_iter=20 --nshards=1 --user_file=user_file.csv --item_file=business_file.csv  --gensgd_rate4=1e-2 --gensgd_rate5=1e-2

WARNING:  common.hpp(print_copyright:183): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
...
Warning: missing: 5101 from node feature file: user_file.csv out of: 48978
Warning: missing: 16865 from node feature file: business_file.csv out of: 28402
   2.42267) Iteration:   0 Training RMSE:    2.18202
    3.0081) Iteration:   1 Training RMSE:    2.03203
   3.47768) Iteration:   2 Training RMSE:    1.90535
   4.07795) Iteration:   3 Training RMSE:    1.77818
   4.65206) Iteration:   4 Training RMSE:    1.67263
   5.23848) Iteration:   5 Training RMSE:    1.56425
   5.78475) Iteration:   6 Training RMSE:    1.47417
   6.38091) Iteration:   7 Training RMSE:    1.40324
   6.98647) Iteration:   8 Training RMSE:    1.34509
   7.57702) Iteration:   9 Training RMSE:    1.29067
   8.19608) Iteration:  10 Training RMSE:    1.24391
   8.80638) Iteration:  11 Training RMSE:    1.20927
   9.40638) Iteration:  12 Training RMSE:    1.18695
   9.94451) Iteration:  13 Training RMSE:    1.15484
   10.5267) Iteration:  14 Training RMSE:    1.13502
   11.1167) Iteration:  15 Training RMSE:    1.11524
   11.6525) Iteration:  16 Training RMSE:    1.10079
   12.2368) Iteration:  17 Training RMSE:    1.07966
   12.8187) Iteration:  18 Training RMSE:    1.06165
   13.4304) Iteration:  19 Training RMSE:    1.05136
Found 16335 new test users with no information about them in training dataset!

Conclusions: 
1) including user and business properties significantly improves prediction performance. 
2) When adding additional features, you should be careful not to overfit.


The output of gensgd is the file yelp_test_set_review.csv.predict

   1.4793704
N/A
   3.3002301
   2.8208445
   4.0713396
N/A
   3.1468302
   3.6243955

Where N/A is a prediction for a new user. If you like to have the global mean prediction instead, you can run with --cold_start=2

Next: soon I will post some update about performance of GraphChi & how to create the submission format out of GraphChi.

Saturday, April 27, 2013

Incremental SVD

Here is an email I got from Prof. Magnasco from Rockefeller University, NY:

Hi Danny, I've seen some of your posts regarding lanczos and thought I'd ask you.

I have a problem where I need to compute the first few hundred eigenvectors/values in the PCA of a large, dense dataset, about ten thousand times five million. (It's an avi of a fluorescence microscopy experiment). For the larger datasets I might not even be able to hold the entire set in memory. The normal approach would be to compute the 10000^2 A*A' matrix and then diagonalize it, the problem being that this matrix entails 12 million dot products of 5M element vectors. So I was hoping to find a method that iteratively computes the eigenvectors without such explicit evaluations. Is Lanczos numerically stable enough for such sizes? 

Thanks, 

Marcelo

__________________________________________________________

Marcelo Magnasco     Box 212, 1230 York Avenue, NY NY10065
Professor and Head,      +1 212 3278542 f +1 212 3277422
Mathematical Physics Lab  http://sur.rockefeller.edu/Plone
The Rockefeller University        magnasco@rockefeller.edu




As a great coincidence, I just heard about the same problem solution from Byron Boots, a postdoc in UW:

The incremental SVD papers that I am thinking of have been written by Matthew Brand. He has several papers on this topic, but the one that I have been using the most is this one: http://www.merl.com/papers/docs/TR2006-059.pdf

Unfortunately I am not aware of software package which implements the Brand method. You will probably have to implement it yourself.

Additional resource I found is another paper by Brand:

Brand, M.E., Incremental Singular Value Decomposition of Uncertain Data with Missing ValuesEuropean Conference on Computer Vision (ECCV), Vol 2350, pps 707-720, May 2002 (Lecture Notes in Computer Science)

Which has a better explanation of the setup.

Wednesday, April 24, 2013

ACM KDD CUP 2013

ACM KDD (Knowledge Discovery and Data mining) 2013 conference will be held Aug 11-14 in Chicago. The annual KDD CUP competition is organized this year by Microsoft Research.
There are two tracks:
1) Identification of authorship of academic papers - track 1.
2) Author disambiguation - track 2.

Following the last couple of years, we hope to see some activity of users who are utilizing GraphLab for computing part of the solution. I will update more once we get interesting results to report.

Tuesday, April 23, 2013

Presto: distributed R framework from HP Labs

I got from my collaborator Aapo Kyrola the following pointer to Presto.
Presto is an interesting system which allowed large scale computation in R by distributing the computational workload in a cluster. Presto implements distributed arrays and thus allows efficient implementation of linear algebra primitives like matrix-vector product.

The following two papers where recently published about Presto:

It those papers, a large number of applications where implemented in Presto like K-means, ALS< pagerank, vertex centrality, shortest path and others. A large performance gain of x15 - x40 is demonstrated over Hadoop and Spark.

Unfortunately, it is not clear if Presto will be released as an open source project.

Thursday, April 18, 2013

Distributed Dual Decomposition (DDD) in GraphLab

Our collaborator Dhruv Batra, from Virginia Tech has kindly contributed DDD code for GraphLab. Here are some explanation about the method and how to deploy it.
The full documentation is found here.

Distributed Dual Decomposition

Dual Decomposition (DD), also called Lagrangian Relaxation, is a powerful technique with a rich history in Operations Research. DD solves a relaxation of difficult optimization problems by decomposing them into simpler subproblems, solving these simpler subproblems independently and then combining these solutions into an approximate global solution.
More details about DD for solving Maximum A Posteriori (MAP) inference problems in Markov Random Fields (MRFs) can be found in the following:
D. Sontag, A. Globerson, T. Jaakkola. 
Introduction to Dual Decomposition for Inference. 
Optimization for Machine Learning, editors S. Sra, S. Nowozin, and S. J. Wright: MIT Press, 2011.

Running DDD

The input MRF graph is assumed to be in the standard UAI file format. For example a 3x3 grid MRF can be found here: grid3x3.uai.
The program can be run like this:
> ./dd --graph grid3x3.uai 
Other arguments are:
  • –help Display the help message describing the list of options.
  • –output The output directory in which to save the final predictions.
  • –dualimprovthres (Optional, default 0.00001) The amount of change in dual objective (in log-space) that will be tolerated at convergence.
  • –pdgapthres (Optional, default 0.1) The tolerance level for zero primal-dual gap.
  • –maxiter (Optional, default 10000) The maximum no. of dual update iterations.
  • –engine (Optional, Default: asynchronous) The engine type to use when executing the vertex-programs
    • synchronous: All LoopyBP updates are run at the same time (Synchronous BP). This engine exposes greater parallelism but is less computationally efficient.
    • asynchronous: LoopyBP updates are run asynchronous with priorities (Residual BP). This engine is has greater overhead and exposes less parallelism but can substantially improve the rate over convergence.
  • –ncpus (Optional, Default 2) The number of local computation threads to use on each machine. This should typically match the number of physical cores.
  • –scheduler (Optional, Default sweep) The scheduler to use when running with the asynchronous engine. The default is typically sufficient.
  • –engine_opts (Optional, Default empty) Any additional engine options. See –engine_help for a list of options.
  • –graph_opts (Optional, Default empty) Any additional graph options. See –graph_help for a list of options.
  • –scheduler_opts (Optional, Default empty) Any additional scheduler options. See –scheduler_help for a list of options.
Anyone who tries to run it - please let us know!

Wednesday, April 17, 2013

CLiMF Algorithm in GraphChi

I got some good news to report: last week we got a great contribution from Mark Levy (last.fm) for GraphChi collaborative filtering toolkit. Mark have implemented the CLiMF algorithm, described in the paper: CLiMF: learning to maximize reciprocal rank with collaborative less-is-more filtering. Yue Shi, Martha Larson, Alexandros Karatzoglou, Nuria Oliver, Linas Baltrunas, Alan Hanjalic, Sixth ACM Conference on Recommender Systems, RecSys '12.

 CLiMF is a ranking method which optimizes MRR (mean reciprocal rank) which is an information retrieval measure for top-K recommenders. CLiMF is a variant of latent factor CF which optimises a significantly different objective function to most methods: instead of trying to predict ratings CLiMF aims to maximise MRR of relevant items. The MRR is the reciprocal rank of the first relevant item found when unseen items are sorted by score i.e. the MRR is 1.0 if the item with the highest score is a relevant prediction, 0.5 if the first item is not relevant but the second is, and so on. By optimising MRR rather than RMSE or similar measures CLiMF naturally promotes diversity as well as accuracy in the recommendations generated. CLiMF uses stochastic gradient ascent to maximise a smoothed lower bound for the actual MRR. It assumes binary relevance, as in friendship or follow relationships, but the graphchi implementation lets you specify a relevance threshold for ratings so you can run the algorithm on standard CF datasets and have the ratings automatically interpreted as binary preferences.

CLiMF-related command-line options:
 --binary_relevance_thresh=xx Consider the item liked/relevant if rating is at least this value [default: 0]
 --halt_on_mrr_decrease Halt if the training set objective (smoothed MRR) decreases [default: false]
 --num_ratings Consider this many top predicted items when computing actual MRR on validation set [default:10000]

Here is an example on running CLiMF on Netflix data:

./toolkits/collaborative_filtering/climf --training=smallnetflix_mm --validation=smallnetflix_mme --binary_relevance_thresh=4 --sgd_gamma=1e-6 --max_iter=6 --quiet=1 --sgd_step_dec=0.9999 --sgd_lambda=1e-6

  Training objective:-9.00068e+07
  Validation MRR:  0.169322
  Training objective:-9.00065e+07
  Validation MRR:  0.171909
  Training objective:-9.00062e+07
  Validation MRR:  0.172372
  Training objective:-9.0006e+07
  Validation MRR:  0.172503
  Training objective:-9.00057e+07
  Validation MRR:  0.172544
  Training objective:-9.00054e+07
  Validation MRR:  0.172549

I am very excited about this development - and I hope many more users will follow with additional contributions to our growing code base! Thanks Mark!!!

DARPA PPAML

I got this following DARPA call from Mike Draugelis, our man in Lockheed Martin:


Machine learning – the ability of computers to understand data, manage results, and infer insights from uncertain information – is the force behind many recent revolutions in computing. Email spam filters, smartphone personal assistants and self-driving vehicles are all based on research advances in machine learning. Unfortunately, even as the demand for these capabilities is accelerating, every new application requires a Herculean effort.  Even a team of specially-trained machine learning experts makes only painfully slow progress due to the lack of tools to build these systems.
The Probabilistic Programming for Advanced Machine Learning (PPAML) program was launched to address this challenge. Probabilistic programming is a new programming paradigm for managing uncertain information. By incorporating it into machine learning, PPAML seeks to greatly increase the number of people who can successfully build machine learning applications and make machine learning experts radically more effective. Moreover, the program seeks to create more economical, robust and powerful applications that need less data to produce more accurate results – features inconceivable with today’s technology.

And here is the call abstract:
The goal of the PPAML program is to advance machine learning by using probabilistic programming to 1) dramatically increase the number of people who can successfully build machine learning applications, 2) make machine learning experts radically more effective, and 3) enable new applications that are impossible to conceive of using today’s technology. In support of this overarching goal, PPAML has a number of sub-goals. Specifically, the sub-goals are 1) to make machine learning model code shorter, 2) to reduce development time, 3) to facilitate the construction of richer models, 4) to require lower levels of expertise in building machine learning applications, and 5) to support the construction of integrated models.

Wednesday, April 10, 2013

The GraphLab Workshop - Why Should You Care?

Everyone knows that one of the hottest topics today is big data analytics. The GraphLab workshop is a "trade show" for all the significant graph analytics and graph database solutions. In one day you could learn more about the following systems (a preliminary list!):


Featured Projects

Google’s Pregel is their Bulk Synchronous graph framework. Prof. Vahab Mirrokni is going to give an oral talk about graph processing @ Google.
Apache Giraph is the open source equivalent system to Google’s Pregel. Dr. Avery Ching, one of Giraph contributors, will give a talk about large scale graph processing @ Facebook.
Dr. Pankaj Gupta, the creator of Cassovary Graph Processing system @ Twitter will give a talk about Who To Follow (WTF) service in Twitter.
Naiad is a parallel data flow framework from Microsoft with the focus of incremental computation. Dr. Derek Murray from Microsoft Research will present Naiad.
Intel GraphBuilder is a software for creating graphs out of raw data, utilizing Hadoop for parallel graph creation. Dr. Theodore Willke from Intel Labs will present Intel Labs work in this domain.
GraphLab is CMU+UW open source graph processing system, which supports both bulk synchronous parallel as well as asynchronous computation. Prof. Carlos Guestrin will present the latest GraphLab project.
Allegro Graph is a high performance graph database with RDF support. Jans Aasman, the CEO of Franz, will give a demo of their newest graph database.
Combinatorial BLAS is a distributed memory parallel graph library from LBNL/UCSB. Dr. Aydin Buluc will present comb-BLAS.
Grappa is a distributed graph processing framework using commodity processors, from The University of Washington. Prof. Mark Oskin will present Grappa.
Presto is a distributed framework for speeding up R computations by HP Labs. Shivaram Venkataraman from Bekreley and Kyungyong Lee will present Persto.
Titan is a distributed graph database. Dr. Matthias Broecheler will present Titan.
Neo4j is an open source distributed graph database in Java. Alex Averbuch from neo4j will present neo4j.
Infinite Graph from Objectivity is a distributed graph database.
DEX is a high performance and scalable graph database system. Norbert Martinez will present DEX.
YarcData, a Cray spinoff is creating customized hardware solutions for ultra fast graph processing.
Systap LLC is a startup working on speeding up graph algorithms using GPUs. Bryan Thompson from Systap will present preliminary results of applying the gather apply scatter model on GPU.
Linked Data Benchmark Council (LDBC), a new EU FP7 project that aims to establish industry cooperation on graph database benchmarks, benchmark practices and benchmark results. Dr. Alex Averbuch (Neo Technologies), Norbert Martinez (Polytechnic University of Catalonia)
and Dr. Andrey Gubichev (Technical University of Munich) will present LDBC.
Other notable talks at the GraphLab workshop:
Trifacta is the hottest bay area startup out there, started by Prof. Joe Hellerstein from Berkeley and Prof. Jefferey Heer from Stanford. Prof. Joe Hellerstein will talk about Productivity for Data Analysts: Visualization, Intelligence and Scale.
Dr. Lei Tang from Walmart Labs will talk about adaptive user segmentation for collaborative filtering.
Alpine Data Labs is a Greenplum spinoff focusing on big data analytics. Seven Hillion will describe a case study of big data analytics on top of Hadoop.

Stay tuned - additional talks and demos are going to be added soon!

Sunday, April 7, 2013

Facebook graph benchmark system

I just heard from Carlos Guestrin about a new graph benchmark system from Facebook called LinkBench. I will be happy to hear when anyone tries it out...

Additionally, I got from Nilesh Jain from Intel Labs a link to LDBC an EU project for promoting graph benchmarks in industry. I was not aware of this project, but they are organizing the GRADES workshop I previously wrote about. And guess who is giving the keynote talk at the GRADES workshop? You may have guessed right - Carlos Guestrin is going to talk about GraphLab. The GRADES workshop will take place June 23rd in NY.

Insider ML Jobs

In this blog post I will publish some open ML positions relating to big data analytics I got from my contacts. Those positions are not public yet and are published first in this blog.

PhD  in CS, Bio-informatics, EE, physics, Statistics post doctoral fellow in  biomedical informatics for 1 year. Emphasis of applications of Big Data to medicine. The fellow will be involved in projects which mine the electronic health record in of the Veterans Affairs system.

For details contact Alon Ben-Ari, MD, department of anesthesiology VA Puget Sound Seattle, Washington.





Stay tuned - more jobs to be posted soon..