Large Scale Machine Learning and Other Animals: August 2012

Wednesday, August 29, 2012

Rapid Miner & Myrrix

Another new thing I learned from Brad Cox, Rapid Miner software seemed to be heavily used in the homeland security / defense sectors.

Here is a short introductory video.

It seems it does not scale well to large models but have an excellent UI which helps visualize the data.

As a follow up, I got this from Dave Laxer:

Did you know about RADoop? RADoop = RapidMiner + Hadoop.

http://www.radoop.eu/

It seems it is an effort to scale rapid miner to larger models.

Slightly related, I got a link to Myrrix from Charles Martin. It seems to be a recommendation engine on top of Mahout headed by Sean Owen.

Big Math Library on top of GraphLab/GraphChi

Recently I am toying with a big math library around GraphLab / GraphChi with a "matlab" like interface that can scale to very large problems.

Today I finished implementing SVD (Lanczos) algorithm for both GraphLab/GraphChi.

Why do we need a big math library? If your problem is small and you can compute it quickly in Matlab than you are done. However, if the data is big, and Matlab can not even load the problem we are in trouble. That is why a scalable implementation is useful.

Let's start with a very simple example - Jacobi method for solving a linear system of equations. The code looks like this:

for (int i=0; i < max_iter; i++){

    x = (b - A*x)/A_ii;

}

A second example, is the alternating least squares (ALS) algorithm:

for (int i=0; i < max_iter; i++){

     V = A.backslash(U);

     U = A.backslash(V);

}

Another example, a little more fancy, is Lanczos iteration main loop:

    U[k] = V[k]*A._transpose();

    orthogonalize_vs_all(U, k, alpha(0));

    for (int i=k+1; i<n; i++){

      V[i]=U[i-1]*A;

      orthogonalize_vs_all(V, i, beta(i-k-1));

      U[i] = V[i]*A._transpose();

      orthogonalize_vs_all(U, i, alpha(i-k));

}

    V[n]= U[n-1]*A;

    orthogonalize_vs_all(V, n, beta(n-k-1));

In each iteration of Lanczos, the vector U[i] is left multiplied with the matrix A and then left multiplied with the matrix A'. Namely, U[i+1] = U[i]*A*A'.

What is nice about it, is the underlying implementation.
In GraphChi: the matrix A is stored on disk and read in parts to perform the multiplication. It is never read fully into memory. And thus we can scale to very large models.

In GraphLab: the matrix A is distributed across multiple machines so the product is a distributed operation. Also the vectors V[i] and U[i] are spanned across multiple machines.

The user who writes the high level code never needs to bother where is the data (on disk? in the network? ) and can write high level math primitives like in Matlab.

So where is the catch? I must admit there is a lot of ugly code which is needed to implement the high level functionality. However, the user does not have to actually worry about it and can enjoy the improved functionality.

Tuesday, August 28, 2012

Airlines on time performance dataset

I got the following interesting dataset link from Brad Cox :

The ASA challenge site has two decades of flight data totaling many GB. Data contains year, month, day, origin, carrier, destination, delay, etc. Goal is to determine what factors (day, carrier, destination, etc) best predicts the delay time.

From ASA website:

The data consists of flight arrival and departure details for all commercial flights within the USA, from October 1987 to April 2008. This is a large dataset: there are nearly 120 million records in total, and takes up 1.6 gigabytes of space compressed and 12 gigabytes when uncompressed.

The aim of the data expo is to provide a graphical summary of important features of the data set. This is intentionally vague in order to allow different entries to focus on different aspects of the data, but here are a few ideas to get you started:

When is the best time of day/day of week/time of year to fly to minimise delays?

Do older planes suffer more delays?

How does the number of people flying between different locations change over time?

How well does weather predict plane delays?

Can you detect cascading failures as delays in one airport create delays in others? Are there critical links in the system?

You are also welcome to work with interesting subsets: you might want to compare flight patterns before and after 9/11, or between the pair of cities that you fly between most often, or all flights to and from a major airport like Chicago (ORD).

Next: how to use GraphChi for computing predictions on this dataset.

Saturday, August 25, 2012

On convergence of Gaussian BP

I got the following question from Gregor Gorjanc, our man in Lubliyana:

Dear Danny,

We have spend quite some time trying out several things to use GaBP with the models we work with. Unfortunatelly, we can never make it work without regularization. It seems that GaBP is in particular sensitive to models where graphs have V like DAG structures, i.e., some nodes having two parental nodes. For example we work with the model describing genetic inheritance and in essence we model this process between a trio of two parents and a child. In what follows I will describe the model (just a tiny example) and show you some examples that appear to give some insight with which types of model GaBP is having problems.

Say we have two parents (1 and 2) and their children (3 and 4). For simplicity assume we observe phenotypic value (y) on all four individuals. The aim of the model is to infer genetic value (g) for each individual based on phenotypic values and pedigree links. The generic model (I have skipped some parts) I work with is:

y_i = g_i + e_i

g_i = 1/2 * g_f(i) + 1/2 * g_m(i) + w_i

e_i ~ N(0, 4)

w_i ~ N(0, 2)

where:

y_i = phenotypic value on individual i

g_i = genetic    value of individual i

e_i = phenotypic residual

g_f(i) = genetic    value of father of individual i

g_m(i) = genetic    value of mother of individual i

w_i    = genetic residual

The full model graph for this family (see attached files namedfamily_full*) would be:

g_1 = w_1

g_2 = w_2

g_3 = 1/2 * g_1 + 1/2 * g_2 + w_3

g_4 = 1/2 * g_1 + 1/2 * g_2 + w_4

y_1 = g_1 + e_1

y_2 = g_2 + e_2

y_3 = g_3 + e_3

y_4 = g_4 + e_4

With the LHS of a system of equations equal to:

## [1,]  5  2 -2 -2

## [2,]  2  5 -2 -2

## [3,] -2 -2  5  .

## [4,] -2 -2  .  5

isDiagDom(x=LHS)

## $isDiagDomMatrix

## [1] FALSE

spectralRadius(x=LHS)

## $spectralRadius

## [1] 1.02

## $spectralRadiusSmallerThan1

## [1] FALSE

This very simple model does not converge. So I attempted to build smaller models to understand what is going on. The first attempt was to take one parent and one child. This model (see attached files named family_one-one*) is now:

g_1 = w_1

g_3 = 1/2 * g_1 + w_3

y_1 = g_1 + e_1

y_3 = g_3 + e_3

With the LHS of a system of equations equal to:

## [1,]  3.67 -1.33

## [2,] -1.33 3.67

isDiagDom(x=LHS)

## $isDiagDomMatrix

## [1] TRUE

spectralRadius(x=LHS)

## $spectralRadius

## [1] 0.36

## $spectralRadiusSmallerThan1

## [1] TRUE

Since spectral radius is smaller than 1 this model converges to exact marginal means as well as variances! Now I added another child node (note we have now A like structure of a DAG) so the model (see attached files named family_one-two*) is now

g_1 = w_1

g_3 = 1/2 * g_1 + w_3

g_4 = 1/2 * g_1 + w_4

y_1 = g_1 + e_1

y_3 = g_3 + e_3

y_4 = g_4 + e_4

With the LHS of a system of equations equal to:

## [1,]  4.333333 -1.333333 -1.333333

## [2,] -1.333333  3.666667  .       

## [3,] -1.333333  .         3.666667

isDiagDom(x=LHS)

## $isDiagDomMatrix

## [1] TRUE

spectralRadius(x=LHS)

## $spectralRadius

## [1] 0.47

## $spectralRadiusSmallerThan1

## [1] TRUE

Which again converges to exact marginal means as well as variances. Now I created a model with two parents and one child (note we have now V like structure of a DAG) so the model (see attached files named family_two-one*) is: 

g_1 = w_1

g_2 = w_2

g_3 = 1/2 * g_1 + 1/2 * g_2 + w_3

y_1 = g_1 + e_1

y_2 = g_2 + e_2

y_3 = g_3 + e_3

With the LHS of a system of equations equal to:

## [1,]  4  1 -2

## [2,]  1  4 -2

## [3,] -2 -2  5

isDiagDom(x=LHS)

## $isDiagDomMatrix

## [1] TRUE

spectralRadius(x=LHS)

## $spectralRadius

## [1] 0.77

## $spectralRadiusSmallerThan1

## [1] TRUE

This also converges to exact marginal means but not for variances.

From the above tests it seems that GaBP:

- underestimates variances when V like structures appear in the model (example family_two-one) --> in the Markov network this structure is a small loop between the three involved nodes

- fails to converge when there are overlapping loops in Markov network, i.e., due to two V structures sharing parental nodes (example family_full)

Is seems that this shows why GaBP does not work with our applications as above structures are highly abundant in our models. Do you see any value in these observations?

Attached is a set of files for each example and a small shell script that show how GaBP from GraphLab was being used - only basic options were used deliebrately to understand what was going on.

./GaBP.sh family_one-one 0 0 > family_one-one.log 2>family_one-one.screen

./GaBP.sh family_one-two 0 0 > family_one-two.log 2>family_one-two.screen

./GaBP.sh family_two-one 0 0 > family_two-one.log 2>family_two-one.screen

./GaBP.sh family_full    0 0 > family_full.log    2>family_full.screen

Regards, Gregor

My Answer:

Hi Gregor,

It seems you reverse engineered the known convergence conditions of Gaussian BP (and generally BP).

1) For a graph which is a tree, BP converges to the exact result, namely both the means and the variances are exact.

2) A known result by Yair Weiss (Y. Weiss. Correctness of local probability propagation in graphical models with loops. Neural Computation, 12(1), 2000.) is that for graphs with a single cycle, BP always converges to a unique fixed point. In this case the mean is exact.

3) When the algorithm converges (for a graph with cycles), the mean is exact, but the variance is an underestimation of the true variance. (Weiss, Yair (2000). "Correctness of Local Probability Propagation in Graphical Models with Loops". Neural Computation 12 (1): 1–41.)
See also the great explanation by Jason K. Johnson: Malioutov, Johnson, Willsky. Walk-sums and belief propagation in Gaussian graphical models, Journal of Machine Learning Research, vol. 7, pp. 2031-2064, October 2006.

I believe all your conclusions are correct, maybe except for the two V structures, where the algorithm may converge depends on the matrix values...

Tuesday, August 21, 2012

Drinks & Data Workshop

I stumbled upon this funny note on the web. In seems that this "Drink & Data" workshop is about graph algorithms, and the way they justify their event is by citing the success of the GraphLab workshop:

Anyway, they managed to convince me that graphs are hot :-)

Wednesday, August 15, 2012

Steffen Rendle - libFM

News from the KDD CUP workshop. I was highly impressed by Steffen Rendle, the author of libFM collaborative filtering library. Steffen won the 2nd place in track1 and the 3rd place in track2. Unlike our team who had around 15 people, and the Taiwanese team who had around 20 people Steffen worked alone, and got a great rating in BOTH tracks.

What is nice about Steffen's work, is that he is using only a SINGLE algorithm, and not an ensemble of methods as typically deployed. The trick is the he does very smart feature engineering to create a very good feature matrix. Once he gets the representative feature matrix he uses the libFM algorithm.

I asked Steffen to explain the essense of the method:

A is the input (design) matrix where each row is a case and each column a (real-valued) predictor variable. I.e. the same way of feature engineering as in other standard ML algorithms such as linear/ polynomial regression, SVM, etc. Internally, the FM model works similarly as a polynomial regression, i.e. it contains all pairwise interactions between any two variables in A. The important difference to polynomial regression is that the model parameters for variable interactions are factorized. This is the reason why FMs perform well in problems such as recommender systems.

Some of the recent notable work of Steffen is a caching method for caching ALS computation, that according to Steffen, significantly speeds up ALS computation and makes it a lighter algorithm like SGD. The work is described in his recent SIGIR 2011 paper.

A second interesting work is an online matrix factorization computation described in the paper: Steffen Rendle, Lars Schmidt-Thieme (2008): Online-Updating Regularized Kernel Matrix Factorization Models forLarge-Scale Recommender Systems, in Proceedings of the 2008 ACM Conference on Recommender Systems (RecSys 2008), ACM.
When new users/ items are added into the system, only an incremental update is computed.

Finally, Steffen gave a very detailed tutorial in KDD, about a whole bunch of matrix factorization methods and the relations between them. I find the tutorial a good overview of the connections between algorithms, however it is intended for intermediate user level who actually master some of the algorithms in this domain.

As you may know, we have a very crude and preliminary libFM implementation in GraphLab. libFM algorithm implementation contains a subset of the full libFM functionality with only three predictions: user, item and time. Users are encouraged to check the original libFM library for enhanced implementation. libFM library has a track record performance in KDD CUP and is highly recommended collaborative filtering package.

80-20 rule - or collaborative filtering in practice

Many papers are written about new collaborative filtering methods and approaches. After reading tens of papers and talking to tens of companies I have some insight I would like to share here. I call it the 80-20% collaborative filtering rule. My guess that this rule is a common knowledge for people who are working on collaborative filtering, but I did not see it written or discussed anywhere, so I thought it will be useful to bring it to light.

I will start with an example, but I believe that the rule holds generally across many domains. I just got back from the KDD cup workshop in Beijing. The first presentation by Tianqi Chen from SJTU who won the first place in track 1. Here is a summary of the results obtained by his team:

Note, that the basic method, item based collaborative filtering got a score of 34.0 %
While a combination of an additional 7 methods got a score of 42.7 %. Now divide
34.0 / 42.7 = 0.796. Namely, the basic method has 79.6% accuracy of the most sophisticated method! And of course it is way simpler.

Now let's jump to the results of the second track. A team from the NUTA had the following result:

The simplest model (Naive bayes) got 0.776. Their best predicted model using all the fancy method is: 0.8069. Namely, the simple model gave 96% accuracy!

Anyway, I can continue on and on, checking other contest results like Netflix and KDD CUP 2011. The bottom line is that typically the simpler methods gives an accuracy of 80% or more.
From talking to many companies who are working on collaborative filtering, 95% of them are happy if they have the basic models working right.

Here is some quote from Netlix technical blog (bold marking is mine):

If you followed the Prize competition, you might be wondering what happened with the final Grand Prize ensemble that won the $1M two years later. This is a truly impressive compilation and culmination of years of work, blending hundreds of predictive models to finally cross the finish line. We evaluated some of the new methods offline but the additional accuracy gains that we measured did not seem to justify the engineering effort needed to bring them into a production environment.

In other words, some of the fancy methods that justify publishing a research track paper, or winning an epsilon accuracy contest, are simply not worth the effort of practical deployment.

Wednesday, August 8, 2012

Gephi - nice graph visualization software

I got a recommendation for the Gephi software two independent sources: my CMU collaborator Jay Haijie and Marcos Sainz (ModCloth). Gephi is a free Graph visualization software. It has some cool properties as can be seen in the below video. It seems that many of Graph ranking algorithms like Pagerank and HITS are implemented inside and can be calculated on the loaded graphs. I am not sure to which data magnitude this software can scale - from my experience trying to visualize large graphs results in a mess...

Saturday, August 4, 2012

Data Wrangler

One of the most impressive lectures in our GraphLab workshop was given by Jefferey Heer from the HCI dept in Stanford. Data Wrangler is a visual tool for helping out cleaning large datasets - a time demanding task task which is often ignored when talking about machine learning algorithms. Using Data Wrangler it is posible to visually specify how to clean the data on a small sample of it and generate map/reduce or python scripts automatically that will run on the full dataset.

Here is a quick video preview (the full lecture will be online soon):
Wrangler Demo Video from Stanford Visualization Group on Vimeo.

Here is a link to the full paper.
By the way, my second advisor, Prof. Joe Hellerstein from Berkeley is also involved in this nice project.

Friday, August 3, 2012

Misc Updates

Noam Koenigstein and the team at MS have a new paper which describes Microsof's XBOX recommendation system. This is interesting since it gives us a glimpse about which recommendation systems are actually deployed in the industry. The paper will be presented next month in RecSys 2012.

To my lazy readers, the bottom line is that they are using a simple linear factorization, with a cdf link function. They take the Bayesian approach for computing inference. Implicit ratings are handled by adding random negative samples, as proposed by Pan, Yunhong Zhou, Bin Cao, Nathan N. Liu, Rajan Lukose, Martin Scholz, and Qiang Yang. 2008. One-Class Collaborative Filtering. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining (ICDM '08). IEEE Computer Society, Washington, DC, USA, 502-511.

Aapo Kyorla reports a great buzz around his GraphChi project - his project was starred by 83 users, +1 by 31 users, 500 downloads of the C++ code base, and 250 downloads of the Java code base. Very impressive for a project launched 3 weeks ago.. It was also covered by MIT Technology Review. Here is an executive summary I wrote a couple of weeks ago.

I learned from Nimrod Priel's Data Science that Amazon EC2 has a new SSD disk instance.

I often get updates from people about some recent lectures. Luckily I filter those out since many of the lectures are both boring and do not say much. One notable lecture I liked last year, is Cloudera Josh Wills' lecture from last year big learning workshop. In a retrospective it should be viewed.. It discusses some common setbacks when you try to deploy machine learning in practice.

Thursday, August 2, 2012

Geo Deepdive

In MMDS I liked the lecture by Christopher Re from WISC about Geo Deepdive.
Geo Deepdive is a data mining application in the field of geology. The application digests a huge amount of technical reports and research papers, mines useful measurement information and
presents it using a cool UI. Here is a note I got from Chris:

We've started to get some videos up at www.youtube.com/HazyResearch. One in particular is this one (creepy Siri voice and all!):
More actual technical information (e.g., tutorials on how to build your ownsystem) should be up over the next 2-3 weeks. We'll also have had a handfulof geoscientists more intensely using the system by then.

Geodeepdive reminds me of the work done in quantiFind I wrote about a while back. It is a much needed application in the field of missing infrastructure and applications. I can imagine a lot of other use case besides of geology.

Wednesday, August 1, 2012

Notable book: Crossing the Chasm

I got from Joey Gonzalenz a warm recommendation about the book: Crossing the Chasm by Joeffrey A Moore. It is a book about how to market technology, considered like the bible by many. I started reading it and it seems to be a great book.

Especially I liked the following description of Innovators, technology enthusiasts (the marking in bold is mine)

They are the ones who first appreciate the architecture of your product and why it therefore has a competitive advantage over the current crop of products established in the marketplace. They are the ones who will spend hours trying to get products to work that, in all conscience, never should have been shipped in the first place. They will forgive ghastly documentation, horrendously slow performance, ludicrous omissions in functionality, and bizarrely obtuse methods of invoking some needed function—all in the name of mov- ing technology forward. They make great critics because they truly care.

It definitely reminds me some of the struggle of our GraphLab users to utilize our code.. ;-)
Luckily our code is FAST so we do not suffer from horrendously slow performance...

Large Scale Machine Learning and Other Animals