Large Scale Machine Learning and Other Animals: October 2015

Wednesday, October 28, 2015

News from Stockholm: HopsWorks

A couple of weeks ago I visited Stockholm, I was kindly invited to give a keynote at the SICS Data Science Day. Thanks again for Prof. Seif Haridi for his kind invitation!

One of the interesting lectures there was by Prof Jim Downling, previously from mysql. Jim is building an open source system called HopsWorks to improve the Hadoop experience. Jim have kindly answered my questions about the project.

When did the project start?
Hops started in 2011, we started with HDFS. In 2013, we started YARN. In 2014, we started HopsWorks. So, it's about 14 months in development now.

What is the project goal?
The project goal is to make Hadoop for humans. We want to make it easier for people to store, process, and share data in Hadoop. That means moving away from the command-line to graphical interfaces for non programming tasks. Everything from managing access to data to sharing datasets should be accessible for people who are not data engineers.

What is the project license?
The project is licensed as a mix of Apache v2 and GPL v2. Our extensions to Hadoop are Apache v2 licensed, but we have connectors to the NewSQL database that we support (NDB - MySQL Cluster), and they have to be licensed in GPL v2. Because of the mixed licensing model, we don't provide a single distribution. However, users can install HopsWorks with 6 mouse clicks using our recommended installation tool, www.karamel.io.

Who is using HopsWorks?
We have had most interest from companies and organizations with sensitive datasets that require sharing those datasets in a controlled manner with users. So, Ericsson are interested enabling non-Ericsson employees to do analysis on some of their DataSets without requiring the long process of signing NDAs. As HopsWorks has a data scientist role (who cannot import or export data from the system), they could provide access to external data scientists, knowing they have an audit trail for actions by the external users and that the external users cannot download the dataset or derived data from the cluster. In the area of Genomics, we have a lot of interest as well.

Can you share performance numbers?
I don't have figures for Sentry's performance. The figures I showed were for the state-of-the-art Policy enforcement points (XACML). Sentry is trying not to do any enforcement and is basically sending all of its rules to all of the services to be cached there (HDFS, Solr, Impala, etc). My guess is that Sentry itself can still only handle a few 100 ops/sec. The main problem it has is how to keep the privileges and the data consistent. I don't see how they can do that for all Hadoop services.
Here's Cloudera's own report on the slowdown of turning on Sentry for Solr (it leads to a 20% slowdown for Solr - even with most privileges being stored in Solr):
http://www.slideshare.net/lucidworks/secure-search-using-apache-sentry-to-add-authentication-and-authorization-support-to-solr-presented-by-gregory-chanan-cloudera
They admit that Sentry "doesn't scale to store document-level [privileges]", so they store policies in Solr instead (breaking the assumption that Sentry is the central store for all policies (privileges).

Can you share a video of your talk?
The talk from last week is up:

And here are some screenshots:

Thursday, October 22, 2015

Apache Zeppelin in Picking Up!

A few months ago, I wrote about Apache Zeppelin. Yesterday I visited SICS and met with Jim Dowling. He has an interesting open source project named HopsWorks (I plan to write more about it soon!). Anyway HopsWorks is using Zeppelin and I saw a very interesting demo of its functionality. According to Jim, Zeppelin is really picking up. He sent me the following resources which indicate the rising popularity of Zeppelin:

In Microsoft Azure:
https://channel9.msdn.com/Series/Azure-Data-Lake/Whats-up-with-Spark35-Jupyter-and-Zepplein-Notebooks

In AWS:
https://aws.amazon.com/elasticmapreduce/

In hortonworks:
http://hortonworks.com/blog/introduction-to-data-science-with-apache-spark/

In cloudera:
http://blog.cloudera.com/blog/2015/07/how-to-install-apache-zeppelin-on-cdh/

In HopsWorks :)
www.hops.io

Sunday, October 18, 2015

IEEE ParLearning '2016 workshop announced

My colleague Yinglong Xia from IBM have kindly invited me to participate as program committee at the 2016 IEEE Parallel Learning Workshop. The workshop will take place May 27, 2016 in Chicago. Submission deadline is January 15, 2016. Submissions of original work in the area of parallel and distributed machine learning systems is encouraged!

Wednesday, October 14, 2015

Cloud analytics front is heating up

Last week I attended Amazon Re invent where Amazon announced at the keynote their QuickSight tool for visual summary of data. Using QuickSight it is possible to get some first order statistics about the data (i.e. histograms) do some plotting and look at some slices of data using a visual interface.
See minute 32 of the below video:

Not surprisingly, Google just released a similar product called Cloud Data.

Many people ask me how those tools differ from GraphLab Create - while we do have a basic visual interface, we are focused on the high end machine learning models on top of the data and not just on visualizing it.

IBM have also joined the game by providing a version of their oldie SPSS on top of BlueMix.

Saturday, October 10, 2015

A funny coincidence

As you may know I love Intel, I worked at Intel (my employee number started with 106 so you can guess how old I am), and I think think they make the best CPU.

However, I do have one complaint about the Intel data science marketing team creativity. A year ago we have released a t-shirt at our annual conference (July 2014) which says the following:

Coincidentally, at Strata NY (Oct 2015) Intel TAP group have released the following t-shirt:

I wonder what is going on here??

Sunday, October 4, 2015

Do you really need big data?

An interesting blog post from my friend and Colleague Guy Rapaport summarizes many interactions he is having with customers who are not sure what they actually need but do throw a lot of buzzwords.

Large Scale Machine Learning and Other Animals