Monday, November 23, 2015

Veles: deep learning by Samsung

I got this from my fiend Assaf Araki from Intel: Veles is a new project by Samsung for distributed deep learning.

Interestingly, their first blog post says: "Vadim Markovtsev (main developer) and Gennady Kuznetsov (project leader) left Samsung to another company about 2 month ago. It is difficult to work with splited command, but we are not stoping Veles developing. We are using Slack to sharing ideas, problems and news. We are slowdown a little, but we will catch up."

Monday, November 9, 2015

TensorFlow: Google releases a new ML library for deep learning

I got this yesterday from both Assaf Spanier and Guy Rapoprt: Google is announcing the release of their TensorFlow library. Main use case is deep learning. With a python interface, multiple GPUs are supported.

A recent benchmark shows that TensorFlow is rather slow compared to Torch.

Sunday, November 8, 2015 deep learning for art!

I got this from colleague Chris Dubois: DeepArt is an application which combines and image with an artist style to create new art work. For example:


The application is generated by Łukasz Kidziński & Michał Warchoł based on a paper: Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge in the research paper entitled 'A neural algorithm of artistic style'. Here are some images from the paper:

Wednesday, October 28, 2015

News from Stockholm: HopsWorks

A couple of weeks ago I visited Stockholm, I was kindly invited to give a keynote at the SICS Data Science Day. Thanks again for Prof. Seif Haridi for his kind invitation!

One of the interesting lectures there was by Prof Jim Downling, previously from mysql. Jim is building an open source system called HopsWorks to improve the Hadoop experience. Jim have kindly answered my questions about the project.

When did the project start?
Hops started in 2011, we started with HDFS. In 2013, we started YARN. In 2014, we started HopsWorks. So, it's about 14 months in development now.

What is the project goal?
The project goal is to make Hadoop for humans. We want to make it easier for people to store, process, and share data in Hadoop. That means moving away from the command-line to graphical interfaces for non programming tasks. Everything from managing access to data to sharing datasets should be accessible for people who are not data engineers.

What is the project license?
The project is licensed as a mix of Apache v2 and GPL v2. Our extensions to Hadoop are Apache v2 licensed, but we have connectors to the NewSQL database that we support (NDB - MySQL Cluster), and they have to be licensed in GPL v2. Because of the mixed licensing model, we don't provide a single distribution. However, users can install HopsWorks with 6 mouse clicks using our recommended installation tool,

Who is using HopsWorks?
We have had most interest from companies and organizations with sensitive datasets that require sharing those datasets in a controlled manner with users. So, Ericsson are interested enabling non-Ericsson employees to do analysis on some of their DataSets without requiring the long process of signing NDAs. As HopsWorks has a data scientist role (who cannot import or export data from the system), they could provide access to external data scientists, knowing they have an audit trail for actions by the external users and that the external users cannot download the dataset or derived data from the cluster. In the area of Genomics, we have a lot of interest as well.

Can you share performance numbers?
I don't have figures for Sentry's performance. The figures I showed were for the state-of-the-art Policy enforcement points (XACML). Sentry is trying not to do any enforcement and is basically sending all of its rules to all of the services to be cached there (HDFS, Solr, Impala, etc). My guess is that Sentry itself can still only handle a few 100 ops/sec. The main problem it has is how to keep the privileges and the data consistent. I don't see how they can do that for all Hadoop services. 
Here's Cloudera's own report on the slowdown of turning on Sentry for Solr (it leads to a 20% slowdown for Solr - even with most privileges being stored in Solr):
They admit that Sentry "doesn't scale to store document-level [privileges]", so they store policies in Solr instead (breaking the assumption that Sentry is the central store for all policies (privileges).

Can you share a video of your talk?
The talk from last week is up:

And here are some screenshots:

Thursday, October 22, 2015

Apache Zeppelin in Picking Up!

A few months ago, I wrote about Apache Zeppelin. Yesterday I visited SICS and met with Jim Dowling. He has an interesting open source project named HopsWorks (I plan to write more about it soon!). Anyway HopsWorks is using Zeppelin and I saw a very interesting demo of its functionality. According to Jim, Zeppelin is really picking up. He sent me the following resources which indicate the rising popularity of Zeppelin:

In Microsoft Azure:


In hortonworks:

In cloudera:

In HopsWorks :)

Sunday, October 18, 2015

IEEE ParLearning '2016 workshop announced

My colleague Yinglong Xia from IBM have kindly invited me to participate as program committee at the 2016 IEEE Parallel Learning Workshop. The workshop will take place May 27, 2016 in Chicago. Submission deadline is January 15, 2016. Submissions of original work in the area of parallel and distributed machine learning systems is encouraged!

Wednesday, October 14, 2015

Cloud analytics front is heating up

Last week I attended Amazon Re invent where Amazon announced at the keynote their QuickSight tool for visual summary of data. Using QuickSight it is possible to get some first order statistics about the data (i.e. histograms) do some plotting and look at some slices of data using a visual interface.
See minute 32 of the below video:

Not surprisingly, Google just released a similar product called Cloud Data.

Many people ask me how those tools differ from GraphLab Create - while we do have a basic visual interface, we are focused on the high end machine learning models on top of the data and not just on visualizing it.

IBM have also joined the game by providing a version of their oldie SPSS on top of BlueMix.