Friday, August 21, 2015

Deep Learning Workshop - Nov 7 in SF

Just got this from Arshak Navruzyan from startup.ml. They are hosting a deep learning workshop Nov 7 in SF. You are welcome to use discount code Dato when registering.

Tuesday, August 18, 2015

Data Science Summit - Talk Videos Released

As you may know I have been heavily involved in the organization of the Data Science Summit, a 1000 data scientists and machine learning researchers event this July in SF. We have recently released the event video talks. In this blog post I will summarize some of the highlights of our event.

The first talk you should watch in case you missed it is Prof. Carlos Guestrin keynote, which summarizes whats new in Dato:

An interesting talk from Prof. Mike Franklin from Berkeley AMPLab, about what's new in Berkely AMPLab:
Interesting to learn that Mesos, Tachion and Spark have graduated to startaps. What will be the next startup out of AMP Lab? Mike mentions Velox, their predictive service which competes with prediction.io (among others). KeyStoneML is a library of machine learning pipelines. MLMatrix is a library for matrix linear algebra operations. SampleClean is a project for involving humans in the data cleaning process.

A related talk by Prof. Seif Haridi (SICS) about Flink, a system geared towards stream processing:
Unlike Spark which implements streaming with small batches, Flink is written to support continuous stream handling. Quoted very good performance of Flink vs. Storm.

Another interesting talk by Prof. Alex Smola attracted big audience. Alex have recently formed a startup around his parameter server work. Unfortunately we did not get permission to release his video yet. I am working on that.

Wes McKiney, the creator of Python pandas used our conference to announce Cloudera's new Ibis project, which is a way to parallelize Python code on top of a Hadoop cluster at scale.



A related lecture by Peter Wang, CEO of Continuum about dusk - a different attempt to parallelize Python code. He also explores in detail their visualization library Bokeh.


Prof. Chris Re have covered his DeepDive framework. Recently he opened another exciting new startup around providing ML tools for a larger audience. For example PaleoDeepDive allows mining complex information out of pdf papers (including NLP, mining tables, geographical coordinates etc.)

Prof. Jeff Heer from Trifacta and University of Washington presented his recent research on how to improve visualization in a joint research project with Tableau. Multiple layouts and options are explored and a recommendation engine filters the results to present the most attractive and informative to the user.

Prof. Dhruv Batra from Virginia Tech described their visual question answering project, a cool project which answers free text questions on images:



In the startup session, an interesting talk by Stephen Merity from CommonCrawl:
Where Stephen describes the different cool things people do with their collected web data. For example, Stanford's Glove project which provides another word2vec implementation. Analyzing the web for the price of a Sandwich, an interesting work from Yelp for collecting US phone numbers out of the web.

One of the most bizarre applications (in a good sense!) is from compology.us - a US company who is monitoring trash bins using sensors and using GraphLab deep learning to detect the level of trash and optimize the pickup routes.

The last talk I wanted to highlight is the audience favorite: a talk by Amanda Cassari from Concur which shows how to run GraphLab Create on top of Spark:



Saturday, August 15, 2015

GraphLab Create Traning Day at Strata New York

We are setting up a full day GraphLab Create tutorial on Sept 29 in Strata NY. Training topic is building and deploying large scale machine learning applications. Feel free to use discount code: PCDANNYB to get 20% discount.

Wednesday, July 29, 2015

Scala training in SF

My Israeli Colleague Tomer Gabel is giving two full days Scala training in SF - Aug 11. My blog readers are welcome to use discount code: BOLD200 for getting 200$ off.

A new graph partitioning algorithm at CIKM

We got the following email from Fabio, a gradient student at Rome University:

I'm Fabio Petroni, a Ph.D. student in Engineering in Computer Science at Sapienza University of Rome.

Together with other researcher, we recently developed HDRF, a novel stream-based graph partitioning algorithm that provides important performance improvements with respect to all existing solutions (we are aware about) in partitioning quality.
In particular, HDRF provides the smallest average replication factor with close to optimal load balance. These two characteristics put together allow HDRF to significantly reduce the time needed to perform computation on graphs and makes it the best choice for partitioning graph data.

A paper describing the HDRF algorithm will be presented in the next CIKM conference (http://www.cikm-2015.org) and is available at the following address (this is the final submitted version): http://www.dis.uniroma1.it/~midlab/articoli/PQDKI15CIKM.pdf




We will work with Fabio for including a version of his algorithm for our latest code base GraphLab Create. 

Tuesday, July 28, 2015

Some exciting developments at Dato

You may have missed our latest Dato blog post, so I wanted to shed light on two of the coolest released features:

It's particularly exciting to mention that GraphLab Create's integration with Numpy will effectively scale scikit-learn. Now with GraphLab Create and Dato Predictive Services, you can deploy existing scikit-learn models at scale as a RESTful predictive service by changing only a few lines of code. Very cool.

graphlab-create-numpy-scale


Dato Distributed now with distributed machine learning

# jobs distribution environments
# s = gl.deploy.spark_cluster.load(‘hdfs://…’)
# h = gl.deploy.hadoop_cluster.load(‘hdfs://…’)
e = gl.deploy.ec2_cluster.load(‘s3://…’)

# set distribution environment to my AWS cluster
gl.set_distributed_execution_environment(e)
Dato Distributed enables GraphLab Create users to execute parallel computation of Python code tasks on EC2, Spark or Hadoop clusters. The above shows how GraphLab Create can switch between these environments by changing one-line of code. In GraphLab Create 1.5.1, Dato Distributed on Hadoop now seamlessly supports distributed execution of machine learning models including logistic regression, linear regression, SVM classifier, label propagation and PageRank. Distributed machine learning on EC2 and Spark are in the works.

dato-distributed-pagerank-iteration