Large Scale Machine Learning and Other Animals: August 2014

Friday, August 29, 2014

Scalable data science training in Seattle

Together with the University of Washington in Seattle, we are setting up a full day of scalable data science training using Graphlab Create, on Wed Sept 17. Anyone who is interested in welcome to register here, you are welcome to use discount code GLABER.

Thursday, August 21, 2014

Do you like "The Killings"? Dive into Seattle police data!

Here is an interesting blog post analyzing Seattle police data. I got it from Carlos Guestrin, our CEO.

Another interesting dataset is Allstate insruance claims data, which is from their Kaggle competition.

Wednesday, August 20, 2014

GraphLab Create helps analyze FCC network data

My collaborator Scott Kirkpatrick from the Hebrew University is using Graphlab Create to analyze FCC broadband data. He is using GraphLab Create to slice & dice large corpus of network measurement data. Here are some resulting beautiful plots that illustrate network traffic from different aspects. The data is free, anyone who wants to look at the code is welcome to email me and I will share the ipython notebook to generate those plots.

GraphLab Create's Boosted decision trees for Kaggle Bike Sharing Competition

My collaborator Jay Gu has just released a blog post which explains how to use Boosted Decision Trees in GraphLab Create to compete in Kaggle's Bike Sharing Competition. Using this simple solution we get to place 15 in the leaderboard out of 569 competitors!

Sunday, August 17, 2014

DataRobot raises 21M$ series A

Just got this from my collaborator Jay Gu: DataRobot raises 21M$ in series A. A Boston startup who is trying to automate data science. According to this blog post, the invested was led by NEA, who also invested in Databricks (Spark) as well as GraphLab.

A related company is SparkBeyond, an Israeli startup who raised 4M$. They also automate data science by automatically generating features and evaluating them using multiple algorithms.

Tuesday, August 12, 2014

Interesting paper from Dataiku about WCSD 2014

Dataiku recently won first prize at the Yandex WCSD 2014 competition. Here is a paper describing their methodology. Dataiku was recently present at our GraphLab Conference. They have a visual Excel like environment for data manipulation, cleaning and predictions.

Friday, August 8, 2014

Sparse K-means

I got from my collaborator Jay Gu the following recent paper: A Single-Pass Algorithm for Efficiently Recovering Sparse Cluster Centers of High-dimensional Data from ICML 2014. Basically it is K-means with L1 constraint on the cluster center. The results are sparse cluster centers, which may sense for example when clustering text documents together.

A second relevant paper I got from my collaborator Yao Wu is
Web-Scale K-Means Clustering by Scully from Google Pittsburgh. The paper uses mini batch to speed up computation and achieve sparsity using project gradient ascent.

Tuesday, August 5, 2014

Misc News

Collaborative filtering tutorial by Netflix

I got this from my collaborator Alice Zheng, a lecture about collaborative filtering by Xavier Amatriain from Netflix at the MLSS summer school organized by Alex Smola at CMU:

Deep learning @ Spotify

I got the following from my colleague Zach Nation: An Interesting blog post from Spotify about convolutional neural networks usage to learn latest factors for collaborative filtering. And here is the related NIPS paper.

Cloud Service @ Databricks

Just recently Databricks has announced their business model: cloud service running Apache spark.

Here is the keynote at the Spark summit:

MapGraph: First Multi-GPU Graph Analytics System by Systap

I just heard from my colleague Bryan Thompson from Systap that they have recently released MapGraph: the first distributed graph analytics framework which supports GPUs. Here is their blog post giving additional details.

Large Scale Machine Learning and Other Animals