Friday, May 15, 2015

Open Data Science Conference - May 30 in Boston


Tuesday, April 28, 2015

PowerLyra framework just got best paper award at Eurosys

Just got a note from Rong Chen from SJTU: PowerLyra, a PowerGraph extension for handling partitioning better in some cases, just get the best paper award at Eurosys. Nice work Rong!

Saturday, April 25, 2015

Mediego - another personalization startup by a famous professor

Following the recent trend of well known professors that open a startup, I found out today about Mediego, a startup by Anne-Marie Kermarrec, a senior researcher at INRIA, France. Her most well known work relates to publish substribe and multicast networking.

It seems that Mediego is yet another personalized recommendation startup. Here is a short video:

Wednesday, April 22, 2015

Strangely, I was chosen as the #1 data scientist to follow!

My Colleague Alice Zheng brought to my attention a blog post by Andy Green who believes I am the #1 data scientist to follow. I  swear I am not behind this :-)

Monday, April 20, 2015

Time Series Support in Spark

I connected with Sandy Ryza, a Senior Data Scientist @ Cloudera has recently been working on a library for time series analysis with Spark.
Here is a quick Q&A about his project.









1) When did you start the project?

Pretty recently.  It looks like my first Github push was in the middle of February.
2) What is the goal of your project?

The goal is to provide a set of tools and abstractions that make it easy to deal with large scale time series data.  More concretely, a lot of this means basically mimicking the functionality of tried and true single-node libraries.  Especially in domains like finance, Pandas and Matlab dominate time series analysis.  The project seeks to provide an alternative that's well suited to a distributed setting.
3) Who are the contributors?

So far it's been a one-person project of myself, though I've received interest from a bunch of different people, and I'm hoping to see some external contributions soon.
4) What is the challenge to support time series data. Which mechanisms are currently missing in spark?

I think the main challenges are around finding the right abstractions for storage and analysis.  Single-node libraries can store a whole collection of time series in a single array and slice row or column-wise depending on the operation.  In a distributed setting, we need to think and make choices about how data is partitioned.  Good abstractions guide users towards patterns of access that the library can support in a performant way.
5) Are you supporting a streaming model or a bulk model or both

The initial focus is on bulk analysis.  We're targeting uses like backtesting of trading strategies or taking all the data at the end of a day and fitting risk models to it.  I think we'll definitely see uses that demand a streaming approach in the future, so I'm trying to design the interfaces with that in mind.

6) Which aggregation mechanisms are currently supported.

As the library sits on top of Spark, all of Spark's distributed operations are close at hand.  This makes it really easy to do things like train millions of autoregressive models on millions of time series in parallel and then pull those with the highest serial correlation to local memory for further analysis.  Or group and merge time series based on user-defined criteria.
7) Which algorithms are currently supported

So far I've been focusing on the standard battery of models and statistical tests that are used to model univariate time series.  They're the kinds of tools that are more likely to show up in a stats or econometrics course than a machine learning course: autoregressive models, GARCH models, and the like.
8) Which algorithms you plan to add in the near future?

There's quite a ways to go to support all the time series functionality that shows up in single-node libraries: seasoned ARIMA models, Kalman filters, sophisticated imputation strategies.  The other interesting bits come on the transformation / ETL side: joining and aligning time series sampled at different frequencies, bi-temporality 

9) What is the project license

Apache Version 2.0, which I think is the case for all Cloudera open source projects.
10) What are the programming interfaces supported. Is it Scala? Java?

The library is written in Scala, but supports Java as well.  A few people have expressed interest in Python bindings, so I'd like to support those when I find the time.

Friday, April 17, 2015

Dato's CEO named finalist at GeekWire CEO of the year!

Dato CEO Carlos Guestrin
Dato CEO Carlos Guestrin
Carlos Guestrin, CEO of Dato: Who says academics can’t make the transition to entrepreneurship? Guestrin, the Amazon Professor of Machine Learning in Computer Science & Engineering at the University of Washington and a past GeekWire Geek of the Week, is looking to make his mark with Dato (formerly GraphLab). The Seattle startup, which raised $18.5 million earlier this year,  is looking to help businesses make better sense of data. “Our company was founded on a mission to create a more intelligent world,” said Guestrin at the time of the financing.
Hailing from Brazil, Guestrin studied computer science at Stanford University before moving on to Carnegie Mellon where the concept for Dato was originally born. In 2012, he joined the University of Washington’s faculty before spinning the open-source project off into its own company.

Help us to help Carlos get elected by clicking this 5 seconds survey.

Saturday, April 11, 2015

Is Mahout revived?

Some interesting news from my colleague Alice Zheng. It seems that Mahout is abandoning the map reduce paradigm and now being reimplemented on top of Spark/ H2O. A couple of years ago Mahout was a popular machine learning library on top of Hadoop but recently the project is stumbling as many key people left.