Monday, March 2, 2015

SkyMind: A new machine learning startup to support deeplearning4j

Yesterday I connected with Adam Gibson and Chris Nicholson from SkyMind, a new startup around support and maintenance of depplearning4j, one of the popular deep learning packages. To all the VCs who are reading this blog,  please note that SkyMind is looking for funding. 

What is SkyMind?

Skymind is the commercial support arm of Deeplearning4j, the distributed open-source deep-learning framework for the JVM. Adam Gibson created Deeplearning4j, and has spoken at Hadoop Summit, OSCon, Tech Planet and elsewhere. He's the author of the forthcoming O'Reilly book "Deep Learning: A practitioner's guide."

What is deeplearning4j license? what is SkyMind business model?

Deeplearning4j, ND4J (our scientific computing library) and Canova (vectorization) are Apache-2.0 licensed, which gives them IP protection on derivative works they create with our software. 

Skymind builds "Google brains" for industry. Our software works cross-platform from server to desktop to mobile, and handles every major data type: image, sound, time series, text and video. What Red Hat is to Linux, we are to Deeplearning4j. 

Which distributed systems does learning4j support? (Hadoop, Spark, Yarn?)

YARN,Spark. We also allow users to create standalone distributed systems using Akka and AWS.

Can GPU mode run distributed? Can you support multiple GPUs? 

No Infiniband yet, but it can do internode coordination and leverage GPUs via ND4j.

From your experience: what is the typical speedup of GPU vs. CPU

We're finishing benchmarks now. Just implemented matrix caching and raw Cuda. Will know more numbers soon (plan to benchmark on gpu matrix caching with Spark).

What is the most powerful deep learning methods implemented in deeplearning4j? what are their typical use cases?

Sentiment analysis for text (which has applications for CRM and reputation management); image and facial recognition, which has wide consumer and security applications; sound/voice analysis, which is useful for speech-to-text and voice search; time series analysis, which is useful for predictive analytics and anomaly detection in finance, manufacturing and hardware.
                  
What is your target user. Do I have to be a deep learning expert?

The entry-level data scientist who needs to productionize an algorithm focusing on unsructured data where traditional feature engineering methods have fallen over. Familiarity with machine learning ideas will help, but it's not necessary to get started. We introduce most of the crucial ideas on our website. 
                
Which programming language interfaces do you support?

Java/Scala right now. We'll have a Bash command-line interface that loads models via JSON.

There are a few other deep learning libraries like Theano and Caffe. Can you outline the benefits of deeplearning4j (either in terms of accuracy or speed or distribution?)

Caffe was created by a PhD candidate at Berkeley. It specializes in machine vision and C/C++ based. Deeplearning4j is commercially supported, handles all data types (not just images), and is based on the JVM, which means it works easily cross platform. 

Theano/PyLearn2 is written in Python and likewise serves the research community. It is useful for prototyping and widely used, but most people who create a working net in Python need to rewrite it for production. Deeplearning4j is production-grade from the get go. 

Theano allows you to build your own nets but the generated gradients can be slow. Theano is also harder to get up and running cross platform. As for caffe, we integrate better:

Theano and Caffe are released under a BSD license that does not include a patent claim and retaliation clause, which means they do not offer the same protections as Apache 2.0. 

What is the typical dataset size where you find deep learning to be effective. how many images?

You don't need very much data for deep learning as long as you tune it right (dropout, rectified linear units,..). It also depends on the problem you're solving. If you're training a joint distribution over images and text for example, you may want more. For simple classification, you can get away with a more tuned algorithm (aka more robust to over fitting).

How do you deal with classification of imbalanced classes?
    
We sample with replacement and random DropOut and DropConnect between layers to learn different features.

Besides of classifying images to labels. Can you identify object locations in images? Can you find similar images?

With enough data, yes.

Friday, February 27, 2015

Intel analytics library

My colleague Matt Grover from Walmart sent me the following link. Intel is releasing an analytic library with some ML primitives and also Spark RDD support. The library is in beta mode.

Thursday, February 26, 2015

March 29 deadline for submitting to Papis.io 15

I am happy to announce the 2nd predictive API conference (Papis.io) that will take place 6-7 August joint with KDD in Australia. The 1st Papis.io was held in Barcelona joint with Strata Barcelona this November and attracted 200 predictive analytics practitioners.  Video of the talks are available online. There is just a month for submission deadline.

Wednesday, February 18, 2015

Dato Core Open Source Released!

We are proud to announce the latest GraphLab Create open source - renamed to Dato Core.
Here is our Github repo. We are slowly going to deprecate PowerGraph and older version of GraphLab.

Thursday, February 12, 2015

Next.ml event in Boston - April 27

I got introduced to next.ml by my colleague Gideon Wulfsohn who is now working at indico.io. Readers of my blog are welcome to use 15% discount code: dato when registering to the next.ml upcoming event April 27 in Boston. Students are welcome to apply for a free ticket here. Among the interesting speakers Matei and Paco from Databricks, Yhat, and indico.

Wednesday, February 11, 2015

Kaggle is cutting down

I got this from my friend Micky Fire: Kaggle is cutting down 1/3 of their workforce. To remind, Kaggle is the popular data science competitions website.

DataDog acquires MortarData

I got this from my colleague Johnnie Konstantas: DataDog acquires MortarData. Not long ago I helped David Bayer from AmplifyPartners to setup a VIP Dinner in Israel. AmplifyPartners invested at round B of DataDog. I previously wrote about Mortar a couple of years ago in my blog.