Data Science and ML | Databricks Blog

Page 26

Random Forests and Boosting in MLlib

January 21, 2015 by Joseph Bradley and Manish Amde in Engineering Blog

This is a post written together with Manish Amde from Origami Logic. Apache Spark 1.2 introduces Random Forests and Gradient-Boosted Trees (GBTs) into...

ML Pipelines: A New High-Level API for MLlib

January 7, 2015 by Joseph Bradley, Evan Sparks and Shivaram Venkataraman in Engineering Blog

MLlib’s goal is to make practical machine learning (ML) scalable and easy. Besides new algorithms and performance improvements that we have seen in...

Efficient Similarity Algorithm Now in Apache Spark, Thanks to Twitter

October 20, 2014 by Reza Zadeh in Engineering Blog

Our friends at Twitter have contributed to MLlib, and this post uses material from Twitter’s description of its open-source contribution , with permission...

Scalable Decision Trees in MLlib

September 29, 2014 by Manish Amde and Joseph Bradley in Engineering Blog

This is a post written together with one of our friends at Origami Logic. Origami Logic provides a Marketing Intelligence Platform that uses...

Apache Spark 1.1: MLlib Performance Improvements

September 22, 2014 by Burak Yavuz in Engineering Blog

With an ever-growing community, Apache Spark has had it’s 1.1 release . MLlib has had its fair share of contributions and now supports...

Statistics Functionality in Apache Spark 1.1

August 27, 2014 by Doris Xin, Burak Yavuz and Hossein Falaki in Engineering Blog

One of our philosophies in Apache Spark is to provide rich and friendly built-in libraries so that users can easily assemble data pipelines. With Spark, and MLlib in particular, quickly gaining traction among data scientists and machine learning practitioners, we’re observing a growing demand for data analysis support outside of model fitting. To address this need, we have started to add scalable implementations of common statistical functions to facilitate v

Mining Ecommerce Graph Data with Apache Spark at Alibaba Taobao

August 14, 2014 by Andy Huang and Wei Wu in Engineering Blog

This is a guest blog post from our friends at Alibaba Taobao. Alibaba Taobao operates one of the world’s largest e-commerce platforms. We collect hundreds of petabytes of data on this platform and use Apache Spark to analyze these enormous amounts of data. Alibaba Taobao probably runs some of the largest Spark jobs in the world. For example, some Spark jobs run for weeks to perform feature extraction on petabytes of image data. In this blog post, we share our

Scalable Collaborative Filtering with Apache Spark MLlib

July 23, 2014 by Burak Yavuz and Reynold Xin in Engineering Blog

Recommendation systems are among the most popular applications of machine learning. The idea is to predict whether a customer would like a certain item: a product, a movie, or a song. Scale is a key concern for recommendation systems, since computational complexity increases with the size of a company's customer base. In this blog post, we discuss how Apache Spark MLlib enables building recommendation models from billions of records in just a few lines of Pyt

Distributing the Singular Value Decomposition with Apache Spark

July 21, 2014 by Li Pu and Reza Zadeh in Engineering Blog

Guest post by Li Pu from Twitter and Reza Zadeh from Databricks on their recent contribution to Apache Spark's machine learning library. The...

New Features in MLlib in Apache Spark 1.0

July 16, 2014 by Xiangrui Meng in Engineering Blog

MLlib is an Apache Spark component focusing on machine learning. It became a standard component of Spark in version 0.8 (Sep 2013). The...