Skip to main content
Page 1
Engineering blog

Easily Clone your Delta Lake for Testing, Sharing, and ML Reproducibility

September 15, 2020 by Burak Yavuz and Pranav Anand in Engineering Blog
Introducing Clones An efficient way to make copies of large datasets for testing, sharing and reproducing ML experiments We are excited to introduce...
Engineering blog

Enabling Spark SQL DDL and DML in Delta Lake on Apache Spark 3.0

Get an early preview of O'Reilly's new ebook for the step-by-step guidance you need to start using Delta Lake. Last week, we had...
Engineering blog

Time Traveling with Delta Lake: A Retrospective of the Last Year

June 18, 2020 by Burak Yavuz and Denny Lee in Engineering Blog
Get an early preview of O'Reilly's new ebook for the step-by-step guidance you need to start using Delta Lake. Try out Delta Lake...
Company blog

Diving Into Delta Lake: Schema Enforcement & Evolution

September 24, 2019 by Burak Yavuz, Brenner Heintz and Denny Lee in Company Blog
Try this notebook series in Databricks Data, like our experiences, is always evolving and accumulating. To keep up, our mental models of the...
Company blog

Diving Into Delta Lake: Unpacking The Transaction Log

The transaction log is key to understanding Delta Lake because it is the common thread that runs through many of its most important...
Company blog

Introducing Delta Time Travel for Large Scale Data Lakes

February 4, 2019 by Burak Yavuz and Prakash Chockalingam in Company Blog
Get an early preview of O'Reilly's new ebook for the step-by-step guidance you need to start using Delta Lake . Data versioning for...
Engineering blog

Benchmarking Structured Streaming on Databricks Runtime Against State-of-the-Art Streaming Systems

October 11, 2017 by Burak Yavuz in Engineering Blog
Update Dec 14, 2017 : As a result of a fix in the toolkit’s data generator, Apache Flink's performance on a cluster of...
Engineering blog

Running Streaming Jobs Once a Day For 10x Cost Savings

This is the sixth post in a multi-part series about how you can perform complex streaming analytics using Apache Spark. Traditionally, when people...
Engineering blog

Working with Complex Data Formats with Structured Streaming in Apache Spark 2.1

In part 1 of this series on Structured Streaming blog posts, we demonstrated how easy it is to write an end-to-end streaming ETL...
Engineering blog

New Features in Machine Learning Pipelines in Apache Spark 1.4

Apache Spark 1.2 introduced Machine Learning (ML) Pipelines to facilitate the creation, tuning, and inspection of practical ML workflows. Spark’s latest release, Spark...
Company blog

Using 3rd Party Libraries in Databricks: Apache Spark Packages and Maven Libraries

July 28, 2015 by Burak Yavuz in Company Blog
In an earlier post, we described how you can easily integrate your favorite IDE with Databricks to speed up your application development. In...
Company blog

Making Databricks Better for Developers: IDE Integration

June 5, 2015 by Burak Yavuz in Company Blog
We have been working hard at Databricks to make our product more user-friendly for developers. Recently, we have added two new features that...
Engineering blog

Statistical and Mathematical Functions with DataFrames in Apache Spark

We introduced DataFrames in Apache Spark 1.3 to make Apache Spark much easier to use. Inspired by data frames in R and Python...
Engineering blog

Apache Spark 1.1: MLlib Performance Improvements

September 22, 2014 by Burak Yavuz in Engineering Blog
With an ever-growing community, Apache Spark has had it’s 1.1 release . MLlib has had its fair share of contributions and now supports...
Engineering blog

Statistics Functionality in Apache Spark 1.1

One of our philosophies in Apache Spark is to provide rich and friendly built-in libraries so that users can easily assemble data pipelines. With Spark, and MLlib in particular, quickly gaining traction among data scientists and machine learning practitioners, we’re observing a growing demand for data analysis support outside of model fitting. To address this need, we have started to add scalable implementations of common statistical functions to facilitate v
Engineering blog

Scalable Collaborative Filtering with Apache Spark MLlib

July 23, 2014 by Burak Yavuz and Reynold Xin in Engineering Blog
Recommendation systems are among the most popular applications of machine learning. The idea is to predict whether a customer would like a certain item: a product, a movie, or a song. Scale is a key concern for recommendation systems, since computational complexity increases with the size of a company's customer base. In this blog post, we discuss how Apache Spark MLlib enables building recommendation models from billions of records in just a few lines of Pyt