Skip to main content

Databricks Sets New World Record for CloudSort Benchmark Using Apache Spark at $1.44 Per Terabyte

In Collaboration with Industry Partners, Databricks Earns Second World Record in Two Years, Reducing Data Processing Costs In the Cloud per Terabyte by 68 Percent

November 15, 2016
Share this post

SAN FRANCISCO, CA--(Marketwired - Nov 15, 2016) - Databricks®, the company founded by the team that created the popular Apache® Spark™ project, announced today that in collaboration with industry partners, it has broken the world record in the, a third-party industry benchmarking competition for processing large datasets.

Utilizing Apache Spark and working in close collaboration with Nanjing University and Alibaba Group to form the team, NADSort, Databricks architected an efficient cloud platform for data processing. The platform sorted 100 terabytes (TB) of data at a total cost of USD $144, or $1.44 per TB, worth of cloud computing resources for both the Daytona and Indy CloudSort competitions. This record outperformed the previously held record by University of California, San Diego of $4.51 per TB, with savings of 68 percent.

The objective and purpose of CloudSort Benchmark entry is to measure the lowest cost in public cloud pricing per terabyte, reducing the total cost of ownership of the cloud architecture (a combination of software stack, hardware stack, and tuning) and encouraging organizations to adopt and deploy big data applications onto the public cloud. In 2014, Databricks set the record for Databricks' 2014 record, sorting 100TB of data, or 1 trillion records in 23 minutes, which was 30 times more efficient per node than the previous record held by Apache Hadoop. The sorting program, based on the Databricks' 2014 record and updated for better efficiency for the cloud, ran on 394 ECS.n1.large nodes on the Alibaba Cloud, each equipped with an Intel Haswell E5-2680 v3 processor, 8 Gigabytes of memory, and 4x135 GB SSD Cloud Disk.

"Databricks reduced the per terabyte cost from 4.51 dollars, the previous world record held by University of California, San Diego in 2014, to 1.44 dollars, meaning our optimizations and advances in cloud computing have tripled the efficiency of data processing in the cloud," said Databricks Chief Architect and leader of the CloudSort Benchmark project, Reynold Xin. "With these innovations, to process the same amount of data in 2016 in the cloud costs one third of the price in 2014!"
Catalysts for Cost Efficiency Improvements

Three important factors made this CloudSort cost efficiency possible, according to Reynold Xin in his blog:

  1. Cost-effectiveness of cloud computing: Increased competition among major cloud providers has lowered the cost of resources, making deploying applications in the cloud economically feasible and scalable;
  2. Efficiency of software: Continued innovations in Apache Spark, such as Project Tungsten, Catalyst, and whole-stage code generation, has benefited Spark enormously improving all aspects of the Spark stack;
  3. Optimization of Spark and cloud-native architecture: Combined in-house expertise in Spark and deep expertise gained in operating and tuning cloud-native data architecture of tens of thousands of clusters for customers have led to incremental gains of efficiency, developing the most efficient cloud architecture for data processing.

"The achievements of two world records in two years leave us humbled, yet they validate the technology trends we've invested in heavily," said Databricks CEO, Ali Ghodsi. "First, we believe open source software is the future of software evolution, and Apache Spark is the most efficient engine for data processing. And second, cloud computing is becoming the most cost-efficient, effective, and scalable architecture to deploy big data applications."

Read the blog to learn more:

Contact Databricks to get started:

About CloudSort Benchmark:

About Databricks

Databricks' vision is to empower anyone to easily build and deploy advanced analytics solutions. The company was founded by the team who created Apache® Spark™, a powerful open source data processing engine built for sophisticated analytics, ease of use, and speed. Databricks is the largest contributor to the open source Apache Spark project. The company has also trained over 20,000 users on Apache Spark, and has the largest number of customers deploying Spark to date. Databricks provides a just-in-time data platform, to simplify data integration, real-time experimentation, and robust deployment of production applications. Databricks is venture-backed by Andreessen Horowitz and NEA. For more information, contact [email protected].

Recent Press Releases

Databricks Strengthens Presence in Korea with Senior Leadership Hires
Read Now
Introducing Databricks LakeFlow: A Unified, Intelligent Solution for Data Engineering
Read Now
Databricks Open Sources Unity Catalog, Creating the Industry's Only Universal Catalog for Data and AI
Read Now
Introducing Databricks AI/BI: Intelligent Analytics for Real-World Data
Read Now
Databricks Unveils New Mosaic AI Capabilities to Help Customers Build Production-Quality AI Systems and Applications
Read Now
View All