Aaron Davidson is an Apache Spark committer and software engineer at Databricks. His Spark contributions include standalone master fault tolerance, shuffle file consolidation, Netty-based block transfer service, and the external shuffle service. At Databricks, he leads the Performance and Storage team, working on the Databricks File System (DBFS) and automating the cloud infrastructure.
May 27, 2021 04:25 PM PT
Enter the next phase of democratized analytics and AI to increase scale, agility and reduce time to innovation. With the introduction of NEPHOS, Databricks is improving how organizations engage with data and analytics platforms by placing greater control in the hands of data teams. By helping data teams get to work faster while providing infosec and platform admins confidence that the governance and security policies are enforced, organizations can securely accelerate innovation. NEPHOS unleashes data teams to get to work faster and smarter with instant compute, optimal price/performance ratio, and inbuilt security and compliance features. In this session, learn how the newly announced automation enables workspaces to be ready to run notebooks or execute SQL queries instantly without the labor-intensive and manual tasks of setting up infrastructure.
April 24, 2019 05:00 PM PT
Last year, Databricks launched MLflow, an open source framework to manage the machine learning lifecycle that works with any ML library to simplify ML engineering. MLflow provides tools for experiment tracking, reproducible runs and model management that make machine learning applications easier to develop and deploy. In the past year, the MLflow community has grown quickly: 80 contributors from over 40 companies have contributed code to the project, and over 200 companies are using MLflow. In this talk, we’ll present our development plans for MLflow 1.0, the next release of MLflow, which will stabilize the MLflow APIs and introduce multiple new features to simplify the ML lifecycle. We’ll also discuss additional MLflow components that Databricks and other companies are working on for the rest of 2019, such as improved tools for model management, multi-step pipelines and online monitoring.
June 30, 2014 05:00 PM PT
This talk will present a technical “”deep-dive”” into Spark that focuses on its internal architecture. The content will be geared towards those already familiar with the basic Spark API who want to gain a deeper understanding of how it works and become advanced users or Spark developers.
This talk will walk through the major internal components of Spark: The RDD data model, the scheduling subsystem, and Spark’s internal block-store service. For each component we’ll describe its architecture and role in job execution. We’ll also provide examples of how higher level libraries like SparkSQL and MLLib interact with the core Spark API.
Throughout the talk we’ll cover advanced topics like data serialization, RDD partitioning, and user-defined RDD’s, with a focus on actionable advice that users can apply to their own workloads.