Adi Polak is a Sr. Software Engineer and Developer Advocate in the Azure Engineering organization at Microsoft. Her work focuses on distributed systems, big data analysis, and machine learning pipelines. In her advocacy work, she brings her vast industry research & engineering experience to bear in educating and helping teams design, architect, and build cost-effective software and infrastructure solutions that emphasize scalability, team expertise, and business goals. Adi is a frequent presenter at world-wide industry conferences and O’Reilly courses instructor. When Adi isn’t building Machine Learning Pipelines or thinking up new software architecture, you can find her hiking and camping in nature.
May 27, 2021 03:50 PM PT
No matter if your data pipelines are handling real-time event-driven streams, near-real-time streams, or batch processing jobs. When you work with a massive amount of data made out of small files, specifically parquet, your system performance will degrade.
A small file is one that is significantly smaller than the storage block size. Yes, even with object stores such as Amazon S3, Azure Blob, etc., there is minimum block size. Having a significantly smaller object file can result in wasted space on the disk since the storage is optimized to support fast read and write for minimal block size.
To understand why this happens, you need first to understand how cloud storage works with the Apache Spark engine. In this session, you will learn about Parquet, the Storage API calls, how they work together, why small files are a problem, and how you can leverage DeltaLake for a more straightforward, cleaner solution.
November 18, 2020 04:00 PM PT
Today with cloud-native rising, the conversation of infrastructure costs seeped from R&D Directors to every person in the R&D: "How much a VM costs?", "can we use that managed services? How much will it cost us with our workload??" , "I need a stronger machine with more GPU, how do we make it happen within the budget?" sounds familiar?
When deciding on a big data/data lake strategy for a product, one of the main chapters is cost management. On top of the budget for hiring technical people, we need to prepare a strategy for services and infrastructure costs. That includes the provider we want to work with, the different tiers plan they have, the system needs, the R&D needs, and each service's pros and cons.
When Apache Spark is the primary workload in our big data/data lake strategy, we need to think about rather we want to manage it ourselves or work with a managed solution such as Databricks.
Azure Databricks is a fully managed service rooted in Apache Spark, Delta Lake, and MLflow OSS. When discussing the best ways to work with Apache Spark, performance, and tuning comes to play. Databricks provides us with a managed, optimized Apache Spark environment ~50 times faster than OSS Apache Spark. But we need to consider the costs carefully.
Join this session to learn about resources consumed with Azure Databricks, the various tiers, how to calculate and predict cost, data engineers and data science needs, cost efficiency strategies, and cost management best practices.
Speaker: Adi Polak