Andy Dang is the co-founder and head of engineering at WhyLabs, the AI Observability company on a mission to build the interface between AI and human operators.. Prior to WhyLabs, Andy spent half a decade at Amazon, where he built massive data pipelines for the Advertising Platform, deployed some of the company’s first ML applications to production, built the internal Machine Learning Platform and helped launch the first iterations of SageMaker. Andy holds a Masters in CS from Tokyo Institute of Technology. He is a frequent speaker on topics ranging from MLOps to building responsible AI systems.
May 27, 2021 05:00 PM PT
In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data.
In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.