The data science lifecycle consists of multiple iterative steps: data collection, data cleaning/exploration, feature engineering, model training, model deployment and scoring among others. The process is often tedious and error-prone and requires considerable human effort. Apart from these challenges, when it comes to leveraging ML in enterprise applications, especially in regulated environments, the level of scrutiny for data handling, model fairness, user privacy, and debuggability is very high. In this talk, we present the basic features of Flock, an end-to-end platform that facilitates adoption of ML in enterprise applications. We refer to this new class of applications as Enterprise Grade Machine Learning (EGML). Flock leverages MLflow to simplify and automate some of the steps involved in supporting EGML applications, allowing data scientists to spend most of their time on improving their ML models. Flock makes use of MLflow for model and experiment tracking but extends and complements it by providing automatic logging, deeper integration with relational databases that often store confidential data, model optimizations and support for the ONNX model format and the ONNX Runtime for inference. We will also present our ongoing work on automatically tracking lineage between data and ML models which is crucial in regulated environments. We will showcase Flock’s features through a demo using Microsoft’s Azure Data Studio and MLflow.
Subru is a Principal Architect at Microsoft in the GSL team, currently focusing on data science lifecycle automation platforms. Previously at Microsoft, Subru was a Principal Research Engineer working on different aspects of YARN scheduling, specifically scaling it to 50K+ nodes and providing SLA guarantees. The work is a critical driver for the internal Cosmos BigData clusters having scheduled nearly one trillion tasks that manipulated close to a Zettabyte of production data. Prior to Microsoft, Subru worked at Yahoo! where he contributed to Oozie's precursor, near real-time stream processing on Hadoop and HBase replication.He is also a member of the Apache Hadoop PMC where he has been actively contributing since 2007 with emphasis on YARN resource management. Subru's research interests include large scale distributed systems, Systems-for-ML and ML-for-Systems.
Avrilia is a senior scientist at Microsoft’s Gray Systems Lab (GSL).
Her research broadly lies in the area of data management with a recent focus on machine learning model management and large-scale stream processing. Her current work attempts to simplify the data science lifecycle by automating some of the tasks that data scientists perform manually today. She also works on system problems that arise at very large-scale such as improving the performance of complex streaming pipelines as well as the resource utilization of cloud deployments. She actively contributes to the design of the Dhalion library which has been used to efficiently tackle some of the above problems in production. Avrilia has also made open-source contributions to Apache Heron (as committer) and to MLflow.
Avrilia received her Ph.D. in Computer Science from the University of Wisconsin-Madison. Before joining Microsoft, she spent 3 years at IBM Almaden Research Center working on SQL-on-Hadoop engines and natural language interfaces for databases.