We will share our experiences in building Data Science and Machine Learning (DS/ML) into organizations. As new DS/ML teams are created, many wrestle with questions such as: How can we most efficiently achieve short-term goals while planning for scale and production long-term? How should DS/ML be incorporated into a company?
We will bring unique perspectives: one as a previous Databricks customer leading a DS team, one as the second ML engineer at Databricks, and both as current Solutions Architects guiding customers through their DS/ML journeys.We will cover best practices through the crawl-walk-run journey of DS/ML: how to immediately become more productive with an initial team, how to scale and move towards production when needed, and how to integrate effectively with the broader organization.
This talk is meant for technical leaders who are building new DS/ML teams or helping to spread DS/ML practices across their organizations. Technology discussion will focus on Databricks, but the lessons apply to any tech platforms in this space.
Chris Robinson: Hello everyone. And thanks for attending our session. Today we’re going to talk about building data science into organizations and share some experience that we both have from the field. So, first off, let’s talk about our perspectives. I’m Chris Robinson. I’m a senior solutions architect with Databricks and a former director of data science and omni-channel marketing at Overstock.com. I’m a career data scientist and an avid Apache Spark user. Joseph.
Joseph Bradley: Yeah. Thanks Chris. I’m Joseph Bradley, another solutions architect. In terms of my background, I spent most of my time at Databricks as an early engineer, working on ML, interrelated tech, especially in the Apache Spark open-source community. So I do you want to point out these backgrounds in terms of our perspectives about building data science into organizations. Chris, in terms of directing that kind of effort. Me, in terms of building back in tools for it. And then both of us working as solutions architects at Databricks. Where we work with a very wide variety of organizations, large and small, different industries, but all on building sort of unified platforms around the Lakehouse architecture for data analytics and AI. The other important thing, I think, especially to folks at this conference is that a lot of our strategy is built around open tools like Apache Spark, Delta Lake and MLflow.
So, tons of organizations are doing data science nowadays. I’m sure you’ve seen a lot of statistics like that on the left about many companies investing in big data and AI initiatives, and many not yet having those in production or widespread. So we hope that our lessons today from our experiences in helping companies get to the smaller percent of ones with widespread AI in production will be useful, both for those at the beginning and towards the end of their journey. In terms of the goals of a data science, ML or AI program, we’d like to break these into short-term and long-term ones. So in terms of the short-term, the first important thing might be to validate that data science is indeed worthwhile and prove it to some business partners. The next of course could be to do that, to get resources, data, data scientists, and other talent and its executive sponsorship.
And an important short term goal is to show vision, to get to those longer term goals on the right. Where longterm you’d want to show business impact across different parts of the business, increase productivity over time. And of course, scale data science from the initial effort to throughout the organization. Now looking at it from the other dimension, what are the challenges in getting there? We break these into organization and tech and platform. Where in terms of organization, they’re obvious ones like team building, building skillsets, hiring versus training, or a mix thereof. Team organization could mean and better data science in each of the organizations, within your company, or it could mean a standalone data science org. And here it’s really important, I think, to adapt to your culture and DNA. In terms of business and executive alignment, this is critical both for getting the resources needed, but also for making sure that data science is directed at the right strategic efforts for the company.
And then finally, an important challenge is to make sure that R&D is accepted within the company. Many companies don’t have huge R&D arms and data science that it’s hard, especially at the beginning is a bit researchy, so getting the buy in there is critical. In terms of challenges for tech and platform. I think two of the very common ones we see are poor integration between data science and other data teams. And then planning for scale and production under investment constraints where you don’t want to build up a lot of technical debt early on, but you also don’t want to build the entire thing at the beginning. So I’ll turn it back to Chris to talk through sort of our philosophy for addressing these challenges.
Chris Robinson: Yeah. Thank you very much, Joseph. So our general philosophy is to progress through three stages of crawl, walk, run, as you grow your data science organization and capabilities. Along the way, we’re going to focus on strategy for success. Organizational changes as the data science org matures and technological platforms that focus on scalability. But first, before we dig into the meat of our presentation, I wanted to talk a bit about execution. And I think the advice we give in this slide really applies to any of the three stages. And you want to start thinking about this early on. First, we recommend that you use an agile process for data science and learn from the past experience of software development. This is going to allow you to iterate with sprints and stand-ups and fail fast in R&D, but more importantly, it’ll will align you to the same timelines as the rest of the technological organization.
Next transparency is key, and we really can’t stress this enough. You need to be communicating frequently to your business partners and executives and make sure that you’re bringing your business partners and consumers into the modeling process and make them an integral part of that process. Bring them in early and talk to them often. Next, you want to collaborate with the data and platform teams, make your needs known to them and understood and make sure that you’re looking out for shortcuts, which will built technical debt that often is very hard to deal with later. And as a personal anecdote of advice, all of this is made easier with a nice box of donuts early in the morning.
So next, let’s double click on our crawl stage. In the initial crawl stage, our strategy will be to identify quick wins and focus on building a plan for the future. You want to identify projects that require a low amount of effort, but have a very high return in terms of value. For the organization, you should focus on building channels of communication and make sure you’re nurturing those channels. And for the platform, meet your data science teams where they are today and use the tools that they’re familiar with. Focus on productivity to quickly build those early wins and gain trust with your business partners. So in this crawl stage, we’re going to set up some assumptions that’ll help us with the rest of our discussion. First, in terms of a team, we’re assuming that there are one to two data scientists likely reporting into a CTO.
You can think about these individuals as the founding members of your data science practice, and more than likely they’re acting as full stack data scientists. Meaning they’re taking their models all the way from early exploratory data science, all the way through productionisation and post-production monitoring instability. And because of the requirements of this full stack data scientist, we noticed they typically a math or computer science background, but this is not a hard, fast rule. In terms of the company, there’s going to be a desire to become data-driven and likely the company is going to be smaller in size, meaning a startup or an existing organization with new data initiatives and a desire to establish a data science practice.
So what does success look like in this section? You’re going to have successful MVPs with a few models manually in production. You’ll take a step back and start to build an AI/ML strategy that you can take forward in the next two stages. And you’re going to be in the discovery phase for new projects and low-hanging fruit. Identifying those initiatives that are going to take a lower amount of effort, but have a high value return. And with that, Joseph I’ll pass it back to you.
Joseph Bradley: Thanks, Chris. Yeah. So Chris has been talking about this crawl stage from the organization building perspective, and I’d like to just share a few thoughts from the platform or tech selection one. My first piece of advice to enable data scientists is to allow them to keep using tools they’re familiar with. If you’ve been trained in a master’s program or on the job at a company, you’re used to generally, interactive notebooks, languages like Python and R, standard ML libraries, Get and so forth. And I’m not going to go through this whole list. But the key point is that data scientists come in with an expected set of tools and making sure that the platform can support them is critical, both for productivity, but also for hiring.
Also along the lines of sort of standards. I think, it’s turning to open source standards, which I think people know and love at this conference. Apache Spark, Delta Lake, MLflow, Koalas. This is important, partly for hiring, like we were just mentioning. People are often more familiar with these types of open tools, but also for portability. At the beginning of your journey for data science, you may not know your exact needs in five years. And so building around open standards will allow you to move workloads onto and off of whatever platform you have.
The next step. I think I’d like to give around tech and platform is productivity. And there, I think one of the key challenges I see with a lot of customers is self-service analytics for data scientists. It’s all too common that data scientists teams need to like go to an Infor team or a platform team and beg them for the right compute resources for the right libraries and environment to get started. And so if a platform can enable them to say, start up a machine or cluster on demand, start using it, start up a larger one, maybe share it with their team members. This allows a lot more flexibility, but of course, with that freedom comes responsibility and the platform should also be able to enforce that responsibility through cost controls and governance. In terms of libraries and environment, think what data scientists need most are first plug and play environments with popular ML libraries to just get started quickly, but also the ability to customize for future projects.
So throughout this talk, we’d like to share this running example, an example of a company which started its data science journey with ML prioritization of sales opportunities. Telling the sales team, here are the top 10 Ops that you should focus on this quarter. Now, in this example, I’m going to talk through first what a data scientist might need. They need to get that ML workspace, clusters, environments, sync code, maybe from Get or importing the notebook. They need to get access to data, do some iterative development in say notebooks and then analyze results where tools like MLflow, which we mentioned can be really valuable for auto logging, same metrics for some post-talk analysis. So these are the kinds of things they need to start producing insights. And then finally to share those with the sales team, they need maybe visualizations or dash boarding, so that they can show that they actually proved some value. Now I’d like to turn it back to Chris, to talk through how this data science perspective fits into the broader org.
Chris Robinson: Yeah. Thank you, Joseph. So first off as you start to tackle a new problem, like prioritization of sales opportunities, you’re going to want to have a discussion with the sales stakeholders to understand the problem, the data, and more importantly, to set expectations. Make it clear to them what you can achieve with ML and what you can’t achieve with ML. Along the way, as you start this project, you’re probably going to want to start new hiring and training of new data science team members. In this particular domain, it would help the project a lot if you could hire a data scientist that had performed in sales operations in a past role. As you develop your model, you’re going to want to focus on explaining the results and making sure that you and your business partners understand the future potential to the sales org and hand in hand with that is building executive alignment and getting buy-in for long-term initiatives. The way that you do that is identifying projects like this, that take a lower amount of effort, that have a high value return.
Next, in terms of technology and platform, you’re going to want to focus on platform enablement and improvement. You’re going to start bringing in new data sets, which is going to require you to partner with teams like infrastructure teams and data engineers. So in this case, you might want to bring in customer history and maybe some sales data or other data domains from your organization. And last but not least, you’re going to want to take a step back in the crawl stage and think about long-term platform and data pipeline planning. So you can prepare to grow and scale in the future stages.
So now let’s, double-click on the walk stage. In the walk stage, we’re going to focus on building successful products that incorporate ML. You want to use the communication channels that you have established to improve executive visibility and cross team integrations. And finally, as you establish your ML platform, focus on scaling and automating workloads to establish a clear path to production. And now, as we take a minute to think about the organization, let’s talk about the dynamics of the team. We’re assuming here, because you’ve had some early successes in the crawl stage, that you’ve been able to open up headcount and hire new members to your data science team or teams. And those data science teams are now supporting multiple business units. You’re going to have strong integrations with software engineering for production, and this is why those early communication channels are so critical. And you’re also going to start thinking about diversifying skillsets amongst your team for domain expertise, whether that happens to be machine learning domain expertise or business domain expertise. At the company level, data initiatives are being discussed with executives and different business units are pushing for new data projects.
And what you’ll notice is you have different business champions emerging for AI and ML, and they’ll talk to their colleagues in other business units. So what does success look like here? You have successful Mbps and production models across multiple business units. And you’re starting to think about uniform testing standards being established, making sure that you have KPIs that are understandable and easy to communicate both across the org and up the executive [inaudible]. And the name of the game here is really scale, but we’re going to need a bigger boat. So how do we scale? We can scale technology. We can scale our teams and we can scale and repeat our successes across our organization. With that, I’ll hand it back to Joseph to think a little bit deeper about scale.
Joseph Bradley: Yeah. Thanks Chris. Taking a look from the platform perspective, as Chris said, there are many types of scale, but the obvious one to make sure to address is scale in terms of data size. So to think about that, I think it’s useful to look at a typical machine learning workflow from data prep on the left, to model consumption on the right. And note that any one of these points could become a bottleneck as your data or problem sized grow larger. And so a platform needs to be able to offer the ability to scale easily, say data preparation becomes a bottleneck. Well maybe the data scientists have been using Pandas. The platform needs to make sure they can smoothly transition to say Koalas or Spark data frames and UDFs in order to get past that bottleneck. Similarly for the rest of these potential bottlenecks, and I’m not going to go through the different tools being mentioned here, but it’s important to consider this for future planning as these workloads become more diverse. Going beyond scale, I think at this stage, it becomes more important for data scientists to think about production.
And the first steps towards that are automation and reproducibility. Job scheduling is a great first step in terms of taking an ad hoc data science workload and adding a bit of automation. Ops should be thought of essentially as units of work and here I’m showing sort of the Databricks API, but this is general to job orchestration tools. Jobs should have a piece of logic, have the required environment that logic needs to be reproducible and other specifications such as the compute resources required, alerts, et cetera. And so this will allow data scientists to become a bit more productive, allowing sort of set and forget workflows where possible. The other part of reproducibility, I think is ML specific and there a tool like ML flow becomes really critical. To have reproducibility with ML code, data clusters, environment specs, all feed into this. So having something like ML flow auto logging to allow it. Is really important. Then finally talking about production and scaling to say multiple data science projects becomes more important to think about security.
And they’re making sure the platform can tie in to say cloud credentials, policies around clusters and other resources, table access controls becomes important. So getting back to a running example, now our sales opportunity prioritization project was successful and the company now is trying to build an ultra Finn products in several areas. So the data scientists working on that are going to start to need new things. On the left, they’ll start to need to have tighter integration with broader data pipelines. Here are tools like Delta Lake and Apache Spark can be really valuable for automating ingestion and processing of new data for ML and also outputting the results and insights into business or product for consumption.
Next, they may need to say improve their modeling process either to improve the models or improve their efficiency. And their tools like Hyperparameter Tuning, MLflow auto logging and so forth can be really valuable. As the data grow or they reach a larger data problems, data scientists may need to scale up or out. The final thing the platform should allow is of course, this automation we mentioned last, scheduling training jobs, inference jobs, automation for exporting results to downstream consumers. Basically making them both more productive and creating true products. So I’ll hand this off to Chris to talk through this again, plugging into the broader org.
Chris Robinson: Yeah. Thank you, Joseph. So like Joseph said, because we’ve had those early successes in the crawl stage, and now we’re starting to spread out across different business units and really thinking about ML driven products. We’re going to have alignment between executive and data science teams to drive those data driven products. The next step is to make sure that we’re formalizing a methodology and communication channels to have education for business stakeholders, to understand the ML models and insights, and then use them to action correctly. And with that, we want to have knowledge sharing across business units for ML driven products. Not just knowledge sharing amongst the data science and technical folks, but also knowledge sharing amongst their business partners on what has been successful and how to correctly integrate with the data science team. From the platform perspective, because we’ve come up with a repeatable architecture and harden that architecture, we see platform adoption by multiple business units. And with that increased governance needs for platform, covering the needs of more business units and personas as we acquire new data domains.
And as of course, platform is going to play a key role in establishing these best practices and helping us to propel into the run stage. So at the run stage, production is the name of the game, you want to focus on reproducible wins across multiple verticals. At this point, the organization should have data science in its DNA and every corner of the company should be data-driven. Finally, you’re going to establish reliable and efficient production processes, focusing on stability and measurable, repeatable results. So let’s take one last opportunity to think about the organization in terms of team, because we have been able to spread across different business units in the walk stage. We’re going to now have multiple data science teams across verticals led by a central AI executive. You’re going to have standard development and deployment processes for your models. That early integration with the software engineering teams and those communication channels are finally really starting to bear their fruit.
And you’ll establish a center of excellence across verticals so that you can communicate amongst data scientists, talk about what’s working, what’s not working and making sure that you’re not only sharing successes, but that you’re learning from each other’s failures. At the company level, data initiatives are being reported to the board and data-driven decision-making is spreading across the organization. So what does success look like here? You have successful production models in multiple verticals that are adding measurable value. You have uniform testing standards that are established and being rigorously adhered to, and you also will have a program to grow citizen data scientists. Whether those be analysts or BI professionals that are looking to expand their skillsets and grow in their career, or our business partners that are just looking to use a data science mindset and be data-driven in their day-to-day decision-making. And with that, I’ll pass it back to Joseph to talk a little bit more about the technology.
Joseph Bradley: Thanks, Chris. Yeah. From the platform and technology side, of course, at this stage, you have a lot of diverse data science workloads. And so there are a lot of topics we could cover, but I do want to mainly talk about the production side, ML upside and examples of patterns, which can be repeated regardless of the application. So I’d like to first talk about the model life cycle and here, I’ll speak about it in terms of MLflow, but this could really apply to other ML Ops tools as well. When say a new project starts, the first thing the data scientists will need to do is start playing around with whatever ML library they want and producing models. Once that is successful, they’ll want to provide a bit of rigor around this. And so the platform should be able to track things about that model, parameters, metrics, other artifacts, metadata, there needs to be a process for them to move this to a serious production stage.
And there are something like MLflow model registry becomes important, allowing movement from development to staging, to production, with things like different personas, having different responsibilities. And in fact permissions to make these changes. Then finally there need to be standards in the platform for deployment and across a larger organization that might mean supporting quite a few different deployment options. And here a tool like MLflow can be really valuable in providing sort of out-of-the-box support for these different options. So this is a nice example of a model life cycle, sort of best practice workflow, which a platform could dictate. I’d like to spend a little more time on the model registry and deployment options. So to give a concrete example of ML Ops with the model registry, I’m going to click through this sort of animated example, where I start as a data scientist training and producing a model.
I then can say, create a version of this in the model registry. And that’s going to let me promote that version to staging where that will trigger a web hook to run a model validation job. And if that validation job is successful, it’ll comment back with test results and make an automated request to transition the model to production. Now, in this example, I’m having a person in the loop. And so we’ll say that an ML Ops person receives an email making a request for that transition and they need to go in and manually approve the new prod model in the model registry. That final thing will trigger a web hook for putting that new model into production. In this example, as a production batch inference job. So I’ll emphasize that this is of course just one example, potential process, but one which can generalize across multiple use cases and business units.
Going a bit deeper into the deployment. We’ve talked about the part on the left here, GetInData, say in Delta Lake or Future Store. Doing some model training, model tracking, putting it in the registry, but then we might need the platform to both provide options for modes of deployment, but also recommendations for which ones to you use. Here are batch streaming, rest APIs, embedded systems, pushing results to BI tools can all be good options depending on the requirements of they application. So making sure that the users of the platform understand the latency and cost trade-offs here, is really important.
So getting back to our running example, now our organization has data science throughout its business. And the platform and the teams supporting it, or the Data Science Center of Excellence, provide this data ingestion all the way to model monitoring and feedback standard life cycle. The platform needs to not just provide this sort of standard life cycle of course, but also allow some customization. I, as a new data scientist working on a Greenfield project might only care about exploratory data analysis, but later on, I want to be able to adopt this full process. So Chris, I’ll ask you to talk about fitting this into the broader work.
Chris Robinson: Yeah, thank you Joseph. So I think early on in this process, when you’re thinking about data ingestion and preparation, really that first step. Business understanding becomes key, and along that journey, you’re also going to want to search for some executive sponsorship. Whether you have an existing executive that you’ve been working with for quite some time, or you’re a new data science team coming in, and you need to find the right executive sponsor. In order to be successful and really collaborate through this cycle. The Center of Excellence for Data Science and Machine Learning becomes buried. And in that center of excellence, you want to have common metrics, discussions, and KPIs, make sure these things are measurable. They’re easy to understand, and then you can compare projects vertical to vertical across your business. As you move to the end of this life cycle, you’re going to start to see business value realization.
And along with that have end user feedback. You want to take those two key pieces of information and keep iterating, keep improving on your data science practice and your machine learning life cycle. From the platform standpoint, ML and data platform and pipeline integration becomes very key to make these processes repeatable. And as you start to scale this and harden these processes across your organization, data and resource sharing and governance become key. So in the example that Joseph gave where, say we’re a new team that’s coming in and just concerned with exploratory data analysis, but we’ll be moving an eventual model into production. We want to make sure that we have very simple onboarding processes for new teams and new use cases. Make sure that in that center of excellence, they have communication channels and maybe more senior data scientists that they can go to for mentorship and help. And as they move that model eventually into production, we want to make sure that there’s standard handoff processes for production jobs, and that we understand where ownership lies at each stage in the life cycle.
And key to all of this is of course, shareable documentation and usage education. Make sure that the outputs of the models are understood by your business partners and they know how to use them for actionable insights. So as we start to wrap up our talk, we wanted to offer you some resources to learn more. We have some links here for related talks and blogs, customer success stories, as well as information about the Databricks machine learning related products. So as we take one last look at our philosophy, we hope that you can take this methodology and apply it to your own teams and initiatives. And remember successful data science requires you to be daring, to fail fast, communicate, and constantly improve. And to achieve these goals, you need to collaborate across data and infrastructure teams, understand business problems and prioritize which problems to tackle first. And if after all, you do need the technology and platform to scale and repeat these successes with measurable wins and results. So thank you very much for your time today, Joseph and I very much appreciate it.
Joseph Bradley works as a Sr. Solutions Architect at Databricks, specializing in Machine Learning, and is an Apache Spark committer and PMC member. Previously, he was a Staff Software Engineer at Data...
Chris joined Databricks in July of 2019 as a Solutions Architect. Previously, as Director of Data Science for Digital Marketing and Fraud Prevention at Overstock, he and his team utilized big data and...