Make the Most of Your Talent and Time When Working on AI and ML Projects – Automate the Rest

Tired of spending too much time on manually intensive tasks to onboard and prepare data for your AI and ML projects? A proliferation of loosely integrated point tools and the lack of automation results in a great deal of time spent writing glue code and coordinating tooling, instead of training and operationalizing your ML models. There is a better way, and it starts with automation. In this session we’ll discuss how you can automate the manually intensive and time-consuming work of onboarding and preparing your data; so you can focus your talent and time on making the best use of AI and ML to further your business goals.

Watch more Spark + AI sessions here
Try Databricks for free

Video Transcript

– All right, good morning, good afternoon, good evening, depending on where you’re joining us today, and welcome to the session from Infoworks. We’re going to talk about how you can start making the most of your talent and time by using automation to take care of your data grunt work. So I’m Ramesh Menon, Head of Product Management at Infoworks. Over the last 20 plus years, I’ve worked on solving various data challenges, ranging from identity resolution, master data management, and very large graph traversals. And now at Infoworks, I’m focused on automating data operations and orchestration, Kevin.

– Yeah, hi, Kevin Holder here, I run our Field Engineering team here at Infoworks. And my background is, I’ve got about 20 plus years also in enterprise data warehousing, analytics, big data and NoSQL space. I’ve worked with a lot of large enterprises to help them, you know, develop, implement analytics infrastructure, during my career. – Great, so between the two of us, we have about 40 plus years of experience with data. – It’s a long time Ramesh. (laughing) – So what we gonna do, over the next 30 minutes is have an interactive dialogue about data challenges and how to overcome them. So Kevin, you’ve helped a lot of enterprises with their data strategy and data management. Can you describe the different tasks that data engineers and data scientists typically have to do?

Getting data ready for analytics is time-consuming

– Yeah Ramesh, I’d love to talk about that. And, you know, some of the challenges, really can be summed up with two words, grunt work, right? So, but you know, the diagram here really shows the, hierarchy of, you know, what data engineers, data scientists are generally, you know, working towards, and, you know, really the cool, the fun stuff is at the top level of the pyramid. And that’s really where we wanna be spending our time is on the machine learning, you know, training models, you know, doing the advanced algorithms. But, in reality, we end up spending a lot of time on these lower tiers, right, you know, collecting data, get it ingested, preparing it, aggregating it and doing those types of things, right. And, you know, it’s an environment where more and more use cases, you know, keep coming at us, shorter timeframes to deliver, you know, less folks on the data teams to chip in here, and, so it really becomes, you know, a challenge and, you know, I’d love for you to share, you know, some of your thoughts on, you know, based on your experience, you know, kind of why that is because, you know, we’ve seen the same type of pattern even before big data in the cloud, you know, back in the early days of data warehousing, you know, it was the same thing, people spending, you know, about 80% of their time, doing this backend work, as opposed to the higher value added stuff. And, quite frankly, this is the boring stuff, you know, and we wanna be able to free up our time and do the cool stuff, fun stuff. So, why is this pattern, you know, continuing Ramesh? – Yeah, that’s absolutely right, Kevin. So for decades, ETL has been the way to get data ready for analytics.

Why is getting data ready for analytics so complex?

These ETL tools were designed to replace scripting typically with a visual programming paradigm. And the flow usually was grab a table, do some transition on it on the way and land it somewhere, typically in a data warehouse. Now, with big data, cloud and the sheer scale of modern analytics, you know, we need to do a lot more, right? New challenges have appeared, like you can see in this slide and have massively increased the grunt work, needed for getting data ready for analytics. And at the same time, the requirements and the volume of new analytics, just keeps increasing.

– Yeah, it’s a great point Ramesh and, you know, if you look at the things on the list here, you know, you mentioned ETL tools, ETL really don’t address in an automated way, at least these types of things here that are really relevant for the big data world, the cloud world that we’re in now, right. You know, the tools were great for, the data warehousing paradigm, where you could pick up a table and move it from point A to point B, but to do these types of things that you have on the screen here, is really challenging, right? Because you end up developing a lot of different maps and scripts and, you know, you’re working to a level of a paradigm and I don’t know about you, but I’ve never been able to kind of map and script my way out of, (laughing) some of these challenges, using tactiful tools or accessing a lot of people really start to hand code these things. Moving away from those tools, right, which just, you know, so how do we, you know, I know when you were building the ideas around Infoworks for the product, you really had an opportunity to rethink the space, the space that you and I have worked in for so long. and so, how did you go about rethinking that and what is, you know, what is unique about what we’re doing here to help solve these specific challenges? – Yeah, that’s right, I mean, I’d add that mapping and scripting, rinse and repeat didn’t work too well, even in the sort of the ETL and the data warehousing days. So, you know, we started with looking afresh at the problems and the solutions, right. And the answer is pretty clear, it’s that whenever there’s complexity, automation solves it, automation makes dealing with the grunt work easier and faster, it helps one to do work faster so that, we can focus on those higher level problems that we need to solve, like you pointed out earlier on.

There’s a better way! Automate every step of the data workflow

So, we created Data Foundry, to be a deeply automated system and to provide the complete functionality for creating and deploying data workflows. So what’s in the data workflow, you start with onboarding your data, it’s, you know, getting it from the source systems into your data lake, then you prepare your data, get it ready for analytics, and finally operationalize it. So next, we would walk through each of these steps, that can be automated across your entire data workflow. – Yeah and while we’re walking through these, we are going to, put this within the context of a demo and we’ll show a complete intuitive use case, going all the way from data sources to training a machine learning model and Databricks using a notebook, and really show how, you know, you can simplify that notebook, focusing on the value added, you know, the development that you wanna do there and leverage Infoworks to help, you know, automate some of this backend work to help you, really focus your talent and time on more important things. – Yeah, so let’s get started with the first step, which is, onboarding data. There are numerous challenges that need to be solved here. And as you know, any one of us who’ve actually had to deal with getting data over, into these data lakes and get them ready for analytics know that, you know there are things starting with discovering the data that’s available in the sources, understanding what the data types are, mapping them and converting them properly so that you don’t lose data precision. It’s about change data capture, it’s about making sure that incremental works really well. It’s about not only data change, but schema change and schema drifts. All of these things need to be taken care of because data onboarding is this really critical and the complex first step in your analytics process. These issues, that you can see on the left hand side here, are really complex and time consuming and just examples of those kinds of, you know, data grunt work, we’re talking about. So, I’ll hand it over to Kevin, so that we can see how automation can take, the grunt work out of onboarding data. – Okay, thanks Ramesh. So, when I log into Infoworks and start using the system, one of the first things that happens is Infoworks will reach out and automatically crawl, the metadata for the sources that I want to ingest into, my DatabricKs Delta Lake environment. Your data sources and, you know we work with a lot of large enterprises who have sources that range, you know, and you can see on the screen here, CSB files, relational databases, data warehouses, InfoWorks has a lot of different connectors for most of the major different types of data sources that you may have out there. And so we crawl the metadata for those, and we create a catalog automatically. This is 180 degrees different than, what traditional tools have done, and then we drive the entire process off that catalog going forward and this automatically, you know, enables a governance framework, that allows you to make sure that you can govern what you’re doing over time here. I can go into this data catalog, I can search. So if I search for example for consumer, this will return my data set here that I want to work with, and then I can click in here and I can see the details behind this. So in this example, we have three different tables that are part of this consumer loan data set, two of these have been ingested into our Delta Lake environment already, and I can tell that one of these is an incremental ingestion, one of these is a full refresh, every time it runs, I can tell how many records have been there, what the status is, et cetera. I can also tell that this one has not been adjusted yet, so there’s some additional configuration that needs to be done. I can also click on and get to table details for any of the objects that are here in this catalog. This case I have the schema here where I can scroll through, but what I really wanted to highlight, as it relates to automation, is this automated process that we apply, that Data Foundry applies to every single table that’s ingested into your data Lake. This process flow here, will handle automatically creating, a change of speed, a error handling table automatically, where errors can be routed. There’s a continuous merge, that’s automatically there to help keep your current target in sync with your source, and then there’s a history option that’s available and this can be turned on and off based on the particular table, and what your needs are. But this, if you turn this on, this will automatically start to track history at the row level for data that’s being adjusted into your Delta Lake environment. With traditional tools, this will require a lot of drag and drop mappings, you know, or mapping and scripting, we were talking about earlier, to accomplish which is error prone and not consistent, right? This is automated completely for every single table, that you ingest into your Data Lake environment. – Great, so Kevin now we have all the source systems crawled, ready for bulk ingestion, you know, we’re keeping it synchronized and everything’s in the catalog. So, as you said, there’s no way to, no need to map and script your way out of it, and everything leverages automation and it’s completely configuration driven. – That’s right, and you know, one of the more common questions that I get, at this stage of showing the product is, “hey, this is great but I’ve worked “with higher level tools in the past, “and I run into limitations,” right? One of the really powerful things about Infoworks is that, we’re giving you this level of abstraction to work in and get things done very quickly, but we also allow you to get in and configure and you know, tune the processes at a very low level, but give you the power that you’re looking for, and I can easily do that. If I click on configure, this will take me into the configuration panel for this particular type of source that I’m ingesting here. I can choose options such as full refresh, or let me make it an incremental. I can specify natural keys, so that we can automatically understand how to identify unique records. I can choose my incremental mode, whether it’s append or merge, choose watermark columns, create my schema information for the Delta Lake side of things. I can even get into easily configuring how I want to partition, the data within the data Lake. I also talked earlier about the history, with this checkbox right here. I can click that on, and get a quick enabled history, SCD type 2 for data that’s being ingested into the Data Lake. So, there’s also an advanced configuration tab that allows you to go in and do even deeper configurations. – That’s the onboarding process, so, Ramesh why don’t you take us into, the next phase of our journey here? – Absolutely. So, once we’ve got data onboarded, the next step is really preparing those data sets to be ready for analytics.

Prepare Data – Grunt Work that can be automated Cleanse

Now here too, there’s, you know, a bunch of complex things that need to be done. So typically, you know, what I would need to do is combine these data sets, but before that, I need to standardize them, so that they are in a common format or they are joinable. I’ll have to deal with data quality issues and typically have to do some sort of data cleansing. And then I also need to worry about data lineage. So for example, if I derive a new column, I need to keep track of where this column was derived from, what data sets, which sources it came from and also looking downstream, if there are folks who are gonna be using my data model as input into their own data workflows, I need to be able to understand the impact of anything that I might do, when I go to change those data models. My analytics use case might need record versioning. So as Kevin just mentioned SCD 2, you know, I might need to maintain a SCD 2 in my data model targets. Even when I’m done with developing all of this business logic and implementing these data transformation pipelines, I also need to think about performance optimization. Do I need to use broadcast joins? Do I need to do in memory sorts? Do I need to do dependency management? And if I’m running in modern cloud infrastructures, how can I leverage on demand and ephemeral infrastructure? So, these are all additional complexities that make, you know, really a lot of grunt work show up in the prepared data step. So Kevin, I’ll turn it back to you, so that you can show us how automation can help with this stage. – Sure, thanks Ramesh. So, let me go over to data pipeline editor. In the previous demo, we were able to ingest data and keep our data lake synchronized with our source, and so we have a large set of data that maps directly to the source structures and we may have, hundreds and thousands of tables, coming in here, that is very useful for a lot of use cases. However, for a lot of use cases, we also need to take data from multiple sources and combine those together and apply business rules and create new data models, within the data lake, and, that’s what the pipeline designer allows us to do here. When I go into a pipeline, I can see different versions that might be available. So Infoworks has, out of the box version control built in, and this is telling me that I have three versions of this pipeline available here, this one is my active version, so there’s an older one out here and then it looks like there’s a newer version out here and that someone is working on to, promote into production later on. If I double click on the pipeline, this takes me to, the editor, and when I’m in the editor, I can get to any of the data sources that are available to me, within the domain where this pipeline lives. I can also access data that is being created from other pipelines. So I can chain pipelines together, to handle more complex use cases. I also have the pallet down here which has a lot of different transformation options available, everything from filters, to joins, to fuzzy matching to various different targets for your cloud based data warehouses, where I may wanna push, some curated data sets to statistical analysis functions. So, really the idea here is to drag and drop these things onto the pallet, and then, use the transformation objects to create your business rules. In this example, I’m taking these two tables and for each I’m going to derive some new columns, do some cleansing on it, join the data together, and then write it out to a target. One specific type of automation that I wanted to highlight here is the ability to do incremental pipelines very easily using a simple checkbox.

You saw on the ingestion part of the demo, how we can do an incremental ingestion from a source, with that ingestion we’re managing, the watermarks and the dates that these records are flowing in, and that allows Inforworks to know, okay, if I check this box, I can then pick up data and process data incrementally. With traditional tools or platforms that takes a lot of, you know, mapping and scripting that we talked about earlier to do, you end up reinventing the wheel quite a bit. We can do that with a simple check box and you’ve got your incremental processing enabled. I also want to highlight something on the target side as well, if I go in to this target, I can specify a lot of details about how I want this data materialized into the data lake. I can pick the type of synchronization I want to do. I can choose if I want to do an SCD type 2, so, if we need track history on this target as well, I can specify that, I can configure it, and that’s automatically enabled to the simple dropdown here. I can specify my path information and my format for, you know, loading this into Delta lake in a delta format, specify keys, partitions, all the same type of power that you saw on the ingestion side, we can configure here, and then when we deploy this to production, this job will run, pick up my data sources, apply the business logic and land a new model into the data lake.

– All right. So, Kevin a lot of enterprises have a lot of legacy SQL workloads typically, you know, in data warehouses, that need to be reused or could be reused, but to convert them into Spark, you know, preserve the logic and do our performance optimization is a lot of grunt work again. So, you know, perhaps you can show us how that can be automated. – That’s a great question. So, I went into the configuration panel for a pipeline and, we have the ability to import SQL workloads and we support a number of different dialects, Hive, Teradata, MySQL Oracle SQL server, you can choose that, you can choose, you know, how you wanna handle, double quotes, back ticks, et cetera, inside of the SQL statement. We will import this and create a graphical representation based on how we parse that SQL. So you’ll get a graphical pipeline, just like the one that you saw, you know, that was built out on the designer there. The cool thing about it is, you know, I may pull in a territory to SQL, generate that pipeline, and then once it’s Inforworks, we can then turn around and deploy that easily into, know, a Spark as a Spark job running in Databricks, for example, right. So, really easy way to start to migrate some of your, legacy workloads over, and deploy those as jobs that are running natively inside of Databricks. So it’s a great question. One other thing, I wanna highlight here is, you know, when we created the targets in the pipeline that you saw, that is automatically integrated into the data catalog as well. So I went back to my catalog view here and I moved from data sources over to data models, and I can see all the different types of models that are now being created as output from the pipelines that are being run in this environment. So, you know again, automatically integrating this back into the data catalog, so it’s here, it’s searchable and delivering that governance framework that we talked about earlier.

– All right.

Okay, so now we’re at the last stage of the data workflow, which is really the stage about operationalization. There’s a lot of grunt work there too that needs to get automated. Now, this is especially challenging because, a lot of analytics projects get stalled because of difficulties in moving data workflows into production.

Operationalize Data – Grunt Work that can be automated

Typically, this requires a lot of rework or additional work to handle operational requirements. So as an example, you know, if I have to orchestrate, a multitude of lower level tasks, that comprise a use case, so, a few ingestions, a couple of data transformation pipelines ,et cetera, you know, those all need to get stitched together in the right sequence with the opportunities for parallelization and so on. That requires a lot of, you know, thought and effort. The other part of it is that infrastructure will have failures. So, you know, networks may go out for a little period of time. There may be, you know, servers or source systems that may have outages. And so, how can I make my data workflow fault tolerant? So, you know, resume from the last successful point or restart, you know, pause, et cetera. And, also deal with scheduling, you know, how often should this run, you know, every hour and so on. And those kinds of things, are just a lot of additional work that a lot of times people have to do using low-level tools. Even when I have this deployed into production, I typically need to monitor my performance of this data workflow, against an SLA. And once, you know, I’m doing that, I also need to be able to balance cost and performance and get access to all the relevant metrics. So once again, lots of grunt work here, and automation can tremendously help with this and I’ll turn it back to Kevin so that we can see how automation can help you. – Okay, great, thanks Ramesh. So let me move over into the workflow section of Infoworks where we can, you know, design a workflow.

What you’re seeing here is a pallet, where I can build a intuitive workflow to drive the use case that we’ve been building here.

The first steps here are to ingest those two data sources that we configure. I can do that in parallel because there’s no dependencies there, so those all run simultaneously. When that completes, then I want to execute a data pipelines, so we’ll run the pipeline that we just saw on the last demo, and then assuming the pipeline gets built and all my data is prepared and ready to go, Then I’m wanna call, a node to execute a notebook over in Databricks, and that notebook will contain the advanced algorithms to go and train the ML model, to complete the use case here. These are very easy to configure. I drag the object onto the pallet here, and then I can double click into any of these and quickly configure them using a series of dropdowns. So this is an ingestion task, I can break my data source, my tables for my pipeline, I can choose the pipeline that I want and a specific version that I want to deploy with this workflow. And then in this node, we’re actually using some script, with some conditional logic and here I’m going to call a notebook over in Databricks and orchestrate that. And this will be fully integrated into this workflow, including error handling, so that we can literally deploy and execute this entire intuitive flow, inside of Infoworks. After I build out the workflow, I can then go in and easily put it into production and start to run this on a schedule. It can also be triggered through an API, so if you want run these on demand, and then we will track a complete history of every time this workflow runs. And so, for example, I can tell that this job field here, I can go into this and I can quickly tell, okay, these completed successfully, I did something wrong in my code for my notebook that caused it to fail, I need to go fix that, right. After I fix that I can run the workflow again and it will pick up from where it left off. And now I can tell that this completed successfully, but the key there is, it’s not going to go back and re-execute the ingestions and the pipelines again. It will literally pick up from where it left off and continue processing from there. So that’s a really powerful capability to help you manage your environment. We are also tracking detailed metadata about these jobs and for each individual run. And you could do a lot of reporting on that. This is a simple Gantt chart view, that’s showing me for each of those tasks within this workflow, exactly how much time each of them took, and it’s even showing me the number of Databricks units that were used for this particular task and the number of core hours that were used, in the cloud, and this can all be aggregated up and you can report at the job level, at the task level, roll it all the way up to domains, and have some very powerful functionality there. The last thing I’d like to show is the notebook that we’re calling at the end of the process here. This is a Databricks notebook that’s going to train a model. This is a simple notebook here, and this is really where, you know, the data scientists and data engineers really wanna spend their time is doing the advanced work here, right. So, traditionally this notebook was much more complex because it had the logic to ingest the data, do the synchronization or it was actually multiple notebooks that we were orchestrating, in a flow, but it was much more complex, right. We’re able to simplify this down by using Infoworks that handle and automates, the ingestion, the data preparations steps, taking care of all of that, and then executing this at the end of that flow and then doing it in a complete, into a workflow. The nice thing about it is, you know, now this can, be updated, so if a month from now, the business comes back and says, “Hey, we have a new data set, “we need to go retrain the models.” We don’t have to go through a lot of manual steps to make that happen. We can now feed that data back in through this process, train the models, have it deployed, and everything is updated and ready to go. – So, what we, you know, really did is, you know, go through all of the different, data grunt work challenges that occurred every stage of the data workflow.

There’s a better way! Automate every step of the data workflow

And Kevin I’m, you know really glad you showed the example of SCD 2, because it’s something that I personally struggled with, using other tools over the last, you know, decade or so. So, it’s a great example of how automation can really make things simple. And also that, – I thought you have some, some scar tissue from the mapping and scripting. – That’s right, I’m glad there wasn’t any mapping scripting there. (laughing) So, if we, you know, summarize, right, I mean, onboarding your data is the first step, it’s not only about ingestion, it’s about synchronizing. It’s about keeping all of this in, you know, under a data governance framework, preparing your data with data transformations and business logic to create those data models that users will use for analytics, and finally operationalizing it, moving it into production and making sure that your data workflows are running in a fault tolerant way. So, if you’re wondering about how to learn more, Kevin, can describe that to you. – Now yeah, absolutely. So, we have a booth at the summit this week, so we’d love for you to drop by the booth, you know, come by the booth, hang out. We can take you deeper into the Infoworks, you know, product, go and do a deep dive in everything that you saw here. We’d also love for you to get your hands on the product and play around with it. So, you know, go to and there’s a lot of resources there and you can go there to get your hands on a free trial also. – All right so, you know, thank you again for your time and, you know, please provide us your feedback and don’t forget to rate and review the sessions.

Watch more Spark + AI sessions here
Try Databricks for free
« back
About Ramesh Menon


Prior to Infoworks, Ramesh led the team at Yarcdata that built the world’s largest shared-memory appliance for real-time data discovery, and one of the industry’s first Spark-optimized platforms. At Informatica, Ramesh was responsible for the go-to-market strategy for Informatica’s MDM and Identity Resolution products. Ramesh has over 20 years of experience building enterprise analytics and data management products.

About Kevin Holder


Kevin Holder manages the customer solutions architecture and engineering team at Infoworks, specializing in data engineering for big data and cloud data. Prior to Infoworks, Kevin held leadership roles at Couchbase and Informatica, where he was responsible for architecting customer solutions employing NoSQL and data warehouse technology. Working with hundreds of companies over 20 years, Kevin has deep knowledge of how to scope, architect and deliver successful technical projects.