Join this session to hear how Databricks is leveraging automated data pipelines and the lakehouse architecture to drive data insights across marketing. By using Fivetran to pull data from a number of marketing data sources such as Salesforce and Marketo, and ingesting it into the Delta Lake, Databricks’ marketing team greatly simplifies the data engineering process. This allows the team to focus on developing ML models for predictive insights that improve conversion and marketing ROI. The results are easily rendered on Tableau dashboards to empower analysts and other cross-functional teams.
Session attendees will learn how to:
Myles McDonald: Hi, everyone and welcome, thank you for being here. Today, we’re going to discuss, in detail, how Databricks leverages Lakehouse and Fivetran for Marketing Analytics. My name is Myles McDonald, I’m a technology alliances manager for Fivetran and grateful to manage our Databricks partnership today. With me, I have Chris Klaczynski, the marketing analytics manager for Databricks and a formidable Fivetran power user. We will get to know Chris in just a bit. Prior to getting started, should you have any questions for us, please utilize the chat, we have an entire team that can help answer your questions as we go along and thanks so much in advance for your engagement. Before we get started, a quick note on the agenda. First, we have introductions. As mentioned, you have myself and Chris but you’ll also hear from Craig Wright, a Fivetran partner and engineer, a little later. We will then do a quick Fivetran overview and architectural spotlight of where Fivetran fits in the data landscape. We’ll dive right into the Q&A with Databricks, to discuss their use case of Fivetran and then that’s where we’ll hear from Chris and finish with Craig Wright’s demonstration of Fivetran.
First and foremost, a little bit about Fivetran. Fivetran was born of the Y Combinator, as an incredibly fast growing startup with over 1,800 customers. We help customers across almost all industries and verticals, from SMB to enterprise, seamlessly ingest data from 170 prebuilt, no code and no maintenance data pipelines. For marketing and sales, finance and HR, to complex databases, Fivetran has you covered. We are fortunate to have a shared investor with Andreessen Horowitz, with Databricks, we’re able to see the tremendous technological value and synergies that we have together. So, let’s chat a little bit about the Fivetran architecture. Fivetran sits in the ingest portion of the traditional data stack architecture. Our core job is to remove the burden from traditional ETL processes and provide customers a simplified way to ingest analytic-ready data, from applications, databases, events and more.
Enter Delta Lake. So, data engineers, data analysts and data scientists, can spend engineering time and focus on impacting core business initiatives, unimportant analytics, BI, AI or ML, projects. Fivetran was a part of the initial ingest partner launch back in February with Databricks last year and have been helping customers accelerate analytical workloads ever since. Whether customers are just getting started in their analytical journey or are knee deep in complex analytical projects, Fivetran finds a home in all of these scenarios, helping simplify the outset of an acceleration of these initiatives. I did want to highlight that, Fivetran is no stranger to the original ETL process. The dichotomy between building and maintaining versus automation via Fivetran has been and is traditionally viewed as a competitor or competing, when in reality, how the way we see it is that, it’s actually a joint success story. It’s not a versus, it’s a with.
As engineering teams begin or continue to run lean, efficiency is vital and critical to success. Our customers, like Databricks, chose Fivetran and to use Fivetran to offload a portion of their ETL to focus on more impactful efforts of their business. More on that later. Moral of the story, I hope you don’t see this as a competitor or doing one thing versus the other or one thing is better than the other but rather seeing this as an opportunity to explore an alternative or an addition to, that can help your business grow at scale. Now I’m super, super excited to do a quick Q&A with a Fivetran power user at Databricks. With me, I have Chris Klaczynski as mentioned, who’s a marketing analytics manager at Databricks. Chris, I’d love for you to start by just telling us a little bit about yourself.
Chris Klaczynsk…: Thank you Myles. So, my name is Chris K and I joined Databricks in 2019, as the first marketing analytics employee. Before that, I was working at Amazon on the Kindle and Alexa customer behavior teams.
Myles McDonald: Awesome. Well, great to get to know you a little bit more, Chris. I’d love to understand and have the audience understand, a quick tidbit on what Databricks does and what your role is there.
Chris Klaczynsk…: Right. So, we provide a unified analytics platform that helps customers solve the world’s toughest problems. And what I do is, I help marketing solve their toughest problems, whether that’s evaluating how a campaign is performing, measuring the ROI of a certain channel or understanding the impact of an event on a customer’s usage.
Myles McDonald: Awesome. That sounds like pretty important stuff. Well, I guess, in that vein, what does success look like for your team and how do you typically measure that today?
Chris Klaczynsk…: So, we support marketing ski objectives and those would be, of course, generating new pipeline opportunities, driving awareness and growing our database, as well as increasing usage at customer accounts. And we do this by providing dashboards, forecasts and various analyses.
Myles McDonald: Awesome. That’s great. So, now we’re going to really get into it. This is what… I think, I’m personally excited to learn and I know that the audience is, as well. How did you guys manage your data ingestion before Fivetran? What were you doing? What were some of the processes that you guys had in place?
Chris Klaczynsk…: Yeah, Myles. We did not have dedicated data engineering resources, so we had a mixture of Alteryx with some legacy pipelines managed by central teams. And the long and the short of it was, it wasn’t working for us well.
Myles McDonald: You must know the next question because my next question to you is, what challenges about that process did your team face, with that existing kind of data ingestion process?
Chris Klaczynsk…: Yeah. So, as we transitioned from a traditional data warehouse to Databricks, we were facing a lot of problems with our Salesforce and Marketo pipelines. A few of those were, we weren’t able to get our data into the Delta format, we had to use parquet. Another one was, we weren’t able to natively append data, we had to come up with some clever hacks to do that. And any schema change would inevitably break our pipeline and we’d have to scramble and figure out how to prevent an outage.
Myles McDonald: Boy, clever hacks, that’s an interesting one for sure. I don’t know if that… that is probably the opposite of efficient, I’m assuming, for your team.
Chris Klaczynsk…: Yeah. Not something we want to do on a daily basis.
Myles McDonald: Absolutely. Well, now that we’ve talked about the challenges and obviously what you were doing before and some of the problems that you were facing, I’d love to understand, as your team started to be interested or looking into other potential solutions, specifically for data pipeline providers, what were you really looking for? What was super important for you and the team?
Chris Klaczynsk…: Yeah. So, we’re not data engineers and we’re not looking to be in the data engineering business. What we were looking for is something that was turnkey, didn’t require any coding and was, of course, reliable and easy to use.
Myles McDonald: That makes sense to me. I guess, it might be important for the group to kind of walk through maybe the current architecture, what does that currently look like for your team?
Chris Klaczynsk…: Yeah. So, we use Fivetran for our core source system data, from three different systems, Salesforce, Marketo and Google Analytics and all of the data from these systems is coming in through Fivetran. But we also utilize a lot of other product data and data coming through other pipelines from different teams. And once we have all of our data in the data lake, we are able to do different transformations and get that data into a meaningful form that we can then run analysis on or provide dashboards.
Myles McDonald: Excellent. Yeah, I love this. For me, I’m a very visual person, so it’s certainly helpful to see a little bit about how the overall flow of the architecture works. Here’s an off the cuff question for you, was this… implementing Fivetran, was this a year long process, was just super crazy for the Databricks team to get stood up? Or tell me a little bit about that quickly.
Chris Klaczynsk…: No, Myles, it was very simple. So, we followed the quick instructions, set up a cluster, white listed the IPS and we were pretty much off to the races within a few days.
Myles McDonald: I love that, that’s great. And I figured that’d be a quick answer. Biased, of course, because I knew it was going to be quick. That said, I think what everyone wants to know, including myself is, by implementing Fivetran, what has your team been able to accomplish? What are some of the successes and success stories you can share?
Chris Klaczynsk…: Yeah. So, once we set up these pipelines, we were able to automate a whole series of different Tableau reports and those answer marketing’s most common questions. And that allows us to divert our efforts into some more interesting projects. So, what we’ve been able to do is, provide accurate forecasting to campaign teams, to allow them to understand how they’re pacing and if their programs are performing well. And we’ve also been able to work with other teams at Databricks, on a variety of data science and ML projects.
Myles McDonald: That’s awesome. That sounds really… it sounds like it’s been really impactful for your team. Obviously, we talked about the challenges of spending a lot of time on kind of those clever hacks, in regards to the pipelines and it sounds like it removed that and those efforts, which is really great. I think it might be helpful, as kind of one of our last questions here, to discuss… I know Databricks probably wasn’t the only company in the world struggling with some of these same challenges. So, what advice would you give customers or prospects that are facing similar challenges to what we discussed today?
Chris Klaczynsk…: Yeah. Good question, Myles. If you’re struggling with certain pipelines or you’re simply just not wanting to invest data engineering resources in them, give Fivetran a try. It certainly worked fantastic for us and it’s very easy to use and simply put, it just works.
Myles McDonald: Chris, thanks so much for sitting down with us today, virtually, to discuss Databricks’ use case of Fivetran. I know I’ve learned a lot and I really appreciate the time, going into as much detail as you did about Databricks’ use case. Now, if we can make a smooth transition, I’d love to introduce Craig Wright, who’s a senior manager, developer relations and partner engineering for Fivetran. He’ll be showing us a Fivetran demonstration, to really help wrap us up here and at the data and AI summit that Fivetran has sponsored and really be able to showcase what Fivetran is all about. Craig, thanks so much for being here, take it away.
Craig Wright: So, I have a few tabs open here that I’d like to share with you. The first is a Fivetran account that has already been connected to a Databricks warehouse or a Databricks setup on Amazon. So, we’re already connected to Delta Lake here and we’re ready to go, just to setup a connector. My apologies for the old logo, we are actively correcting that as I speak. I also have open my window to that Databricks cluster, so I’m going to be able to show you that the data didn’t exist, that it will exist after we’re done and that we can query it. And finally, I’m going to be connecting the GitHub connector and so, just to make sure that I have all my information available to me, I’ve opened up the GitHub documentation on the fivetran.com/dock site, just in case I had any questions while I was going through this.
Okay. So, I’m going to setup a GitHub connector, which is a fairly fast and easy connector to setup here, it’s great for demoing. I’m going to go ahead and click connector up here and this is essentially the giant list of all of the connectors we support. Rather than try to find GitHub in that list, I’m going to go ahead and just use the type down and find it. So, all I need to do here is give the schema a name, a name that will be permanent, authenticate with GitHub and the data will be ready to start flowing. There’s a few configuration options that you’ll see in a minute but I’m going to go ahead and get this going. Before I actually start this though, I want to go over to the Databricks site and demonstrate that I haven’t loaded a GitHub table yet. So, this is our testing cluster, so there’s a lot of different schemas already created here and there is a GitHub schema that was created by testing but if I search for that GitHub Databricks demo, it does not exist yet.
Okay. Let’s go through the OAuth flow. I’ve already, actually, authenticated with GitHub, so I didn’t even need to do that. And normally, what you would see there is the GitHub authorization screen. And rather than sync all repositories, because I have an absurd number of repositories attached to my account, I am just going to go ahead and sync this one repository, in the interest of time. So, we do a few connection tests for every connector. In the GitHub case, we just verify that the credentials are good, if we’re connecting to the API and making sure we actually still have access to those repositories. And that is it. I want to emphasize this fact, at this point, I can just go ahead and click start initial sync. And, in fact, in the interest of time, I am going to do that and the data will start syncing. That is all the setup that I needed to do for this connector to get data flowing from GitHub.
If I wanted to, there’s a couple of other places where I can do some setup. In the schema tab, I can choose whether or not to sync entire tables. With a more database like connector, I can also choose whether or not I want to sync columns or if I want to hash a column, so that the data in that column is [inaudible] on the Delta Lake side. And in the setup tab, I can control how frequently this connector syncs. It can sync as infrequently as every 24 hours or as frequently as every five minutes. Fivetran does not need a running cluster to sync data, so we’ll turn the cluster on, we’ll start the sync, we’ll turn it off. It’s possible that the customer… Even if you don’t load a lot of data, that still takes time and money to start the cluster up and shut it down. And so, it’s possible a customer may only have the need to sync every hour, just to kind of spare those costs on turning the cluster on and off.
Alternatively, they may have a ton of data that’s coming in and absolutely needs to be there as fresh as possible. So, five minutes works great for that. Okay. I’m going to pause here, while the connector syncs, we’ll join this again later and take a look at how it landed in Databricks.
Excellent. We can see that the first historical sync has finished. Historical syncs are the first time a connector is synced with Fivetran and they usually take a little bit of time. In this case, this historical sync took 15 minutes, that’s to download all of the data for one GitHub repository. Okay. So, let’s go see if that data showed up in Databricks. So, in a very trivial way, I can just open this up in the data viewer and see, oh, yes, indeed, I have the GitHub’s Databricks demo and that’s actually easier to see when I use your type down functionality and the tables are here. Why don’t I go ahead and open up a workspace and see if we can’t query a little data out of that table. So, here I have a SQL workbook and I’m going to use Fivetran’s documentation here for a minute because I know that we have… Fivetran publishes the schema, ERDs for all of their schemas. And here it is.
So, I’m going to go ahead and open that up. And I think what I would like to do is… Ooh, that’s too zoomed in. Is, take a look at what commits have been made to this repository. I have a pretty good memory of who committed what to this repo and so, this seems like a good way to test if the data got in there. So, okay. I can see, I probably want to pull the commit sha, the author email, the author date and maybe the message. So, going back to Databricks workbook, I can go ahead and type that in, select sha, author email, author date and a message, from GitHub Databricks demo dot commit. Let’s ordered by the author date descending. Excellent. Let’s go ahead and give that a run. There’s nothing like a live demo to get the excitement flowing. Oops, I made a small mistake, so let’s go back and fix that. What did I do wrong? Oh, so simple and yet so relevant, so important. So, I just left out the by, so let’s go ahead and do that one more time.
All right. And look at this, this looks not unlike a get log command, where we’re seeing, essentially, the history of all of the commits to this repository but we’re now seeing it with data that has been inserted in the data lake. So, this is a very simple example of how Fivetran works but what I’m hoping to have shown is just how simple it is for data to move from a source and into Databricks Delta Lake.
So, the takeaway I wanted to leave this audience with is, this simple statement, Fivetran delivers Databricks Delta Lake customers with both zero maintenance data pipelines and the ability to achieve completeness of data, with the automated ingestion of data from modern systems, no matter the sources’ schema or the API changes. And we think this represents a sea-change in giving pipeline and analytics engineers the ability to focus on what is really important to them, which is generating business value out of this data, not spending a ton of time figuring out how to get this data into Databricks Delta Lake to begin with. We found that it has a ton of value for our customers and we think you’ll find that it has a lot of value for yours as well.
Chris Klaczynski manages data for Databricks' marketing team, and has done so since 2019. He is currently focused on using Databricks to implement predictive analytics across marketing. Prior to Datab...
Myles McDonald, an animated alliance manager of 4 years, manages some of the largest cloud providers in the world for Fivetran. He believes alliances bring hyper-growth companies together to help cust...