data brew logo


BI on Data Lakes – Making it Real for Retail

Columbia Sportswear is a data-driven enterprise, integrating data from all line-of-business-systems to manage its wholesale and retail businesses.

In this session, we discuss the lessons learned with Lara Minor, Senior Enterprise Data Manager at Columbia Sportswear, on how her team achieved a 70% reduction in pipeline creation time. This had reduced ETL workload times from four hours with previous data warehouses to minutes enabling near real-time analytics. Her team migrated from multiple legacy data warehouses, run by individual lines of business, to a single scalable, reliable, performant data lake.

Lara Minor
Lara Minor is a Senior Enterprise Data Manager for Columbia Sportswear. She is an inspiring IT leader with 10+ years of experience influencing corporate growth and profitability through innovative technology strategies, dynamic leadership, and an ability to shape high-performing, multicultural teams.A hands-on coach and mentor, I most enjoy providing software developers and cross-functional teams clear vision, meaningful feedback and motivating them to make big things happen. I’m particularly strong when working in environments that require a strong mix of technical aptitude, business acumen and communication skills in order to achieve major milestones. A collaborative problem solver by nature, I’m comfortable navigating the most challenging projects and enjoy taking on the most complex or problematic initiatives. I’m also skilled in managing within matrixed global organizations and committed to keeping communication and culture integral to the business as I lead.

Video Transcript

The Beans, Pre-Brewing

Brooke Wenig: 00:06
Welcome to Data Brew by Databricks with Denny and Brooke. This series allows us to explore various topics in the data and AI community, whether we’re talking about data engineering or data science, we will interview subject matter experts to dive deeper into these topics. While we’re at it, we’ll be enjoying our morning brew. I’m Brooke Wenig, Machine Learning Practice Lead to Databricks.

Denny Lee: 00:26
And my name is Denny Lee, I’m a Developer Advocate at Databricks. For this episode, we’d like to introduce Lara Minor, a Senior Enterprise Data Manager at Columbia Sportswear to discuss how Columbia transitioned from data warehouses to data lakes, achieving a 70% reduction in pipeline creation time and reduced ETL workload times. Lara, why don’t you tell us a little bit about yourself and how you got into the field of big data?

Lara Minor: 00:52
Hi, I’m Lara Minor and I’ve been working at Columbia for about nine years now. And during that time, we’ve done a lot of the traditional BI reporting. And a few years ago, we decided that we needed to make a change up in our platform. Columbia was moving towards the cloud and was initiating a lot of cloud movement. And at the same time, we decided that we needed to do some work on our enterprise data warehouse. So, that’s what spurred us to look at a cloud solution.

Brooke Wenig: 01:26
Very interesting. So, I think as a followup to that, we would love to know what are the business problems that you’re trying to solve now that you’re migrating all of your workloads into the cloud?

Lara Minor: 01:35
Yeah, so traditionally we’ve done a lot of BI reporting across the company, whether that be looking at sales across the company, forecasting, inventory purchasing, and much of what we’re doing today is the same, just on a bigger level. So, with coming into the cloud, we’ve been able to have different business units be more involved with what kind of data is coming in and what we’re doing with that data to create our data assets. And at the same time, we’ve been able to bring on a couple of analytic teams that are now starting to get into more types of analytics, like churn rates and inventory turnovers and lifetime value of a customer.

Lara Minor: 02:19
So, it’s really exciting to see the movement that we’re just starting to see now towards that, now that some of those basic BI problems are solved and people have access to their data like they’ve never had before. They never had that in our previous platform and they do today. And so, a lot of business involvement like we did not have previously. So, that’s been really exciting for us.

Denny Lee: 02:45
That sounds really interesting. From the sounds of it, you broke down a lot of silos by switching to the cloud. What were the kinds of data platforms that you were using before that basically had created those silos? What motivated you to migrate outside of just breaking those things down?

Lara Minor: 03:05
So, we had a pretty traditional platform with it, an appliance model on a data warehouse that was in a data center that was a shared control with that company and us. And then, we had a very popular ETL tool. So, it was very much like we have all these, if this group of developers that know how to use this ETL tool and then we have these people that create data assets, and this group works together to move that data in. And from all the conferences I attended and everything else, six, eight, nine years ago, we had the same problem as everybody else.

Lara Minor: 03:43
The business would come and say, “Here’s my requirements and I want this report.” And then we would spend six to nine months creating that report. And then by the time we got to them, they were like, “It’s not quite what I wanted.” And you would have to go back to the drawing board and move that through. And that process was very slow. And also that we couldn’t get them that direct access to their data.

Lara Minor: 04:07
And so, really what needs to happen, in my experience with BI, is the business has to see the data before they can tell you what they want to do with it. They want to be able to look at it and manipulate it and get some ideas of what they can do with it before they create this great report that they’re going to use and broadcast out to a thousand users.

Lara Minor: 04:28
So, what I would say is one of the benefits now is we can provide that access. So, our platform is that we do all of our compute on data lakes. So, in BI reporting, you can create a relational model, you can create a dimensional model, and then you can put reporting on top of that. We create all of those models, which we call models or assets, we create all of those assets out on the data lake, and then they get pushed up to the data warehouse where the reporting takes over. So, the BI layer is actually on that data warehouse.

Lara Minor: 05:02
But we can provide access for the business onto that data lake area before it ever gets to the warehouse, allow them to take a look at it and see what they want to do with that data, provide some better requirements for how they’re going to slice and dice it before we waste time doing that. So, there’s much more collaboration and all of that is happening on the Databricks platform.

Lara Minor: 05:27
So, we have a resource where they have read access, they can log on, access this data, take a look at it, work with us if they… We have a lot of SMEs in our company that know how to do queries and such, so they can get on there and do that, it’s another benefit of the Databricks platform is SQL as a standard for what 30, 40 years? So you have a lot of people that know how to use that. A business user doesn’t have to know Python or Scala in order to come on and mess around with data on Databricks, they just have to know SQL and a lot of business SMEs do. We get them on there, they can take a look and we can move forward with that better collaboration between the business and us. And that’s what I see really growing right now, the number of requests that we have for that direct access to that data on Databricks, so that they can get in there before we push it to a dimensional BI layer.

Brooke Wenig: 06:31
So, I love all the enthusiasm about Databricks, especially the unsolicited enthusiasm. How did you convince the business leaders that they needed to look at the data themselves before they could ask the question about what they want to use data to solve?

Lara Minor: 06:45
I don’t think we had to convince them, they have been asking. We have not been able to provide a platform. So, I think everybody recognized what that problem was, or that this, this process of producing… When we’re creating certain sales reports, they have to be dead on, they want to see them within a dollar, those very tight requirements. And so, when you’re doing that kind of reporting, it makes it difficult to be more flexible and more agile and quick. So, we can get them, we all knew that if we could get them that access up front and then by the time we get to that tightened down layer, we can be more agile doing that when we can do more experimentation before we ever get there. And it wasn’t a hard push. And the business involvement is happening naturally, we’re not soliciting. It’s just starting to be part of the process. And they’re asking, “Can I get on and look at it before we do this?” And we’re like great. Yeah. “Yes, you can.”

Brooke Wenig: 07:58
So what were the biggest issues with the migration to the cloud? Were they security? Were they technical? Were they political? What were some of the factors that people were opposed to when transitioning to a cloud-based solution?

Lara Minor: 08:09
Yeah. So, Columbia did it a little bit different in that we moved everything to the cloud. We didn’t take a piece. We didn’t say, “Hey, we’re going to do these log files, and we’re going to move those to the clouds, or we’re going to just do our data science, we’re going to move back to the cloud.” We did the whole thing. So, we decided to get off our on-prem, expensive, that was going to be an expensive hardware upgrade that we didn’t want to do, and we just moved everything.

Lara Minor: 08:41
And so, what that means is all of that data. There’s a lot of data that you use for your BI reporting across the supply chain and all of that now is in the cloud. And so, when we did that, yeah, we had some surprises and that’s one of the things I talk about is tech is tech, eventually you’re going to figure it out. We’re all in technology, we’ve been doing it for a long time. You’ve got people on your team that are going to do that. Here’s the things that we didn’t, that kind of caught us, first of all, was that security layer.

Lara Minor: 09:15
So, that security layer ties in directly with data governance and so about the same time that we started moving in, we started a data governance team. And really, when we were moving all this data over, there were a lot of… We’ve put together communications and such about how this data was going to be open. Because previously on that other platform, we had a report, these 10 people had access to that report, these 50 people, these thousand people, whatever it was that it was kind of controlled, there was a process to get access.

Lara Minor: 09:49
Now, that we have all this data that we’ve provided kind of open to report access to, the business came in and there’s concerns about that. How do we define something as restricted? How do we make sure the right people have access to consumer data so that we’re in line with laws that are out there? And then how do we make sure that we’re in line with just, do we want our sales data to be available to everybody in the company? Or do we want to restrict that?

Lara Minor: 10:16
So, a lot of conversations about restricting data or open data that went all the way to the top at Columbia. And so when you start talking about that, it’s like, “Whoa.” There is a lot of talking and a lot of coming together that had to happen to make sure that we were meeting everybody’s expectations and then putting that in line with your security model.

Lara Minor: 10:40
So, our data lake security or our data warehouse security and our Power BI security, there’s security in all of these different layers and all of those different layers have to meet those agreements that we’ve made for data security. So whether they be from the security team, because of laws or protections, or they be from the business team, because, “Hey, I don’t want to have access to gross margin for everybody in the company.” So, those two things coming together, a lot of time spent there, and not as clear as tech, when you get past the tech, you’re going to have your problems and we’re going to struggle through that. But when you start talking about what does the CEO versus the COO versus the CIO versus my boss want to do with data open? So who is it open to? Then that’s a lot more. It’s a much different subject area and a lot harder.

Denny Lee: 11:37
No, that’s very true, actually, exactly to your point. It’s not the tech that usually bogs you down, it’s the process around it that usually bogs you down. And so I’m glad you provided a little bit of light to that. So, switching over to the tech, just because that way it’s a little easier on all of us to discuss right now, how long did it take for you to build this new system? You’ve alluded to the fact that you basically were building it on top of data lakes. So why don’t you describe a little bit about what that system is now and how long did it take to build it?

Lara Minor: 12:12
Yeah, it’s really hard to put a number on how long it took to build it. So one of the previous questions somebody asked is, “How long does it take to get Databricks up and going?” And I was like, “Oh, I don’t know, a couple hours.” So, you can have a workspace up and be processing some data pretty shortly here. But there’s a difference between that and like I said, we moved everything over. So that includes our scheduling and everything else.

Lara Minor: 12:36
And you might say, “Scheduling?” Well, yes, because similar to other companies, all of our source systems are still running on batch. So, it’s great that we could run real time, but these batch systems I’ve got to wait for SAP to go through a certain amount of their processes that happen once a day in order to pick up that data when it’s done. So, scheduling or when a job fails, how am I going to know that things are failing and that data is not processing?

Lara Minor: 13:04
So there’s a big difference between starting up a warehouse or starting up a resource rather, a workspace getting data in and processing all of your reporting data according to SLAs than have already been set for reporting. So, we started in 2017 with a POC that we did and we started talking to Microsoft about planning this out. We did not get introduced to Databricks until 2018.

Lara Minor: 13:36
So, what I like to say is from the time that we were introduced to Databricks to the time we went with our first major piece of our go live was about, was April to September. So was that six months or so, yeah. But we had a lot of work that was going on behind there. And also, at the same time, Microsoft was really moving along with some of its, like Data Factory is a big one. So we use Data Factory to get data from all of our source systems and that was going through major changes at that same time.

Lara Minor: 14:10
And so, we were constantly in touch with the Data Factory team trying to get it to where it needed to be for our usage. So, a lot of that was going on in that 2018 timeframe. But yeah, by 2018, we were pushing major things into production.

Brooke Wenig: 14:30
So, that’s actually a very fast timeline. What has been the feedback from your team so far? Is there anything that they miss about their on-prem world? How especially has COVID changed things now that everybody’s remote? Does it enable easier access now that everything’s on the cloud and in Databricks?

Lara Minor: 14:46
Yeah. That piece is nice, that we don’t have to be on VPN to access most everything. Once we get into the data warehouse, we do. But sure, everything’s super easy and people can be out and still support something if they need to or whatever it is. The team was really pleased with the platform. So, I would’ve, I don’t know that we would have continued with it if everybody wasn’t. But when we switched everybody over, there were just a lot of comments about the speed, it’s much faster. If you [inaudible 00:15:21] on data lake and using Databricks and you can process data quite a bit faster than you can on a traditional BI platform. So, pretty hands down, everybody that came over, even folks that have worked on SAP and BW, they came over and talked about the speed as well. So, we get a lot of our data from SAP and those systems are fast, the data lake is fast. Everybody’s very pleased. Yeah.

Denny Lee: 15:54
So, you had mentioned before, there was like a 70% reduction in pipeline creation time and reduced detail workloads. And that’s probably has a little bit to do with your usage with the data lakes, the Azure Data Factory that you called out, Databricks and I believe also Delta Lake as well. But in addition to the performance, how’s it different running and operating these new systems? Because I’m sure the people who were working for you, they had a big switch in terms of how they used to do things on those on-prem systems and then they had to both migrate to the cloud and use these new systems. I’m just curious, how was that shift for them?

Lara Minor: 16:36
Okay. Yeah. So we have a 70%, we had a really big jump in how fast we could bring data. And most of that has to do with the way that we bring data in. So, when we get it into the Delta Lake from our source systems, we were able to configure that, so that it’s just a matter of configuration, you’re not writing new code every time you’re bringing in source data, it’s just a configuration and you dump it out to the configuration and bam, everything’s getting loaded, which is really nice. Our compliments to our tech lead on that one, on creating that.

Lara Minor: 17:15
We still have work where we have to create the data assets. So, when you’re doing BI reporting, you get all this source data in, but then there’s a lot of data manipulations that have to happen in order to get it into either a relational or a dimensional model, applying quite a bit of business logic.

Lara Minor: 17:31
And that piece is where the developers come into play. And we definitely haven’t seen a 70% decrease in time there, because that’s where the work is happening, somebody has to code that, somebody has to code it according to some requirements. But what was nice about the platform is, like I said, everybody knows SQL. So when we started, everybody was doing all of their code in Spark SQL, why they absorbed how this new platform worked. And slowly, over, I would say, like 2019 and part of 2020, people were shifting to Python. And now everybody is developing in Python and they’re sharing back and forth how to better do things in Python and how to get faster performance when you use this versus this. And so we’ve seen a lot of growth in that.

Lara Minor: 18:23
And we even have a little mini project right now where we’re going to go back and look at some of those things that we originally developed and try to get those times down as a cost savings like performance improvements of things that people have learned recently or over the last year. And so, you’ve got that and that’s what makes that easy for people to come on to our platform as well.

Brooke Wenig: 18:48
So, how did everyone switch over to Python? I’m very intrigued by that because often I find that people that don’t have a programming background or a computer science background, they feel very comfortable in SQL and they’re hesitant to switch to any other programming language. What were the resources that you provided them or what was the motivation for them to switch to Python?

Lara Minor: 19:06
Yeah, I’m going to be honest, I didn’t provide them any resources. So I would say that we definitely hired a couple of people that weren’t developers. So we made a shift in how we hired. So we’re not looking for that ETL tool skill anymore, we’re looking for folks that add… I knew that way back in, I think, 2017 is when I hired the first person that knew C Sharp. I was like, “I hired a coder.” I was like, “I can’t hire this, I need these people who know these things.” And those ETL developers though, are very sharp people as well. So, they’re teaching each other, they’re taking Python classes out in LinkedIn Learning and out on the web and they’re looking at examples and then they’re going forward.

Lara Minor: 19:56
So, I think everybody this day and age in tech has to be able to do that. You see a lot of that. It’s not 10 years ago or whatever, we would send people to a one-week class. You just don’t see that much anymore. There are so many resources available on the web and such and just a lot of these people that came from the original team were ETL developers and used the tools, but they’ve made that shift, naturally. So compliments to them for learning.

Denny Lee: 20:32
I completely understand the progression you that you did. In one of my previous jobs, that’s exactly what happened. My first set of hires were specifically Python developers as well. I had a bunch of ETL, SQL developers and then exactly like you called out, they were helping each other out, so that way the ETL developers themselves actually started learning Python. And it was great. So, watching that progression and then, so when we started talking about data science, at a later junction, they were like, “Oh, cool. We can do it now.” I was like, “Yep. That’s the point.” And so, my original question is what do you think is the hidden value of your migration? I was almost answering saying, “Oh, it’s probably getting everybody to learn Python.” But I’m sure you have other thoughts in terms of what you think the hidden value of the of your Hadoop migration here.

Lara Minor: 21:20
The hidden value of the Hadoop migration is… I have several things. When I did our first estimates on cost, I did like what it would move to move our current stuff over and then if we saw growth at such and such a percent for, it’s I had to do seven years, for seven years. And what it would be if we saw explosive growth? And I called it explosive and I put some big numbers on there, but that’s exactly what we’ve seen is that explosive growth number, I had growth number. I didn’t know that we would see that.

Lara Minor: 21:58
We had business units that came on, we had money thrown at us, “Hire people and bring this, do this and bring all this stuff on.” Things have changed a little bit with COVID, naturally, we’re a retail company. So, there’s been a little bit of containment there. But back in 2019, was just a really big year. We brought in all kinds of new data and created all kinds of new data assets. So, I did not necessarily see that coming and that was wonderful. It was a scramble for us to try to keep that data warehouse really locked down and make sure that we weren’t creating multiple bottles, where we didn’t need them and things like that, but that was the scramble. But it was in a good way, we were really adding a lot of value with all that, those additional data sources.

Lara Minor: 22:51
But one of the really great things to watch is they funded a little DTC, what we call DTC analytics team, which is Direct to Consumer in retail, and that’s where you’re really getting into those consumer questions. And so, we have an analytics team out there that just works steadily on bringing in other data sources that aren’t really required by the rest of the enterprise. And they’re really doing a lot of work on consumer analytics.

Lara Minor: 23:22
So, what is the lifetime value or marketing plans? They work with the marketing team, and it’s just like a little under the covers team, that’s working away doing all these great things with consumers and really excited about that. I’ve always thought that consumer analytics, that’s where you’re going to get some serious ROI. So, I’m excited to see what that team can produce. Like I said, we all have brands of clothing that we love, where they’re really doing that interaction with us and paying attention. And that’s what this team can do. They can work with these consumers so that Columbia is their brand and I’m looking forward to seeing the output as they continue down the line of what they’re doing, but there’s a little hidden value there.

Brooke Wenig: 24:09
So, along with the reduction in pipeline creation time, it sounds like this has enabled you to go after new use cases. What’s one of the new use cases that you can solve using data lakes that you couldn’t solve using your legacy system?

Lara Minor: 24:23
I would say… So, a lot of people talk about, a lot of things that I read talk about using your data lake for big data. It’s always that kind of thing. But what I’ve seen as a benefit of the way that we’ve done it and having all of our compute on the data lake is that we can take those, like I said, with our BI reporting, most of it’s very, you’ve got to be rock solid on the numbers and the things that you’re producing, accuracy in SLAs and everything else. And then you could take some other data sources. We haven’t done it yet, but we could bring in Facebook, or we could bring in [inaudible 00:25:07] we could bring in… weather actually already is coming in, but different data sources.

Lara Minor: 25:12
And you can take those less accurate data sources and then combine it with these data assets that have already been created. So, for example, if somebody’s coming on to do data science, they don’t have to take the source data and do their data science like, “Oh, but I forgot about returns, or I forgot about this. I have to calculate this. I have to calculate this.” Because that’s already been calculated in that data that’s produced for those assets for reporting.

Lara Minor: 25:42
So, they can use those business assets that have been created with all of that due diligence and they can join it up with any other kind of data source that doesn’t necessarily have that accuracy or that control, but they have that starting point. And I think that that’s a real benefit on the lake.

Brooke Wenig: 26:02
So, is there any advice that you would give people when they’re trying to plan out this migration to the cloud, to data lakes, enabling Python developers? What advice do you have for other folks that are planning the same transition that Columbia went through?

Lara Minor: 26:14
Yeah. Well, as far as the tech goes, I would do a really solid POC. I would make sure that you can get something in there and try it out and make sure that you can get the value from it that you’re looking for. There’s a lot of different, like I’ve mentioning before, there’s a lot of different ways that you can go at this. You could decide just to do a small piece of something, or you can decide to go all in and there’s everything in between. We were in that all in camp.

Lara Minor: 26:41
We did do a POC. There were a couple of things after the end of that POC, where we’re like, “We can’t really seem to schedule jobs.” And I don’t remember, it was 2017, and there was one other thing. And I was like, “Oh, I’m sure we’ll be able to figure that out.” Figuring that out was really hard. So, that is where we, I would say, make sure, I learned a lesson there. Make sure your POC has some checkpoints and don’t sign off on a POC until you’ve hit them. That was our experience with that.

Lara Minor: 27:12
That’s all changed now, the system is completely different. That’s what you should expect with cloud as well. That’s what I recently told my bosses. Like, “We need to allocate some resources to constantly be pushing us forward on the new capabilities of the platform, whether that be Databricks or data warehouse or whatever it is, whatever resource that we’re using.” And then, the other surprise was that this was more of an enterprise effort than I anticipated. I thought we were just going to be moving what we had before on one platform to another. But that definitely blew up in a much more of an enterprise project in a great way. I can’t tell you how positive that’s been. And it did bring about all those questions with governance and security and things like that.

Lara Minor: 28:06
So, I think you’ve got to figure out which direction you want to go. I’ve talked to other companies that just do data science on their Azure platform, and they’ve kept their BI before. And I’ve talked to companies that want to move the whole thing over. You’ve got to just figure out which direction you want to go and then expect lots of surprises, as with any migration.

Denny Lee: 28:35
Well, thanks for providing this sage advice for migrating enterprises to the cloud, data lakes and lakehouses. This has been a great session around building BI on data lakes, making it real for retail with Lara Minor from Columbia. I really appreciate you taking the time out from your very busy schedule for joining our Data Brew vidcast today. So, thanks very much.

Lara Minor: 28:56
Thanks for having me.