Filtering vs Enriching Data in Apache Spark

Download Slides

Apache Spark provides lot of options of joining the data for its data sets. This talk will focus on comparing the approach of Enriching the data (left outer join) versus filtering the data(inner join).How both approaches end up with same result and highlight the merits of Enriching the data approach helped us in Capital One. We at CapitalOne are heavy users of Spark from its initial days.This talk will provide more details of how we evolved from filtering to Enriching the data for credit card transactions and highlight what benefits we got by following Enriching the data approach. Being the financial institution,we are bound by regulation.We need to back trace all credit card transactions processed through our engine.Will be providing the details on how Enriching the data approach solved us this requirement. This talk will provide more context on how financial institutions can use Enriching the data approach for their Spark workloads and back trace all the data they processed. We have used the filtering approach in Production and what were it issues and why we moved to Enriching the data approach in Production will also be covered in this talk. Attendees will be able to take away more details on Enriching and filtering options to decide on their use cases.It will be more relevant for users who wants to trace their data set processing with more granularity in Apache Spark.

Watch more Spark + AI sessions here
Try Databricks for free

Video Transcript

– Hey, hello everyone! Hope everyone are doing fine and you all are accommodating to this new norms of working from home and attending the conferences virtually. This is my second virtual conference this year and I hope you all are catching up with it. And thanks for taking some time to go through this session of enriching the data vs filtering the data in Apache Spark. I hope this presentation provides a context of two different patterns which you can use in your Spark based application. Let’s get started.

Before we dive into the technical topic, I thought I will give a brief introduction about myself.

So I am a Master Software Engineer working in Capitaone. And if you own any of our credit card, as will you go on and swipe the card in Starbuck, buy coffee, how much you are supposed to earn as a rewards is computed by my application, and I have been developing Spark based applications from its initial days. I’ll just call it out 1.2 and with Java 2, and I have been following big data technology very closely and I have been working with big data technologies for quite a while and I have been contributing to Capitalone Medium tech blog regularly and I have written multiple blogs on our Medium site.

And you can hit me in twitter and LinkedIn. I have provided my handles.

Okay, so here is the agenda for today’s discussion. So, first we will hit upon the rewards based use case. What we have gone through in Capitalone. And that will set us up for the discussion on what kind of patterns we here used and which one provide us more leverage. So those are all the things we’ll get around more on the use case context. And first, we will touch upon filtering approach and how we here used filtering approach in our use case and what kind of problem it had with that particular approach and how we work on that particular approach with another approach called enriching the data approach. And, finally we will conclude and leave some room for questioning.

(Keyboard keys clicking) So before we dive into the use case, so by now we may know Capitalone is pretty much into the tech space and we are proud pioneer in banking sector when comes to the technology and we develop most of our softwares with open source first and pretty soon we are going to be operating fully on cloud and we will be exiting our data center very soon.

Rewards Use case in CapitalOne

And we use Apache Spark extensively for variety of our use cases. Wheel, batch, streaming, machine learning workloads. Its used extensively across the inter place.

(Keyboard keys clicking) Okay, so we will get into the use case in detail. So, this particular use case comes from rewards part of Capitalone, and for easy understanding and brevity, I have abstracted the application for simpler context. So, we apply specific account based rules and some specific transaction based rules and we compute the earn and beep our system in the database, right? So, you are swiping that credit card in a Starbucks and buying coffee and you are supposed to get a particular earn, right? So, that is what this whole application does. So, it gets the swipe and provides you the rewards. And, its completely done in Apache Spark and that’s the use case we are going to discuss using two different approaches, one we will start off with the filtering based approach and we will get into another approach and why we chose. Right first we will get into the filtering based approach.

Filtering the data Approach

So the filtering based approach uses Spark’s inner-join, right? So, joins are like pretty much standard, and we use for most of this data set computations, right? So in this particular filtering based approach, why we are doing it? We are bringing in two different datas here and we are doing inner join on particular join criteria and that’s what I mean by stage and how the computation happens and to get started I have provided a filtering based approach diagram. We’ll get into a data based example later. So, what filtering approach with our credit card use case, whatever I was mentioning, how it flows through, right? So, we are bringing in ten transactions and we are applying account based filter and that filters out five transactions, which is done in memory using our Spark inner join. So, which filters out five transactions, I mean. And then we are applying transaction based business rules. For example, transaction based business rule is, just consider only the purchase. Don’t consider the payments. Similar to that for account based, its something like, whether the account is in good standing. This is not a defaulted account. Or those kind of the simple cases what I am defining by account or business based rules, right? So, then that transaction based rules filters out another three. Finally, yields us two transactions which is what we are using to compute earn, right? So, this is what I mean bu a filtering approach. We are starting off with ten transactions. We are ending up with two transactions, which is a valid reason but on the whole process, in this simple example, we have filtered out eight transactions which happened in memory, right?

So, let’s get into data based example, with the same use case, right?

Filtering Approach Example


Okay. So the same example, I am just getting three transactions and we are trying to apply account based rules on those transactions. In this example, transaction data sets, left hand data set and account based data sets, right hand data set. And, we are reading good standing accounts from account based data set and when we are trying to join that using the inner join in memory that yields us a result, right?

So, which probably by now you know what could be the result. So it removes a defaulted transaction and yields us two, which is good standing account’s transactions. So same way when we apply transaction based inner join where we are trying to read purchase category and we are applying inner join on the same transaction data set yields us one transaction. This is the valid transaction where account is in a good standing condition, and it is a purchase transaction and it is supposed to earn rewards, right? So this is what I mean by filtering approach where we are starting off with three transaction, in this simple example, we are ending up with one transaction, which is a valid transaction but this kind of approach has its own problems, right?

Issues with Filtering Approach

We will get into that kind of thing now. So, what are the issues filtering approach has.

So its pretty simple to deal with that kind of small data set but that’s not how typical realtime use cases are, right? The big data processing and what happens is when you deploy this kind of filtering approach on a large data set, it becomes very hard to debug the application post deployment. And once the application is deployed because of the sheer volume of processing, it becomes hard for us to see why this particular transaction is not making into the computation of earn, right? So, if we are dealing with, say 20 million transaction, 40 million transactions, on a given day, right? How to identify on a granular level, what happened to the transactions, right? That’s what I mean by back tracing of data is hard when the computation happens in memory, right? In the same example, we are starting off with three and we are ending up with one transaction, we know two non-processing transactions won’t get filtered out. But how to ensure that that is what always happens, right? That’s what back tracing and be from a regulated industry, we need to have this kind of capability to back trace why a particular transaction is not making through, right? With this volume of processing, what matters is, not what we computed, we need to ensure what we computed and why we are not computing for other transactions as well, right?

We can argue that counts is a operation which everyone uses when they rapidly prototyping the Spark based application, right? But counts is not a efficient way of doing things when you are dealing with a huge data set, right? We can argue that we can do counts at each stage in the same example and try to get the number of records and we can reconcile but the issue is counts always provides us the information on how many transactions made it but what we need in this case is why it did not make it through. So the contextual information at each stage, counts cannot provide. Counts can provide only how many got processed. And everyone in Spark world knows that counts is a costly operation. And it puts suppression on the driver side and typically we tend to avoid counts but this can be mitigated if we still sticking to the filtering approach, right? But how did we overcome this filtering based approach issue without having the counts? That’s when we adopted enriching the data approach.

(Keyboard keys clicking) So enriching the data approach.

Enriching the data approach

The same example we will go through in the lens of enriching the data approach, right? Before we start the key thing in enriching the data approach is, here we don’t use the Spark’s inner joins. Instead we use left outer join. Instead of filtering the data what we do is, we keep on enriching the information from the right hand side into the left hand side data set. So which means in the same example, right? We are bringing in ten transactions and we are trying to do left outer join on a account based same five transactions, right? Instead of filtering out five transactions which happened in the previous pattern, what we’re doing in this is we are bringing on only the information what we need from right hand data set into the left hand data set but we are not filtering out, instead we are just enriching the data. So, to the same use case, we are adding what we need from accounts and then same ten transaction goes through the next stage, where we are adding transaction based information into the same data set. So it keeps the number of transaction intact, and with that you apply business logic which will yield us the same two transactions which we are supposed to earn, which is the desired result what we need and how we arriving at the same conclusion or the same result is what’s slightly different when we are adopting the enriching the data approach. (Keyboard keys clicking) We will go through the similar data based example for enriching as well similar to what we did with the filtering approach.

So we will take the same example. (Keyboard keys clicking) where we have same three transactions and we are bringing the same accounts.

Enriching Approach Example

The same example what we used in the filtering the data approach. So when we try to do a left outer join on this same data sets, what happens in this case is, the number of transactions remains intact but we are bringing in what we need from the accounts dataset into the transaction dataset. So which yields us the result set like this.

So if you see here we need the status of the account at the point of processing, mostly the stateful information at that point of the computation, right? So if you see here, the result set, the same three transactions which is enhanced with the account status into the dataset. And while we do the same for transaction based where we are bringing in the categories and when we do the left outer join and enhance the transaction dataset with the category, it yields us this, right? So here we are bringing in the category which is probably a purchase, purchase, payment, just simple example.

And using this same example what we can do is, now we have the dataset, now we can apply the business rules, whatever we want to capture on top of this. And that we can use a particular flags. This is the pattern we have adopted. So while we apply the business logic it can be multiple flags. You can maintain which means that how much granular information you want to keep on holding to this, right? In this I have just shown only a simple flag which says its eligible for next stage, which is computed based on the dataset where the status is good on the categories purchase. So then only for that it sets it true, the rest of thing its false. We can go one more granular level and capture the information by going to the each and every column. And capture that as a stateful flag. And then act on those flags and have one finer flag which you can use for your next stage of computation. So in this example, eligible for next stage flag is said based on status of good and the purchase category. So which yields us (Keyboard keys clicking) one transaction, which is the same result what we have achieved with the filtering approach as well. But with enhanced enriching approach, right? What we are doing is, we are capturing additional information at each stage of the computation. This is what makes it easy for us to back trace or go back and see why this particular transaction is not making through. That kind of granular information helps us post deployment. Now let’s see what are the advantages of enriching over filtering approach.

Advantage of Enriching over filtering

So data from each stage is enhanced or enriched into the original dataset. So in this example, what we have seen, we are bringing in account information what we need to act on that particular dataset, we are enriching that to our original dataset. So it captures the state information at that point of the computation which makes it easier to debug or even analyze the dataset later. So we can go back and see why this particular transaction is not making through. And with the flags approach you will be able to see “Oh this particular flag is set too fast.” So which means because of that particular stage is what your transaction is not making it for computing the earn. So and also we are capturing the data columns or flags at each stage, which gives us more granular details for back tracing the transactions, right? So the example of transaction based business rules or account based business rules, here we are trying to see it’s good standing accounts or purchase transactions. So we know the transaction which did not make it through falls into any one of this case. So for huge volume of processing we know we are arriving at the desired result but how we are going back and reconciling the data which is not making it through.

That is very much essential for this kind of regulated industry where we need to go back and see each transaction at that granular level. Right, so enriching the data approach provides us that kind of flexibility and granular details.

And you don’t need to do costly counts operation at each stage of the computation.

So instead you have the details whatever we are capturing as a flags or the additional columns which got enriched at the time of the processing. That’s what we can use for analyzing the debugging at the later stage. So there is no need of counts at any stage.

Still this is a huge win, compared to the filtering. Even though if we use counts which may not be providing all the details what we need, it’s a costly operation which may clearly avoid if you are doing enriching approach on a huge dataset.

So we had the filtering based version one in production for few months and after deployment we found out that it was becoming cumbersome for us to back trace the transactions and also to get into the details of the transaction which is not making it through, because of mainly all the computation, whatever we have been doing. The bunch of business rules we have been applying. It was yielding us the result but how to go back and see the transactions which got filtered out in memory and why it got filtered out. So that kind of problem we had post deployment, and after deploying it, we immediately, in few months, we started iterating on our version two using the filtering approach, and we have deployed

enriching the data approach based version two. It was in development for few months and we deployed that to production as our version two. (Keyboard keys clicking)

So then we made the switch to the enriching based version two in production and that particular pattern based spark application is what currently is successfully running in our environment production for few years now and this is the approach we are using to provide millions of miles, cash, and points as rewards to our Capitalone customer. (Keyboard keys clicking)

So if you are using our credit card and making a swipe, as I mentioned earlier in my initial discussion, our application is that’s what computes the earn what you are supposed to earn for your credit card swipes and if you have used our credit card in any of the purchase,

and this is how your transaction flows through in production using the enriching the data approach.

– [Gokul] Thank you everyone for the opportunity.

Watch more Spark + AI sessions here
Try Databricks for free
« back
About Gokul Prabagaren

Capital One

Seasoned Spark professional developing and managing Spark workloads at scale for CapitalOne.