Unlock Mainframe and IBM i for the Data Lakehouse

You’re moving more data and workloads to Databricks – but what do you do with the application data inside your mainframes or IBM i systems? You cannot ignore this critical data – but making it usable in Databricks is easier said than done. That is why you need an approach to data integration that helps you to access this critical data with no disruption to you or your business.

In this session, learn how the Precisely and Databricks have partnered together to help you eliminate legacy data siloes to make high value, high impact data available for all your analytics and machine learning projects.

You will leave this session with:

  • An understanding of how to quickly ingest data legacy sources to the cloud for use within Databricks through an engaging demo
  • Best practices for modernizing ETL processes and reducing development costs
  • How one customer used Precisely and Databricks to empower business users with the most up to date information from their mainframe by populating Delta Lake

Watch more Spark + AI sessions here
Try Databricks for free

Video Transcript

– [Instructor] Hello everybody and welcome to the session where we’ll talk about how to unlock your mainframe and IBM data or the Data Lakehouse with Precisely and Databricks. My name is Ashwin Ramachandran and I’m a Senior Product Manager at Precisely. Now, let’s start the discussion today by talking about why this data matters. Why does mainframe matter? Why does the system like an IBM matter to your business? Now, one of the key pieces I’d like to point out here is data that lives inside are some of these traditional or legacy systems be it a mainframe and IBM i server, Legacy Data Warehouse, a relational transactional database. They continue to adapt and deliver increasing value with each new wave of technology. So if you’re saying and insurance company and you’re doing a lot of claims processing. That claim data is a treasure trove of information that you can’t afford to not analyze, that you can’t afford to not include as part of your ML model if you start to try to be more predictive about your business. Now when you look at the systems themselves and some of the key trends in the market. Whether really astounding fact here is over 70% of executives continue to see their mainframes and IBM i’s is extremely important, because they have customer facing applications that depend on those platforms. Now, if you look at the mainframe, specifically, many people sometimes think about the mainframe is this long running platform, not necessarily gaining importance over time but it’s actually the opposite. In fact, in 2019, alone, the transaction volume on mainframe systems grew, 55%.

Again, more data, more potential insights and it’s valuable that you mind this insights. Now, in terms of these legacy systems as a whole when you step back and take Look at the larger picture. One of the really interesting facts there is, over the past decade, there’s been over a trillion and a half of enterprise an IT spends to support analytics workloads and things like that in some of these legacy platforms. So the key point that I’m trying to make here is, while we talk about these, and sometimes term these platforms as legacy. They really are key critical drivers to business success. And moreover, harnessing that data in your data projects is more and more important than it’s ever been before. Now, why is it more important?

What happens when legacy data is unlocked?

Well, couple of reasons. Number one. When you’re actually able to unlock this data, that leads to enhance the BI and more complete analytics. So again, let’s say you’re a bank now. And as a bank, your job is to perform some sort of or build some sort of a AML, Anti-Money Laundering application using some machine learning techniques. Well the key thing that you’re gonna need to harness and get is actual transactional data. Sometimes it may require you to pull that data as it’s generated. When you talk about something like an ATM transaction or a credit card swipe. More often than not, that transaction is going through a mainframe system. To actually build more comprehensive BI and more comprehensive analytics or even machine learning projects for that matter. The key piece here is to make sure you’re unlocking that legacy data so that it can be properly combined with the rest of the data you’re collecting in your organization to get more complete holistic insights. Now, as a result of that, just a natural follow on is when you include this legacy data and unlock it, you’re getting cheap improved data delivery. You can discover more about your business, ask better questions about your business and base the answers with the questions on data. Now again, when you’re actually unlocking this data, one of the key challenges on keeping to keep in mind as I democratize my data, how do I do that in a way that secure? It’s gonna provide the types of governance and lineage information that I need. And this is vital. Again, your decisions and your business insights that are based on data, are only as good as the data that’s feeding those insights. And having that clear line of sight is really really important when unlocking that legacy data.

And then again, once we’ve spoken about, once you really harness all of the data at your disposal, you’re actually able to feel more projects based on it. So with that said, I’d like to talk a little bit about what we’ve Precisely been bring to our table with our Solution Connect. And Connect really is the best solution for accessing and integrating this type of data with cloud framework in real time. So we can efficiently and quickly connect to a variety of platforms, whether it’s mainframe hive, relational database or other. Integrate that data and do it in the future proved way, that you can design your pipeline once and deploy it in any environment. So as you build and mature your integration architecture or as you build and mature your Data Architecture, your integration architecture is along for the ride and you don’t need to redevelop things. As a result, development times are reduced from weeks to days. And that’s just because, with Connect, users have a graphical interface, they can pull that data from the graphical interface, deploy it in whatever environment they need. And the last piece here that Connect offer is going back to that, the key challenge we talked about before is security and governance also want to (mumbles) this Connect. So users can securely transfer their data, get the appropriate level of data lineage they did, especially as that data moves from system to system. Coupled out with our unrivaled scalability and performance and you have an enterprise for integrating your most critical data assets with your critical machine learning platforms.

Now together Precisely and Databricks, give a great and pose a unique offering for our customers.

Precisely # Databricks: unlock legacy data

On the Precisely side, we have over 50 years of experience in unlocking data from some of these legacy data sources. We know it, we understand it, we can deliver that data within the tight SLAs that are required. Users can build these visual data pipelines, they can modernize their batch ETL processing at scale, leveraging our high-performance engine. And we also do have Change Data Capture capabilities, so that users don’t necessarily have to do full data refreshes multiple times a day. Now Databricks is a fantastic partner because with their unified Data Analytics Platform, Databricks offers ten to 100 times faster, a core engine processing on Petabyte-Scale Data, compared to open-source Spark. They advertise the lowest TCL to do with Auto Scaling and auto configuration capabilities, which stands in contrast to some of the other platforms that are on the market today for machine learning and advanced analytics. And Databricks brings to the table this unified collaborative experience about data engineers and data scientists on a single platform. So when you couple what Precisely brings to the table with our high-performance connectivity and data transformation capabilities and what Databricks brings to the table with this scalable platform that we can deploy our workloads on in a scale up fashion. We can begin to see why delivering this data inside of these tight SLAs and a future proof way becomes so compelling. Now at a high level, connected Databricks work very closely together.

Connect and Databricks

So on the Connect side, all we do is we can pull in the data from a variety different platforms or sources. Like I said, if you had data residing on a mainframe, whether that’s in a flat file, a VSAM file, an iOS database, a Db2 database, we have capabilities around data extraction but also data, CDC on that data or Change Data Capture on that data with a minimal footprint, a highest performance engine on the market. We can also connect collect data from a bunch of different relational databases and enterprise data warehouse. We can ingest data from things such as flat files, XML files, JSON or even some sort of system like Hadoop HDFS. We can pull streaming data in from Kafka. And again, deploy on any cloud environment you need or connect to data sitting in any core environments such as AWS, Azure or Google Cloud Platform. Now Connect again like I mentioned before, is natively integrated with Databricks, so we can leverage Databricks to do that data ingest. We can do a large scale data transformation, cleansing, merging all of that done in the Precisely engine running on Databricks. And finally, because we’re natively integrated with Delta Lake, we can deliver that data into Delta Lake with our native high performance connectors. The user never has to worry about how that pipeline they built, is actually gonna be deployed and running on the Databricks platform. Instead, they focus on the business task at hand and the requirements of their integration. Precisely and Connect take care of the rest of it for you. Now again, once that data is served in Delta, there are a variety of different architecture customers can deploy, maintaining raw data layers, bronze layer, silver layers, gold and rich layers. But we can always again, natively load that into Delta and that can serve for the reporting BI and machine learning applications can be built on top of it to Now dive in one step deeper in terms of our connectivity here.

Get data trom legacy sources into the Data Lakehouse! Want your data from legacy, mainframe and IBM i is loaded in Delta Lake?

Again, really what we’re offering is the ability to get that data from these legacy platforms, natively loaded into Delta Lake. And Connect does that through our wide variety of connectivity again, accessing things like VSAM data, mainframe, flat file data, Db2 data, coupling out with our design, once deploy anywhere approach that doesn’t require any sort of data staging. That really allows us to ensure the data is delivered within the timelines that’s required. From a compliance perspective, we have a lot of customers who are in more regulated industries such as financial services and insurance. And oftentimes, there’s a requirement that the actual data is preserved in the cloud in its original format in a bit by bit format. And what we can do is we can comply with that and preserve that data even if it’s complex lease and data, or variable length mainframe data and an energetic encoding. We can actually take that data stored in the cloud in it’s original format. But, again, make that data available for processing within Databricks. So you can leverage Connect to do that large scale data transformation into a format, the machine learning applications running inside the Databricks can understand. And again, from a lineage perspective, we can provide really granular level of detail as to what data was touched? Who touched it? How did they transform it? And how does the metadata propagate throughout the integration pipeline. Now the last piece I’ll mention here is performance. Performance is something we really pride ourselves on and we really focus on our self-tuning engine automatically, ensures the best performance for complex things like joins, data aggregation, data transformations and coupling our native vertical scalability with the horizontal scalability of the Databricks Unified Data Analytics Platform really ensures that it cannot only be delivered but it can be delivered within tighter and tighter SLAs as the business demands. Really, users can break down data silos and start operating on data in minutes. So to show you that what I’d like to do is step through a little bit of a demo of showing you how can act work in conjunction with Databricks to move data out of the mainframe into the database, Unified Data Analytics Platform.

So let’s start off by showing you what our interface looks like. Here, I’m building a data pipeline when merging multiple data sets together, a lot of it originating from the mainframe but my goal here and what we’ll show you is how I can load that data into Delta Lake. So in this job, as we call it, there are multiple different steps. The first steps start out by pulling data off the mainframe, combining it together and then loading. So really simple kind of extract transform load process. So to begin with, for a mainframe file, for a mainframe fixed record link file, I associated a copy book. Connect natively understands COBOL copybook, you can link to it directly within our editor here. And once you actually link it in, all of the different football fields are automatically translated for you to choose a format that even someone who doesn’t understand the mainframe can interpret. So here we can deal with things that occurs depending on, redefined and packed decimal. Now here I define my source file and in defining my source file, I select the actual location where that list, I can pull that data directly off the mainframe if I so desire and if I need to and then define the record type. So here I’ve chosen a fixed record length type but you could choose a DCM or something else.

Now here you can see the data is encoded as EBCDIC. Connect supports all the different ICU code pages. So if you’re using a non-standard EBCDIC code page or an international EBCDIC code page, we understand how to deal with that. What I’m gonna do now is show you a little bit of the data sampling we have. And you can actually see the data. The data you’re looking at on the screen here is raw mainframe data but Connect is interpreting it for you and showing it to you in a displayable form to make the integration easy. Now that it’s just a matter of defining a target. I’m gonna just take that file, transform it according to the data definitions in the copybook. And I’m gonna transform it and load that into Databricks. So this will actually go into Delta. I’ve provided my JDBC URL, leverage the authentication that’s there in the Databricks environment. And that’s pretty much it. Now I’ve fully defined a transformation process from mainframe all the way into Delta in a couple clicks.

And you can always add other database tables if you wanted to add more Delta tables fairly easily. And when you wanna do that, what you can do is just open up that dialog. We’ll list out all the different tables inside the database, select the table you need and then we try to Automap everything for you as best as we can. So here you can see on automatic field through a new target table. You can define how you wanna map or you can provide or put in individual field level transformations as well. And additionally, you can optionally filter data. With mainframe data, it’s often the case that data will be taken from a single file and because it’s stored in a hierarchical format, oftentimes users will split that data out into multiple relational tables. But with that, I can now run this job. I can actually now run that job natively inside of Databricks. You can again, with a single click of a button, you don’t have to define the low level details, just select the actual framework and then run the integration process. So here, I’m gonna load that data in. Once this finishes running, we’ll be able to take a look at that data sitting inside the database. Now we’re gonna quickly show you some of the statistics. And these statistics are very useful for users to better understand any potential issues that were discovered into the data, any failed records that fail to load, data volumes read in but now all that is available in the statistics for you. So you can see I’ve read over a million records. I’ve rolled out over a million records Delta and then if you go into the database environment, you can now see those tables that we loaded into and you can start exploring the data within there. Again, within four minutes, we just took a complex EBCDIC encoded mainframe file and loaded that into Delta. Now the data is ready and waiting for more data refinement, a cleansing enrichment and beyond that, some real intense machine learning processes as well.

So with that, really, we’ve talked about how Connect and Databricks work together. So what I’d like to do next is talk about a customer story that shows business transformation in action.

Now, this is a customer story ran a global insurer with the goal of digital modernization.

Global insurer tackles digital modernization

Now, they had a top down initiative to improve their NPS scores, improve their operational efficiency by digitally transforming their claims processing. Much of this claims processing runs on the mainframe or ran on the mainframe at the time. And they wanna take that and modernize it with a couple of key outcomes in mind. They wanna improve the claims experience for end customers, again, helping with things such as NPS scores. They wanted to use some machine learning to identify certain patterns in the claims data that would alert the business to unexpectedly severe claims. And then lastly, they wanted to automate some of the fast tracking of low dollar claims without the need for an adjuster to manually get involved. So as you can see, this mainframe data has all these claims. And what they really wanted to do was unlock that data, make it available in Databricks. So some of these other used cases could be enabled. Now with a problem statement to find that was a great first step fact.

Creating enterprise claims hub required quickly integrating new types of data

There are a couple of key challenges when it comes to integrating this data. And the main one that really sums it up here is their existing methods of integration were inefficient, or really unable to successfully connect to integrate and deliver that mainframe data in the way I just showed you. So if they needed a way or they needed a solution that was gonna allow them to successfully and reliably migrate this claim data for thousands of clients and be attended to thousands of clients moving all that data into the cloud for this comprehensive ML. Now, another key challenge was, for anyone who’s worked with mainframe data and you’re very familiar with the way metadata works in that environment,. You have things like Cobalt redefines repeating nested arrays, cobalt high low values, and EBCDIC pack data. The key challenge was dealing with those kinds of data types and translating that into something that was gonna work inside of Delta, inside of Databricks and so what they needed to do was as best as they could leverage the existing metadata they weren’t leveraging on the mainframe. This would allow them to simplify their integration pipeline but also more importantly, maintain source to destination lineage visibility. So by leveraging the existing COBOL copybook or existing metadata on their mainframe, they will be able to more easily integrate their data but also provide that clear line of sight for compliance purposes. And the last key challenge here scale with growing data volumes, like I mentioned before, we’re talking ten thousands of clients here. And so the number the amount of data was a key challenge they needed a solution that was gonna scale.

So together really Precisely and Databricks were able to help build this high performance data hub for machine learning for analytics and they use connecting our ETL capabilities within Connect in conjunction with date of birth to achieve this.

Precisely and Databricks helps to create high performance data bub

Now, the key results here, not only was the data delivered but there was no downtime or rework for implementing this new approach to legacy integration. they were able to take some of the existing pipelines that had deployed this connector in a non beta box environment, and then just port those over to run natively inside of Databricks. This would get coupling in the vertical scalability of connected with the horizontal scalability of the Databricks platform. Another key piece here was they were able to meet the high volume requirements of their hub.

And then ultimately, going back to that business problem, they’re able to deliver on those key goals, making the the claims experience faster and improving the overall customer experience. So again, here’s an instance where there was a key business objective and a key business directive. That directly relied on being able to handle access, integrate, transform and deliver legacy data that was still being generated on the mainframe.

So I’d like to close this session by just touching on a couple of takeaways for success. The first thing I would suggest learning from this customer or learning from some other customers that we work with is ensure that the goals of your monetization efforts are clearly defined. What’s the end state vision here? Are you trying to say cost? Are you trying to improve performance in existing pipeline or is it something else? Is there the opportunity to deliver data in a more timely fashion to gain some sort of business competitive advantage? The next point I would say here is ensure that you’re choosing an integration solution that’s going to easily allow you to expand for new use cases. So today, discussion wasn’t actually to mean a move things. But as business needs change and evolve, if they have new requirements, say around real time data delivery, or advanced new requirements or had data needs to be integrated within Databricks, they’re able to leverage their Connect investments to expand to new use cases. Now, another thing I would call out here, especially as organizations go through this modernization is some of the new requirements for the cloud platforms may break the current data integration architecture and really the key message here is don’t lock yourself in too much. You wanna a solution that’s going to be flexible enough, solve for today, but allow you to embrace tomorrow as it comes.

And then again, the last thing I would highlight here and this really relates to a lot of precisely these customers that we work with, is selected tool that’s gonna allow you to solve your integration problems around across the hybrid landscape. From data center to public cloud, so the example that I shared with you just now is a great example of an integration architecture, the customer selected to solve for these hybrid problems. As it stands that mainframe system, the towers in the data center is not going away anytime soon. It’s gonna continue to power the business four key areas of differentiation and success. And so the challenge of the organization is ensure that data can freely and securely move across the hybrid landscape. And that’s a key thing to make sure you’re solving for a part of your integration architecture. So with that, I’d like to thank you for joining the session.

Watch more Spark + AI sessions here
Try Databricks for free
« back
About Ashwin Ramachandran


Ashwin Ramachandran is the Senior Product Manager for Precisely's Integrate portfolio. In his 4 years, he has had the opportunity to engage with customers at multiple levels, from support, to training, to leading pre-sales evaluations. Across all of these instances, he has particularly enjoyed the process of creating new ways in which Precisely software can help customers overcome their pressing business challenges.