Skip to main content
Platform blog

SAP's recent announcement of a strategic partnership with Databricks has generated significant excitement among SAP customers. Databricks, the data and AI experts, presents a compelling opportunity for leveraging analytics and ML/AI capabilities by integrating SAP HANA with Databricks. Given this collaboration's immense interest, we are thrilled to embark on a deep dive blog series.

In many customer scenarios, a SAP HANA system serves as the primary entity for data foundation from various source systems, including SAP CRM, SAP ERP/ECC, SAP BW. Now, the exciting possibility arises to seamlessly integrate this robust SAP HANA analytical sidecar system with Databricks, further enhancing the organization's data capabilities. By connecting SAP HANA (with HANA Enterprise Edition license) with Databricks, businesses can leverage the advanced analytics and machine learning capabilities (like MLflow, AutoML, MLOps) of Databricks while harnessing the rich and consolidated data stored within SAP HANA. This integration opens up a world of possibilities for organizations to unlock valuable insights and drive data-driven decision-making across their SAP systems.

Multiple approaches are available to federate SAP HANA tables, SQL views, and calculation views in Databricks. However, the quickest way is to use SparkJDBC. The most significant advantage is SparkJDBC supports parallel JDBC connections from Spark worker nodes to the remote HANA endpoint.

Architecture
Figure 1: Architecture

Let us start with SAP HANA and Databricks integration

First, SAP HANA 2.0 is installed in the Azure cloud and we tested the integration with Databricks.

Installed SAP HANA info in Azure:

version2.00.061.00.1644229038
branchfa/hana2sp06
Operating SystemSUSE Linux Enterprise Server 15 SP1

Here is the high-level workflow depicting the different steps of this integration.

Please see the attached notebook for more detailed instructions for extracting data from SAP HANA's calculation views and tables into Databricks using SparkJDBC.

SAP HANA
Figure 2: Steps to access SAP HANA data into Databricks

Configure the SAP HANA JDBC jar (ngdbc.jar) as shown in the image below

SAP HANA JDBC Jar
Figure 3: SAP HANA JDBC Jar

Once the above steps are performed, perform a spark read using the SAP HANA server and JDBC port.

SAP HANA table
Figure 4: SAP HANA table

Start creating the dataframes using the in shown below with schema, table name.

df_vbap_table = (spark.read
 .format("jdbc")
 .option("driver","com.sap.db.jdbc.Driver")
 .option("url", "jdbc:sap://20.XX.XXX.XXX:39015/?autocommit=false")
 .option("dbtable", "ECC_DATA.VBAP")
 .option("user", "******")  ## should be using databricks secrets instead of putting credentials in code
 .option("password", "******")  ## should be using databricks secrets instead of putting credentials in code
 .load()
)

Also, we can do a filter pushdown by passing SQL statements in the dbtable option.

.option("dbtable", "(select country, sales from ECC_DATA.VBAP where country = 'USA')")

To get data from the Calculation View, we have to do the following:
For example, this XS-classic calculation view is created in the internal schema "_SYS_BIC".

Calculation View
Figure 5: Calculation View

This code snippet creates a PySpark dataframe named "df_sap_ecc_hana_cv_vbap" and populates it from a Calculation View from the SAP HANA system (in this case, CV_VBAP).

calc_view = '"_SYS_BIC"."ecc-data-cv/CV_VBAP"'

df_sap_ecc_hana_cv_vbap = (spark.read
 .format("jdbc")
 .option("driver","com.sap.db.jdbc.Driver")
 .option("url", "jdbc:sap://20.XXX.XXX.XXX:39015/?autocommit=false")
 .option("dbtable", calc_view)
 .option("user", "*******")  ## should be using databricks secrets instead of putting credentials in code
 .option("password", "*******")  ## should be using databricks secrets instead of putting credentials in code
 .load()
)

After generating the PySpark data frame, leverage Databricks' endless capabilities for exploratory data analysis (EDA) and machine learning/artificial intelligence (ML/AI).

Summarizing the above data frames:

Dataframes

Dataframes from SAP HANA
Figure 6: Summarize of Dataframes from SAP HANA

The focus of this blog revolves around SparkJDBC for SAP HANA, but it's worth noting that alternative methods such as FedML, hdbcli, and hana_ml are available for similar purposes.

Download Notebook

Try Databricks for free

Related posts

Platform blog

Developing an Open Data Ecosystem with SAP

April 19, 2023 by Samir Patel, Sam Steiny and Awez Syed in Platform Blog
Across a number of industries, from manufacturing, energy, life sciences to retail, we are seeing enterprises increase their focus on business durability, resilience...
Platform blog

Introducing Databricks AutoML: A Glass Box Approach to Automating Machine Learning Development

May 27, 2021 by Kasey Uhlenhuth in Announcements
Today, we announced Databricks AutoML , a tool that empowers data teams to quickly build and deploy machine learning models by automating the...
Platform blog

Introducing MLflow 2.3: Enhanced with Native LLMOps Support and New Features

With over 11 million monthly downloads, MLflow has established itself as the premier platform for end-to-end MLOps, empowering teams of all sizes to...
See all Platform Blog posts