Integrating Apache Spark and HANA

Published: July 1, 2014

This morning SAP released its own “Certified Spark Distribution” as part of a brand new partnership announced between Databricks and SAP. We’re thrilled to be embarking on this journey with them, not just because of what it means for Databricks as a company, but just as importantly because of what it means for Apache Spark and the Spark community.

Access to the full corpus of data

Fundamentally, every enterprise's big data vision is to convert data into value; a core ingredient in this quest is the availability of the data that needs to be mined for insights. Although the growth in volume of data sitting in HDFS has been incredible and continues to grow exponentially, much of this has been contextual data - e.g., social data, click-stream data, sensor data, logs, 3rd party data sources - and historical data. Real-time operational data - e.g., data from foundational enterprise applications such as ERP (Enterprise Resource Planning), CRM (Customer Relationship Management), and Supply Chain and Inventory Management (SCM) systems - has historically been maintained separately and moving data across in either direction to allow for analytics across the data set is cumbersome at best. The union of Spark and HANA is designed to change that.

With over 200,000 customers and among the largest portfolios of enterprise applications, SAP’s software serves as the gateway to one of the most valuable treasure troves of enterprise data globally, and SAP HANA is the cornerstone of SAP’s platform strategy underpinning these enterprise applications moving forward. Now, the community of Spark developers and users will have full access as well, enabling a richness of analytical possibilities that has been hard to achieve otherwise.

By the same token, enterprise applications built on top of HANA will now move closer to achieving the holy grail of a fully closed-loop operational system. Many of these applications are critical decision support systems for enterprises that directly drive day-to-day business. Having access to the full corpus of data would enable more accurate and effective decisions across verticals - sensor and weather data for utilities, social trends for retailers, traffic patterns and political events for commodity players. In practice, however, much of this data is held outside of HANA; with the Spark + HANA integration, enterprises can make mission-critical decisions across ‘100% of the data’.

More than just data stores - two powerful engines

Much of the Big Data paradigm has been built on the notion of bringing ‘compute to the data’, as opposed to simply trying to ETL all the data to a single central repository. Spark and HANA embody this principle by providing powerful engines that operate on data in place. Beyond being a repository for valuable corporate data, HANA also provides a wide variety of advanced analytics packages - including predictive, text/NLP, and geospatial - at blazing fast speeds. Spark also provides advanced analytics capabilities - SQL, streaming data, machine learning, and graph computation - that work natively on HDFS and other data stores such as Cassandra.

Beyond their individual capabilities, the true power of this integration is the ability of Spark and HANA to work closely together. Rather than performing a simple ‘select *’ query to grab a full data set, Spark can push down more advanced queries (e.g., complex joins, aggregates, and classification algorithms) - leveraging HANA’s horsepower and reducing expensive shuffles of data. A similar mechanism works for HANA users, where TGFs (Table Generating Functions) and Custom UDFs (User Defined functions) provide access to the full breadth of Spark’s capabilities through the Smart Data Access functionality.

Certified Spark Distribution - the value of ecosystem

Candidly, SAP is not known for its long history of open software use and contributions. That said, they certainly respect the value of it and what it can deliver - their bet on Spark is certainly a testament to that. More importantly, they understand the value of a vibrant ecosystem, and that a unified community is a key ingredient in enabling this. That’s why they’ve been adamant that any SAP Spark distribution is a Certified Spark Distribution - and hence capable of supporting the rapidly growing set of “Certified on Spark” applications and the development ecosystem. This action by a global powerhouse - along with the other Certified Distributions - is a strong testament that maintaining compatibility with the community-driven Apache Spark distribution can be achieved without sacrificing innovation and growth.

Significant potential for the road ahead

SAP’s distribution of Spark and its current integration with HANA is undoubtedly a terrific start and opens up a wealth of new opportunities for the Spark community. As we look forward, we see no shortage of potential opportunities for deeper integration - greater cross-functional performance, incorporating elements of SAP and HANA's security model in Spark, and facilitating the deployment of Spark and HANA together against a multitude of environments - and are excited to see where the journey leads.

What's next?

December 11, 2024/4 min read

Innovators Unveiled: Announcing the Databricks Generative AI Startup Challenge Winners!

December 11, 2024/15 min read

Access to the full corpus of data

More than just data stores - two powerful engines

Certified Spark Distribution - the value of ecosystem

Significant potential for the road ahead

Never miss a Databricks post

Sign up

What's next?

Innovators Unveiled: Announcing the Databricks Generative AI Startup Challenge Winners!

Introducing Databricks Generative AI Partner Accelerators and RAG Proof of Concepts