Guest blog: SequoiaDB Connector for Apache Spark

This is a guest blog from Tao Wang at SequoiaDB. He is the co-founder and CTO of SequoiaDB, leading its long-term technology vision, and is responsible for the leadership of advanced technology incubations. SequoiaDB is a JSON document-oriented transactional database.


Why We Chose Apache Spark

SequoiaDB is a NoSQL database that has the capability to replicate data on different physical nodes and allows users to specify which “copy of data” that the application should access. It is capable of running analytical and operational workloads simultaneously on the same cluster with minimal I/O or CPU contention.

The joint solution of Apache Spark and SequoiaDB allows users to build a single platform such that a wide variety of workloads (e.g., interactive SQL and streaming) can run together on the same physical cluster.

Making SequoiaDB Work with Spark: The Spark-SequoiaDB Connector

Spark-SequoiaDB Connector is a Spark data source that allows users to read and write data against SequoiaDB collections with Spark SQL. It is used to integrate SequoiaDB and Spark, combining the advantages of a schema-less storage model with dynamic indexing and the power of Spark clusters.

Screen Shot 2015-07-08 at 11.28.23 AM

Spark and SequoiaDB can be installed in the same physical environment or in different clusters. The Spark-SequoiaDB Connector pushes down the search conditions to SequoiaDB, and only retrieves the records that match the query predicates.  This optimization enables analytics against operational data sources without the need to perform ETL between SequoiaDB and Spark.

Here is a code example of how to use the Spark-SequoiaDB Connector in SparkSQL:

sqlContext.sql("CREATE temporary table org_department ( deptno string, deptname string, mgrno string, admrdept string, location string ) using com.sequoiadb.spark OPTIONS ( host 'host-60-0-16-2:50000', collectionspace 'org', collection 'department', username 'sdbreader', password 'sdb_reader_pwd')")
res2: org.apache.spark.sql.DataFrame = []
sqlContext.sql("CREATE temporary table org_employee ( empno int, firstnme string, midinit string, lastname string, workdept string, phoneno string, hiredate date, job string, edlevel int, sex string, birthdate date, salary int, bonus int, comm int ) using com.sequoiadb.spark OPTIONS ( host 'host-60-0-16-2:50000', collectionspace 'org', collection 'employee', username 'sdb_reader', password 'sdb_reader_pwd')")
res3: org.apache.spark.sql.DataFrame = []
sqlContext.sql("select * from org_department a, org_employee b where a.deptno='D11'").collect().take(3).foreach(println)
[D11,MANUFACTURING SYSTEMS,000060,D01,null,10,CHRISTINE,I,HAAS,A00,3978,null,PRES,18,F,null,152750,1000,4220]
[D11,MANUFACTURING SYSTEMS,000060,D01,null,20,MICHAEL,L,THOMPSON,B01,3476,null,MANAGER,18,M,null,94250,800,3300]
[D11,MANUFACTURING SYSTEMS,000060,D01,null,30,SALLY,A,KWAN,C01,4738,null,MANAGER,20,F,null,98250,800,3060]

Financial Services Industry Use Case: Improved Transaction History Archiving System

The joint solution of Spark and SequoiaDB can help organizations to retain more and get more value out of their data. Here we will showcase a financial services industry example, where a bank implemented an improved transaction history archiving system with Spark and SequoiaDB.

For the past few decades, most banks run their core banking systems on mainframes. The technical limitations of their mainframe systems meant that transaction history older than one year had to be removed from the mainframes and archived on tapes.

However, banking customers today have much higher expectations of customer service than the past, driven by the broad adoption of online and mobile banking. To compete for customers more effectively, one of our customers – who is a large bank – wanted to improve its offering by allowing their customers to search for transactions older than one year.

Screen Shot 2015-07-29 at 11.52.34 AM

Using SequoiaDB, this bank can save 15 years of all customer transaction data in 50 physical nodes (occupying more than 1PB disk space). This new system allows customers to access their full transaction history easily on the mobile device as well as the website.

Screen Shot 2015-07-29 at 11.53.23 AM

Financial Services Industry Use Case: Product Recommendation with Spark and SequoiaDB Integration

Once the full transaction history of all customers is readily available, the bank built a customer profiling system based on the transaction data to find the appropriate investment products for each customer.

Once the customer profiling system calculates product recommendations when processing all of the transaction data and logs, these properties are written back to a collection with a tag array for each customer.

These properties are used by the front desk staff and the recommendation engine to identify the potential interests of each customer. After deploying this system, the success rate of financial derivatives recommendation increased by more than ten times.

Screen Shot 2015-07-08 at 11.38.57 AM

What next with Spark at SequoiaDB

Most of our financial customers are interested in streaming (for anti-money laundering and high-frequency trading use cases) or interactive SQL processing (for government supervision). We intend to put more efforts to improve the features and stability of these Apache Spark components, such as helping SparkSQL to support standard SQL2003.

For more information about Spark-SequoiaDB Connector please visit:

https://github.com/SequoiaDB/spark-sequoiadb

Try Databricks for free Get started

Sign up