This is a guest blog from Darin McBeath, Disruptive Technology Director at Elsevier. To try out Databricks for your next Apache Spark application, sign up for a 14-day free trial today.
Elsevier is a provider of scientific, technical, and medical information products and services. Elsevier Labs is an advanced technology R&D group within Elsevier. Members of Labs research and create new technologies, help implement proofs of concept, educate Elsevier staff and management, and represent the company in technical discussions.
In this blog, we will talk about how we utilize Databricks to build Apache Spark applications, and introduce our first publicly released Spark package - spark-xml-utils.
Elsevier Labs and Apache Spark on Databricks
Within Elsevier Labs, we have been investigating Apache Spark since early 2014. We accelerated our pace of adoption since we implemented Databricks at the start of 2015.
Some of our use cases with Apache Spark on Databricks:
- Usage analysis for one of our web properties using the Databricks REST API.
- Text mining with popular tools such as GENIA and Stanford Core NLP against a large portion of our content.
- Feature extraction using MLlib.
- Extrapolation of author graphs, affiliation graphs, and cited by graphs from our content with GraphX.
- Ad hoc analysis using Spark SQL notebooks.
We use Databricks to develop standalone Java applications as well as notebooks in Scala and Python. We also use the Databricks job scheduler to automatically schedule daily and weekly jobs.
Introducing spark-xml-utils
We have a lot of content in fairly complex XML (highly structured, nested, and with numerous namespaces). While the mere mention of XML can cause much angst and gnashing of teeth, the fact of the matter is there still exists a lot of XML in big data sets. Case in point, Elsevier has tens of millions of complex XML files for journal articles, book chapters, and other information. This surfaced the need for easy to use, powerful, and flexible tools for processing many XML files in parallel.
The spark-xml-utils library was developed to provide some helpful XML utilities:
- Filter documents based on an XPath expression
- Return specific nodes for an XPath/XQuery expression
- Transform documents using a XSLT stylesheet
By providing some basic wrappers to Saxon, the spark-xml-utils library exposes basic XPath, XQuery, and XSLT functionality that can readily be leveraged by a Spark application.
Using spark-xml-utils
Spark-xml-utils provides access to three common XML tools: XPath, XQuery, and XSLT. This allows a developer to select the most appropriate or most familiar tool for the task at hand.
A recurring pattern in the example below is the use of mapPartitions. This allows us to initialize the processors for XPath, XQuery, and XSLT once per partition for optimal performance. We then use an iterator to process each record in the partition.
The sequence file used in the examples below (a representative and small example for one of these) is publicly available in s3://spark-xml-utils/xml. In this sequence file, the key is a unique identifier for the record and the value is the XML (as a string). A sample of a complete XML record is available from (https://s3.amazonaws.com/spark-xml-utils/xmlfiles/SourceXML/S0001870812002101.norm-xml).
XPath
The XPathProcessor class defined in spark-xml-utils provides methods that enable processing of XPath expressions (filter/evaluate) against an XML string.
The result of a filter operation will be a boolean TRUE/FALSE. As an example, assume we want to subset our content to only ‘journal’ articles with a published date between 2012 and 2015. The code below will filter the records, keeping the journal articles (content-type=’JL’) published after 2012 but before 2015.
Output:
Like the filter operation, the evaluation operation applies an XPath expression against an XML string. But instead of a boolean, its result will be the string result of the XPath expression. For example, assume we want to extract the journal name for every record. The following example returns the srctitle (i.e. journal name) for each record.
Output:
While the examples above all included namespaces (because our XML is fairly rich and complex) if the content did not have namespaces, the code could be simplified to the following. Namespaces are also not needed if wildcards are used (instead of namespace prefixes) in an expression. For example: "/*:doc/*:meta/*:
srctitle/text()
".
XQuery
The XQueryProcessor class defined in spark-xml-utils provides a method that enables processing of XQuery evaluate expressions against an XML string (similar to the XPath evaluation).
An evaluation operation can do more than just return raw results; we can also return multiple fields and do some basic formatting. Assume we want to return a JSON record for the journal name and the publication year. The following example returns a JSON record containing the journal name and the publication year for each record.
Output
XSLT
The XSLTProcessor class defined in spark-xml-utils provides a method that transforms an XML record by applying an XSLT stylesheet obtained from an S3 bucket (or passed as a string). While some basic transformation could be done with XPath (or XQuery), XSLT is the tool of choice for complex transformations.
Assume we want to return a JSON record for the journal title. The following example returns a JSON record containing the journal name for each record.
The stylesheet used in the above transformation is listed below.
Output:
Much more complex scenarios are also possible (e.g., filter documents where the record is of type ‘journal’, the stage is ‘S300’, the publication year is > 2010 and spark-xml-utils Github site.
Conclusion
We are just scratching the surface for what we would like to provide with spark-xml-utils and what is possible. Within Labs, we have been using it for over 9 months and have had great success. If you are interested in processing XML, I encourage you to install the spark-xml-utils package, play around with the examples (the data and stylesheet is publicly available), or better yet, use your own data. We are certainly receptive to feedback and additions by the community.
The spark-xml-utils package can be found on the Spark Packages site and Github.