Extending Spark SQL API with Easier to Use Array Types Operations

Download Slides

Big companies typically integrate their data from various heterogeneous systems when building a data lake as single point for accessing data. To achieve this goal technical teams often deal with data defined by complex schemas and various data formats. Spark SQL Datasets are currently compatible with data formats such as XML, Avro and Parquet by providing primitive and complex data types such as structs and arrays.

Although Dataset API offers rich set of functions, general manipulation of array and deeply nested data structures is lacking. We will demonstrate this fact by providing examples of data which is currently very hard to process in Spark efficiently. We designed and developed an extension of Dataset API to allow developers to work with array and complex type elements in a more straightforward and consistent way. The extension should help users dealing with complex and structured big data to use Apache Spark as a truly generic processing framework.

Session hashtag: #Dev3SAIS



« back
About Marek Novotny

Marek obtained bachelor and master degree in computer science at Charles University in Prague. His master studies were mainly focused on development of distributed and dependable systems. In 2013, Marek joined ABSA Capital in Prague to develop a scalable data integration platform and a framework for calculating regulatory reports. During the work on those two projects, he gained experience with many NoSQL and distributed technologies (e.g. Kafka, Zookeper, Spark). Nowadays, he is a member of Big Data Engineering team and primarily focused on development of the Spline project.

About Oleksandr Vayda

Alex is a senior software engineer in the Big Data Engineering team, Barclays Africa Group Limited. Alex obtained his second degrees - Masters in computer science - at National University of Radioelectronics in Kharkov, Ukraine. Before that he received engineering degrees at National Aerospace University. During his software development career started at 2002, Alex participated and authored a number of Web startup projects, mostly focusing on Java and later Scala development and system architecture. Currently he's one of the key developers of Spline project - Open-source Spark lineage tracking tool developed by ABSA Capital.