Alfonso Roa is a Functional Architect in Habla Computing. Alfonso has been using Apache Spark mainly for the last six years in real-world applications for all kinds of sectors and he has extensive experience in consultancy projects focused mainly on big data, developing applications for analysis and ML, and tries to spend some of his free time creating libraries to give back to the open source community. He’s also a co-organizer of the Madrid Scala Meetup group.
October 16, 2019 05:00 PM PT
Working with complex types shouldn't be a complex job. DataFrames provide a great SQL-oriented API for data transformation, but it doesn't help much when the time comes to update elements of complex types like structs or arrays. In such cases, your program quickly turns into a humongous code of struct words and parenthesis, while trying to make transformations over inner elements, and reconstructing your column. This is exactly the sample problem that we encounter when working with immutable data structures in functional programming, and to solve that problem, optics were invented. Couldn't we use something similar to optics in the DataFrame realm?
In this talk, we will show how we can enrich the DataFrame API with design patterns that lenses, one of the most common type of optic, put forward to manipulate immutable data structures. We will show how these patterns are implemented through the spark-optics library, an analogue to the Scala Monocle library, and will illustrate its use with several examples. Last but not least, we will take advantage of the dynamic type system of DataFrames to do more than transforming sub-columns, like pruning elements, and renaming them.