Spark Schema For Free

Download Slides

DataFrames are essential for high-performance code, but sadly lag behind in development experience in Scala. When we started migrating our existing Spark application from RDDs to DataFrames at Whitepages, we had to scratch our heads real hard to come up with a good solution. DataFrames come at a loss of compile-time type safety and there is limited support for encoding JVM types.

We wanted more descriptive types without the overhead of Dataset operations. The data binding API should be extendable. Schema for input files should be generated from classes when we don’t want inference. UDFs should be more type-safe. Spark does not provide these natively, but with the help of shapeless and type-level programming we found a solution to nearly all of our wishes. We migrated the RDD code without any of the following: changing our domain entities, writing schema description or breaking binary compatibility with our existing formats. Instead we derived schema, data binding and UDFs, and tried to sacrifice the least amount of type safety while still enjoying the performance of DataFrames.

Session hashtag: #SAISEnt5

« back
About Dávid Szakállas

Dávid acquired an M.Sc. in Software Engineering from Budapest University of Technology and Economics, where he specialized in distributed systems. He got acquinted with Scala as an intern 4 years ago. Since then he worked with various programming languages, but Scala remained one of his favorites. Dávid's past projects include a graph database search engine, and distributed error tracing for microservices. He is currently working as a Software Engineer at the Budapest office of Whitepages, where he and his collegues develop the ingestion pipeline of the data lake that powers the core search engine, which answer to over 2 million queries every day. He holds a personal programming blog, and a love for electronic music.