Sparser: Faster Parsing of Unstructured Data Formats in Apache Spark

Download Slides

Many queries in Spark workloads execute over unstructured or text-based data formats, such as JSON or CSV files. Unfortunately, parsing these formats into queryable DataFrames or DataSets is often the slowest stage of these workloads, especially for interactive, ad-hoc analytics. In many instances, this bottleneck can be eliminated by taking filters expressed in the high-level query (e.g., a SQL query in Spark SQL) and pushing the filters into the parsing stage, thus reducing the total number of records that need to be parsed.

In this talk, we present Sparser, a new parsing library in Spark for JSON, CSV, and Avro files. By aggressively filtering records before parsing them, Sparser achieves up to 9x end-to-end runtime improvement on several real-world Spark SQL workloads. Using Spark’s Data Source API, Sparser extracts the filtering expressions specified by a Spark SQL query; these expressions are then compiled into fast, SIMD-accelerated “pre-filters” which can discard data at an order of magnitude faster than the JSON and CSV parsers currently available in Spark.

These pre-filters are approximate and may produce false positives; thus, Sparser intelligently selects the best set of pre-filters that minimizes the overall parsing runtime for any given query. We show that, for Spark SQL queries with low selectivity (i.e., very selective filters), Sparser routinely outperforms the standard parsers in Spark by at least 3x. Sparser can be used as a drop-in replacement for any Spark SQL query; our code is open-source, and our Spark package will be made public soon.

Session hashtag: #Res4SAIS

« back
About Firas Abuzaid

Firas Abuzaid is a 3rd-year PhD student in the Stanford InfoLab, advised Profs. Peter Bailis and Matei Zaharia. Firas works on problems at the intersection of machine learning and systems; he enjoys building new systems and abstractions that make machine learning faster, more scalable, and easier to use. His work has applications across a broad variety of domains, such as video classification, recommendation serving, and data analytics; Firas has presented his research at multiple venues, including NIPS, VLDB, and HPTS. In his spare time, when he’s not running experiments, you can find Firas running the Dish behind Stanford, or some other scenic spot around Palo Alto or Menlo Park.

About Shoumik Palkar

Shoumik Palkar is a 3rd year Ph.D. student at Stanford University, working with Prof. Matei Zaharia on high performance data analytics and computer networking. Before joining Stanford, he graduated with a B.S. in Electrical Engineering and Computer Science from UC Berkeley in 2015.