Big data never stops and neither should your Spark jobs. They should not stop when they see invalid input data. They should not stop when there are bugs in your code. They should not stop because of I/O-related problems. They should not stop because the data is too big. Bulletproof jobs not only keep working but they make it easy to identify and address the common problems encountered in large-scale production Spark processing: from data quality to code quality to operational issues to rising data volumes over time.
In this session you will learn three key principles for bulletproofing your Spark jobs, together with the architecture and system patterns that enable them. The first principle is idempotence. Exemplified by Spark 2.0 Idempotent Append operations, it enables 10x easier failure management. The second principle is row-level structured logging. Exemplified by Spark Records, it enables 100x (yes, one hundred times) faster root cause analysis. The third principle is invariant query structure. It is exemplified by Resilient Partitioned Tables, which allow for flexible management of large scale data over long periods of time, including late arrival handling, reprocessing of existing data to deal with bugs or data quality issues, repartitioning already written data, etc.
These patterns have been successfully used in production at Swoop in the demanding world of petabyte-scale online advertising.
Sim Simeonov is an entrepreneur, investor and startup mentor. He is the founding CTO of Swoop and IPM.ai, startups that use privacy-preserving AI to improve patient outcomes and marketing effectiveness in life sciences and healthcare. Previously, Sim was the founding CTO of Evidon (CrownPeak) & Thing Labs (AOL) and a founding investor in Veracode (Broadcom). In his VC days, Sim was an EIR at General Catalyst Partners and technology partner at Polaris Partners where he helped start five companies the firms invested in, three of which have already been acquired. Before his days as an investor, Sim was vice president of emerging technologies and chief architect at Macromedia (now Adobe). Earlier, he was a founding member and chief architect at Allaire, one of the first Internet platform companies whose flagship product, ColdFusion, ran thousands of sites such as Priceline and MySpace.