Robbie has been involved in the big data community for the last seven years, and he was an early Spark adopter back in 2014. He has contributed to a number of projects, including Apache Cassandra and the Cassandra Spark connector, and is the author of Cassandra High Availability. At IBM, Robbie leads a group that includes the Spark Technology Center, as well as Big Insights and other data processing technologies that power the Watson Data Platform.
February 16, 2016 04:00 PM PT
We hear a lot about lambda architectures and how Spark can help us crunch our data both in batch and real-time. After a year and a half in the trenches, I'll share how we at The Weather Company built a general purpose, weather-scale event processing pipeline to make sense of billions of events each day. If you want to avoid much of the pain learning how to get it right, this talk is for you.
June 7, 2016 05:00 PM PT
Weather is both massive and intensely local. Building a weather service involves capturing lots of data for many, many places over long timeframes, applying intensive analysis to those datasets to predict what will happen next, and then delivering those insights to the millions mobile users and other endpoints who request it, all within a fraction of a second. Come see the full end-to-end view of how The Weather Company uses Spark to manage this large scale challenge.
February 8, 2017 04:00 PM PT
What if you could get the simplicity, convenience, interoperability, and storage niceties of an old-fashioned CSV with the speed of a NoSQL database and the storage requirements of a gzipped file? Enter Parquet.
At The Weather Company, Parquet files are a quietly awesome and deeply integral part of our Spark-driven analytics workflow. Using Spark + Parquet, we’ve built a blazing fast, storage-efficient, query-efficient data lake and a suite of tools to accompany it.
We will give a technical overview of how Parquet works and how recent improvements from Tungsten enable SparkSQL to take advantage of this design to provide fast queries by overcoming two major bottlenecks of distributed analytics: communication costs (IO bound) and data decoding (CPU bound).