Ted has seen the world of data from helping out hundreds of different companies while serving as a Printable Solutions Architect at Cloudera to multiple years at the leading game company Blizzard building out data pipelines, and managing data engineering efforts. Now Ted, servers as a Directory of Enterprise Architecture at Capital One, solving data problems at every level of the company.
April 23, 2019 05:00 PM PT
Ever sense data became cool in the early 2000's the dream was to have a company that collected, learned, and reacted to data in NRT and with the flexibility to grow. We are now at that time, and Spark can be a major part of that solution. In this session we will walk through the common data architecture patterns and show how you may be steps away from a world class implementation of streams and spark to reach world first class levels with your data understanding, learning, and response. In this session we will be digging deep into spark patterns for training, model execution, and streaming feature state management.
June 7, 2016 05:00 PM PT
In the world of distributed computing, Spark has simplified development and open the doors for many to start writing distributed programs. Folks with little to none distributed coding experience can now start writing just a couple lines of code that will get 100s or 1000s of machines, immediately, working on creating business value. However, even through Spark code is easy to write and read, that doesn't mean that users don't run into issues of long running, slow performing jobs or out of memory errors. Thankfully most of the issues with using Spark have nothing to do with Spark but the approach we take when using it. This session will go over the top 5 things that we've seen in the field that prevent people from getting the most out of their Spark clusters. When some of these issues are addressed, it is not uncommon to see the same job running 10x or 100x faster with the same clusters, the same data, just a different approach.
October 25, 2016 05:00 PM PT
Traveling to different companies and building out a number of Spark solutions, I have found that there is a lack of knowledge around how to unit test Spark applications. In this talk we will address that by walking through examples for unit testing, Spark Core, Spark MlLib, Spark GraphX, Spark SQL, and Spark Streaming. We will build and run the unit tests in real time and show additional how to debug Spark as easier as any other Java process. The end goal is to encourage more developers to build unit tests along side their Spark applications to increase velocity of development, increase stability and production quality.
February 7, 2017 04:00 PM PT
So you know you want to write a streaming app but any non-trivial streaming app developer would have to think about these questions:How do I manage offsets?
How do I manage state?
How do I make my spark streaming job resilient to failures? Can I avoid some failures?
How do I gracefully shutdown my streaming job?
How do I monitor and manage (e.g. re-try logic) streaming job?
How can I better manage the DAG in my streaming job?
When to use checkpointing and for what? When not to use checkpointing?
Do I need a WAL when using streaming data source? Why? When don't I need one?
In this talk, we'll share practices that no one talks about when you start writing your streaming app, but you'll inevitably need to learn along the way.
February 7, 2017 04:00 PM PT
If you have a parent child relationship or a many to many relationship in your data model you will want to learn about nested dataset functionality in Spark. Ted Malaska (co-author of Hadoop Application Architecture) will walk through why nested types may change your life in solving common problems like large joins and even cartesian joins. This talk will include a full code example of create nested tables with Spark SQL, populating them those tables, and finally accessing them through a number of ways.
June 5, 2017 05:00 PM PT
It is one thing to write an Apache Spark application that gets you to an answer. It's another thing to know you used all the tricks in the book to make you run, run as fast as possible. This session will focus on those tricks.
Discover patterns and approaches that may not be apparent at first glance, but that can be game-changing when applied to your use cases. You'll learn about nested Types, multi threading, skew, reducing, cartesian joins and fun stuff like that.hreading, skew, reducing, cartesian joins, and fun stuff like that.
Session hashtag: #SFdev13
June 5, 2017 05:00 PM PT
So you know you want to write a streaming app, but any non-trivial streaming app developer would have to think about these questions: - How do I manage offsets? - How do I manage state? - How do I make my Spark Streaming job resilient to failures? Can I avoid some failures? - How do I gracefully shutdown my streaming job? - How do I monitor and manage my streaming job (i.e. re-try logic)? - How can I better manage the DAG in my streaming job? - When do I use checkpointing, and for what? When should I not use checkpointing? - Do I need a WAL when using a streaming data source? Why? When don't I need one? This session will share practices that no one talks about when you start writing your streaming app, but you'll inevitably need to learn along the way. Session hashtag: #SFdev5