At Sharethrough, which offers an advertising exchange for delivering in-feed ads, we’ve been running on CDH for the past three years (after migrating from Amazon EMR), primarily for ETL. With the launch of our exchange platform in early 2013 and our desire to optimize content distribution in real time, our needs changed, yet CDH remains an important part of our infrastructure.
In mid-2013, we began to examine stream-based approaches to accessing click-stream data from our pipeline. We asked ourselves: Rather than “warm up our cold path” by running those larger batches more frequently, can we give our developers a programmatic model and framework optimized for incremental, small batch processing, yet continue to rely on the Cloudera platform? Ideally, our engineering team focuses on the data itself, rather than worrying about details like consistency of state across the pipeline or fault recovery.
Apache Spark is a fast and general framework for large-scale data processing, with a programming model that supports building applications that would be more complex or less feasible using conventional MapReduce. (Spark ships inside Cloudera Enterprise 5, and is already supported for use with CDH 4.4 and later.) With an in-memory persistent storage abstraction, Spark supports complete MapReduce functionality without the long execution times required by things like data replication, disk I/O, and serialization.
Because Spark Streaming shares the same API as Spark’s batch and interactive modes, we now use Spark Streaming to aggregate business-critical data in real time. A consistent API means that we can develop and test locally in the less complex batch mode and have that job work seamlessly in production streaming. For example, we can now optimize bidding in real time, using the entire dataset for that campaign without waiting for our less frequently run ETL flows to complete. We are also able to perform real-time experiments and measure results as they come in.
Our batch-processing system looks like this:
Sharethrough’s former batch-processing dataflow
For our particular use case, this batch-processing workflow wouldn’t provide access to performance data while the results of those calculations would still be valuable. For example, knowing that a client’s optimized content performance is 4.2 percent an hour after their daily budget is spent, means our advertisers aren’t getting their money’s worth, and our publishers aren’t seeing the fill they need. Even when the batch jobs take minutes, a spike in traffic could slow down a given batch job, causing it to “bump into” newly launched jobs.
For these use cases, a streaming dataflow is the viable solution:
Sharethrough’s new Spark Streaming-based dataflow
In this new model, our latency is only Spark processing time and the time it takes Flume to transmit files to HDFS; in practice, this works out to be about five seconds.
When we began using Spark Streaming, we shipped quickly with minimal fuss. To get the most out of our new streaming jobs, we quickly adjusted to the Spark programming model.
Here are some things we discovered along the way:
Sharethrough Engineering intends to do a lot more with Spark Streaming. Our engineers can interactively craft an application, test it in batch, move it into streaming and it just works. We’d encourage others interested in unlocking real-time processing to look at Spark Streaming. Because of the concise Spark API, engineers comfortable with MapReduce can build streaming applications today without having to learn a completely new programming model.
Spark Streaming equips your organization with the kind of insights only available from up-to-the-minute data, either in the form of machine-learning algorithms or real-time dashboards: It’s up to you!