Building a Streaming Microservice Architecture: with Apache Spark Structured Streaming and Friends

Download Slides

As we continue to push the boundaries of what is possible with respect to pipeline throughput and data serving tiers, new methodologies and techniques continue to emerge to handle larger and larger workloads – from real-time processing and aggregation of user / behavioral data, rule-based / conditional distribution of event and metric streams, to almost any data pipeline / lineage problems. These workloads are typical in most modern data platforms and are critical to all operational analytics systems, data storage systems, ML / DL and beyond. One of the common problems I’ve seen across a lot of companies can be reduced to general data reliability problems. Mainly due to scaling and migrating processing components as a company expands and teams grow. What was a few systems can quickly fan out into a slew of independent components and serving-layers all whom need to be scaled up, down or out with zero-downtime to meet the demands of a world hungry for data. During this technical deep dive, a new mental model will be built up which aims to reinvent how one should build massive, interconnected services using Kafka, Google Protocol Buffers / gRPC, and Parquet/Delta Lake/Spark Structured Streaming. The material presented during the deep dive is based on lessons learned the hard-way while building up a massive real-time insights platform at Twilio where data integrity and stream fault-tolerance is as critical as the services our company provides.

Watch more Spark + AI sessions here
Try Databricks for free

Video Transcript

– Hey thanks for coming to my Spark Summit session on building a Streaming Microservice Architecture with the Spark Structured Streaming and friends. My name’s Scott Haines, and I’m the senior principal engineer at a company called Twilio. And I’m trying to discuss a little bit more about myself, my background, and then, we’ll actually just jump into things. All right so me, as I said before, I work at a company called Twilio. If you’re not familiar with Twilio, we’re a large communications company that started off about 11, 12 years ago. Just doing like SMS and, telephoning, we’ve moved on since there, been there for about four years, and in general, I’ve been working in Streaming Architectures for about 10 years, so I starting at Yahoo. Some other kinds of things that I think are kind of interesting about the stuff that I’ve done, like at Twilio, I brought a Streaming-First Architecture to the voice and video, group through a project called Voice Insights, about four years ago, I also lead Spark office hours to Twilio and I’ve really just enjoyed distributed systems for about my entire career. All right, so the agenda for today. so we’re going to take a look at, it’s basically the big picture of what does the streaming architecture look like? How does that really differ for different kinds of, API driven architectures that people have seen before? and then we’re also taking a look at like the actual technology that drives it, so what are protocol buffers, why do I like them? Why do I think you should like them as well? and also what is gRPC? What is RPC? what’s a protocol stream? and then we’re gonna also figure out how this actually fits into the Spark Ecosystem. So it’s kind of the agenda for today and we’re gonna just kinda pop right into it.

So let’s take first a look at the big picture. So zoom all the way out and take a look at what an architecture looks like, that’s streaming first, so the Streaming Microservice Architecture, we’re gonna kinda go from the left all the way to the right and consider this basically a span.

Streaming Microservice Architecture

So if you consider a span as a single kind of event stream, from a client to a server and then through the entire kind of streaming backend, this is what every single servers within this gigantic streaming architecture would look like. But we’re going to focus literally on a single client to serve a relationship and how data is traversed through a streaming system. So in the first box, we take a look at a gRPC client. What is the gRPC client, what is gRPC? also what is HTTP/2? And Protobuf says, we kind of distill what the actual individual components are in the technology that actually drives this architecture. We’ll be able to kind of traverse along through this entire presentation, to be able to really understand like how this actually works. But it’s a very kind of top level, high level overview. We go from a gRPC client, which would be this client that’s running in, iOS SDKs, it could be running in JavaScript. That doesn’t really matter, it could be a client running in the Java Server, that client itself connects and does a remote procedure call to a gRPC server, which is communicating, with that gRPC client over something called Protobuf, protobuf messages are now passed from this client to the server in the same kind of way that you’d expect to see a normal method call just on a client like SDK. So it kind of encapsulates this client to server relationship without having to really think about where the server lives, which is really, really cool, plus it’s lightening fast. Aside from there, once everything hits the server, things are, (mumbles) to Kafka we’ll talk a little bit more about Kafka, but I think everybody who’s at Sparks Summit also realizes and knows about Kafka and also the importance of Kafka. but once the prototype is in Kafka across an individual topic, but (mumbles) get picked up through Spark. So then Spark and natively interact with protobuf through something called ScalaPB. And then ScalaPB allows you to actually convert your protobuf messages directly into a data frame or a data set within the actual, within your Spark application. So you have this kind of entire kind of end to end system getting everything into Spark once things are in your actual Spark application, it’s really easy then to interact with anything with either, within the Spark ecosystem, or outside of the Spark ecosystem, just by converting the individual data frames to parquet and dropping it in something like HDFS S3 or wrapping that all with Delta. Alright, so we’re gonna take a look at a little bit further, how does this actually expand? So if you start off with a single kind of SPAN level architecture and the SPAN level architecture allows you to kind of have that client to server relationship, servers and passing an individual message to a topic in Kafka, once something’s in Kafka, then things get really interesting. So you take a look at this architecture. This is basically just a reconfiguration of exactly what we were looking at before. So once things are in HTFS like we take a look at the bottom row, or if things are in Kafka and the top row, there’s the ability to actually pass these messages, whether it’s from the gRPC client to the gRPC server, to Kafka to your Spark application, but you can also pass it back and forth. So you can go back to Hadoop or you can take your parquet data and read it, back from Hadoop again, back to an application. And what we’ve done at Twilio is we’ve used this architecture to, be able to kind of message pass reliably, between the individual streaming Spark applications, without actually the necessity to kind of schedule a bunch of different batch jobs like you’d expect in something like can Airflow, or even just like a crontab, or any different Spark applications. The cool thing too, is that at the very end, once you’re ready to kind of close books on whatever the messag is, that can then go back to gRPC, because gRPC enables bidirectional communication. So you have the ability to also kind of peer on the end result of a Spark job, and then you can, (mumbles) pass it back down to a client once, a specific job has been run through from end to end, we’re gonna take a look at what that actually looks like as we kind of, drive deeper into the individual components of this architecture, but as a high level, it’s literally a message from a client to the server. That’s encapsulated through a remote procedure call, which basically kinda gets rid of that edge, which is, client and server, from there, when things go into Kafka, then, it’s really up to you what you want to do with that. But the kind of end result is that you can really reliably pass data back and forth with compile time guarantees, which is really nice, especially for a system that’s online and running, 24/7, 365.

Alright, so what, we’re gonna talk now about protocol, like what is protocol buffer, aka protobuf, so protocol buffers themselves. So why use them, so if you’re not familiar with protocol buffer, and you maybe you’re familiar only with something like, with JSON CSV, et cetera. So JSON CSV are structured data, but it’s not strict. So the types can basically change and mutate over time, and that can break your downstream systems, with one kind of bad upstream, commit. The cool thing with protobuf is that it’s basically compiled down and it’s language agnostic. So if you think about like a common message format that can be used, like in this case to encapsulate what a user is, a user can be anything in the system it’s really defined how you want to define a user, but, the big kind of win here is that you have this kind of, it’s a language agnostic message format that allows you to compile down to the language of your choice. So if your server side language of choice to say C++ maybe it’s Java, but your client library is all written in Python or Node.js. This allows you to use the same messages back and forth. So you don’t have to worry about a bunch of different APIs that are trying to, absorb, API that’s written in say JSON, which might be changing, so any change downstream is gonna basically make everything else problematic, a bad commit, a bad push that got rolled back then breaks your downstream systems. So protocol buffer is basically was an idea that came out of Google and the nice thing about it is that given that you can inject it into any of your libraries as a version compiled, asset or resource for that library. Then it allows you to really do anything that you want to, without having to worry really about how does that data change as long as you’re abiding by some of the guidelines. So I would recommend taking a look more protocol buffers, just take a look at like, and you can find the protocol buffers there as well. And that gives you a really good overview of how to actually use them. But really like the big win and the big takeaway is that you have the ability to compile down and have version, data that will, can flow all the way through your system, across the actual language boundaries. So from Java to Node.JS and back and forth, et cetera. So they’re really cool, we’re gonna take a little bit more look at them.

So, as I said before, you have kind of a language agnostic message type, and that can be compiled down to your language of choice. So the really cool thing with that too, is that you have the ability to basically auto-generate, kind of different builder methods or kind of a scaffolding class. So if you think about like, say a Java builder, like we have on the right, this is a Java builder being built in Scala with our value higher, our val over there, so val user, so one way of taking a look at this as basically saying, I want to create a new, I want to create a user. So now I have a new builder and all the methods that actually created an unbeatable structure, are all basically written for you. So there’s nothing you actually have to do. So it was like to talk about this as like, be lazy it’s all right, just write your messages, compile it down and now everything else has been done for you do you, so you save time you’re also accurate and things can be versioned, which is really, really good aside from that too, protocol buffers come with their own, serialization and deserialization library. So you don’t have to worry about how does this work in my language, because the actual libraries that are interrupting, for protobuf and itself, across Node.js and Python, Java, Scala, et cetera, all have their own way of basically serializing and marshalling those objects so that you don’t have to worry about that. So you get a lot of, you basically get an entire survey system for free, which is lightning fast and super optimized, which is wonderful.

All right, so now we’re going to take a look at gRPC and what a protocol stream is. So this is gonna piggyback on the idea of, everything we’ve seen so far. So it’s using protocol buffers, and this is basically the first look at the client to server gRPC relationship.

So gRPC, what is it? So gRPC piggybacks on protobuf, and it’s a way of defining in protobuf a server contract. So if you think about RPC, the idea of RPC is to have a remote procedure call.


What does that mean? Well, if you’re a client and you have an SDK and the SDK has methods that will do something in the case where there’s actually IO involved with there, there becomes this kind of, it’s almost an invisible boundary on how the client sends a message to the server. We’re gonna take a look at an, example for like ad tracking, which was something that I used to do back in the day when I was at Yahoo. And before we used to use JSON for everything, and we’re gonna take a look at how to do that with, gRPC. But for an example, if you have a gRPC client, that’s running say a JavaScript SDK, and you’re tracking, how a customer interacts with an ad for say, like growth marketing or something like that, then the gRPC client wouldn’t have to worry about how to compose an object, for the server, other than just creating protobuf and sending that to the server through the actual gRPC client. So instead of having to worry about, what type of HTTP library are you’re using, is this going to work across different versions of JavaScript? Are people using the right library version, will this work indefinitely? that’s all kind of encapsulated in the whole gRPC process. So gRPC stands for Generic Remote Procedure Call. I like to think of it as Google remote procedure call because it was created at Google as well. And it’s used to power a lot of the services that people actually use nowadays, from TensorFlow, to CockroachDB and like across the board from like, from Envoy as well, which is using HTTP/2 and protobuf as a way of doing a reliable proxy. but it’s really, really cool because if you compile it down, you don’t have to worry about writing, scaffolding for a lot of your different, server-side libraries or client-side libraries, given that you can actually still compile it down like you would your protocol buffer to create your scaffold. So your interfaces in Java and vice versa. The other really kind of cool thing too, is as I talked a little bit about before you have the ability to do this bidirectional streaming, and this doesn’t come necessarily out of the box by default, but there’s, Plugable support for it. So if you think about, your server pushing down to your client then your client, responding back and pushing back to the server and having this whole entire communication, that’s one of the things that gRPC actually powers. so it actually takes a lot of the effort out of, how do you do like, how do you do channel-based communication? So for example, say you’re running a different application where a customer has logged in and they want to get say, say it’s stock data, now your server can just push that down for people who are listening to the specific stock say, I don’t know, say it’s some NASDAQ stock of some sort. then you just be able to get those stocks and be able to take a look at that as they’re updating in real time, just by subscribing to a channel with, the stock trackers that you’re actually tracking, which is kind of nice and then the server-side does one thing and then broadcast it across all of the individual, peers within that individual channel. so that’s like the whole bidirectional part. And so be given that, that streaming and you can connect it to your backend, which is also streaming running Spark jobs that are streaming. You have this kind of end to end stream. That’s really kind of opens up the boundaries to a lot more, things that were very complicated in the past and just adds a framework on top of that.

So, as I said before, we’re gonna take a look at an actual example, for gRPC, so the gRPC example that I wanted to show today, was basically just AdTracking. So we take a look at what does the message look like? that’s gonna be our interface between our client and our server. We’re gonna take a look at the actual server code that accepts these messages, and admits them to Kafka and then sends back a response to the gRPC client. And this is basically an example just to kind of show, how little effort there really is to actually make something that can be, robust and powerful within your ecosystem. So consider those, like, a very smart way of doing data engineering and ensuring that everything that you expect to get into your system will get into the system and a very kind of defined, versioned, data package.

Alright, so gRPC defining messages, as we saw before, in the example with the user, everything’s basic just, it’s known as a message in the (mumbles). So message AdImpression. We would take an adId, which should be a string (mumbles). There may maybe have a contextId, like what page or what category of pages and ad is being actually displayed in a potentially a userId or a sessionId. So you can actually track this back to like an individual, GDPR first, make sure that the userId itself is not something that’s, my social security or something else like that, but it’s going to be something that, makes sense in your user for targeting, could be an IDFA or anything else, like from (mumbles) world to identify like, a type of user that’s really not like not known, but you’re tying all of this information from the notion of some kind of user across the context, back to an individual ad at a specific time. And that’s basically ad impression. So there’s only four fields. Then you just have like a normal kind of response. So there’s a status code, to the individual a message that’s going to be like a typical kind of HTTP status code. and, or it can be a status code that you create for your own server as well, to allow you to identify different types of problems or different issues that come up. So you have this kind of relay back and forth. Plus also a message, so if you want to human readable message to respond to your client, telling them why something didn’t work or telling them that something did work, all of that’s possible as well.

Alright, so we had our AdImpression and I still have that up on the right hand side of the screen as well. So when I said before that you did basically, you have protobuf and then you also have the ability to define your services as protobuf as well. This is where that all kind of comes into play. So we talk about this as basically a server definition or service definition. So if we have a click track service, which is taking an AdImpression, then really all we do is say, a look at like the top, right. So servers click track service. So now we have a named service. Our rpc method would be AdTrack, AdTrack method, takes an AdImpression and it returns a response. So if you think about everything, that’s basically, in this definition right here, there’s really nothing that can be misconstrued because the client has to create an AdImpression and it has to add, adId, contextId, userId, and timestamp, or it can validate and say, this is not correct because you’ve added incorrect data. Maybe it’s an empty string or something like that. And your response is always gonna be exactly the same response for the compiled version of your client code and for your server code and for the protobuf at a specific version. So if you consider a versioning, like people would do, across the board from like say Maven versioning, or like any kind of versioning and Artifactory then if you’re using some (mumbles) or something else, then if you’re doing like a non breaking change, great, everything will always work. If you’re doing a breaking change, then coordinate that with, downstreams so that when an upstream changes that nothing breaks down the stream, but if you’re following like really strict, API style, best practices, then no matter what you do, everything should be backwards compatible anyways, protobuf works backwards compatible. So potentially if you’re trying something new and you’re Canary testing say a new field in your AdImpression, other services downstream to have an old version of the protocol buffer for your AdImpression, wouldn’t know that they could even add a new message type. And anybody else is interrupting with that AdImpression would have something called an unknown field and they would not have to do anything with it, which is really great, so it allows you to basically opt-in if you want to, for specific messages for different fields are potentially being added, and it’s a really nice way to be able to not break a service, which is in production, just because you wanna try something new. And you’re working on say like a rollout that’s maybe, kind of opt-in with like feature flags. so all of this stuff basically comes out of the box. if you’re following a protobuf and gRPC standards.

Cool, so given now that we have kind of this interface for what a message looks like, what a response looks like for AdImpression, and for our response, we have a service, which is basically our AdTrack service. when we were compiled down the gRPC definitions, I’m using it Fitbit compile, the Akka Scala gRPC and all this does is basically take, the individual definition from the protocol buffer, the .proto file that we looked at before, and once it’s compiled down, it will scaffold the interface for my click track service. So we take a look at this class itself. So we have click track service implementation, which takes basically an Akka Materializer and I’m not going to go into that because this isn’t a talk about Akka, but it extends click track service is what we define in our RPC, contract. So when we extend the click track service, and then we basically scaffold this method, all we do is that, all we have to do at this point is actually fill in the blanks for what the, AdTrack is, so we take a look at override definition, AdTrack. So what we’re taking is our gRPC data in, is our AdImpression. We know that this is exactly what we should be receiving as an in for this AdTrack, because we define that ourselves in our, rpc definition. And then all that’s happening is saying that we’ve wrapped the response in a future. So there is a promising at some point in their future, unless everything fails, you will get a response to your client. So it’s basically an async kind of contract over here. So what we’re looking at right now, if we kinda go through the code is basically just a promise of a response, before we took a look at the, high level kind of big picture of the architecture, we took a look at it from client to server server to Kafka. This is all that is showing. and it’s also a dummy Kafka client, just to kind of show you how it would work. So where we have KafkaService published record KafkaRecord, with the Try, which is a Boolean, it’s gonna always just return true. This would be connected to a real Kafka client new we are doing much more, on your end to do this, but that wouldn’t fit in the example. So Kafka.publish, we’re sending a KafkaRecord, which has a topic of an so we’re sending basically just binary data to Kafka. We’ve got a binary topic, and then we’ve also got a binary data payload. So everything’s binary, it’s lightweight. The nice thing is that as we saw before, because the part above comes with the tone serialization library, it’s very quick to be able to serialize to a byte array. So we get basic binary data in, we have the AdTrack data. We can, do what we want prior to actually sending the stuff to Kafka. So say, for example, you want to add, say a server-side only field, to the actual ad impression. You could do that and say it’s a, stamping a timestamp of like, the time that you received that message, with all said and done, once you basically take that AdImpression record and you call toByteArray on it, you now have a ByteArray, so you have now binary data that is versioned and compiled to a specific version of your protobuf, on a specific topic, which is the So if this is a successful response and we’re able to publish this to Kafka just the 200, okay, comes back, which would be like a normal kind of HPP code for your client, encapsulating the fact that, you have published, the AdImpression that you wanted to, record, that’s fairly, lightweight. I think all in all, this is about maybe 80 lines of Scala. it could, of course be a lot larger, like in a normal kind of production use case with like validations across the board and everything else, but this gives hopefully like a quick kind of tidbit into, what actually would it take to implement the service and actually go run it.

So on the right hand side, you’re going to have literally the same exact screen from before, but this is gonna go a little bit more now in to basically what the protocol stream would be. So if you think about the client to the server relationship, being basically a pipe of data for your AdImpressions, this data basically flows directly from your client to your server, from the server, everything basically is flowing into the Kafka. And at that point everything’s basically, entering into the stream and spend, that’s basically like the ingress point into this, streaming data lineage pipeline, or whatever else that you have at your company. but the nice thing is that there’s no like random kind of raw garbage data, going into Kafka. And a lot of people would probably struggled in the past, with, kind of garbage in garbage out systems. So the really nice thing about basically putting gRPC in front of, your data lineage pipelines, or really any, kind of pipeline for reliable data at your company, this allows you to have, actual contracts for data, literally from the client to the server, from the server to Kafka. And then there’s also at that point, given that the protobuf has the ability to, be validated at the server-side, nothing that is garbage would get into the system, if nothing that is bad gets into the system, they don’t have to be as defensive downstream, as long as you’re abiding by really good best practices. So this takes a lot of like the defensive programming out of the actual like, out of the actual path, which makes everything a lot simpler downstream, ’cause you don’t have to first process the data and remove garbage data. You can do that prior to anything actually hitting Kafka, which speeds up your whole entire system downstream. The other nice thing too, is that you can use this really for anything that you can think of. So in the abstract use case, they will all, they can all go into, real time personalization, and predictive analytics for what, to show next for the ads based on how people have actually interacted with this, really in real time, so, creating really whatever you want, it’s all at your fingertips.

So this is basically the component that we’ve looked at so far, within like the kind of gRPC architecture. And so we talked about the service, we talked about the AdTrack with our trackedAd, that’s, all it’s doing is basically running a remote procedure call on their server, the gRPC server, which has that Click Track Service Implementation, which has the AdTrack method for that trackedAd. So that’s a binary message that’s being read as binary data into the server-side, which has a small footprint, in terms of, not being like a huge kind of bloated JSON payload, all of that can then very quickly be validated because exactly what fields to expect. And then that can go into like our stream. So itself is still just protobuf and it’s still binary. So it’s really no different than what we’re looking at from like a gRPC client to the GRP server. We don’t really change anything. So there’s really, if there’s no mutations, things can run super fast. So at that point from the gRPC server to Kafka nothing changes again. So it’s just binary validated, structured data, which is really, really good as a, kind of a start to your whole data lineage pipeline. And especially like in Spark, it’s nice to not have to worry about being overly defensive with your streaming data because streaming systems can go down at any time.

Structuring Protocol Streams:

So now we’re going to basically talk more about how do we, kind of wrap a nice kind of shell around this whole entire idea. So given it Sparks Summit and given that people really like streaming and Spark did a really good job of structured streaming. We’re talking about structured streaming and protobuf because it really is like the icing on the cake of this whole entire architecture, so as I said before, once everything gets into Kafka, we have topics that are bound to actually individual, it’s a topic of protocol buffer, so it’s a protocol stream. So the Structured Protocol Streams.

So the structured, with structured streaming itself with protobuf, the nice thing is that you can actually use, everything that’s, Spark actually already has baked in internally through their expression encoders library, to be able to take protocol buffers, which are compiled in this case, it’s a Scala example through ScalaPB, so ScalaPB is a compile library that takes your protocol with buffer definitions and (mumbles) case classes.

Structured Streaming with Protobuf

Given that a case class extends something called product in Scala and Spark runs with the product encoder, you can actually either implicitly bring in, the product encoder for any type by just calling like Spark session that implicits, and you’re basically importing from there. Or you can actually explicitly generate this implicit import. So you have to worry about doing it at runtime and you know exactly what you’re expecting. So I have over here, implicit val at expression and coder adImpressionEncoder, which takes an encoder of AdImpression. And all it’s doing is basically wrapping the Encoders.product, which ships for free with Spark. So what does this get you so it to native interoperability with protobuf in Apache Spark through ScalaPB, and then it also allows you to just work directly with the data that’s coming in from Kafka, from that client to the server, to Kafka and now into your actual Spark application. So from here, you can basically, you can do aggregation on it, for an individual user. You can prepare features for machine learning, send it down to, through your machine learning pipelines. And then, everything can basically roll right back into your ad server, which could also be a GPS, gRPC server as well, so it really connects the dots for this whole entire streaming system, in a really nice way where everything is expect, like there’s an expectation of how things would actually work.

So, as I said before, native is better. So with strict native Kafka data frame conversions, there’s no need for transforming of intermediate types. So if you’ve ever worked with a type that isn’t supported by Spark out of the box, a lot of times what you’ll do is do a transform method that takes some kind of binary data from Kafka and then you brought your own kind of serializer or deserializer into the mix. And then from there, you have to either choose to use only the RDD API, or you have to actually write your own kind of complicated expression encoder. Back in 2018, I did a Spark summit, talked about doing streaming trend and discovery. So in that code base, there’s actually an example of how to do, how to basically bring your own native protobuf encoder, to Spark but in this case, we’re just looking at the ScalaPB because it’s a lot easier. So on the right hand side, all I have is basically, it’s an example of what a structured streaming job would look like, that’s bringing in this AdTrack information and it’s running it through like an actual machine learning model. So if you take a look at the right hand side, we have our query, which is our streaming query, which is basically our input stream of Kafka data. we’ve loaded our data and now we’re just basically transforming that data into an AdImpression, then we’re just joining it with an adId across something called contentWithCategories. So contentWithWategories would be just, it’d be an ETL process, based on an adId to get more information about what is actually encapsulated within this ad. From there, we have all the information we would need to transform this data across our serialized, pieline. So there’s going to be a lot of other talks probably, the past few days of the Spark Summit, they’ve talked all about how to actually create like a pipeline. And how do you actually, load and save, pipeline models. And then how do you also say like a logistic regression model, linear regression model, et cetera, so this use case is actually kind of just showing what it would look like from McDonald’s like the data engineering or like machine learning engineering point of view, once those models and pipelines have been serialized off when you’re bringing them in. So it’s all in stream, so join on adId, transform to basically run our pipeline transformation that we’re going to go and basically just predict. So in the case of say like, ad serving, we want to predict the next best ad for this individual user. and then from there, we’re just gonna go write this out as parquet, and we’re gonna dump that into HTFS or Delta and allow, another system to pick that up, which can also do that in real time. So at this point, once everything hits Kafka once it’s been processed and once it’s been, it’s a predicted label has been applied to that data. Now I can get picked back up, you have this kind of full end to end cycle, which is basically, serving your ads, and the other nice thing too, is that given that there’s really nothing that should change, as long as best practices are enforced, then you can actually you can rest at night knowing that the pipelines are actually safe. And that’s, I think one of the biggest things that people worry about, especially from a data engineering point of view is that, a small mutation creates a butterfly effect that can take down an entire ad server. and to speak on ads, ads are money. so that’s something that you don’t actually really wanna do. So if you put in the right kind of safety precautions on the whole entire system, you don’t have to really worry so much about that and it’s really good for everybody, who’s also like in the downstream path as well.

The other nice thing too about structured streaming with protobuf is that you can also use your protobuf definition. So you can say taking a step back, if you have a protobuf, which is, say an input format and even expected output format for your protobuf, you could actually use, both protobuf objects, as a way of validating a strict output type. So, Delta has done this with versioning, the underlying struct types that are being stored in the Delta Lake, if you don’t use Delta like yet, and you’re using something like an HDFS, then this is a lot more important because it’s not as easy to be able to change the format of your parquet data over time. So one thing that we found like on my team at Twilio is that, if we have a definition that can be written out as like, say like a DDL, which is part of like a data frame schema, then it’s really easy to kind of read that data back in and then validate. So, on the top right, I have this basically this method, so def validateOutput takes the data frame, returns a boolean, So, DF schema to DDL equals our, report struct type DDL, so the struct type DDL is something that is basically versioned for every single, release of this actual piece of Spark code and from there, if anything ever kind of mutates, then it would be invalid. And once it’s invalid, then we are, either logging it as invalid, or we’re actually, most likely I’m actually pushing that data to say Datadog or other actually monitoring services to let us know that things have actually broken in the pipeline. but in fact, it’s a lot better, to do this at the gRPC level so that nothing that’s invalid actually enters into the pipeline at all, because otherwise you have to be defensive all the way across the board, but once you have a Spark application that is already running, that is say exchanging only parquet data, then the parquet did itself, which is converted to your data frame, can still be validated, with either Protobuf or, a struct type itself. So I think the really the big key takeaway is that as long as you know exactly what is in input and an output to your systems, it’s really easy at that point to create something that, seems really kind of complex from the outside, but is very solid, from an operational point of view, once it’s actually up and running, as long as you’re adhering to kind of strict best practices, but more often than not, it’s a lot easier to actually write that into a wrapping library, which is enforcing, some kind of strict data policies, which is really something that’s important. Alright, so now we’re gonna take a look at a real world use case, specific example from one of the Close of Book jobs that runs within my team, at Twilio for the voice insights team. so what we do is we have a job that runs, pretty much at the end of every single day, just revalidating all the data that was coming in through our streaming pipeline was actually accurate. And the nice thing about this is that we have this kind of this known kind of notion of a Close of Books and all of this data can actually be written out in to kind of a final output stream, which can then be reused to run machine learning jobs and everything else against. So we take a look at the method on the top right we have a run method, run method basically creates a stream inquiry. And this is taking our call summary data. So I work on the voice insights team. We have CDRs, which are called summaries and that call summary itself, is basically just loaded from, our parquet stream. And the parquet stream itself is then, just kind of compiled down, it’s a deduplicated and we pushed that back into a final, final stream back into HTFS. So it’s just an example of how to, how to do that. And, it’s giving you an idea of how we’ve actually wrapped this into our data engineering pipelines. So that’d be kind of interesting to show this after showing like the AdTrack and some of the other kind of example use cases to show what is a real world use case look like? And it’s actually really fairly trivial and fairly simple to actually take a look at that. So, I’m going to show that, so what we’ve looked at today, is kind of the end to end gRPC pipeline to Kafka, to Spark and then final output to parquet and other Hadoop (mumbles) and I really believe that this pipeline, works really well, especially for, as things get more and more complicated.

Streaming Microservice Architecture

So a lot of people will start off and you might have, a single service that has a single server maybe, two or three Kafka topics, maybe one Spark application, that’s actually processing all that data as you add more and more and more and more into this kind of same sort of system, if you follow the same idea of basically looking at everything as a span across time. So from your client method requests to the server, from the server to the Kafka topic, from the Kafka topic to a Spark application or many Spark applications like we looked at before, and then finally into like a source of truth store, like a Hadoop or a Delta Lake running on say Hadoop, which is parquet format. This gives you kind of a reliable conduit to continue to (mumbles) expand and make a more complicated, data lineage, pipeline, or entire kind of, a data lineage service like at your company all while looking at something that’s as simple as what we’re looking at right now. So we’re going to take a look now at just a simple kind of recap.

So our recap, so we took a look at protobuf and just to recap protobuf is just a language agnostic way of structuring data by creating a common message formats. They have compiled time guarantees, which means that once it’s been compiled, it can’t be mutated for that version of what’s been compiled. It also brings to the table lightening fast serialization and deserialization, which is super useful for sending this data across to the wire from your gRPC client to your gRPC server. And then also from the gRPC server to Kafka as a kind of binary message proxy into Kafka a gRPC as I said before is also language agnostic. So it’s following the same kind of Synetic properties that, Protobuf does where you can create a common format and you can compile it down to your language of choice. It’s also super low latency given that it’s running binary data over HTTP/2, it also has compiled time guarantees. So your API won’t change for the version that you’ve actually compiled, which falls in line to the same thing that protobuf offers. So compile time guarantee is on your message format, compile time guarantees on your actual server interrupt layer plus it’s a super smart framework, which has a lot of momentum. We also looked at Kafka and I didn’t go too much into Kafka because everybody had Sparked Summit I’m guaranteeing knows what Kafka is, but as a super quick recap, it’s highly available, it has a native Spark connector. So there’s no effort to basically create a connector for, either producing or consuming from Kafka. And you can use it to pass records to one or more downstream services, which is really great if you want to test different things, it’d be the end testing. last we took a look at structured streaming and this has been out for a long time now, but it’s a nice way of reliably handling data, end mass, the nice thing is that given that you can do protobuf to data set and data frame, conversions natively by using like the ScalaPB or library of your choice, or you can choose to write your own. it’s also really nice too, because everything can go kind of end to end protobuf then from there, the protobuf to data frame to parquet is also native in Sparks. So it allows you to take all that, all that, everything that you’ve worked hard on and reliably store it so that you don’t have a corrupt data Lake or data warehouse, for a future processing of your, valuable data assets.

So that is it, thank you so much for coming out to the, Spark Summit this year. I’m really excited that you got to learn a little bit more today about gRPC, and protobuf and Kafka and structured streaming, and find out a little bit about the things that my team has worked on over the past four years in terms of, creating a pipeline similar to this fund that I just showed today.

Watch more Spark + AI sessions here
Try Databricks for free
« back
About Scott Haines


Scott Haines is a full stack engineer with a current focus on real-time analytics and intelligence systems. He works at Twilio, as a Senior Principal Software Engineer on the Voice Insights team, where he helped drive spark adoption, streaming pipeline architectures, and helped to architect and build out a massive stream and batch processing platform.

In addition to his role on the Voice Insights team, he is also one of the Software Architects for the Machine Learning platform at Twilio where he is helping to shape the future of experimentation, model training and secure integration.

Scott currently runs the company wide Spark Office Hours where he provides guidance, tutorials, workshops and hands-on training to engineers and teams across Twilio. Prior to Twilio, he worked writing the backend Java API’s for Yahoo Games, as well as the real-time game ranking/ratings engine (built on Storm) to provide personalized recommendations and page views for 10 million customers. He finished his tenure at Yahoo working for Flurry Analytics where he wrote the alerts/notifications system for mobile.