Overview of end-to-end lifecycle to productize and commercialize alternative datasets at S&P Global Market Intelligence
Benefits to discuss:
Jay Bhankharia: Hi everyone. My name is Jay Bhankharia, product manager lead for our marketplace platforms at S&P global market intelligence, alongside my colleague Srinivasa Podugu who leads our marketplace technology teams. Today, we’re going to be presenting to you on how S&P global market intelligence commercializes alternative data. The topics today that we’ll cover is really around S&P market intelligence thinks about, creates, and brings alternative data to life. We’re going to walk through the strategy of alternative data, how we productize that. And then lastly, how we think about go-to-market and commercializing that data.
So before we begin, what is alternative data? Let’s just level set here on this term that’s frequently used in the market and is very discretionary. First and foremost, when S&P global thinks about alternative data, we really think about it from the perspective of data that may be unusable or is repurposed for different things, and has a lot of adjacent uses to it. The list here goes on and on of the different types of alternative data that’s out there. And we see this list growing weekly and monthly on a regular basis. One interesting thing to note about alternative data, similar to the saying, “one man’s trash is another man’s treasure,” is something we definitely see in the alternative data space in terms of different, real ways of repurposing this data and making it useful for a lot of different use cases.
And as we think about alternative data, there’s a lot of big trends as well as challenges that have grown this space so tremendously over the last few years. Probably the biggest reason for adoption within alternative data is really the underlying technology and capabilities that make this data more accessible and actionable than ever before. 10 years ago, what may have been on someone’s wishlist of data to analyze now has become readily available, given the amount of technology compute storage that makes it more accessible than ever before. When we think about data storage, the amount of data storage that’s required for some of these larger data sets and the cost effectiveness of that has come up to scale so much more to really make it much more accessible and palatable. In partner with that, in hand in glove, is really the processing power. To be able to analyze these terabytes of data and find meaningful insight into it, you really need a processing power that could handle that type of workload, but also that was cost-effective in doing so.
And the third area from a trend perspective is just the amount of data availability that we now have. Businesses are putting out data exhaust, companies are finding new derived analytics. We’re seeing much more meaningful and insightful data that’s been able to come online due to the highly connected world that we live in. But with these three trends and advances in alternative data, there’s plenty of challenges that go along with it. First and foremost, the major thing we always hear from our clients is just how messy this data can be. Clients have a lot of trouble just cleaning, standardizing, and structuring this data to make it even meaningful. The second part is really just understanding if there’s any value in these new unique data sets. Given they are so novel and unproven yet, clients really are a little hesitant if it’s just noise, or if there’s actually any signal that they can find within this information. And the third thing which goes hand in glove with data availability is just a lack of clarity of what exists, given so many new data sets are popping up left and right.
Understanding what’s out there, what’s meaningful, and having a clear view on how that can be related to your business is definitely a challenge we hear a lot from our clients. Just to hone in on some really key insights that we’ve heard from our clients and we hear through our conversations around alternative data is that clients spend up to 10X the amount on costs after they purchase a data set than even the cost of the purchase. And this really hits back to challenge one. Once a client gets a dataset, their data scientists and their data analysts are spending a lot of resources, just scrubbing, standardizing, linking, and storing that data. Not only is it a resource constraint, it’s a time suck as well. And this is one of the biggest pain points we hear that we really look to solve and make really key to our underlying value propositions here. So when we think about our alternative strategy, with that in mind, there’s really three key pillars to how we think about our datasets that we look to bring to market.
First and foremost, we need to make sure it has proven value and that there is some meaningful insight from this data. At S&P Global Market Intelligence, we have a team of dedicated quantitative quants, data scientists, and researchers, really spending a lot of time looking at use cases, back-testing the data, and ensuring that there is insight that can be derived from these types of data. We really want to make sure we take a stance of quality over quantity when bringing data sets to market and really thoroughly and robustly testing them to ensure that there is signal and not just noise. The second point, which really speaks to that 10X slide is, once we do find a data set that has some meaningful insights, the real goal here is now, how can we take the 80 20 rule and flip it on its head? Meaning if our clients are spending 80% of their time structuring and linking and scrubbing the data and 20% analyzing, we want to switch that so that they can spend only 20% of the time doing some of the heavy lifting and 80% analyzing.
And with that in mind, we really like to take these complex data sets, make them point in time, link them to all of our identifiers, structure them and clean them, so it’s much more systematic and easy for our clients to ingest and then begin their analysis. We really want to be a force multiplier here as it relates to our ability to make the data as usable as possible. And the third thing is, we understand the sophistication level of clients is across a spectrum. And with that in mind, we want to make sure we build data sets that can meet our clients based off of their level of sophistication and their desire. Whether that’s certain clients that are more sophisticated that just want raw data with a little bit of structuring and linking, or the other end of the spectrum is clients just want a derived analytic or a signal that they can just put into their model or just use as is without having to do anything on their side.
And so we’re mindful that when we build our data products and our alternative data products, we really try to attack both ends of the spectrum so that we have a holistic data product that any of our clients can really use and find insights from. So great examples of alternate of our alternative data strategy at work is our textual data suite. Our textual data suite comes from a variety of unstructured textual offerings, which we look to clean, structure, and link, to provide those derived analytics that make the data more usable. If we think about the world today, 80% of today’s data is unstructured. And much of that actually is in text form. So a lot of our clients are really looking to layer on artificial intelligence, various machine learning techniques, along with NLP to really find value and derive insight from these types of data sets. We felt we could provide a lot of significant value to help our clients address the pain points of linking, structuring, and finding analytics within these data sets through our suite of offerings here.
And here are just list a few of those, which we’ll dive into a little bit more detail on the next few slides. So the first example of a really powerful alternative data set that we’ve recently brought to market is our machine-readable filings. Essentially our machine-readable filings take annual filings across a wide variety of countries and companies and make them machine readable. So what we do here is we take away the menial task or the heavy lifting of having to source all of these data sets, reformat them, parse them, and structure them to make it more usable for our clients. In addition, not only do we do this for current filings, but we’ve done this historically across time as well. So now our clients have that back testing data that more of our investment management clients so readily love to use. Making this data much more systematically ingestible makes it a lot easier for our clients to do the various types of analytics that they want to such as any type of NLP, aggregating this data across various region, sectors, or industries, to find meaningful insights from the data.
And so if we think about machine-readable filings as providing that structured data set in its raw form with some linking, the other end of the spectrum here, moving up the value curve, is are textual data analytics suite. Which is more sentiment scores and behavioral metrics derived from earnings, transcripts, calls. So with with this data product, we take earnings call transcripts and actually create derived analytics on top of them across a variety of different dimensions, including positive negative sentiment, analyst and caller engagement, language complexity, and the number of words in the sentence. So with this product, we’ve taken the raw earnings calls, we’ve structured it, but we’ve also layered on top of that analytics as well that come prepackaged with our data feeds. So now clients, if they’re not interested in actually doing some of that NLP analysis on their side, we’ve actually done that for them.
And a great example of how this can be used and how clients have seen value is if we take a Tesla’s earnings call from 2018, where they did a public Q and A. And when they started this on YouTube, we saw that some of the questions didn’t elicit the best responses from Elon Musk, and he got a little bit frustrated with them. And we can see that the sentiment of the call started going down negatively pretty quick and pretty fast. And the insight we could derive from this was actually that Tesla stock fell in after-hours trading by almost 5%. So what we’ve been able to do here for clients that may not want to build this on their own is we provided them these derived analytics where they can now quantify a sentiment, which can then help power some of their industrial models or other types of benchmarks and aggregates. And so now that we understand a little bit of the data strategy, let me turn it over to Srini to talk a little bit about how we productize these alternative data sets and bring them to life.
Srinivasa Podug…: Thanks, Jay. My name is Srinivasa Podugu. I head marketplace technology at S&P global. Prior to that, I have worked for JP Morgan where I led the credit risk technology. Today, I’m going to talk about how we productize alternative data, and I’ll take that into three parts. I’ll first go and then talk about principles, and then how we do it, and then what are the benefits that we have gained by implementing this kind of a design patterns. The first, the principles. First and foremost, reduce time to market. As Jay have explained the data trends and the challenges, the alternative data is very messy. So it requires a lot of analysis upfront and then implement robust pipelines, derive the value out of it, and then produce the data in various products that we sell to our customers. So the [inaudible] time end to end, we want to reduce it. And not only that, we want to increase the throughput.
There are so many data vendors out there, so many data sets out there, we want to quickly identify the value in each of those data sets, and then whatever, wherever we find value, we want to deliver those things faster. And next, we are providing a lot of data products already to our customers. When we bring in the alternative data, we want to provide the consistent user experience to our customers in the same way that they were experiencing of the other data products. As the alternative data is coming into our side, we not only take the data as is, we would like to enhance the value of the alternative data. And then we also want to create a derivative products out of this data as well. These are the fundamental principles that we go by while creating the productionizing these datasets. So let me walk you through how we productize this alternative data. We bring in the data from the alternative vendors. The data comes in various different formats, depending on how they collect the data and they structure it, and the frequency at which they send the data to us would vary.
And also the sizes, the volume of the data would also vary. So we bring in all the data, we needed to find a suitable technology for us to land this data. So we are bringing the data into an analytics platform, privately the Databricks environment. So we land the data into [inaudible] and we quickly make that available to our product specialists and research specialists, where they can analyze this data more. And then find the value, identify nuances there, and probably come up with some kind of a business rules, and then hand it off to our product specialists, the product teams, and also the developers to proceed with the next steps in the of that data enhancements. So that’s where the developers would come into picture and they would implement the pipeline and then bring the data into a transformer state where this data would have been having a much more structured, much more formatted. And that data lands into curator zone. And from the curator zone, we actually create some kind of aggregations, trying to do any kind of a change captures.
Some vendors would give full data sets every time, some vendors will give a full data set once, and then they’ll give the incremental, some vendors will give the full historical at one time and then the current year’s worth of data at one time, and then maybe incremental [inaudible] point of time. So all these nuances that we will handle it internally, and we provide the customer experience wise, we deliver once full file and incremental. So that experience is going to be consistent irrespective of how vendors are giving the data to us. So that is how we land the data into the consumer zone. And once the data is available into this platform, so this becomes a golden copy of the data. And on top of this data, now our data scientists can come in and then they can implement any kind of analytical models and then create any derived products. So one example that Jay has already talked about, external data analytics, this is built on top of the transcripts. And then we get into our product space. So our historical flagship product is the feed product.
So feed product is our primary, we deliver the alternative data through the feeds, and we also deliver the data through the APIs, and we also provide an analytical platform for our customers work bench. Jay’s going to be talking about this one a little more in the next segment and the marketplace S&P global. How the clients will be using these products. Clients will come in and discover the data sets on the S&P Global Marketplace. And then they want to get into the research and evaluation, that’s where the Workbench is going to be useful. And then once they’re happy with the data that they’re looking for, they can actually start taking the data into their environment, through the feeds or the APIs. And as we build this platform, this platform is well-integrated with our internal tools. For example, the single sign-on. Single sign-on is a tool to which all the platform is integrated. Our clients who are logging into S&P Global Marketplace, with the single sign-on, they can actually log into the Workbench. And the Workbench itself is private to each client.
So how do we do that? The moment we onboard the clients to our internal tools, and we will spin off a separate work bench for each client. So these are all well connected in that respect. And with this design pattern, what is that we have achieved, is that though the data is messier and then all the quantitative research that has been done and end to end, we were able to deliver the 10 alternative data sets in four months. And then we actually built robust data pipelines. We were able to solve any kind of nuances that will come in the alternative data. We provided one golden copy, so one golden copy that is hitting all our distributed channels, such that we are able to provide a consistent user experience, whether they get the feeds or whether they get the APIs, or whether they get the workflow in the Workbench. The data is going to be consistent.
And then we accelerated the dataset evaluation through our Workbench. So Workbench is our unique offering. As I said, Jay will talk more about it. The data is going to be available into the Workbench, then right after the data is available onto our golden copy. And finally, with these unified data analytics platform powered by Databricks, we are able to achieve this one using this design pattern, and we gain the advantage of bringing in our S&P’s capabilities in combination with the Databricks capabilities, and we gain the advantage from the both end. And the advancements that are happening with the Databricks and the capabilities that we are building on our side and analytical data science algorithms that we are building. So it’s of a great value, and we are able to deliver that unified analytics data platform here. Thank you, Databricks, And I would like to pass it back to Jay.
Jay Bhankharia: Great. Thank you, Srini. So after we productize the data, the next step in the process is how do we commercialize all of this alternative data? So that market intelligence with our marketplace platform, our go-to-market strategy is really focused on reducing time to value. What does that mean? We know our clients are spread thin and we need to ensure that we can quickly and easily help our clients discover, understand, and demonstrate the value of our alternative datasets. And we take a three-pronged approach to this consideration. One, how do we make it easier exploration? Two, how do we provide the thought leadership to really help guide our clients and help them understand the value of this data set? And three, how do we provide them the tools and the platform so that they can explore this data in an as frictionless way as possible? So why don’t we dive into these three in a little bit more detail?
So first and foremost, around easy exploration and thought leadership. As discussed before, the amount of data availability is one of the biggest upstream challenges for our clients in terms of just being able to cipher through what’s out there and what’s potentially meaningful. And to help solve for this problem, what we’ve done at a Market Intelligence is create a storefront catalog called Marketplace, really designed to help with this challenge. The design principles around the marketplace is really meant to be transparent and easy and intuitive to use. The goal here is to really provide clients a robust catalog of all of the data sets we offer in a really simple and interactive way. Where we not only provide a brief description of the data, but provide one layer deeper, where we provide a lot of the metadata statistics around history of the data set, number of data items, coverage, et cetera. But also providing them quick snippets of sample data and the data dictionary upfront to really help guide their process.
The website is very search-driven and intuitive. So as clients are searching for various themes, whether that be ESG, supply chain, or text, they can quickly and easily find all of the data sets that are relevant to those types of themes when looking on our storefront. In addition to the storefront, another key area to help reduce that time to value is the robust thought leadership that S&P produces around these various topics. We do that in a number of ways. First and foremost, we have a large group of internal researchers, quantitative analysts, and data scientists that produce a lot of white papers, proof statements, and case studies across the different data sets we have. Great examples of that include how we leveraged geospatial data that’s on our platform to rate foot traffic over the last year throughout the pandemic. And so providing these statements really help our clients, one, understand the data set and how it could be potentially used, but help spark thought too for other ways that that type of analysis could be used for their specific workflows.
In addition to some of the white papers, we even take that a step further to host a multitude of different types of webinars and thought leadership forums. We try to make these as timely as possible, the one you see here is really around COVID-19 and the pandemic, and some of the ways that alternative data could be used to really look at insights that have help drive economic factors during this time period. But we put on a lot of research, everything from ESG supply chain, and try to bring industry participants together to really, once again, show the value of the data, but also provide a thought leadership overlay on top of that. To make sure that it makes sense to the client, and they can understand the different use cases and workflows from that data. The last and certainly not least, the newest offering as Srini mentioned is a new product called Marketplace Workbench.
And this platform is essentially a sandbox environment posted on top of a configured Databricks environment that really is S&P data along with a library of pre-built notebooks, to help our clients gain an understanding of the data and demonstrate its value in a very easy way. Some of the pain points that we’ve heard from clients is around the steep learning curve around alternative data. And so how are we able to flatten that learning curve by providing them this library of different notebooks that demonstrate the value of the data along with the code that underlines that and underpins that? And the second big pain point that we see is sometimes clients deal with challenges of just bringing data in-house. Whether that’s dealing with the red tape of InfoSec reviews, legal reviews, and administrative burdens. So by providing them access to a single sign-on web based platform, it really makes that process much more frictionless as well. And we’ve seen a lot of traction and excitement over this platform.
Once again, when clients can start looking at these prebuilt notebooks, it really helps them understand the value of the dataset right away. And now they’re able to take those notebooks, clone them or create their own notebooks, and begin analyzing the data on their own very, very quickly and seamlessly. Another really great benefit of Workbench that that’s provided to our clients is the collaboration tools that it offers. Once again, a lot of our clients don’t like working in silos anymore, and data teams in general don’t like working in silos. So the ability to have real time co-authoring, co-editing, and commenting functionality makes it easier for various stakeholders across an organization to look and review the data sets and then come to a decision much more quicker and efficiently. It’s also populated with all of our S&P data, a majority of our market intelligence data. So clients don’t need to ingest multiple feeds or APIs, but they can come to one holistic interface where all of our data’s preloaded for them to then access it together.
And then lastly, part of the Databricks platform is the multi-language support that it offers. So for data teams across a spectrum, whether you’re a data analyst fluent in just SQL, or a data scientist that uses Python, they can all work in these notebooks efficiently and seamlessly. So as we think about reducing time to value, this platform of Workbench really does bring the data, the code, and the tools all together for a really seamless experience for our clients. To look at the notebooks here, a great example of how we’ve been able to bring some of the data to life and had material client value is we took our machinery to a filings and work with our data science teams internally to build out a topic modeling machine learning model. And we were able to see the changes in time in the automotive sector in the topics that were discussed in filings. And it was interesting to see how few and far between the mentions of China were, call it 10 or 15 years ago, but how that has changed over time.
And similarly, even more relevant is how little there was talk in the filings of electric vehicles and how much that’s changed over time. And we’re able to visualize that within the platform, we’re able to provide the underlying code, which really demonstrates the value of the dataset. Where a client can now take that and change the sector, change the topic mapping and other factors, and really play with the data, demonstrate its value, and understand it more easily. So in summary, we think the alternative data space is growing leaps and bounds. There still are a lot of challenges, but there’s a lot of unique insights to this data as well. We’re really excited for the opportunity within Marketplace, within the S&P Global Marketplace and with Workbench as well, to really bring this data to life, demonstrate its value, and hopefully have more of our clients use it and gain insights from it as well. And with that, I’ll turn it over for some Q and A.
Jay Bhankharia is the Head of Marketplace Platforms at S&P Global. Prior to this role, Jay has had roles across S&P Global in corporate strategy and development, business development, and GTM strategy...
Srinivasa Podugu is the Head of Marketplace Platforms Technology at S&P Global. Prior to this role, Srini led technology for credit analytics and ratings products at S&P Global. Srini led Credit Risk ...