RDF, Knowledge Graphs, and ontologies enable companies to produce and consume graph data that is interoperable, sharable, and self-describing. GSK has set out to build the world’s largest medical knowledge graph to provide our scientists access to the world’s medical knowledge, also enable machine learning to infer links between facts.
These inferred links are the heart of gene to disease mapping and is the future of discovering new treatments and vaccines. To power RDF sub-graphing, GSK has developed a set of open-source libraries codenamed “Project Bellman” that enable Sparql queries over partitioned RDF data in Apache Spark.
These tools provide the ability to scale up to Sparql querying over trillions of RDF triples, provide point-in-time queries, and provide incremental data updates to downstream consumer applications. These tools are used by both GSK’s Ai/ML team to discover gene to disease mappings, and GSK’s scientists to query over the world’s medical knowledge.
John Hunter: I’m John Hunter and I’m the product director of AI operations at GSK. My product is called Knowledge Graph, and it’s a highly connected RDF graph database that houses all of the world’s scientific and medical knowledge extracted from documentation within most of the world’s medical and scientific knowledge bases. What I’d like to introduce is our approach to Knowledge Graph at GSK. We’ll start with a bit of an intro of drug R and D. From there, we’ll go to why a knowledge graph, and we’ll also review the way we are storing knowledge graph data at GSK, how we’re querying our data and an introduction to an open source library that we’ve built in partnership with 47Degrees called Bellman, which is a Sparql library, and I’ll walk through a quick demo of how that works and a little bit about the future, where we’re headed within GSK with regards to Knowledge Graph.
It’s good to have a little bit of a background of drug and vaccine R and D. It takes about 12 to 15 years to bring a drug or a vaccine to market, and there are different phases in this drug pipeline. We’re going to focus on the research stage of this talk because that’s where I work. If you can imagine a timeline moving from left to right, all of these different phases from research, preclinical, clinical trials are different phases in the research and development of medicines and vaccines. The focus of my work is on the research part, which is generally about a two to three year process of scientists finding new medicines and new approaches that can possibly be a target for new drugs and new treatments. If you could imagine that if we could shave off a year or two off of each of these phases, we can bring new lifesaving drugs to market much quicker.
Mind you, this is under normal circumstances. Also of note, it’s interesting, 5% of drugs that actually make it into human trials, which is this third piece here, about 5% of them make it through to the actual government agency review. It’s a very high risk, very costly process. Any time savings we can have along the way will greatly benefit both the company and humanity.
There are typical R and D use cases for data, and the way we can accelerate each one of these stages in the drug pipeline is to work with data at scale. There are different use cases that we have. The first is the discover, where this is mostly the individual scientists querying for data and just looking for new insights and new information on diseases and treatments so that they can come up with some new drug target or some novel approach to treating a disease. The encode and the predict use cases are more along the lines of the AIML use case, where machines, in the encode use case, machines create representations that they can understand at scale. We’re talking billions of data points that are encoded for machine consumption. For the predict use case, that is more along the lines of inferring links in data that don’t exist, but may or should exist. More along the lines of a machine learning type or inferencing type of use case.
There’s lots of different data sources that scientists work with. Most of them are structured data sources and unstructured data sources. The structured data sources are generally sources like Wikidata, some other NIH funded programs that store different types of data from genes to diseases to proteins, and then unstructured data, and those are just white papers in repositories, such as Elsevier and PubMed. Data scientists and traditional scientists are in this dilemma of how do they unify and manage all of the silos of data. What we’ve found is that a knowledge graph really fits this use case well, where we can connect data and federate it and integrate it in such a way that is both accessible by humans and by machines alike.
We went through sort of the problem statement and a little bit of why a knowledge graph, but let’s go into some more detail around why exactly is the knowledge graph a good fit, and how does it connect disparate data sources together? In a knowledge graph, connections, the atomic unit in a knowledge graph, is the RDF triple, which is two pieces of data connected by a predicate, where the knowledge piece comes in with knowledge graph is the actual connection itself is encoded meaning. This meaning can be used both by humans and machine alike to not only connect facts and connect entities in the system, but also inscribe meaning to those connections, where in a traditional database these connections have an implicit meaning, but in a knowledge graph, all connections are explicit. Not only are they explicit, they’re required and expected. Joins in a relational database are expensive. Federating data is expensive in a traditional database. In a knowledge graph, it’s the norm. We’re expected to bring in disparate data sources together.
As you can see here with this simple diagram, you can see how we can bring sources of data from disease databases, gene knowledge graphs, compounds, pharmaceutical classes, and connect those all together into one cohesive unit that can be queried and joined pretty easily. A natural question is why a Knowledge Graph built on Spark? The answer is that GSK has done a lot of research on different graph data stores that are out there, and they’ve either failed our requirements one way or another, either data ingress or data egress failed, or some of the data stores failed cap theorem, where the availability just wasn’t there for us. We wanted to build a highly available system, and just the sheer size of our Knowledge Graph. We have currently about 500 billion connections and they’re growing daily. As we get incremental data from some of these data extractions, specifically from the unstructured data where we’re using NLP to extract facts out of scientific literature. This is a big contributor to our data growth. High read/write throughput we covered because that’s the ML and AI use case where we’re just reading and writing lots and lots of data either to train a model or write encoded data back into the Knowledge Graph that’s machine-readable. Of course, Spark is a mature tech and has a fantastic OSS community.
It might be worth talking a little bit about Knowledge Graph data. RDF is the foundation of most Knowledge Graphs and it’s the foundation of the Knowledge Graph that GSK has built. It stands for resource definition framework and it’s standard for describing linked data on the web and uses URIs as identifiers, which is a really effective way to uniquely identify data within our system. Many times the data is linked to an actual resource on the web, so if you paste a URI in a knowledge graph into a web browser, you many times will get information. It’s not required, but a lot of knowledge graph data vendors actually do that and some of the open source data sets that are out there actually do link to an actual webpage. Wikidata is a famous example of that.
RDF triples come in different flavors, and we support the Ntriple format, which is a really simple flat structure that works really well in Apache Spark. It’s a split able format. Other formats are XML, Turtle, and some other formats that are non-split able. We chose RDF Triples because they are split able and can easily be ingested into Spark in a highly parallelized way and processed in a highly parallelized way. Ntriple that fits really nicely into tables in Spark. We’ve got our subject predicate object triples that are represented just by three simple columns, and also included is an optional graph position, which indicates, at least in our knowledge graph at GSK, where that data came from. Data lineage is encoded right into each and every triple in the graph.
What’s great about that is that we can partition on those graphs so that we could prune the amount of data that we’re querying. If an end user is interested in only a part of the data, we can actually target that specific graph and load only that data, speeding up queries. Then as a secondary partition, we could partition on the P column, which is a lower card [inaudible] column, which allows us to really, really speed up queries, especially for queries that specify that P position. There’s lots of different ways to represent the data in the knowledge graph on disk. The data structure we chose is just one gigantic SPO table. It’s very easy to manage. With Spark’s partitioning, we get the best of both worlds. We get a single table, which is easy to deal with, and we can also get data partitioning in the form of graph and predicate partitioning.
There are also some other requirements that we were looking at and we like to have, such as ACID transactions, atomic writes, so if a right fails in the middle, we want to be able to roll back and start the ride again without having to deal with a data spill, and isolated reads, if a user is reading from the data store in the middle of a write, they’ll get consistent results. Dedup knowledge grafts tend to have lots of duplication of triples. It’s expected not only between data sets, but also within data sets. It’s just the nature of the way RDF has queried and written. Incremental queries is also an additional requirement. There are data extractions that can handle this for us, such as Apache Hudi and Delta Lake. We’ve looked into both. We’re going with Apache Hudi because of compatibility with our on-prem systems.
The query and knowledge graph. Sparql is the defacto standard, and it’s a very simple language to learn, which is why we have chosen it as our interface to our knowledge graph. It’s a very Sequel like language. Many of our data scientists and biologists know Sequel, so learning sparkle is very easy. It also provides some really nice language features for creating sub-graphs, which is a large part of what we do with our knowledge graph. If you could imagine a knowledge graph of 500 billion triples, it’s not convenient to query 500 billion triples when you’re just looking for one or two results. What we generally do is we’ll create a sub graph, load that into a lower latency system, like Neo4j or Blazegraph or something faster in order to do faster, more transactional type queries.
One of the pieces that we were missing was how do we write Sparql queries and have those execute on Apache Spark? There were no commercial offerings that allow us to do this. There are some open source offerings, not that run on spark. Well, one does, however, we were looking at packages like Apache Jena and Blazegraph, which don’t run on Spark, but could provide Sparql querying features, and Sansa Stack, which is an actual library that allows users to execute Sparql queries on RDF data on Apache Spark. This is the work of Jens Lehmann and his team at the University of Bonn. A lot of our work was inspired by his team’s work and the Sansa Stack framework, so we’ve adopted a lot of those pieces that they’ve built and incorporated it into our own, and also took care to make sure that our features work with Sansa Stack so that we could inter-operate with some of their excellent libraries.
Due to this, because we have this requirement to run Sparql queries on Apache Spark, we decided to roll our own Sparql engine. That Sparql engine is called Bellman. It’s an open source project that’s sponsored by GSK and is developed in partnership with 47Degrees. Here’s the URL to the Bellman repository. Please have a look at it. Please contribute, star, comment, open up bug reports. We’re looking for people to help us grow and expand and improve the library.
The architecture for the Bellmen Sparql Engine has a bunch of different stages. The first stage is to take in a Sparql query and parse it, turn it into an abstract syntax tree, and then from there we can create an algebraic data type, which is the recursive structure that gets passed over to a compiler, which takes that algebraic data type and turns it into Spark data frame and Spark data set operations. Made a little bit of mistake here. These are out of order. Actually, they’re in the right order. Static analysis is where we do some last minute checking to make sure that there are no logical errors in the code that was passed before it’s actually optimized. We have some optimizations that we do in Sparql queries where we compress certain statements down into one statement, reduce the amount of scans over data, because that can be very expensive. Once all of that optimization static analysis is done, we pass it over to the engine, which executes the code on Apache Spark and then a formator, which formats the output. That gives us control over the type of data that is produced from the engine, so all the different RDF formats like Ntriples or XML.
We’ve got a bit of a demo set up and we’d like to walk you through it and we will step over there now. We’ve got a Databricks notebook set up here. We’ve got a cluster fired up. We’ve got the Bellman Sparql libraries loaded up as well into the cluster, and we are ready to go. This first section here is just some imports, and we’re initializing the Jena library, which we’re using to do a lot of the heavy work and lifting behind our engine. A lot of the lower level RDF formatting and checking to make sure that RDF is valid. We do a lot of that using the Jena library. It’s a lot of work with heuristics and trial and error when it comes to making sure your RDF data is properly formatted. Jena has done a lot of that for us already, so why not use it.
Command two here is where we’re loading our knowledge graph. What I’ve loaded here is Wikidata. It is the latest [inaudible] Wikidata variant, which is a version of Wikidata. It’s all RDF, but it’s a version of Wikidata where each of the facts within the knowledge graph itself are above certain threshold of quality. We want to make sure that the facts within our knowledge graph are at least verified by experts, and that is the latest [inaudible] dataset. We’ve loaded that. I’ve pre-run a lot of these commands because some of them take a little bit of time. This command takes three minutes, so I’ve pre-run it just to show how many triples are in the graph. There are over five billion, which is why it took about two minutes to count them. If we print the schema we can see that SPO columns and pretty simple, simple data structure.
Just to show, this is just a normal data frame that we’ve loaded data into. Let’s just see the top 10. We’ve got that here. Now, what we’ve done is we’ve imported the Bellman libraries. What that gives us is this nice Sparql syntax on our data frames that we can just call and inline a Sparql query and run it. We can do that. We can see that that ran. We get identical results to the select star limit here. If we go into the actual job itself just to look at the Sequel that was generated, Spark read the first 10,000 triples and just returned 10. Just to show you how we’re doing it in the Bellman engine, it’s very similar. We’re reading 10,000 triples. We have an extra project statement here, and then collecting the output. Very similar approach and works very similarly to the Sparql, at least the Spark Sequel counterpart.
Now we can start exploring our knowledge graph now that we’ve established that it’s been loaded and all is working well. This first query, we’re just going to find all gene variants within Wikidata that are positive prognostic indicators, and I chose pancreatic cancer. What a positive prognostic indicator is is it’s a positive outcome for a patient when a certain gene variant is present. This query here just shows how we can query the knowledge graph for that information. We’re also querying we’re querying which variant has a positive prognostic indicator for pancreatic cancer, but we’re also saying don’t only give us the variant. Also give us the gene as well and return that in a table. That’s what we see here. We can see the variant. What’s really cool about this is that you can select the actual URI, paste it, and you can get more information about the gene. Really, really nice feature of RDF and knowledge graphs that really play nice with scientists and with literature. Really a nice way to query and to learn more about what it is you’re querying within your knowledge graph.
That gave us one result out of all of the five billion triples. Let’s just check against Wikidata and see if we get the same result there. Wikidata provides this nice Sparql query service, so let’s run that query. As you can see, one result. Wikidata is formatting their data a little bit differently than the way we’re formatting it, but the actual IDs should match up, 213, 961, 922, 213, 961, 922. We’re in parody with Wikidata. That proves that our Sparql Engine is working incorrectly. Yeah.
Now that we’ve been able to query one gene variant, next query just gives us all of the gene variants. Let’s see if we could find all of the gene variants that are positive prognostic indicators of a particular disease. When we run that query, we get many more results, getting 96 rows here. From there, we can just start querying other aspects of the knowledge graph. The goal here, what we’re doing is we’re exploring the knowledge graph and figuring out how we could… the goal is to create, what I would like to create, let’s just say search on diseases. Let’s say we want to have the ability to input a disease’s name and get back all kinds of semantic information about that disease name. It’s kind of what I’m driving at here. Doing a little bit of exploration here to see what’s in the graph.
From there, now that we’ve got our positive prognostic indicators, let’s see what gene variants have a positive therapeutic indicator, meaning which presence of which gene variants has a positive effect on a disease when a certain therapy is applied. We can query the graph for that data and get that back. We get quite a bit of data there. It seems to be a lot of drug data within Wikidata. That’s kind of cool. From there, if we want to be able to create a function that takes a disease as an input and provides all kinds of semantic information about that disease as an output, we wanted to make sure that we could, in fact, query all the diseases within the Wikidata knowledge graph and get all the drugs that are used to treat that disease. That’s what this query is doing, giving us back all of those rows.
Now we have a nice set of queries to start putting it all together into this function to allow us to do a disease search. This is how we’ll put it all together. Sparql has a really nice syntax called Construct. What that does is it creates a new knowledge graph from the data that you’re querying in your where clause. What we’re doing here is we are putting all the queries that we had put together above all together, and then taking all of those results and constructing a new graph that is optimized for disease search. If you see here, I’m creating new triples with disease as the subject position, which gives us the ability to query very quickly on diseases. As you can see, this will give us a nice little dataset that we can query diseases on.
Some really nice aspects of RDF and Knowledge Graph is using what are called ontologies, and ontologies are specifications for naming things. They’re generally agreed upon by different industries, so there are disease ontologies, there are gene ontologies. There are many different names for individual genes. If we stick to a specific ontology and if others stick to those specific ontologies, we’re all calling things the same name. It allows us to query data in between knowledge graphs, which is one of the really important things when it comes to a knowledge graph. You want your data to be interoperable to knock down those silos of data. We’re using the disease ontology and we’re using the simple knowledge organization namespace to create a nice graph for ourselves to query that is standards-based and will inter-operate. We can run that query and we can get back our graph that is diseases with all of the associated genes, gene variants, treatments, and some other nice links like ensemble ID, which is another gene database that gives us additional information about certain genes.
From there, we’ve got our knowledge graph, we’ve got it in memory, we’ve created that graph data frame. Now let’s create our literature search query. Now we want to create a query that takes a disease name and returns all the information we have about that disease. That’s what this query does here. We’ll call our function literature search. It takes a query as the first parameter and a data frame as a second parameter and outputs a data frame. That output data frame will be the results of searching on a specific disease. What we’ll do is now that we have our literature search function, we’ll search on pancreatic cancer again. That should give us our results and we can get our disease, our disease name, pancreatic cancer, associated genes, associated gene variants. This is just like a really simple demonstration of how we can create these sub-graphs very, very quickly that allow us to create specialized data structures and specialized graphs for a particular purpose. If we search for breast cancer, we get a lot more results. It’s a very, very well-researched cancer and lots of information there, along with the associated gene variant and associated treatments.
That’s the end of the demo and hope that gives you a good idea of how we can run the Sparql queries in Apache Spark and how we’re providing this service to scientists within GSK. That brings us to present day. We are working on completing the Sparql 1.1 language specification. We’re optimizing our queries to make the queries run faster. We are optimizing data on disks. We are organizing our data based on the types of queries that scientists are submitting. Also, point-in-time queries and incremental queries, which libraries like Apache Hudi and Databricks Delta gives us. Thank you. Here are the links to the open source library. We hope you’ll join us there and star the repository, submit bug reports and pull requests. Thank you and I will see you on Github. Thank you. Bye.
John is an engineering leader at GSK with a focus on functional programming and big data.