Discovering new drugs is a lengthy and expensive process. This means that finding new uses for existing drugs can help create new treatments in less time and with less time. The difficulty is in finding these potential new uses.
How do we find these undiscovered uses for existing drugs?
We can unify the available structured and unstructured data sets into a knowledge graph. This is done by fusing the structured data sets, and performing named entity extraction on the unstructured data sets. Once this is done, we can use deep learning techniques to predict latent relationships.
In this talk we will cover:
Vishnu Vettrive…: Hello, everyone. We’re going to be talking about Drug Repurposing using Deep Learning on Knowledge Graphs today, and before we get started I just want to quickly introduce my colleague, Alex Thomas, who is a principal data scientist at Wisecube. Alex and I have known each other for many years. He has been working in natural language processing and machine learning, with Indeed before this and also various other startups and he’s also a published author and a speaker, and he’s spoken at Spark Summit and other conferences many times before.
My name is Vishnu, I’m the CTO and Co-founder of Wisecube AI. we specialize in using graph based technologies in drug discovery. This talk we’ll be talking about how we applied some of the graph technologies in drug repurposing, but we’ve been working in graphic technologies for quite a while now, and so we’re pretty excited to be able to share our knowledge as well. Moving on to the next slide.
So to begin, I wanted to set the stage for those of you who are not very familiar with the drug discovery space. There’s obviously a lot of research funding that goes into drug discovery every year, around $200 billion is spent on just the biomedical research part, and unfortunately, there’s not a lot to show for it. In fact, it’s one of the industries where most of the research spending does end up going to waste. There’s some accounts, it’s almost 75% of the research is not reproducible. In fact, it’s gotten so worse there’s actually, many research articles suggest that things are not trending in the right direction. There’s actually a law called Eroom’s law, you might’ve heard of it where it’s the opposite of Moore’s law, where the number of new drugs that are approved per billion dollars spent on R&D it’s actually steadily gone down. In fact, it’s gone down to the point where it’s getting half the amount every nine years or so since 1950.
And so the graph on the right shows you the trend that the pharma industry is actually sort of going through. And this is typically what you don’t want to see. Like you want to see the opposite, right? And so clearly there’s a problem in terms of biomedical research and the amount of output that they’re actually able to produce. Moving on to the next slide.
So there’s obviously various solutions that the pharm industry itself as looking at. Obviously AI is a big part of that. And specifically there’s few different ways you can apply AI. There’s is many different ways. One of the ways that we’re going to be talking about today is, typically when you think about drug discovery. You’re thinking about what’s known as a new compound that you’re trying to find and a new therapy, but oftentimes there’s actually a lot of existing drugs that have already been approved by FDA. And the FDA approval process is a very long and cumbersome process, which takes somewhere from 10 to 15 years sometimes.
And like we just talked about could be like millions of dollars worth of investment. So what if we could actually leveraged the drugs that are already in market and sort of for what they call an off target use, which is if we can repurpose some of these drugs that are already been approved, already proven to be safe for a different use, then that would be like sort of short-cutting the whole process, right? And so that’s what drug repurposing is all about. One of the primary technique that we’ll be talking about today is how do you leverage scientific literature and network or knowledge graphs to combine both of these together, to be able to search for new therapeutic uses for existing drugs that have already been approved by the FDA, right?
And at a very high level, it involves as you would imagine, taking the scientific literature sort of not only just indexing it, but also extracting some of the assertions and triples from it, and then combining it with existing knowledge graphs, and then being able to run a search on it using some advanced AI and looking for candidate drugs. With that said, I would like to also give sort of an overview of like how we actually helped do this Wisecube. We are actually currently working with… and we did work with one of the research institutions based out of Southern California called St. John’s. They’re actually part of Providence health care and using our AI platform, which sort of combines core of the AI processing and also the knowledge graph element. We were able to run a POC with them to produce some candidate results for specific neuro degenerative diseases. And Alex will go into much more detail about the POC itself and what we did there.
With that, I would like to turn it over to Alex who will then walk us through the pipeline itself and go into the details of how we did that, Alex.
Alex Thomas: Thanks Vishnu. Yeah. So there’s a first multiple steps that are required to extract data like this. So first let’s consider that we were having two kinds of data sources. One we have unstructured data, so that would be the biomedical text and that we want to do natural language processing on, and then we have the structured data. And so this will come in the way of curated graphs or databases that we want to unify first with each other and then also with what we extract from the text.
So this is sort of a diagram and a very high level of what we’re doing for the Orpheus project. Really what we’re going to be focusing on here is the unification. So that’s sort of in the ETL category, but also in the link prediction, which is sort of when you see there the TransE L2 embeddings. We’ll get more into that in a moment. We’ve broken this into four steps. There’s the datasets that we actually use for doing this. There’s ingesting them, there’s building the graph, which is going to have these different sources for the edges. And then there’s ultimately the link prediction.
So the three main datasets are the drug repurposing knowledge graph for DRKG and that’s actually an amalgam itself of other datasets. There’s some processing already done from the really great team at the deep learning for graphs project. And then there’s ChEMBL, which is an openly available dataset that has all the chemistry information and PubChem, which is another one, a similar dataset. ChEMBL of course, from the UK, PubChem from the United States. As far as for DRKG, it’s a combination of these six different datasets. So I won’t go into all the details shown here. Mostly it’s actually subsets from these. It’s not really the full dataset for any of them, but let’s talk a little bit more now about how we ingest this data.
So in ingesting the data, there’s the unification process. So this is making sure that the two entities that may have different IDs and different datasets ultimately end up with the same ID inside Orpheus. So that requires a couple of different steps, depending on which data we’re unifying. And then once we’ve done that, we need to load it into a database if we want to serve it. But if we’re doing link prediction, we actually don’t need to do that at first. So we need to load it though to present it to sellers. And then once we’ve done this, ingesting actually would occur after the link prediction, which we’ll get to later. So what do we do about like the outputs that we’re going to get from this process?
So when we ingest data, so linking these IDs, there’s some difficulty because different datasets may have different idea of what is the primary element. So for example, you may have, perhaps in one dataset, the IDs may represent both a drug and a dosage, and you’re going to need to match that up to another one where it’s just the drug. Now, in this case, it’s thankfully pretty straightforward because PubChem actually has an API that lets you retrieve a PubChem compound ID with a drug bank ID and drug bank ID is the primary identifier use for compounds in DRKG. So we actually retrieve a lot of information, pretty straight forward using a rust API, but then that will let us pull information from ChEMBL. Now the PUG VIEW REST API is slower, but returns a larger document for each compound. So depending on what sort of information you want to include. We were primarily interested at least initially with including the small string, which is pretty straightforward to get. But then we also wanted to add in so sort of categorizations, which required a little broader look at the data. So we ended up using with PUG VIEW REST API.
So once you have data in a common format and you want to adjust it in a database, the question of course comes, do you want to have an RDF style or triple store database? Or do you want to have sort of a modern graph DB that uses Gramine property graph? So as first for doing the link prediction, that’s really done on data that’s outside of a graph database. So it won’t affect that, but it will affect how your users can query the data. So the two languages that are used for wearing these are very different. Sparkle is older and has a lot more heritage behind it, but it can be a little bit more clumsy to use because it’s built at a time where there weren’t as many resources for larger datasets. And so it sort of carries with it, some of the constraints from that time. The property graphs, especially the ones that have gremlins for the gremlin language, it’s a very expressive language. So you can do a lot of pretty complex queries easily. We’re leaning towards doing a gremlin based one. So that’s what we’re focusing on at the moment, but we’re also doing some support for a sparkle basements.
So we’ll get to how we actually do the link prediction, but let’s say we’ve produced some predictions for edges. Well, what do we do with them? So you probably don’t want to automatically adjust everything you predict, because you don’t know how good it’s going to be. So generally I’d say the recommended pipeline would be to save all of your predictions, have an expert review them, and then adjust once that the expert as approved. If you get to the point where you’re very confident in your link prediction model, you can consider ingesting them. But I think it should be considered mandatory that you should consider like a mandatory step that you add a property to those edges that tells the user that these were predicted and not actually created by a human.
So the next step would be the graph building. So now that we’ve gone through the datasets, we’ve unified them, we’ve ingested them. We actually want to create a single graph for us to do our link prediction. So there’s three kinds of relationships that we’ll have in our graph. One is an explicit relationship. So these would be the ones that we actually get from the data we ingest. And the next type is literature based ones. So these are ones that we are extracting directly from the text. And then finally are the ones that are formed from link prediction. Let’s look at some more details, these different kinds of relationships. So explicit relationships come in generally three flavors. Triples data, so this would be stuff that’s in RDF format, is explicitly relationships. So when you have your subject, your protocol, your object, that’s explicitly a relationship.
So essentially all you need to do is turn that into a format that your chosen graph database can adjust. For tabular data and this is the common way to dump data from a property graph as in CSVs. But if you’re ingesting a another CSV, and you’re not sure how to adjust that, if you’re not sure if it represents edges or nodes, one good sign that represents edges is that it has two entities, two identifiers. So for example, if you have a CSV, and it has like a protein ID and a compound ID, it’s almost certainly representing a relationship in edge. The next complicated part is trying to understand of the other properties in that tabular data, which ones are for the relationship and which ones are for the individual entities, hopefully like a well-made CSV representing edges should only have properties that are relevant to the edge, but it’s very common to see CSVs that have a lot of duplicated data for each one of the nodes.
Finally, there’s RDBMS data. So stuff stored and maybe like MySQL or PostgreSQL. And you may want to ingest that into your graph database, but how do you actually map the graph design of an RDBMS to a graph database? Well, some hints for finding relationships are looking for tables to have foreign keys. That may be a sign that there’s our edge relationship in there. Although that table may be primarily representing a singular entity, it could potentially have edges as well. And probably the most obvious kind of relationship is those represented by join tables. So a joint table actually has many different names. So essentially it’s a table that has IDs from two other tables, and that’s essentially explicit relationships. Now, one other caveat for pulling over RDBMS data is that it’s often worthwhile to look at documentation or maybe even talk to person involved in it to understand what’s the meaning of these different tables.
Now, the edge is extracted from literature journals, come in two flavors. There’s those that are heuristically extracted, and those are extracted by a model. And you can also further break it down into other dimensions, which is, are you actually extracting the relationship, or are you just looking at sort of quote, current relationship? So the benefit for heuristics to go back to that division is that datasets are kind of rare, but they’re becoming more common for this kind of data. So they’re also pretty easy to implement.
The difficulty is that if you’re doing something heuristically, you’re going to end up running into a lot of complicated language. So you may have a lot of false positives for agencies extracted. And for using a model based one, if you’re trying to use a heuristic to extract the actual labeled edges, of course, then you run the issue of very complicated language or difficult for heuristics. The heuristics for extracting the sort of untyped edges, basically they’re just co-occurrence, those pretty straightforward, but it’s pretty important to make sure that the user understands this sort of represents that these two entities are talked about in the same context, and it doesn’t necessarily have an explicit meaning.
So when we are extracting the ones based on co-occurrence, we’re essentially using TF.IDF, which is how search engines work to extract them. And then we use those TF.IDF values to create a weight on the relationship, how relevant are they to each other? So the idea is that first you’re given the two terms, you want to calculate sort of a summed term TF.IDF for these two entities, with the idea that if you are similar across all these documents, they’re going to come up with a particular value.
And then you want to identify the documents where u and v share a context. So context could be the same sentence. It could be within a certain window of tokens, or if you’re looking for very broad, long distance relationships, it could be even the same document. And once you identify those documents, you sum up the TF.IDF for u and v in those. Then all you do is you take the ratio of the TF.IDF for the shared contexts divided by the two ratios for them separate. And this effectively calculates what’s called a weighted Jaccard for these two entities. I have another slide that talks about more on a sort of pseudo-code way or more mathematical way. So essentially once we’ve calculated that weight, we can show how relevant these entities are to each other in the literature.
So now that we’ve built this graph, we’ve maybe even run, and I know it talked about much about, maybe we run some sort of like relationship extraction. We have the Docker instruction I just went over. Let’s figure out what we can learn from the graph itself in terms of new relationships. Because that’s where we get to the heart of drove repurposed. Because we have these existing drugs and we have these diseases with some literature behind them. And really what we want to do is we want to find out relationships that are implied by the data. So by the graph that we’ve encoded hopefully, they’re implied by it, but are not explicitly described it.
So generally types of ways of doing this there’s untyped models that are essentially looking purely at the structure of a graph. So they don’t really look at the meaning of the graph. They’re essentially saying, “Oh, these two things aren’t connected, but they are connected to other nodes that suggest maybe they should. So that has the benefit of generally being straightforward to implement. So you can, as far as for looking at machine learning literature, there’s a broader base of literature for doing that kind of work. And then there’s the typed one. So this is where you’re trying to predict a specific kind of relationship. And of course the output from that is going to be much more valuable because you’re telling the user that the model is predicting this particular kind of relationship, but it can be much more difficult to train such a model. And as I mentioned earlier, we’re using the deep graph library for doing this model of work.
So the first one, and this is essentially a pure heuristic. It’s not even really a model, which is looking at Jaccard. So essentially the intuition behind this is that you can know a node by the similar nodes that’s wrong, you’ll know a node by the company it keeps. So the way you predict this is you look at the ones that have many common neighbors, and then you say, “Oh, you have many common neighbors, you should probably be connected as well.” So it’s pretty straight forward to do, no model train, but the intuition honestly is not realistic. That being said, you can actually get… it’s not great results, but you can get some good results out of this. And I would suggest if you’re doing this on a new dataset, where maybe you don’t have a lot of other sources of information to validate your predicted edges, this is a pretty good baseline to compare against.
The next one, this is a model based, but untyped prediction. So it’s using DeepWalk. Essentially the idea of DeepWalk is that you produce these random walks through the graph, and then that produces the sequence of nodes. And then you can use a sort of well trodden in NLP for building embeddings that represented nodes. So these embeddings for the nodes essentially are compressing the information of the nearby structure of the graph for each node. And once you have these embeddings, you can build a binary classifier that takes in the embeddings for two of the nodes and just predict whether they’re connected. So the benefits for this is it’s pretty straight forward way to do this. And you can build a pretty sophisticated model, the larger your graph, the bigger your embeddings of more data.
Of course, the cons is that you actually get the types of the edges. The model is essentially saying, the structure of the model suggests there should be an edge. There are ways you can kind of suggest edges like you can in the training set only trained to predict certain types of edges. But of course, that’s going to introduce sort of a bias because the embeddings themselves weren’t built with like that, but you can kind of get types if you restrict the training set. The method we actually use the POC is a TransE L2. So essentially the idea for that is, you train embeddings that encode the relationship type predictions. So the idea is that you are going to train a neural network, that’s going to try and predict that a particular vector relationship holds between two nodes and a relationship.
So that relationship that I showed down below, which is that u plus the relationship embedding, the u embedding plus relationship embedding minus the v. So that’s the target entity, should be zero. So they should be very close. And the idea we have that as if we consider, I guess we have w, u and v here in the picture, if you look at w minus v, so that’s like the edge between them. And then if you look a plus it sort of makes a parallelogram defined by these two vectors. So the idea is that if the head entity plus the relationship type that should get you to the target vector, and that is you train a neural network that does that. And then you take the embeddings, the higher level representations from that network. And once you do that, you have embeddings. So you don’t need, you can run a model at evaluation time. You have embeddings that encode the link prediction.
But of course the difficulty of this is that you, you need to have a pretty built out graph in order to learn these, you’re going to need to learn probably a higher dimension. But we are quite happy with the results that we saw in our internal review. So now that we’ve produced these edges, the next step is sort of, as I was talking about earlier on, we submit it to an expert. So what we did is we filtered our predictions to look only at drugs that are already FDA approved. And we also did another thing using the platform that we have here at Wisecube, which is we predicted the blood brain barrier permeability, because that’s of course going to be very relevant to the research going on, which is related to Alzheimer’s disease. So any of these drugs that we are recommending would have to be able to pass through the blood brain barrier. So right now we’ve submitted our first round of candidates and are waiting to receive the results. Waiting very eagerly I might add. And so now I’d like to hand it back to my colleague Vishnu for the sum up.
Vishnu Vettrive…: Thank you, Alex. So as we saw, there’s absolutely, drug discovery and drug repurposing obviously has a lot of applications and a lot of opportunity. But also knowledge graphs themselves and link prediction themselves, is a very powerful technique that could be used in various different ways. And we believe that, this is the reason why we actually are focusing a lot of our energy on knowledge graphs, and we’ve specialized on it. And we believe that this way we can actually combine the knowledge that’s out there and be able to represent it in a way that is not only easily understood by humans, but can actually unlock signals and information that otherwise is sort of hidden deep in machines and represent the higher level of representation that Alex was talking about. So with that, hopefully, gave you insight in terms of how drug discovery works, how drug repurposing works, and also how we can use in all the graphs to be able to do that. So thank you everyone for joining our session. And we’ll open up for questions. Thank you.
Alex Thomas is a principal data scientist at Wisecube. He's used natural language processing and machine learning with clinical data, identity data, employer and jobseeker data, and now biochemical...
Vishnu Vettrivel is the Founder and CTO of Wisecube, a startup focused on accelerating biomedical research using AI. He has decades of experience building Data platforms and teams in healthcare, fi...