In this talk, we will cover how to extract entities from text using both rule-based and deep learning techniques. We will also cover how to use rule-based entity extraction to bootstrap a named entity recognition model. The other important aspect of this project we will cover is how to infer relationships between entities, and combine them with explicit relationships found in the source data sets. Although this talk is focused on the CORD-19 data set, the techniques covered are applicable to a wide variety of domains. This talk is for those who want to learn how to use NLP to explore relationships in text.
What you will learn
– How to extract named entities without a model
– How to bootstrap an NLP model from rule-based techniques
– How to identify relationships between entities in text.
Speakers: Alexander Thomas and Vishnu Vettrivel
– Everyone welcome to using Spark NLP to build a biomedical knowledge graph or otherwise known as how to build a space telescope and not get lost in the darkness. So, to start with we’ll just quickly introduce ourselves. My name is Vishnu. I’m the CTO and founder of Wisecube AI and I have a lot of experience building data science teams and platforms and also a lot of experience with graph databases themselves. And I’ll let my colleague introduce himself, Alex.
– Hi, I’m Alex Thomas. I’m the principal data scientist at Wisecube. I have a lot of experience with NLP. And I’ll probably be talking a lot more about NLP in the rest of the talk.
– Thanks, Alex. Okay, so just a little background about what we do at Wicecube. We’re actually a biomedical AI company. We focus, we have a core platform called Neffos, that allows you to sort of manage data sets, models, bugs in workflows and modeling visual workflows that allows you to create these and like build models and publish models. But also, what we do is we combine that with a knowledge graph. And we build this knowledge graph out of like publicly available data, but also we merge private datasets into them. Specifically biomedical data, biomedical literature data, scientific literature data and also scientific data sets in biomedical data sets. And we’ll talk about, you know, what we did around that. And then the idea is, then we combine this knowledge graph to then build specific applications or end user applications that can, you know, power scientists around the drug discovery space, things that can help, you know, predict properties of chemicals or activity, and also other applications that can predict drug adverse events and so on. Anyway, so having said that, just sort of a background on like the space itself. The biomedical data space is sort of, you know, sort of exploding in some ways. They call it the Big Bang of biomedical data, where like, in recent years there’s a whole series of events that have happened that have caused like, just this field sort of explode. One, there’s so much more data available now compared to like even 10 years ago. And many reasons for that one. One of the main reasons is of course the genome project and there’s a lot more genome sequencing that are happening. So there’s a lot of sequencing data that’s available. The second factor is just more literature that’s available, just unstructured scientific literature that’s being written, like take COVID Case in point, so much, you know, publishing that’s happening around the space, so much so that it’s almost impossible for human beings to keep up with. And also there’s obviously a lot more computing, you know, cloud based systems that are available which make it easier to sort of process some of these things. But the downside is still that a lot of the insights that can be derived from all of this data is still very much hidden. Or we like to call like either it’s unstructured, or it’s hidden or siloed. Right? In some ways, the insights are hidden, right. And that’s why, you know, we sort of call it like the dark data, where it’s really outside of the access of scientists who really need the insights in any sort of timely fashion to be able to make use of all of this data. Right? And the other downside, of course, is most of this data as you would imagine, not just in biomedical space almost in any vertical, most of the data that’s available 90 plus percent of them are, you know, unlabeled. And so you’re dealing with mostly dealing with, you know, sort of unstructured or unlabeled data. So these are sort of like the problems that are inherent problems or situation that’s inherent in the space itself. Great. So we’ve talked about the problem space. Now, sort of, let’s talk about the solution space, what would a great solution look like? And in our minds, like we’re going to use the space metaphor heavily. So if the problem is like the dark data and that’s available, that you can really get access to and get insights from and 95% of that data available in the space is going to be dark then the solution it should look like something like Hubble Space Telescope. And by combining NLP and knowledge graphs, we believe that it should be able to allow the scientists to quickly point at a given domain. And then visualize what it looks like to get an accurate picture of that space. And then gather quick, you know, insights quickly. And then allow them also to explore all the different connections that are inherent in their area. And, and then also learn representations, but not only representation, but also make predictions in terms of like hey, what are some of the adverse events that are happening? What are the targets that we should be going after? And what are some of the potential, you know, promising solutions that we should be looking at in that space, right? And so that’s sort of what we’re going to be talking about in the next, you know, 10, 20 or so minutes and how we actually constructed this telescope, if you will. And Alex is going to go into much more in depth about the different parts of the pipeline itself and how we actually built Orpheus, which is our solution to deal with this problem. With that, I am going to hand it off to Alex to talk more about Orpheus and it’s construction. So Alex.
– Thanks Vishnu. So yeah, I’m gonna be talking about how we actually built the Orpheus pipeline. As well as giving a demonstration of the Orpheus product. So here we see sort of a abstract representation of the pipeline where we have raw data coming from PubMed and ChEMBL. So this is representing structured data from ChEMBL and actually some of the places, as well as unstructured data from PubMed. So the first step is for us to do some text processing on the unstructured data. And that involves topic modeling, in order for us to summarise it. Then we do named entity extraction, that’s in order for us to find these entities that are going to compose our Knowledge Graph. Once you’ve extracted these entities, we are gonna want to do a couple different things. One, if we don’t have a good named entity recognition model, we can use a heuristic or dictionary based approach to bootstrap a model. But also, once we actually get the entities we’re interested in, we’re going to want to join them with a structured data sets. And that gives us our graph or knowledge graph. And then we can update it with other data sets with user annotations. So we’re going to be going over the datasets of reusing the text processing steps, the topic modeling, entity extraction and graph building. So the dataset that we’re using for the unstructured data is the CORD-19 dataset. It’s a collection of COVID-19 related papers curated by the Semantic Scholar team at the Allen Institute for AI. It could turn, it contains a metadata file, as well as actual articles. Now the metadata file references a large number of articles, many of which are not actually present in the text in the dataset, but you could retrieve them via URL some are behind publisher paywalls. We only do the processing on the data within the actual dataset. So that’s about 110,000 articles now. I think with maybe when the slides were originally made, that it was about 90,000. So some of the numbers may be a little bit off. So it has datasets in two formats. PDF’s, that have been parsed into JSON files and PubMed Central articles which live in XML on PubMed, have, that have also been parsed into JSON files and there’s many over. So the disease dataset that we have is, comes from a website that lists many infectious diseases and has names and synonyms for these diseases. I manually curated these in order to create a dictionary of the names for us to do some dictionary based name the extraction. The compounds, the chemicals that we have are from ChEMBL which we have 13,000 compounds here, we have synonyms so they’re new packed names, as well as any synonyms available, as well as their SMILES . For proteins, we use UniProt. And there’s a subset that we have downloaded, it has about 40,000 or so that regularly occur within CORD-19. So most of the 1.4 million, we never actually find and many of them also don’t really have commonly used aliases. It’s essentially just some sort of a very laborious technical name. So let’s take a deeper look at the CORD-19 dataset. So now we have 110,000 documents. When these, these plots were made, it was about 94,000. The documents had an average character length of about 20 to 26,000 characters, and they cover about 20,000 journals. So we see the log distribution of the journals on the left hand side. So you see the majority of the journals only occur a few times in the dataset. We see the more common ones in the pie chart in the middle. We see Medline, World Health Organization, PMC, Elsevier are sort of the top four there. And we see on the right hand side the number of documents with a particular license. So the different licenses in the state of sight. So most of them don’t have a creative commons one. Oh, it looks like we have Elsevier, okay special COVID license in the third spot. But of course, we really want to understand the data, a text dataset, you have to look at the text. So next, we’re gonna move on to the text processing. So, we use spark NLP. And, you know, it just so happens that I recently wrote a book on natural language processing with Spark NLP. So it’s really useful in a situation like this because the dataset is too large to hold in memory, because these are rather large pieces of text. So even though it’s just 100,000, it’s large pieces of text. And once you start processing, you start taking up a lot more memory. So the Spark NLP library is built on top of Spark Mlib. And they have many pre-built models and pre-built pipelines. But in this situation, I used a custom pipeline because I had specific processing the right one. So, let’s take a look at the processing. So we have a corpus come in. And I do sentence tokenization which is sort of a normal start, you want to make sure you break up your sentences especially if you’re going to be looking at sequences of text, you generally don’t want to look across sentence boundaries. And then I have two tokenization steps. So in the longer pipeline, that’s the one in the top one that goes right. I’m going to be doing more aggressive processing in order to do topic modeling. And so I’m going to be trying to capture n-grams there. On the one that goes downwards, I’m doing a less complex stuff because I’m going to be doing dictionary based named entity recognition and potentially model based entity recognition. So normalizing is something that will simplify the vocabulary space by removing, by lower casing things, removing non-alphabetics. And stop word cleaning is something I only want to do on the part where I’m doing more aggressive filtering. Because stop words can actually be part of maybe a chemical name or a disease name. And lemmatization. Similarly, we want to make sure that lemmatization could potentially be aggressive depending on what tool you’re using for it. Same with stemming. So only do that on one side. And on, the outputs are going to be the normalized tokens from the simple branch of the pipeline and the lemmas from the more complex part of the pipeline that we’re gonna use for topic modeling. So let’s look some more at the text processing. So, here we see that we’re going to take the corpus, we’re going to run it through the basic text processing, what we talked about in the previous slide where we get the lemmas and the normalized tokens out. But we also want to look for n-grams for that more complex pipeline. So there’s a way we can look at those using a manual process. So, that is to look at the frequencies of n-grams and try and identify either valuable phrases or perhaps stop phrases, perhaps phrases that we want to filter out. We identify these and then we can add them in to the tokenizer. And the tokenizer will preserve those n-grams as phrases. So, if we look at, look at the distribution of the TF.IDF of these n-grams, we see that if we look at the unigrams, that there’s a much more sort of broad occurrence so there’s lots of unigrams that occur with some frequency or have a, yeah. And then we see it decreases, we go to bigrams, trigrams and 4-grams. So that means that as you’re looking at n-grams to add that bigrams will probably have more than trigrams, trigrams more than 4-grams. You could potentially look at n-grams beyond that at five or six grams but you’re probably not going to find much there. So topic modeling is a clustering technique done on text, where we try and identify similar vocabulary across different, across the sets of document the sets of documents, the set of documents. So it’s based sort of off this idea from J.R Firth you shall know a word by the company it keeps. So in this situation, we’re gonna be using Latent Dirichlet Allocation, LDA this how it’s usually called. Which is a way of modeling text as the product of a generative process from multiple distributions over the vocabulary. So here we have an example of pyLDAvis that we use to visualize some of these talk topic models. So on the left hand side, we see the distribution of the topics in a dimension, in a dimensionally reduced space. And we see the distribution of topics. I’ll talk a little bit more about this when we look at Orpheus as visualization of . So now let’s talk a little bit about the entity extraction. So there’s two main approaches, we can do dictionary based using the Aho-Corasick algorithm. And that requires a dictionary or word list as well as a mapping to the entities wanting to extract. And then there’s model based extraction. So these are probably the more well reported on because they get better results. But they require a lot of label data. And if it’s deep learning it requires a very large amount of data. And when you’re dealing with a domain specific corpus, sometimes finding appropriate models for this can be difficult. So let’s look at the Aho-Corasick algorithm. So, it’s a pretty well known well tested algorithm. It will let you look through the tokens and to look for sequences that are representative of the entities, without having to backtrack through the text which is a common problem if you’re doing a simple, just trying to match a set of phrases. So it’s really a a prefix try, where you have the word stored in this tree data structure. And it searches through them. But the Aho-Corasick algorithm has these backlinks so that if you come across something that means it wasn’t the phrase, you don’t actually have to rewind, your look through the text, so ends up being much more efficient. The con though, is that it doesn’t use context. For example, if APRIL is a name of a protein, as it is UniProt, when someone mentions APRIL and a pandemic that was spreading around the world in April, it will find that as a protein. The model based approaches, that uses context so it’s much better and it can be tuned to different datasets, so if you have different styles of communication it can be tuned that way. And requires data labeled in the example shown here where influenza virus, influenza is marked as the beginning of a disease phrase, virus is marked as inside a disease phrase. And everything is marked with O outside of a phrase. The con of course for this is it requires label data. And if it’s deep learning requires a lot of label data. So you can actually do some bootstrapping, for the same entity recognition. So you can run a dictionary based approach, and then use a modeling based approach to train on that. So you could probably do a conditional random field, which is more of a classical model used for an entity recognition. Or you could do a deep learning one, as well. But probably once you do this, you’ll get a mediocre quality model. And you might want to consider iterating on this with some human labels. So now that we’ve done all this cool processing, let’s talk about building our graph. So there’s two ways we can talk about it. We can use a heuristic versus model for relationship extraction. That’s finding the actual relationships or we can do heuristics with labels. So basically, if these two entities are found within a document and there are these terms between them, you’ll label the edge in a particular way. The difficulty with a model based one is that requires labels done by experts. So it’s not something you can easily farm out to mechanical . And that can be expensive. And there’s not a lot of freely available models for this out there. The heuristic one, similar to the dictionary approach, you don’t need the data to do it, but it’s going to be prone to erroneous labeling. So in our building of the graph, we’re not going to be labeling them, we’re just gonna be looking at their co-occurrence within the documents. So the first step is to identify the context. And then in our case, it’s going to be the sentence. And then you want to identify what weight you’re gonna use. Use a binary weight, that is do they co-occur or not. A co-occurrence count, but that can end up being dominated by very popular terms. We end up using a TF.IDF. So here’s some basic pseudocode about building it, building the graph or building the the TF.IDF. So what we’re doing here is we’re calculating TF.IDF for the individual entities. And then from that, that gives us sort of a weight. And we can calculate a Jaccard similarity between two terms by looking at just the words that documents they co-occur with. And that would be sort of the binary approach. However, we can also weight it by their TF.IDF, which will give us a more informed weight to the edge. So that would give us a way to Jaccard similarity between the two entities. So that’s the pipeline. So as you do that, you want to create a, we put it into an app because we want to be able to show this to potential users. So this is example the full pipeline. We do all the extraction, we load the entities and the topics in the search index. And we also load the entities and the edges into graph database like OrientDB, or Amazon Neptune. So let’s go take a look at the demo. This is the Orpheus app. So we see on this first page is summaries on the CORD-19 corpus. So the top part we see a visualization of the topic model. On the bottom part, we see some other more generic summaries. So on left hand side, here, we see recent papers. So these are just the most recently published papers and this particular download of the CORD-19 dataset. And on the right here, we see the most prominent terms, just by occurrence of the from the named entities found within the text. So you see the top one here is pelvic inflammatory disease. This is likely a result of false positives and the name of the recognition. We see some other ones that make a lot more sense like SARS, COVID-19, Influenza. So, go back up top to the topic model visualization. We see on the left hand side is a display of the topics. And the larger the circle means the more prominent that topic is, the more prevalent that topic is in the corpus. And their locations on this two dimensional grid represents a vague similarity between the topics. So this is done as a result of taking a higher dimensional space and just doing dimensionality reduction to get it to two dimensions. So on the right hand side, we see the most prominent topics within the corpus, this is just done by occurrence. And this can be tuned to different use cases, this is the generic sorting done within pyLDAvis. So they’re not too relevant right now. But this is certainly tuned by the researcher use case. Next thing we can do is we can click on the topics and we can see their relative distribution of terms across the, as is compared to the general case distribution. So let’s look for example at topic one here. So we see that mortality here is more prominent than enrollment within topic one. But if you look at the blue bars, enrollment is more prominent within the general dataset. So that tells you that mortality is something that will inform us as to what this topic is representing. So another thing we can do is we can actually go and hover over a term. And you can see it’s distribution across topics. So it looks like mortality is not really relevant to most of the topics. But for topics one and three, it’s pretty relevant. So now that we’ve done this, we can actually click through to do a search on mortality and get documents from just a general search index. And we see the titles in the abstracts here. But we also see the named entities extracted from the text. So we can go through look for documents, it might be interesting or entities might be interesting. And listen, we want to click through and see what is available for COVID-19. So when we do this, it takes us to the graph view. So here we have the edges, the nodes and edges of the graph. So these edges are informed by the literature from COVID-19. But in Orpheus, we can integrate other structured datasets that can also define edges. So let’s look at the evidence for a step. So here, we click on the edge, and we see all the documents that provide the evidence for this edge. So the next thing we can do is we can go and we can expand this and we can also look to see some summary information here of Ace2. And when we expand it, we now get new edges that are connected to Ace2. So for example, we have remdesivir connected to Ace2. And we can see the edges, the evidence for the edges connected here. We also have a couple of other filters that we can apply. So we can, for example, filter out all the compounds and now it’s just going to be the diseases and proteins. You can also adjust a minimum confidence and the maximum number of neighbors. So that was a demo and now like that, I hand it back to Vishnu to wrap it up.
– All right, hey thanks a lot Alex. So as you saw, I think some of the, I just want to wrap up with some key takeaways. Like I stated in the beginning, the biomedical space does have a lot of data, but most of it is beyond the reach of scientists to be able to take advantage of what we call dark data. And we believe using NLP and knowledge graphs, we can actually shine a light on it, build the telescope, if you will. To be able to focus on it and get a much better insightful picture out of it and then use it for downstream use cases. We didn’t really talk about a lot of the downstream applications. Hopefully in the next talk, you know, in the future that we’ll actually be able to do that a lot more. But thanks a lot for listening and we hope you have a great conference.
Alex Thomas is a principal data scientist at Wisecube. He's used natural language processing and machine learning with clinical data, identity data, employer and jobseeker data, and now biochemical data. Alex is also the author of Natural Language Processing with Spark NLP.
Vishnu Vettrivel is the Founder and CTO of Wisecube, a startup focused on accelerating biomedical research using AI. He has decades of experience building Data platforms and teams in healthcare, financial services and digital marketing.