In this session, we show how to leverage CORD dataset, containing more than 400000 scientific papers on COVID and related topics, and recent advances in natural language processing and other AI techniques to generate new insights in support of the ongoing fight against this infectious disease.
The idea explored in our talk is to apply modern NLP methods, such and named entity recognition (NER) and relation extraction to article’s abstracts (and, possibly, full text), to extract some meaningful insights from the text, and to enable semantically rich search over the paper corpus. We first investigate how to train NER model using Medical NER dataset from Kaggle, and specialized version of BERT (PubMedBERT) as a feature extractor, to allow automatic extraction of such entities as medical condition names, medicine names and pathogens. Entity extraction alone can provide us with some interesting findings, such as how approaches to COVID treatment evolved with time, in terms of mentioned medicines. We demonstrate how to use Azure Machine Learning for training the model.
To take this investigation one step further, we also investigate the usage of pre-trained medical models, available as Text Analytics for Health service on the Microsoft Azure cloud. In addition to many entity types, it can also extract relations (such as the dosage of medicine provisioned), entity negation, and entity mapping to some well-known medical ontologies. We investigate the best way to use Azure ML at scale to score large paper collection, and to store the results.
Francesca Lazze…: Hi everyone. My name is, Francesca Lazzeri and today I’m here with the Dmitry to present our session on how machine learning and AI can support the fight against the COVID-19. Dmitry is my co-speaker, and he joined Microsoft 15 years ago as a technical evangelist that he then worked for some years as a, AI machine learning, the software engineer, doing the pilot projects with larger companies in Europe. Now he combines this worker as a cloud developer advocate at Microsoft with also being an associate professor at the Moscow Institute for Physics and Technology. He leaves in Moscow, Russia. And in addition to technology, he loves his daughter, Vicky. And he does a lot of great experiments in a using AI for science art.
Dmitry Soshniko…: Thank you, Francesca. Francesca is originally from Italy. She lives in Boston, Massachusetts. She loves data and the long walks with her six months old daughter, she’s adjunct professor of AI and machine learning at Columbia university. And at Microsoft, she is developer advocate manager, coordinating, our team of advocates, working on AI and ML. Before joining Microsoft, she was a research fellow at Harvard’s, in the department of technology and operations.
Francesca Lazze…: Thank you so much, Dmitry. So, this project is actually very, very important project for us, because we noticed that the, one of the major challenges of the COVID-19 pandemic has been finding effective treatments for the virus. So this was actually our problem that we noticed last year and this year as well as, the rapid spread of coronavirus generated an urgent need before comprehensive therapies in the scientific community. And as a consequence we noticed that the medical research community experienced an acceleration in new coronavirus literature, making it very difficult for scientists to keep up. So just to give you an idea, around 30,000 scientific papers related to COVID are published every month. So you can understand that this is a very, very, big amount of the scientific paper and literature in general that scientists need to read.
So in this specific session, we will show you how you can leverage the core dataset, and this is a dataset that is containing more than a 400,000 scientific papers on COVID. And also how you can leverage a different advances in a natural language of processing NLP, and other artificial intelligence technique, in order to generate new insights, in order to support to the ongoing fight against the COVID. These COVID-19 dataset is a very good dataset to start with, because it represents the most extensive machine readable coronavirus service literature. And in the same dataset that there are about 200,000 articles with the full text. We also put in this slide the links so that you can also access the dataset by yourself. You can understand it that this allows the worldwide AI and I would say also machine learning community, and scientific community in general, with the opportunity to apply text and data mining approaches so as to connect the different insights across these very good content.
So what we decided to do is, we started thinking about the different natural language processing techniques, that we can use in order to analyze and extract the knowledge from these datasets. NLP is focused on the interactions between computers and human language. And, most importantly this technique tries to understand how to program computers, to process and analyze the large amounts of natural language data. NLP is behind the following applications. So we just put some of them here in this slide, such as the intent classification, named entity recognition, keyword extraction, text summarization, appliance, question answering, and many others. And also there is the other part that is about the statistical models that you can leverage in order to do similar type of research. And these include the development of probabilistic models that can predict the next word in a sequence, giving the word that proceed it. And the number of the statistical language models that are in user ratings, such as, Recurrent Neural Networks, Transformer also were the GBT, to also the Microsoft Turing, NLG.
And if you want to learn more about NLP with Python, so you can also check out our GitHub repo, that is introduction through NLP with PyTorch. And also there is a Microsoft learning path around, NLP with PyTorch. So our idea, the first idea was the two user BERT. BERT probably you’re already familiar with user. BERT makes use of transformer that is an attention mechanism that learns the contextual relations between words in a text. So in the it’s a very [inaudible] form, I would say, simplify the form. Transformer includes a two separate mechanisms, an encoder that reads the text inputs and a decoder that produce a prediction for the task itself. But, most of the time, as a data scientist, we like to say that these are type of models oppose to direction models, which read the text input in a sequential way, so left to right, or right to left. And Transformer encoder actually read for the entire sequence of the words at once.
And these characteristics that allows them to learn the context of the word based on all its surrounding context and information. So left and right of the word. And before speeding the word sequencing to BERT, part of the words in each segments are replaced with a mask token and the model then attempts to predict the regional value of the mask words are based on the context provided by the other, non-mask of words in that same sequence. What is nice on BERT is that it contain 345,000,000 parameters, but then at the same time, these also represent a challenge, so it is very difficult to train from scratch. In most of the cases as we wrote here, it makes sense to use a pre-training language model.
So this is the main idea that is also behind this project. We wanted to extract as much semi-structured information from text as possible, and then we want to store it into a no SQL database in order to eventually processing also at a later time. Storing information in database would allow us also to make some very specific queries to answer some of the questions as well as to provide a visual exploration tool for medical expert, for structural research and also in order to generate additional insights. So this is, as you can see from this slide, that this is the overall architecture of the proposed system, and, we will also use a different, Azure technologies, to gain insights into, these are paper Corpus, and we will see some of them later, such as a text analytics for health, Cosmos DB and also Power BI.
First of all, in order to give you like a general overview of the approach that we use. So we started with, the idea of extracting entities and relations, in terms of the dataset that we had the Kaggle dataset, that is again a medical learner dataset, and also we could leverage what we call the Generic BC5CDR Dataset. And in terms of the base language model here, we have a few options. We can use a Generic BERT Model. We can also, pre-training BERT on medical text. And finally we can use the PubMedBERT Pre-trained model by Microsoft Research. I want also to show you a snapshot of the data that we use for this project. So in general, this is a very nice snapshot that you can get the nice overview and the type of data that we use, in general, the dataset was installed in many different parts.
So there is a meta data file that contains the most important information or for all publications in one place. These type of information are, for example, the title of the publication, the journal, the alternate state structure, also the date of the publication. Then there is also an additional part of the dataset that is about the full text papers, these are in a directory that is called a document parser that contain structured texts in a JSON format. And then we have also pre-build document embed data that maps code, UIA user, to flow to vectors that reflect some overall semantics of the paper itself. With this type of data, what we did next is using what we call, NER. NER stands for, named-entity recognition. And we can use it in order to extract named entities from text, and also determining their type. In this specific case, the [inaudible] classification model supports NER, named-entity recognition and also other token level classification task in general.
But in general, there is the task called detecting and classifying the gene formations that are also called int entities in a specific tasks. So in other words, in our model, it takes a piece of text as in the end for each word in the text, the model identifies a category that the word belongs to. Also as I was mentioning before, in order to make some insights from the text, we not only decided to use NER, even if it’s the best technique for that probably was available for us to use it. But if we can understand specific [inaudible] that are present in the text we can also perform semantically for each search in the text itself that answer specific questions. And also we can try to obtain data on co-occurrence of different entities.
So in order to train in the learner model, we need a much larger dataset. And this is a very common problem for data scientists. And as you can understand this most of the time, finding those datasets is a very, very challenging task. So the good news, as you can see from this slide and from the architecture, is that, there is a technique that is called the transforming language model, so that can be trained in a semi-supervised manner using what we call the transfer learning. So first, the base language model in this example is BERT, is a train on a larger corpus of text.
And then after that, we can specify that to a specific task such as for example, classification or a NER onus model, [inaudible]. Again, this is, called a transfer process and we can actually extend these process, a further step, but these additional step, but the training of the Generica pre-trained model on a domain specific dataset. For example, in this project, we decided to use the area of the medical science, Microsoft Research has this pre-trained model which is called PubMedBERT, and they’re using text from PubMed, the repository. So we decided on then to extend this a bit further step. And so we decided to use this PubMedBERT technology also. And this model can be further adopted to different specific tasks that can be again, related to specialize the dataset that are eventually available. As you can see this is what we use in how we use also the PubMedBERT technology from Microsoft Research. And we decided also to run all these on Azure Machine Learning. So Azure Machine Learning is the technology that we use in order to run all these models also in parallel.
So for these exercise, we use Azure Machine Learning, and I don’t know if you’re already familiar with these, but in general Azure Machine Learning is a cloud based environment that you can use it to train and deploy, automate, manage, and also track all your machine learning models. Azure Machine Learning can be used for, I would say any kind of a machine learning applications, from classical machine learning, to also deep learning, supervised and unsupervised learning, and you can also write in Python. This is what we use also for these project, but also are. And you can, most of the time you use what we call the Azure ML STK for Python, and you can again, build your model from scratch using just Python. With this lies all the Azure Machine Learning base that provides all the tools that the developers and data scientists needs for their machine learning the workflows, including the Azure Machine Learning designer.
This is adjusting [inaudible] modules, but also Jupyter Notebooks are probably the most used notebooks are for data science and machine learning. And again, you can just leverage the Azure Machine Learning STK for Python in order to get started, the way that Azure Machine Learning and so on, with the machine learning on the cloud. Finally, there is also something that, it’s might be next great for data scientists, is that there is these machine learning a CLI that is an Azure CLI extension that provides commands for managing all of the different resources, in Azure Machine Learning Studio. So Azure Machine Learning Studio and the workspace that you can create on studio is also a very nice environment that you can leverage as a data scientist in order to have all your resources there that you needed to build your end-to-end machine learning solution, and then also to share your solutions with other data scientists and other coworkers. So I will let the Dmitry show you a little bit more, how we actually get started with these project and what we did in more detail.
Dmitry Soshniko…: Okay, let me show you how Azure Machine Learning looks for a developer. The main thing, in Azure Machine Learning is the workspace, which contains everything, including your data, your computing resources, notebooks, and so on. So once you create the Azure Machine Learning workspace, in Azure Portal, you end up in this, Azure Machine Learning Studio, where you have all those resources available to you on the left, for example, you can see the computing resources I have, for example, GPU and CPU, computes that I can use to run notebooks. I also have defined to compute clusters, which I can use for larger-scale training, marginal training. So to begin training in their model I will use in Jupyter Notebooks. So I will go to this notebook section and here I have the, notebook for PubMedBERT training.
So I will just briefly show you the mechanics of how it works. We start with existing PubMedBERT model, so we can load it directly using transformers library. Luckily, this PubMedBERT model is published on the official transformer site. So we don’t need to look for a place where to download it. We just need to know the name and we can automatically instantiate the model, and we can instantiate the tokenizer, which is used to break up the text into separate tokens. Then we load the dataset for the dataset we use, BC5CDR datasets, which contains the large number of entities from PubMed. But unfortunately it’s not recent enough, but, it was our first starting point, the dataset looks like this, and then we need to do some data processing to end up with a token representation.
So we can see that there are two token types in this dataset, chemicals and diseases. And we need to convert the dataset to so-called BIO encoding, which maps every token to their entity class, and also specifies whether it’s the beginning of entity or insight, like inner part of entity or other token, O stands for other token. So there would be essentially five token classes. And our model would be in token classification model with five classes. So we then create this dataset, in PyTorch, and do the training. We instantiate BERT for Token Classification model. We specify the number of classes. And then essentially we need to train that top classifier on top of that model. And you can see, after doing it, we take, we can run this in the notebook.
I will not show it just for the sake of time. And we will end up with a model that can take an abstract piece of text, and it will produce the list of token ideas. So it will extract entities like hydroxychloroquine, HCQ, which is the same hydroxychloroquine, but a short form. And so on. You can see that the model somehow works, but it can not extract some recent diseases, like, for example, COVID-19. Well to train this model, in the notebook is not the best idea because it takes quite some time to train. So the best, the better way to use Azure Machine Learning is to schedule this training as an experiment. So let me switch the slides and show you how this works. To schedule this training to be run as the Asia Machine Learning experiment, we need to define several entities.
First of all, we need to define the dataset, to do that, we can use a YAML file, this BC5CDR YAML, and then we can use, either CLI commands to upload this into the Azure Machine Learning workspace. So we just say, “Az ml data create,” and specify this YAML file. What this does is uploads the data to the Azure Machine Learning workspace storage, and defines the dataset on top of it, which we can then use for our experiments. Second, we need to define the environment on which the experiment will run, and defining the environment normally means building a container around it, with all the required libraries. Azure Machine Learning can do it for us, we just need to define what we need and our environment, in this case we can start with the docker the standard Booker image for the Microsoft Azure ML, which supports GPU.
And then, we specify in the conda, the YAML file, which lists all the libraries, all the versions that we need for our script to run. Once we have defined those two files, we can say, “Az ml environment create,” and this will create the environment for us. When they want to train, later on on the cluster, it will take this container, apply necessary changes and run this environment on the cluster. Finally, to do the actual training, we need to describe the experiment. Experiment, means that essentially we want to run a certain script from the cluster. So here we will specify the entire dataset. And this is the dataset that we have created previously, BC5CDR. We will specify their compute target on where this experiment will run. In this case, I will use one of the clusters that I have shown you, previously in the beginning, the one that supports GPU, because that’s essential for Transformers. And I will specify the code to run, the command to the run. Python, train.py.
And, as the parameter data, I will pass input.input corpus, which will convert automatically the path to the dataset, to the local path in Python. So in Python when I want to do training, I will just access this data as a normal directory. Once I have done that, if needed, I can create the compute cluster also from a common line, in my case, I have done it previously through Azure Portal. And then you just submit the job. And once you submit the job, it starts running on the portal. Let me just show you the dataset, that I have created, it is available here in the portal as well, BC5CDR, and I can actually click and explore the dataset from here. I can get the code that I can use from Python to get access to this dataset.
I can also, have a look at the dataset here. So if I want, I can define the set by uploading the file directly here on the portal and not from a common line. And then, the experiment that I have submitted, it was called, nertrain. I can see it here in the experiments section, and here are all the different runs of the experiment. You can see I have a lot of failed runs when I was setting up this demo, I had some, of course misconfigurations and some problems, but then to find out, experiments, they run for two hours and for three hours. And that is how long the NER training took on the GPU cluster. So I can go into the experiment and I have associated data with that experiment, in the outputs plus logs, for example, I can see all the textual output that my experiment produced.
So whenever I print something in my training script, to console, this will be captured here. So I can see how my training took place, like everything that was printed. And, that the fact that this finished successfully. All other text files, they can show how the preparation took place, how the environment was created and so on. So if I ran into some problems, I can get very detailed explanation here. And what’s interesting, the experiment captures the output directory from my compute. So if I place something in the directory, it will be captured in output. So in my case, after the training of the model, I will save the result into outputs. And here it is, PubMedNER, this is the final trains model. Also, I have captured here all the checkpoints, but that is something I probably would not need. I need the final model to use it.
So once I have done this training, I want probably to get this model out of Azure ML, and that is how I am able to do it. So, I can get the data associated with any experiment by providing its idea in the Azure ML, job download command, and this will download the output directory to my local machine, so I can easily fetch the train model and I can use it anywhere else. So in my case here is the example of how this model performed on one of the abstracts from our core dataset. And you can see that it recognizes correctly some diseases or some chemicals, but, for example, COVID-19 is not recognized because the dataset that we used is rather old. So, unfortunately that’s the best we can do with existing datasets. Also for, to do some better, entity extraction we need some other entity types, like for example, different biological fluids or different ways of treatment. It would be nice to be able to extract those and also extract some common entities such as quantities, temperatures, and so on, which can be, helpful in analyzing papers.
Francesca Lazze…: Okay. Thank you for these wonderful demos. So Dmitry mentioned that we are going also to use some additional tools in order to make sure that our project is going to be successful for the scientific community. And one of these additional tools that we were thinking about using, and that we actually ended up using the Text Analytics for Health. Text analytics for health is a cognitive services that exposes a pre-train, PubMedBERT model with some additional capabilities. It is important to mention here, the Text Analytics for Health is a preview capability that is provided by Microsoft as ISA. So with all the faults. So, as a consequence, it’s important that you use the Text Analytics for Health for your solutions, for your projects but they should not be implemented or deployed in any production environment and in any production user.
Text Analytics for Health is also very interesting because the provisive teacher of the Text Analytics API. So it’s actually a feature itself of the Text Analytics API service that is able to extract and label different, relevant medical information from a structured text, such as for example, a doctor knows or discourage summaries or any different types of clinical documents. It can be used, through a web API or container service support, such as name entity recognition, as you can see, and also a relation extraction. So these are all the different techniques that it’s actually supporting and also entity linking that is more like ontology mapping and negation detection.
What is nice that, and I also mentioned this in this slide is that, it support also, what we call that name entity extraction, as you can see from this slide, this means that it detects words and phrases mentioning in unstructured texts, that that can be associated with one or more semantic types, such as a diagnosis or medication name, also symptoms, age, and sign in general. The other important aspect is about the relation extraction. Relation extraction is very important because you identify meaningful connections between the concept that are mentioned in the text itself, for example, there is a time of condition that is a relation that is a found by associating a condition name and with a time or between an abbreviation and the full description.
So yes, you can understand these, in other capabilities at that is a very, very helpful, for doctors and for the scientific community in general. So both of the named entity extraction and the relation extraction are two capabilities that are supported.
Dmitry Soshniko…: Actually, it is a very good news that, Text Analytics for Health is available because when you’re using Text Analytics for Health, you don’t need to concentrate on how you deploy your model. It is being deployed automatically. It is a scale that support scaling. So very convenient to use this model for our tasks. And to use it, we can either query it using REST API, or we can use Text Analytics as the key, which is the Python is the key which supports working with Text Analytics. To install it we can just use, pip install, and we need to specify very specific version of this the key, the better version of which support Text Analytics for Health. And then, calling it, actually very simple, you extantiate the client object, where you specify the API version, which also needs to be a preview version, which supports the Text Analytics for Health.
And then, the goal that you just pass in the document, and do one function call. Now what you can pass, you can pass either one document or a collection of up to 10 documents to be processed in one query. So, for example, if we take this phrase, “I have not been administered any aspirin, just 300 milligrams of Favipiravir daily,” just the phrase that contains a lot of medical terms. We will, get something like this and return. It will be like health care entities, where for each entity we get the text itself, then the category, what kind of entity it is, the position inside the text, confidence and all the related entities, and linking to ontologists.
Linking to ontologists is very convenient because, the same concept, like for example, COVID-19 or SARS-CoV-2, or coronavirus. Well, coronavirus is not exactly the same entity, but SARS-CoV-2 and COVID-19 are two different names for the same entity. And, their service is able to map it to the same ontology ideal automatically. Also in terms of relations, you can see that here in this text, 300 milligrams is the dosage for Favipiravir, and that is extracted by the service as well.
So you can imagine that if we have scientific papers mentioning different treatments for COVID-19, we will be able to automatically query, what are the dosages of medications mentioned in the text, that is in fact our goal. So, now, what do we need to do if you want to analyze all abstracts from core dataset using Text Analytics for Health. As we mentioned earlier, there are 400,000 papers. What we did we split them into chunks of 500 papers each, so that we can start processing and we don’t have to wait for processing to end, and we can start analyzing and playing with the data. And those chants of 500 papers we stored as JSON files, containing title, abstract and so on.
And then, we need to essentially take this JSON file and enrich it with the entities and relations. So we don’t want to wait for a longtime, for this processing to happen, because if we just run it on one machine, it would take something like, 11, 12 hours. So what we wanted to do, we want to be able to do this in parallel, using a cluster of computers. And for that, we can also use Azure Machine Learning. Azure Machine Learning, originally supports, clusters to do distributed training or a hyper parameter optimization, when you can run different experiments with different parameters on different machines in the cluster, but we can also use it, for jobs like this, for parallel sweep jobs. In fact, there is a specific type of job in Azure ML called the sweep job.
And the idea is that we will have several machines in the cluster running in parallel. They will be fetching data from the same, dataset, and then storing it somewhere in the commonplace for output, to do this job, we define this cognitive sweep job, parallel sweep job, the type is sweep job. And the idea of a sweep job is that we can specify hyper parameters, which would be sampled from different distributions. So in this case, we would have one hyper parameter called number, and it would be sample from, either 0 or 1. So in this case, in the simple case, I will just run two nodes in the cluster. And so essentially I would start two instances of the script in parallel, and each instance would run their file called, process.py, giving it a number of notes too. That number which is the search space parameter, which would be either zero or one, for different tasks.
And I must also provide that input dataset, path. And I would define the input dataset here as well. The processing script itself would be pretty simple. I would just need to read the original dataset, as the [inaudible] file, for example, using Pandas. And then I will go through each rows. And if the number of node would be, like the remainder when divided by number of nodes would correspond to the number of nodes, I would process this record. So that’s a pretty simple code. And let me show you how the datasets, how those experiments look in Azure Machine Learning.
So if I go again, to experiments here, here is my cognitive sweep experiment. And I have several runs again, and each run here, it is the speed job run. So it contains two child runs, which are listed here. You can see, they have different IDs and a different combination of parameters. So one that goes with number zero and other with number one. And if I go to child runs I can actually view, all those runs independently, I can select, for example, this one and see what the output was, and what the parameters were. So here is my strip that did the data processing. So as a result of this parallel processing, we would end up with a bunch of JSON documents which look like this. So for each paper, we would have, ID, title, authors, and then the collection of entities, and collection of relations. And you can see that this data is clearly a semi-structured data, which is very good to be stored in the no SQL type of databases.
Because it is inherently hierarchical. So we have a collection of papers, each paper has a collection of entities, and then are entities are related between each other. So, to store those, in Asia, the best solution would be to use a no SQL database called Cosmos DB.
Francesca Lazze…: So let me introduce a little bit Cosmos DB and why we decided to use it. So Azure Cosmos DB is a globally distributed multi-model database that’s a supports the document graph and key value data models. Azure Cosmos DB does not require any schema, which is great or secondary indexes in order to support queering over a document in a collection. And this was another feature, another capability that was very, very helpful for us. By default, the documents are automatically and in big sector in a consistent manner, and user make a document the wearable as soon as it is created, which as you can understand is another great capability to have, in general, the process that follow is the following. So documents are stored within collections, then a document can contain one or more attachments, and finally the user can access these documents and can this be managed via different permissions? So it’s also great from a different collaboration point of view.
Dmitry Soshniko…: Okay. Yeah. So the whole idea of using Cosmos DB for storing this data is to be able to query this data using SQL queries. Of course, once we end up with JSON files, we could write Python strips in order to transform the data in anyway we like, but that is more complex than just writing a simple query, For example, consider the idea above that, we want to see which papers mentioned, which medications and in which dosages. To do that we can just write one query like this. We want to select the title of paper, joining relations, such that relation is dosage of medication and the target of this relation, which is the medicine would be, hydroxychloroquine for example, it contain this word hydro. And this clearly would just give us the result, which is very useful for a researcher, because he would then be able to go to those papers and study them in more detail.
Let me show you how this query can be executed in Azure Portal. So to show you the Cosmos DB database, I would need to switch from my Azure Machine Learning Studio to Azure Portal. And here is my Cosmos DB database in the Azure Portal. And it contains a built in data explorer, which is a very useful way to look at the data and do all kinds of data manipulation, right here in the portal. You can see that in my database, I have the port database and it has a couple of collections one of them is called papers. And in papers you can select items, and here are all the processed papers, almost 400,000 of them and each paper is adjacent document. So we can select it and have a look at this document, it contains abstract titles, publishing time, and then the list of entities and relations. To query those, we can select a new SQL query, and we can say, for example, Select * FROM papers p, which would give us essentially the list of all papers, which is probably not the most interesting thing.
But, it’s nice to make sure that it works. Yeah, we have the JSON result. Now to come back to our example with, medications and dosages, here is the query that I have shown you previous on the slide, selecting title of paper and dosages of related medications, we can execute it. And we can see in a second, the result in the JSON format, here is the list of all papers and the medications with the corresponding dosages. That looks pretty good, but normally we don’t want just this result, we wants to analyze the data somehow. And to analyze the data we typically use Python, because Python contains a lot of rich functions for manipulating data.
Luckily, we can use Python here in Cosmos DB as well, because Cosmos DB contains related notebooks. And those notebooks are almost traditional Jupyter Notebooks, but they contain some additional extensions. So for example, here, what I can do, I can run the SQL query right here inside the cell. And this SQL really can output the result in the table in the document, but it can also output to any Pandas DataFrame. So what I do here in this query, I say, I want to create my database, and I want to output everything into Pandas DataFrame called the meds. And what I select here is the paper itself, title publishing time and so on, and also the medication, and also the ontology idea related to that medication, because I want to study unique medications, regardless of the term which was used to mentioned them.
So after running this query, what I get, I get this table. So I have medications on the right, their UMLS ID and also all the papers and publishing times. Why this is good, well, I can, produce some interesting insights from this data. Well, this is the very simple insights, for example, I can see how many entities were mentioned in different time periods. You can see that publication activities started in March last year, and then it kind of was increasing since then. And that is the time we were analyzing the data. So that’s why it drops, because we just analyze it for half a month. And then, what I can do, I can come up with the table of most frequent mentioned medications, and also join them by using ontology ideas. And you can see that the most frequent mentioned medication is hydroxychloroquine, which is like the oldest way of COVID treatment, then chloroquine and so on.
And we can also count the number of negative mentions, to see how negative this medicine is, which is probably not the very good estimate for how good or bad this medicine is, but just in case it can give some indication of, the context. Finally what I do here. I come up with the top 15, I think, used medications. And for them, we do some further processing. We group that by month and we calculate the number of positive and negative mentions each month. So this gives us, in the end, the nice graph like this, which reflects different ways of COVID treatment. So for example, you can see this hydroxychloroquine, which is the popular medicine. It was very mentioned, very discussed in beginning and then this graph drops, even though the number of publication increases. Still this is not the percentage, this is the absolute number.
So if we also divided by the number of publications, we’ll probably see even steeper drop in hydroxychloroquine. While other treatments like, for example, remdesivir and Favipiravir, they become more and more, discussed. So they become kind of more popular. So those graphs, they allow us to extract how the treatment of the disease changes over time, which is, I think very useful insight. So to summarize, we have used Jupyter Notebooks and Cosmos DB, and they were really useful because we could run SQL queries and transform it into Pandas DataFrame. So this is the good strategy once we end up with semi-structured data, we then can use SQL query to create the structured table out of that, and then process the structured table using Pandas. And a lot of processing takes place on the database site. We don’t have to worry about that.
So to show you some final results of what we managed to achieve, this is the glove that I have already demonstrated, which shows them changes in, COVID treatment overtime. then we can also, plot things like, relations between different terms. So for example, here, we take diseases on the right and different treatment strategies on the left. So instead of just prevention, using drugs, using oxygen, using placebo, for example, or mechanical ventilation, and we can see, how often different means of treatment I discussed. And you can see that a lot of papers are indeed focused on preventing the disease. Like prevention and masks are among most popular topics. And then, there was also mechanical ventilation and supplemental oxygen. Yeah, and then we can also see that this is the same for other terms like infection in the infection, the most frequent to discuss things is also prevention.
So it would be interesting of course, to also see how this changes over time, but that can be easily demonstrated as well. We just showed the overall number of mentions. Also we can mention, we calculate number of mentions of diseases and different medicines. So here like COVID-19, of course it’s the most frequently discussed disease, because core dataset is focused on COVID-19. And we can see that, hydroxychloroquine is the most popular one, lopinavir and so on. And quarantine also, for some reason is categorized as different medicine.
Finally it would be also interesting to find co-occurrence of terms. So for example, if we take the term medicine, we can see which medical combinations occur often with each other. And this is the diagram on the left. You can see that, for example, hydroxychloroquine and azithromycin, which is a popular combination, which is a well-known combination, is indeed visible from this diagram. And on the right, we can see which treatments go well together. Oxygen and ventilation quite naturally. So I think, those insights are quite interesting, but if we do some realistic, medical studies, we need to do some more deep understanding of texts. And we ideally want to give the ability to communicate this semantic information, also to some medical researchers, not only to data scientists like us who can produce those drafts and SQL queries. And in order to do that, we can use interactive visualization tool called Power BI.
Francesca Lazze…: I totally agree. Data visualization is probably one of the most important techniques that data scientists but also developers have in order to not only understand what are the results of their models and their algorithms, but also to understand how they can extrapolate additional insights from either sources, the source of data they are using, and also the results that are producing. So that’s why we decided to use Power BI just as the handoff these are end-to-end solution. And Power BI it’s great, again, both are for a business expert to data scientists, developers, who also to create some nice visualization around the data and the different data sources that they’re using for their projects. Just in simple terms, Power BI is a collection of software services up to application as well, and also connectors that work together in the different pieces, in order to turn your unrelated also sources of data into, interactive and meaningful insights.
There are many different ways to leverage Power BI’s capabilities and also many different data sources that you can use in order to visualize in your data. Your data may be in an Excel spreadsheet or a collection of a cloud-based data warehouses. For example, Power BI can access all these different data sources and you can start visualizing some of them. Power BI is also based on different components, as I was mentioning earlier, and these components all work together in a very nice way, and they start with three basic components. These is what is called a windows desktop application that is called the Power BI Desktop. Then there is an online, SaaS, so software as a service. This is a service that is called the Power BI Service. And then we have also the Power BI mobile apps for different devices. So these are three elements, Power BI Desktop, the service and the mobile apps are designed to let you create, to share and consume different scientific and business insights in the way that you can leverage those, and also communicate to share those with others. And in this case, with a scientific community.
Dmitry Soshniko…: So let me show you this board that we have created in Power BI Desktop to access our collection of data. As Francesca mentioned, Power BI Desktop can be used to connect to different types of data. So here we can connect from Power BI Desktop directly to Cosmos DB by selecting it here. So what we have done, we connected to Cosmos DB and a typed in the SQL queries above, to get the list of entities and relations. So here on the first step, we can see the list of entities, on the left we can see entity types and on the right the actual entity. So, for example, if I go with a medication name, you can see the number this entity appears in text. And on the right when I select this medication name, automatically the right table is filtered by this type, and I can see all medications.
So here hydroxychloroquine and HCQ are the top two most popular ones. And those are the same concept because you can see they have the same ontology ID. So this is the easier way for a medical professional to explore this database, because ideally you can also build the list of papers where this is mentioned on the right and those kinds of things. If I wants to do the query with dosages of medication that we have considered before, I can switch to relation stock and relations tab. And relations is pretty much the same. I can say dosage of medication here, and I get the table of the right with all different dosages and medication names. Then I can filter by medication name or a certain medication name. And that is a nice way for me to explore the dataset. This concludes our presentation. And I think the main takeaway that we want you to have proud of this doc, is that the text mining for medical texts is really a very available resource to gain insights from large focus of text. And it can speed up the medical research a lot.
We have shown the conceptual demo, how this can be done. But to apply it to different areas of science you would need to train your own entity organizer, because for medicine we have used cognitive service, but for other areas there might not be existing train models. And the process that we have shown in the beginning, how to the train the model on Azure ML, can actually be really useful. Also we have used a lot of technologies in Microsoft Azure during this demonstration, we have shown you how to use, Azure ML to train the models individually. And how to use parallel sweep job to run the series of tasks on the cluster. We have discussed and showing you the text analytics for health to do a named-entity recognition and ontology mapping.
We have used Cosmos DB to store and query semi-structured data, and then Power BI to explore this data interactively. And finally, Jupyter Notebooks to do more detailed data analysis. And this type of technologies can really be very well used together, to produce this final result of being able to analyze medical texts and gain insights. Some more information about different technologies that we have used, you can find in different resources that are shown on this slide. Most of the protus that we have discussed is also covered in the blog post above, which shows also some cold samples. And then you can see links to Text Analytics for Health, to Cosmos DB, to all individual components of Microsoft Azure that we have discussed. And also you can learn, in more detail about almost any Microsoft technology at the resource called Microsoft Learn, which contains indeed a lot of nice courses. Thank you for your attention. And if you have any further questions, ideas, or feedback, we will be really glad to hear from you. Thank you.
Francesca Lazze…: Thank you very much.
Dmitry is a Microsoft veteran, working for more than 13 years. He started as a Technical Evangelist, and in this role presented on numerous conferences, including twice being on stage with Steve Ballm...
Francesca Lazzeri, PhD is an experienced scientist and machine learning practitioner with over 12 years of both academic and industry experience. She is author of the book “Machine Learning for Time...