Automated and Explainable Deep Learning for Clinical Language Understanding at Roche

Download Slides

Unstructured free-text medical notes are the only source for many critical facts in healthcare. As a result, accurate natural language processing is a critical component of many healthcare AI applications like clinical decision support, clinical pathway recommendation, cohort selection, patient risk or abnormality detection. Recent advances in deep learning for NLP have enabled a new level of accuracy and scalability for clinical language understanding making a broad set of applications possible for the first time.

The first part of this talk will cover the deep learning techniques, explain-ability features, and NLP pipeline architecture that has been applied. We’ll provide a short overview of the key underlying technologies: Spark NLP for Healthcare, BERT embeddings, and healthcare-specific embeddings. Then, we’ll describe how these were applied to tackle the challenges of a healthcare setting: understanding clinical terminology, extracting specialty-specific facts of interest, and using transfer learning to minimize the required amount of task-specific annotation. The use of MLflow and its integration with Spark NLP to track experiments and reproduce results will also be covered.

The second part of the talk will cover automated deep learning: the system’s ability to train, tune and measure models once clinical annotators add or correct labeled data. We will cover the annotation process and guidelines; why automation was required to handle the variety in clinical language across providers, document types, and geographies; and how this works in practice. Providing explainable results – including highlighting evidence in the text for extracted semantic facts – is another critical business requirement that we’ll show how we’ve addressed. This talk is intended for data scientists, software engineers, architects and leaders who must design real-world clinical AI applications and are interested in lessons learned applying the latest advances in NLP and deep learning in this space.

Watch more Spark + AI sessions here
Try Databricks for free

Video Transcript

– Hello, everyone, hope you’re having a good time here at Spark+AI Summit 2020, thank you for being here. My name is Vishakha Sharma, I’m a principal data scientist at Roche, and I will be co-presenting our talk with my colleagues Yogesh Pandit, who’s a staff software engineer at Roche, and John Snow Labs’ CTO David Talby.

Automated & Explainable Deep Learning for Clinical Language Understanding at Roche

And the title of our talk is Automated and Explainable Deep Learning for Clinical Language Understanding at Roche. In the first part of the talk, we will cover why patients and doctors need accurate and automated natural language understanding at scale. in the second part of the talk, we will cover how we build deep learning NLP and OCR models, and pipelines that address the challenges. And the last we will cover achieving state-of-the-art accuracy on real healthcare data in production. Full disclosure, Roche is a happy customer of John Snow labs, we are co-presenting this talk to give a high level overview of the Roche use of Spark NLP, a product from John Snow Labs. Nothing contained or stated here in during the presentation constitutes a Roche endorsement of John Snow Labs’ products. John Snow Labs is fully responsible for accuracy and completeness of any statements related to John Snow Labs products, including the product’s performance.

To tell you a little bit about Roche, It is 120 years of, 120 years old company with headquarters in Basel, Switzerland. It has two main business divisions, diagnostics and pharmaceuticals. It is number one in in-vitro diagnostics and the leading providing of cancer treatment worldwide. Within diagnostics, we are a team called Diagnostic Information Solutions.

Our primary focus is the NAVIFY decision support portfolio, where we are working mainly in oncology.

NAVIFY TB, Tumor Board, is a cloud based workflow product that securely integrates and displays relevant aggregated data into a single holistic patient dashboard for oncology care teams to review, align, and decide on the optimal treatment for the patient. The clinical decision support apps ecosystem is secure and fully integrated with NAVIFY Tumor Board. We have three clinical decision support apps. NAVIFY Guidelines, NAVIFY Clinical Trial Match, and NAVIFY Publications Search.

Unstructured healthcare data challenges for NAVIFY portfolio

For a cancer patient, a large number of data points get generated along their journey, for example, genomics, pathology, radiology, their clinical data. The goal here is to navigate through the complexity of patient’s journey and generate a longitudinal view by unlocking these data sources. And for a more comprehensive view, unlocking unstructured data is very important because a lot of the time, a lot of the time, this has diagnostics and treatment information. These data allow us to do clinical decision support and population analytics. For this talk, we are going to focus on the unstructured text data in the pathology. Here is an example of a pathology report.

Sample Pathology Report ,

Pathology reports are very diverse. They have jargon, tables, key value pairs and handwritten notes. If you’ll see on the right side, which is a sample report from pathology domain.

Manually Curated Report

In a lot of cases, when a sample report gets reviewed by a pathologist, it looks like this. How many of you have seen something like this? This is handwritten text. And if you read closely, you can see, they are talking about tumor site, tumor staging, ICD codes, and a bunch of other things. All of these annotations make this report extremely valuable, but the challenge is, how do we extract all this information?

The NAVIFY team identified two significant needs

It is quite clear after looking at these reports that along with NLP, we will need OCR to efficiently extract information. What is NLP? Natural Language Processing is a field of artificial intelligence that helps computers understand, interpret, and manipulate human language. NLP draws from many disciplines, including computer science and computational linguistics. And it’s possible to fill the gap between human communication and computer understanding. What is OCR? Optical Character Recognition is the recognition of printed or written text characters by a computer. So we need natural language processing with high accuracy, specialized for medical data, with minimized time to train models. And that can be extended for new content types. We need OCR with high accuracy and ability to retain document structure like tables, lists, and backgrounds. We had a bunch of requirements from tools and services that would help us achieve this task or needs, like scalability, compliance, low-cost ability to run on print or in the cloud. The success of NLP approaches heavily depends on being able to understand the domain. And as a first step, we want to identify named entities from the domain specific documents. These entities are highly specific to the use cases.

45+ Oncology Entities to Extract

Healthcare data is extremely heterogeneous and complex and requires high quality labeling data and domain expertise, and that can be very expensive and time consuming for any organization. At Roche, we have extracted more than 45 oncology entities from pathology to books. Here are few examples of a surgical pathology report from patients diagnosed with lung, breast, and colon cancer. The highlighted text shows the entities of interest and its associated labels. The first example shows that diagnostic information of a lung patient, lung, right upper lobe lesion, wedge biopsy adenocarcinoma, moderately differentiated, where lung is a Location, and right is a Laterality, wedge biopsy is a Procedure, and 2.5 centimeter is the Size of the tumor and it says surgical margins are not involved.

The second example shows the microscopic description of a breast patient. Histologic type is invasive ductal carcinoma with metastatic features, areas of sarcomatoid carcinoma, histologic grade is not mentioned, and overall grade is three. In this sentence, we have labeled invasive ductal carcinoma as cancer Type and sarcomatoid carcinoma also as cancer Type, and three is labeled as Grade.

So the last example shows, the clinical data and pathologic diagnosis of colon patient. Cancer location is ascending colon mass, and final diagnosis is right colon with appendix, hemicolectomy, two adenocarcinomas, proximal ascending and distant ascending colon. So if you’ll see this example, you see multiple mentions of type, localization, and procedures. And our approach has been, as first step, categorizing the content broadly as possible. And this helps us achieve higher recall with entity extraction, and we then in next step, drill down into achieving higher precision by mapping entities to standard concepts.

Named Entity Recognition (NER)

NER is, in simple words, is entity extraction, and it’s a sub-task of information extraction that seeks to locate and classify named entity mentions in unstructured text into predefined categories, such as tumor site, type of tumor, et cetera. Spark NLP provides both CNN+Bi-LSTM and Bio-Bert implementations, and we have trained models to extract more than 45 labels from the pathology report. CNN+Bi-LSTM is a novel neural network architecture that automatically detects word- and character-level features using a hybrid bi-directional LSTM, and CNN, which is a Convolutional Neural Network architecture eliminating the need for most feature engineering. And BERT stands for Bidirectonal Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep, deep bi-directional representations from unlabeled text by jointly conditioning on both left and right context in all layers. Bio-BERT is the first domain specific BERT-based model that is pre-trained on biomedical domain, a corpora like PubMed abstracts and DNC full-text articles. So let me hand it off to Yogesh, who will tell you more about our workflow. – Thank you so much for the background, Vishakha. So, let me briefly talk about how we landed on this approach. Let me begin to get into a little bit of a background. So when we started, we started off with a more traditional entity recognition approach like CRF, and it worked really well for us when we had like a small well-labeled data set. But as we started to expand beyond the domains, like from pathology to radiology or the genomics side, as we started expanding beyond cancer types, this approach started to crack. Training was taking much, much longer. You couldn’t efficiently leverage word embeddings, clinical word embeddings, and transfer learning across domains and cancer types was not efficient. So that is when we started to experiment with the CNN based approach. It has been working really well for us. We use a Spark NLP-based implementation of it, and we’ve been able to achieve good results with it. It is also a state-of-the-art implementation based on a certain publicly available data sites.

Optical Character Recognition (OCR)

So the next component in our workflow is optical character recognition. So Vishakha briefly introduced this, but let’s quickly reiterate, optical character recognition is basically being able to convert PDF or images, like texts within PDFs and images to more machine readable text. So the goal here has been, as you can imagine, right? To be able to consistently convert PDFs to text that are machine readable and are actually in the domain of pathology and genomics and radiologists. So to evaluate a system like an OCR, we used a combination of a matrix starting with character error rate, which is nothing but the minimum number of operations required to transform your reference document to your output text. So then we also evaluated it based on a word editor, which is nothing but how many words are substituted or deleted or inserted between the reference document and output text. So as you can imagine, like evaluating based on word error rates, it has its own challenges like dealing with spacing or like dealing with the length of word sequences. So because of that, we also looked into a slightly, slightly more advanced metric, which was based on bag of words. So instead of evaluating word error rates, we looked into evaluating bag of word rates, which is, as you can imagine, just, just a bunch, it’s a multiset of words, basically. So each of these metrics, we measured against a bunch of OCR system parameters, like engine model, which tells you basically, if the document is plain text, or if it has an image layer on it, then something like page segmentation mode, which basically tells you if the document has just like a character on it, or like a blob of text on it, or does it have tables on it or something like that. And then some other factors like the scaling of the layer on the PDF, and erosion, and all of that. So overall, based on these metrics, and against the ground proof that we basically generated ourselves, we landed on a set of values for the parameters that has been performing really well for the data that we have. So this is kind of how we experimented and optimized our OCR pipeline.

Entity Resolution (ER)

So the next component within the pipeline is entity resolution. So we spoke about named entity recognition and we spoke about OCR, but with just these two approaches, you cannot get a structure out of unstructured text. So what is entity resolution? So entity resolution basically is removing duplicate normalizing data based on big waiting records that correspond to real world entities across, within your data set basically. So this, what you see on the screen is just an example of example of concepts and the codes for those concepts. So the idea has been, that we’ve been exploring is using clinical word embeddings, and the chunks is all from entity recognition. We tried to resolve these chunks to one of the concepts within a standard terminology. So like you would imagine, like you would have multiple models for your entity recognizers, you would need to have multiple models for your integrated resolutions depending on the terminology that has been used.

So now that we’ve discussed the techniques that are at play here, let me talk about the process.


So what you see on this screen, is on the left side, we’re talking about like labeling data and building the models. And on the right side, we’re talking about deploying the models and like serving the models basically. So, there are many elements here. So far, for building the model in an automated manner, we stuck with old fashioned Jenkins, which we use for orchestration. I mean the primary reason was this is readily available in our infrastructure, so we stuck with Jenkins. Another choice we made was to stick with Jupyter for, of course, for our exploratory analysis, but we also stuck with it to run our pipelines. So what we do is we use an open source tool called Papermill, which basically parameterizes our Jupyter notebooks, and we can just run them as command line scripts in our orchestrator.

So this has made our workflow fairly straightforward. So we did not need to maintain a notebook and then maintain scripts that basically run in our scaling those, basically. After that, if you can see, if you see on the, on the diagram, we use MLflow for tracking our parameters, our performance, and logging our artifacts. So this kind of gives us the ability to compare, pull and deploy any artifact from any of our historical rounds. And after that, we deploy the model basically in a container, which is served to a model server and then they just serve two APIs. So this is more for our sandbox kind of an environment. When we moved towards productionizing this, it is a lot more manual because we are in a regulated kind of environment.

So just to continue on the workflow, this is kind of a zoomed in view of what the consumer of all of these techniques, NLP techniques put together will kind of get, so this slide basically tells you, like, if you have a PDF document, you would need to run it through OCR, which brings you back the text.

MF Workflow

And if it is not a PDF document, of course,

we don’t need to need, don’t need to use OCR. This text then runs through named entity recognition models. You could have one, or you could have many, and that basically fetches you all of the entities that the model predicts.

And after you get these entities, you will hit them to map terminology APIs, and you basically have them resolved to standard terminology.

So, there is an edge Lambda that we have in place that basically orchestrates all of this. So the end user only needs to deal with input as a document and output as structured data.

Training NER model with BERT

So you might wonder, right? I mean, looking at the whole pipeline, how much of code one would need to write to kind of get a model running? So fortunately it is not a lot.

This is a snippet of code from Spark NLP. You can see that there’s a model being built using BERT embeddings and as much a deep learning based entity recognizer. So our pipeline looks pretty similar to this. Of course there is some boiler plate around to get things going, but the training of our model, it’s as simple as what is. So this has really helped us keep things simple. And we’ve been able to reiterate with changes or experiments much, much faster.

So having said that, I just wanted to conclude on this section by saying that this whole NLP process, it has kind of been a journey for Roche. We started from scratch and we’ve been working towards expanding to more domains and trying to automate as much information extraction as possible.

The use of NLP will be a journey

So we want to, we started like, like Vishakha introduced, we started with pathology and we’re working towards a lot more domains like radiology and genomics. We are hoping to leverage what we’ve learned so far to kind of scale up to this new challenge. So now let me hand it off to David, who will tell you more about Spark NLP, which is the tool that we’ve been using for all of our work. – Thank you Yogesh, and hello everyone. What I would like to talk about is the Spark NLP, Is the library, which is as Yogesh and Vishakha explained one of the enabling libraries for these projects and these kinds of use cases.

What is Spark NLP?

So we hope that this will enable you to understand how you can best use these for your own projects for the general case, even though it’s working on NLP in healthcare, I really feel you’re really just working them in this type of very domain specific, mixed understanding programs. On one hand, you want to start with state-of-the-art algorithms models and implementations, but on the other hand, you also know that you’re willing to train your own, but you need to answer specific questions that you need to answer in context. So Spark NLP is an open source library.

Its goal is to provide the industry production grade implementations of state-of-the-art NLP research.

So what the team does is it reads the latest papers, latest research, tries to reproduce it. Whatever doesn’t produce in generalized, it becomes part of the production grade product. It’s probably going to be an open source library for Python, Scala and Java APIs.

It also has an ecosystem and comes with more than 100 pre-trained models and pre-trained pipelines, which you can easily just activate with, I think, three lines of codes and nowadays, and it’s a very active community, 26 new releases in 2018, 13 new releases in 2019. And the same pace is continuing in 2020. In early 2019, Spark NLP became the most widely used NLP library in the enterprise, and just a couple of weeks ago, O’Reilly published in 2020, the AI adoption in the enterprise servery. And once again, Spark NLP, by far, the most widely used NLP library in production and enterprise.

On top of Spark NLP, the product that was used to build this project, is Spark NLP for Healthcare, which is an extension, which is required because as I think you’ve seen here in medical NLP, but your medical NLP is a different problem than general language NLP, all the way down to having different countries, research papers, conferences in different benchmarks, we’ll deal with it. And so there’s a really, there’s a different code base and different set of models that delivers state-of-the-art clinical and medical NLP solutions. And as you can see here, it has the entire, the same entire base, which we talked about, which you can use for anything else. But then it’s healthcare for six specific, first of all, things like tokenization, parts of speech, in spellchecking and even sentence imitation. And then on top of it, recognizing clinical entities, like for example, oncological entities we saw here. So here implementing clinical entity linking. So doing the entity resolution, is, we should explain, in being able to make an entity, to specifically code in a medical terminology, finding that’s [Unclear] an assertion, And so it’s nice that we can extract the term diabetes, but for most use cases, if we cannot tell between patients these diabetes patients, there’s no diabetes, they suspect the diabetes, or a patient has a family history of diabetes. So if you cannot tell between those four, it’s almost useless to have just the term. If you want to look at patient risk, mentioned patient’s clinical trial, all the command is in the case. Now we find the best next action for making the decision that supports it. The other very important features like de-identification of both structured and unstructured data, as well as the OCR capability here, which we have seen here. And on top of it, Spark NLP comes with more than 50 pre-trained models, some of them on embedding, some of them are NER models, reassertion surface models and the inking models, but very important as you’ve just seen is the ability to train your own. Because most often, if you’re in the healthcare setting and you’re working in a specific specialty or specific use case, you will want to tune your models to extract specific focus on those specific entities. In this use case, now, we do not care about symptoms, problems, and drugs, We care about the specific size, laterality, histology of the tumor. And then the question is, okay, how fast can you get to a point where you have a very high accuracy, more than 40 specific in this case?

Accuracy Benchmarks

So in general, in the design of Spark NLP, Spark NLP for Healthcare, and Spark OCR, there’s three main design goals, ensure accuracy, scalability, and speed.

With accuracy, and when we say state-of-the-art, it’s not a marketing term, it’s an academic term, which means that if an algorithm is at the state-of-the-art, it really provides the best accuracy achieved on public academic benchmark in any of your peer-reviewed papers. So basically this is the best they needed, the research community has been able to produce in a verifiable manner.

And as you can see, this is a really formal way, Spark NLP does definitely much better than other libraries, really, but just by adopting some of the latest advances in the unique learning and constant learning. And on top of it, some of the really new things that have happened in the past two or three months, it really is the two, the 2.4 release, the NLP cognition deep-learning algorithm that was reworked, enabling fallout in embeddings, in Spark OCR, and now has 20 different, annotators to be able to preprocess an image. And so it gets mentioned things like erosion and scaling, and now you can also do a noise reduction, an automated scaling, Stu correction and other algorithms that enable you to improve the accuracy of the image before you try to extract text. In a clinical entity resolution, Now it’s also been, been completion. It worked when you were in more accurate algorithms, and it’s actually inhibit aiding in new models, such as mapping in larger techologies such as [Unclear] which are also available out of the box. 2.5 things, I mean, really just a month ago is integrated supports for ALBERT and XLNet embeddings, two new tasks for which Spark NLP produced really state-of-the-art results as of 2020, which is spellchecking and sentiment analysis, which really, because its trainability enables also broke the emotion detection.

Scaling Benchmarks

In terms of scalability, so Spark NLP is based on Apache Spark and is actually, it’s the only way to get distributed open source NLP library. Yeah, you can run it on any Spark cluster. Also one important thing, especially in the healthcare setting, when you’re dealing with patient data. Using this data, it’s just the library. So this is not in as a service offering, you do not need to send your data to a third party, in which case it’s own kind of compliance, sharing, privacy issues. You can run into on your own, either on a local machine on a container or with your needing to scale. You can scale on the cluster and the benchmark that you see on the right, it is just AWS EMR with the zero code changes and kind of the whole story behind these benchmark, if you’d like to produce it, it is available online. And Apache Spark really benefits us here. And because it deals with all the really nasty issues that come with distributed computing, like minimizing shuffling, optimizing caching, and minimizing the amount of bandwidth that we use in really doing execution planning of whole pipeline before we actually run it. So a lot of work was done with the Apache Spark community and the DataBricks team to make use of those and make sure that Spark can clearly squeeze the most out of the algorithm we give it And of course that, distributed computing is not magic. And the speed up you’ll see, it depends heavily on the use case. So if you’re doing inference for example, you’ll like it, you’ll see nearly linear speed ups, but if you’re training say an RNN, which is by nature more iterative, you’re going to see sublinear speed ups.

Speed Benchmarks

In terms of speed, one of the other thing that Spark NLP is [Unclear] is making sure it’s optimized for the latest and greatest algorithm platforms and specifically the ones from Intel and Nvidia. So Nvidia obviously has GPUs and it has several generations of them, each adding different types of memory architectures and even instructions. In Intel, in the past three years really, started producing chips that have deep learning specific instructions and one advantage it has, It can use more memory than only the memory on the GQ, which definitely helps in some use cases, Spark NLP has optimized builds for both Intel and Nvidia. And this specific use case compares two generations of Intel chips and the, a Tesla P-100 from Nvidia, and in this specific case for a specific use case, it was just training an NER cognition model for the French language.

Intel actually turned out to be about 19% faster and almost half price on AWS compared to, compared to this.

Clinical Entity Recognition: Accuracy

So in this specific use case, we don’t just use the open source Spark NLP, we use Spark NLP for Healthcare. So what we also care about, okay, how accurate is it for these use cases? And the two most important tasks are recognizing clinical entities. We can actually correctly started the entities, from oncology and radiology reports, and then can we correctly resolve them to medical terminologies? So clinical entity recognition, so by the way, both of these algorithms have a kind of a different implementation within the healthcare code base.

Because that is what gets you state-of-the-art within this domain.

And as you can see, when you compare the state-of-the-art nowadays, it actually has become really nice. It does the energy progress online, which actually would list the papers and the benchmarks and will tell you what is the state-of-the-art as of today and those papers with scores, and does some really nice, free bit websites that enable you to keep track of the space, which is amazing because it’s moving amazing quickly in the past couple of years. Now, if you publish an academic paper and you claim that you have a state-of-the-art results, you’re likely to be at the top for maybe eight to 10 weeks. Which means that our job to actually catch up with the state-of-the-art, which really means that we just have to keep running, right? And keep adding new things, just to keep taking advantage of all of these new innovation, which is fantastic. So here we can see some public benchmarks on entity recognition, clearly shows that, once again, this is the next to latest version of Spark NLP, which we’ll use as out of the box on public benchmarks. It has accuracy, therefore the standard common data sets, and what was important, the Roche, the use case was not only that, but also the fact that, and if you contrast, for example, to just only using preframe models, Spark NLP is trainable. So what we were able to do here is general training data, using clinicians for specific oncology data, right? And then you’ll be able to fine-tune the model, right? So we’re still using the same guff architecture. We can use the same Bio-BERT in buildings, we can produce a highly accurate model for this use case. Which is a, unique probably.

Clinical Entity Resolution

The second important task is entity resolution. And that’s another thing that you need to do at the NLP stage because you need the, you want to see some to know, for example, that renal failure and decreased renal function are really the same thing.

In healthcare, there are many, many ways to write what is essentially the same thing, so if you don’t normalize and you just use the entities as they are, instead of mapping them to standard terminologies, the problem is not just integration with other systems, it uses those codes, and the point is that you live in my travel feature space, right? Because really what’s the way you want to use it is, okay, if I know that, there’s renal insufficiency, that increases patient risk, that may imply how we treat them, okay. And I don’t want them to deal with three different ways to ID this, three completely separate features that they can apply. And that’s something that you want to do at the text level or the NLP level. And this is also something that comes out of the box. We’d spoken before, but also trainable. So for example, if you’re looking for specific, from what characteristics in this case, we can train a model and just achieve those. And you’re also, you can have public benchmarks on to that, and to academic data sets or share on NCBI in the, in the current numbers, out of the box folders.

Learn more: Spark NLP

So if you want to learn more, or just try to run this for yourself, and one nice thing that’s available now for Spark NLP are the colab notebooks. If you look at those, those links here, which is basically, underneath you have public notebooks. And when you open them, on each lab there’s one button that says running Colab, and then you can run it within your own Google account, and so there’s really nothing to install or setup. And it will show you both how to use pretrained models and then how to train your own models in different types of cases, which is a great way to get started. Other than that, if you’re working in healthcare, definitely consider trying Spark NLP for Healthcare. If you’re working in another domain that requires you to train your own model, so you’re working in legal, you’re working in finance, you’re working in insurance, the best thing is probably start with the open source library and see really how, what’s the smallest number of documents and examples that you need to give in order to train a tuned model for your domain.

And I think that what you’ll find is that the number has gone down significantly in the past two years with advancements in transformers and [Unclear]. And if you have any other questions, we’d be happy to, we’d be happy to answer them. Please get in touch with us.

We’re always interested, and you have to know what people are doing and God bless this very active community. We’ll most likely be able to answer the question, whether it’s a simple out of memory issue or a bigger question on how we would get the use case. So with that, thank you very much.

hk vou for youn time!

So with that, thank you very much.

Watch more Spark + AI sessions here
Try Databricks for free
« back
About David Talby

John Snow Labs

David Talby is a chief technology officer at John Snow Labs, helping healthcare & life science companies put AI to good use. David is the creator of Spark NLP – the world’s most widely used natural language processing library in the enterprise. He has extensive experience building and running web-scale software platforms and teams – in startups, for Microsoft’s Bing in the US and Europe, and to scale Amazon’s financial systems in Seattle and the UK. David holds a PhD in computer science and master’s degrees in both computer science and business administration.

About Vishakha Sharma


Vishakha Sharma is a principal data scientist for diagnostic information solutions at Roche, where she leads advanced analytics initiatives such as natural language processing (NLP) and machine learning (ML) to discover key insights improving NAVIFY product portfolio, leading to better and more efficient patient care. Vishakha has authored 40+ peer-reviewed publications and proceedings and has given 15+ invited talks. She serves on the program committee of the ACM-W, NeurIPS, AMIA, and ACM-BCB. Her research work has been funded by the NIH Big Data to Knowledge (BD2K) initiative to build an NLP precision medicine software. She holds a PhD in computer science.

About Yogesh Pandit


Yogesh Pandit is a Staff Software Engineer in the Analytics Group within Diagnostics Information Solutions at Roche. Currently, he’s leading the NLP efforts to support the company’s NAVIFY platform, which aims to support oncology care teams to review, discuss, and align on treatment decisions for the patient. Yogesh is a bioinformatician turned machine learning enthusiast with experience in biomedical NLP. For the past few years, he’s been working on building applications leveraging data in the life sciences and healthcare space.