All In – Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databricks) – A Real World Case Study

Download Slides

Molecular profiling provides precise and individualized cancer treatment options and decisions points. By assessing DNA, RNA, proteins, etc. clinical teams are able to understand the biology of the disease and provide specific treatment plans for oncology patients. An integrated database with demographic, clinical and molecular data was created to summarize individualized genomic reports. Oncologist are able to review the reports and receive assistance interpreting results and potential treatments plans. The architecture to support the current environment includes Wasbi storage, bash/corn/PowerShell, Hive and Office 365 (SharePoint). Via an automated process personalized genomics data is delivered to physicians. As we supported this environment we noted unique challenges and brainstormed a plan for the next generation of the critical business pipeline line.

After researching different platforms we felt that Databricks would allow us to cut cost, standardize our workflow and easily scale for a large organization. This presentation will detail some of the challenges with the previous environment, why we chose Apache Spark and Databricks, migration plans and lessons learned, new technology used after the migration (Data Factory/Databricks, PowerApp/Power Automate/Logic App, Power BI), and how the business has been impacted post migration. Migration to Databricks was critical for our organization due to the time sensitivity of the data and our organizational commitment to personalized treatment for oncology patients.

Watch more Spark + AI sessions here
Try Databricks for free

Video Transcript

– Hello, my name is Victoria Morris, and I’m an independent consultant working for Atrium Health. Atrium Health is the second largest, not-for-profit health system in the US. The VA is the first. Today, we’re gonna be walking through a real world case study in migrating our genomic pipelines from a bash Hive on HDI configuration, to Spark in Azure Databricks on Microsoft. First, we’ll do a quick overview of our data lake and some background on why we decided to move to Databricks.


Then, we’re going to be talking about migrating our two genomic pipelines. Talk about some things we learned, and then with the impact it had on the business. We value your feedback. Please don’t forget to rate and review the sessions. So let’s begin with the link.

Original Problem Statement(s)

Back in 2016, we were asked to solve this problem. Genomic reports are hidden deep in the EHR and scanned reports. We receive them from multiple vendors. They look different and are very long. It takes a lot of manual effort to summarize them. These reports make generic clinical trial recommendations, which may or may not be available at our oncology center called the Levine Cancer Institute or LCI. And these recommendations need to be translated into our specific available trials. Bottom line, it’s complicated, time consuming and manual.

The objective at the LCI integrated knowledge base or link is to combine data across various data silos in the organization. A common use case for data lakes. Originally, we had four people, and there’s only Lance and I left. Lance’s primary focus in this project is on infrastructure and negotiating with security to get data into the data lake and making sure we keep within budget. I focus on the algorithm that runs the back end, as well as developing the front end user interfaces and any new functionality requests. Additionally, I keep the pipeline running smoothly day-to-day, troubleshooting any issues as they arise. As you can see, we have various data sources converging into our data lake in Azure. All of this data is transformed by algorithms into our two pipelines. Our first pipeline clinical trials matching actually begins an EA pathways. Our fire, smart, embedded app in the EHR used to ensure consistent care across patients regardless of treatment location, while using approved pathways for treatment. Providers determine the best possible treatment pathway which may include genomic testing. The clinical trials matching pipeline then takes the results from these two external tests as input, merges it with the other relevant data, knowing about the patient across the data lake to generate clinical trials matching and genomic report summary. The second pipeline is for the pharmacogenomics. It starts from internal genomics testing results where the raw one and two Alleo calls from the internal lab are transformed to the genotype and phenotype. They derive clinical recommendations and then generate a summary report. Here’s a look at the original tools and the data flow that supported the 2016 process. We started diagnostically with bash script that used Cron to control and run jobs. Scoop to gather data, Hive to do the algorithm. We use Ms Web Apps to pull data on an in the cloud SQL server back end that was auto-created by Microsoft, with the front end user interface that was also auto-generated, but was customizable. We used Emma SharePoint designer and PowerShell to move information between SharePoint lists and databases, et cetera. We’re receiving a lot of different data formats. XML, comma-separated value files, PDFs, banned fast keys. We pull full table copies from internal databases and from the Teza, our enterprise data warehouse. We pull queries across multiple tables or full table reads of data in the Azure storage using Scoop. Our pipeline went live in August of 2016, starting as a proof of concept. Over the years, there have been tweaks.

Once Microsoft stopped supporting Web Apps, we were asked to use the new Emma’s Power Apps. So Power Apps replaced Web Apps and Flow which was the original lean for Power Automate replaced SharePoint designer and PowerShell. We added the additional labs inbound and more data formats. JSON and RSID files. We created Azure SQL server databases to replace the auto-generated Web App ones. And all that time, we continue to maintain a daily production run, multiple reports per day to support clinical users, and the weekly genomics tumor board. The high back end grew to over 8,000 lines by the time we migrated to Databricks.

But like any great success, we had issues. With over 180 oncologists plus pharmacists, nurses, residents, lab, folks who use this data in real time to make clinical decisions on patient treatment options for real life people, when it stops working for any reason, it’s a big deal. We were early adopters of HDI. Microsoft had not envisioned people just turning on and leaving these clusters up and running for years. In the cattle versus pets analogy, we were a bad cow named Fred.

Safe mode kept killing us. Seemingly, randomly, we get up in the morning and realize that everything had failed overnight with no error report, and have to rerun hours of work after getting out of safe mode. We were crunching through a lot of data. 6 million patients, 12 and a half million encounters per year, times five years to find the right patient’s record for the provider who ordered that test. Just to pull in all the background information on the patient from several different systems before we even started looking at the results. I broke Visual Studio a couple of times before we were told to abandon Visual Studio and move to Visual Code. It was an expensive operation as I was constantly being reminded, and we tried lots of things to pair down expenses like scaling, which again, we managed to hit a hidden limit on. There was a max number of logs hidden in the design that no one expected you to reach. But after so many years of operation, like yes.

The scaling also introduced a lot of instability. We always had IP address issues or rather lack of IP address issues. Scaling had a negative impact on that as well. We’d periodically lose IP addresses that were never released by the automated scaling down process. Eventually, we would run out of IP addresses and everything would fail.

Next moves.

We knew roughly what we wanted. We wanted cheap, excuse me, value. We wanted to stay in the cloud. We wanted flexibility in programming language choice, and we were finally ready to commit to Microsoft. We needed a HIPAA compliant infrastructure that was stable, self-sufficient and did not require a lot of tinkering. I wanted something that we get me out of the Visual Studio, Visual Code world into something that was easier to use like Notebooks. We decided to move to Databricks to manipulate clean and gather the data together. We introduced data factories to move data around instead of using the command line. For myself, the tipping point was not having to hands-on manage the infrastructure anymore. I am not a hardware person, I’m a software person.

Speaking of migration.

Migration – starting small

Start small. There is an enormously steep learning curve for Databricks and not just because this pipeline requires me to be fluent in a lot of other technologies besides Databricks like Power Apps, Data Factory, Power Automate, Power BI and SharePoint. So we looked at starting with the second pipeline which was much smaller, and it was much easier to start from scratch.

Pharmacogenomics In progress

Pharmacogenomics is that personalized medicine you keep hearing about. Where depending on your genes, your body may react differently to medications. You might need a higher dose or should not be prescribed some medications at all. Driven by a power app, this lets our internal lab generate lab results for pharmacogenomics to be translated into a report. At the 200,000-foot level, and if you are intimately familiar with genomics, please forgive my nickel dime tour explanation. Genomics testing slices of the double helix of your DNA, RNA and numbers each piece. Then lines the two double helix’s backup, and you get a letter representing one of the four possible acids. A, T, C or G that is in each position called the Leo called one and a Leo called two. Combine a bunch of these snip IDs, say one, two and five, and you make a gene. And depending on what letters are in cord one and cord two, you have a genotype or variant, while types are normal, and or maybe you’re a star six, star six. These gene variants distinguish how your body will metabolize medications, and or the efficacy of prescribing a medication to you.


This is the RSID file we received from the lab. We get the sample ID, the NCBI snip reference number, the ilial one and two calls those CTG, and for some reason, a dash. I don’t know, maybe alien DNA. The truth is out there.

We actually started this app with the labs sending us this spreadsheet, asking if we could create an app to read these files, the files from the previous screen, and translate these phenotypes from the lab results into a report using these Excel spreadsheets. This is the resulting Power App screen from that spreadsheet. The screen does the phenotyping based on the gene selected and the possible variant combinations. A metabolizer is assigned. This screen dynamically redraws depending on the number of variants. The lab defines all this themselves through other screens in the app. The lab users review the patient’s results and ensure they match what they saw on the machine. They also add pertinent info that was not available in the raw data. This shot shows the gene, CYP2C19 which is made up from three RSID shown here. And these are the ilial one and two calls for that specific patient. Once the lab is satisfied, they sign the results and they fall off their queue and that jumps to the clinical review team’s queue. Here, the clinical interpretation work two is presenting a patient for review. Staff can add a personalized consult note, edit text, preview the PDF report, sign off the results is complete, and then it drops off their queue. There’s only one notebook for pharmacogenomics. This started from one of my first notebooks. I didn’t know I like Scala. So it’s a Python notebook! No worries, I just change itself as I go. One of the interesting technical challenges to make the proposed workflow for this application to work is that we needed to be able to run this notebook on demand from a button push inside the power app which is accomplished by a power automate called the Data Factory to run the notebook.

Okay, back to our 8,000 lines of Hive code in pipeline.

So we have four external labs sending us results. Each one in its own format. This report is 15 pages long. I’ve edited it down, a bit of bridge if you will, and these potential clinical trial matches are what may or may not exist at clinic at LCI. And keeping in mind, I’m not sure of anyone not intimately familiar with genomics and all the clinical trials under ASCO could even begin to utilize this page. Or you can have this simple summary paragraph, and eligible clinical trials paragraph too. And yes, this is the actual result from that 15-page report. Clinical trials matching is incredibly difficult and requires an intensive manual effort. The physician going through the patient’s record and looking up what is required for a variety of possible clinical trials that they may or may not be familiar with is prohibitively difficult. By automating this process, providing clinical decision support, we increase the ability of providers to match patients to clinical trials. Why is increasing clinical trial participation important? Well, from it being part of Atrium Health 2020 goals, advances in care. Clinical trials are how we test and make changes to treatments that improve care and outcomes. Patients get care to reduce costs. Atrium Health gets revenue for participating in a clinical trial, and physicians get prestige and recognition for getting their names associated with clinical trials. So it’s a win, win, win. Being able to match clinical trials against these 10 criteria flexibly with no hard coding is a very non-trivial accomplishment. We’ve been looking at commercial matching projects, products which have extremely limited numbers of genes and criteria, all hard coded with about a two week development cycle for each individual clinical trial. Here at Atrium, we have a couple of 100 clinical trials open at any given point in time, with lots of movement, opening and closing of trials daily at Atrium.

Opening Screen

We use Power Apps for staff to manage and maintain how they want things to be matched, and what information should appear in the summary and how it should look. We try not to hard-code anything into the algorithm, wherever possible. It’s all run from the features in these tables. They, that they can manipulate in near real time. Any changes made today are part of tomorrow’s run. No IAS intervention required. And what is hopefully a friendly, easy to use interface. These are complex relationships that we’re trying to describe. And the large part of the redesign has been making it simpler, and easier to enter the clinical trial data. A clinical trial has a parent record, a name and other relevant data like age, range, status, gender. Clinical trial criteria allow us to branch for multiple genes. A clinical trial may have one clinical trial criteria or many. Asker taper has over a 100, for example, but there are lots with just one or two.

Clinical trial criteria, the child record has a gene to match the results against, normalize machine results then for each feature like tumor site, we have two possible states. I want most things with only a few exceptions include any accepts like the orange tumor button, or I want a couple of things and I want to exclude most which is what’s happening in the green diagnosis button. Only include this short list. Button text changes as well as the collections that are displayed below each button, which summarizes the choices made on other screens. Pushing a button, navigate you to a new screen to define inclusion and exclusion criteria for that feature. For lab results, you need an operator like greater than, or less than or equal to and a value. And since we receive results from multiple labs, there may be differences in units of measure. Most are cosmetic like this example, but you can be completely different scales of measurement like decagrams per liter versus micromoles per decaliter. So hemoglobin could be measured where a normal is 120 or using the other scale 12. Overall, you can use the red slider define if you want to count a match. If you meet any of those lab results, or you can flip the switch and acquire all the labs to be considered. The large, gray button changes the overall logic which is currently only include these, that shortlist of include, but you could say include any, except what I define here.

There is this, this is the summary results template page. One of the most important aspects once we have the correct data was how it was displayed to the provider. Dr. Farhangfar is extremely detailed oriented and that’s what makes her great at her job. She can self determine whether or not a result displays for each technology, right down to only for a specific gene on a specific technology for a specific result. She defined exactly the title, and what variable she wants displayed in the details line. Again, no IIS intervention, this is also service. Any changes she makes are seen in tomorrow’s run. In this example, any mutated pathogenic gene from a clean T7 technology should display, and Databricks will substitute the number inside the angle brackets for actual values for the biomarker and NGS protein change when it calculates the summary. This one is for this specific gene, TMB, and only for this result low, only in this technology. Frame shifts are not important enough to display in the summary. So those results are removed from you. Display on the summary is independent of clinical trials matching. Once we finish clinical trials matching, we add to the bottom of the summary, any results that would not normally be displayed in the summary. So it makes sense to the provider reading why there was a match. A good example as wild types. We do not display wild types in the summary unless they’re used to match a clinical trial.

The reports review allows for your curation of the results of the algorithm. Here, you can open the original full PDF using the purple button, you can see the raw machine results in the table below that derived to summary. The orange buttons let you quickly move to another part of the app if you need to tweak something, like how it should be displaying or matching criteria. Using the two edit controls, you can manually override the results from the algorithm for this patient today. We periodically review the manual updates, you can see algorithm output, to consistently improve the algorithm.

So all of that data, XML, CSV, JSON, data table read, SQL polls feed into the data lake in Azure, where Databricks runs it to Notebooks. Reading in the features from the Azure SQL database tables that were edited by the user in the power app, combining it with the data lake data, it generates the results. Uploading them to the data lake or the Azure SQL server database, where a curator can review and release using the power app. The Power Automate flow moves the documents around and updates the SharePoint lists. Power BI reads the data from the SQL server database and updates its views. Emails are sent to indicate there’s working cues or new reports for providers are available. All of this is the power of an interconnected ecosystem. Databricks on Azure with Office 365.

The great migration.

For Databricks, we’ve broken the work down into several independent notebooks. For each, first each lab has a separate notebook that reads their specific input data, XML, CSV, JSON, and transforms it to a standard format for matching and summary creation. This allows us to run labs notebook independently since the cadence and time of day, we get data is different from each lab. Then there’s a second notebook that does the matching for clinical trials and a third one that does the summary. This split is because there’s an optimal number of sales per notebook. When you go over a 100, you get that poor performance warning. It just seems like a good logical break to just put things here to minimize any rework or reloading of table data. Modular nature also makes troubleshooting so much easier than the 8,000 lines of Hive code. You’d have to let the whole script run through for everything trying to debug was torture. Now I can just rerun a local group, a logical group, say one lab and the main match check on a change or just the summary to check on an issue with a display template.

I created the algorithm in Hive, originally. So it was actually pretty easy to translate since we treated the Hive tables like data frames. But there were of course, some differences. In Hive, I have to create a table definition, which gets stored in the database, and in Databricks, I need to have a schema to read the data. There’s no header in my comma-separated value data, and over time, data has changed. We’re getting more fields in 2020 than we did in 2016. So I can’t rely on any of the infer options. There’s also a couple of extra steps to get things in SQL kind of mood. In Databricks, I have a separate line of Python to read the data into a data frame using the predefined schema, and then from Python, I save the data frame to a table.

Creating a clean view of the data

Then things started to look very similar. Often, we were doing the transformations of things could just be copy and pasted, but it was also a chance to rethink some original choices and improve the algorithm by redoing some of the logic in the newer version.

Databricks notebooks manage the clusters that have been in that have been attached to them. It has been a lifesaver. Clusters can scale worker nodes up and down based on workload, you predefine libraries to be attached to clusters that are used for your notebooks. I prefer to write in Scala, but I can switch languages in each cell. And although I try and stick to matrix manipulation wherever possible, as Lance says, sometimes you just need a loop.

Databricks by the numbers. We use a premium workspace. We have about 110 terabytes of data. 2.3 million gene results times roughly 240 clinical trial criteria, times 10 features per clinical trial each day. ‘Cause we recalculate today’s clinical trial matches for all patients every day since trials open and close, and a patient’s eligibility changes. And yes, even during COVID-19, we’re still seeing roughly one new report a day.

Things we learned.

Azure Key Vaults and Back-up

Azure Key Vault. Totally worth the time to set up, especially as you expand access into Databricks notebooks to multiple folks. You do not want all those credentials, they’re everywhere in plain text. It’s important to save a copy of those final working detailed instructions though, ’cause you only have to redo this when you create a whole net new workspace, which is not often, and you will forget. Speaking of creating a whole net new workspace, if you do this unexpectedly, you lose absolutely everything. I know, so old school and old-fashioned, but save complete copies of what you’re doing periodically offline. And yes, we had to generate new workspaces twice during this project.

Working with complex nested Json and XML sucks

Working with complex, deep, nested JSON and XML sucks.

Working with complex nested Json and XML sucks It sounds so simple in the examples and works great in the simple 1le examples – real world when something is nested and duplicated or missing entirely from that record several levels deep and usually in

I mean, it looks great in all those cute little examples they give you, but most real world, self-describing data is messy where you may or may not have a node, which may or may not repeat, and you need to shove it all into that structured table to do anything with it. Cardinality to the rescue. I use that function all the time to see if there was any data inside those complex arrays, which may or may not contain data. Another handy function was concat with string. It lets you concatenate strings or arrays, or a combination of the two into a string. So if you’re not sure if you have one knoll or multi-year results inside your data, because it’s complex and messy like life, you can convert the unknown to a final string to have everything fit neatly into your structured table. I generally use a pipe separator so I can dig through through many entries after the fact.

Tips and tricks?

This was a hard fought lesson. Databricks only read block blobs. Any other blood types, it doesn’t even see the directory. Data Factory is a great tool. We just didn’t use it as much as we thought we would. I’m no expert, but I use these all the time. If you create a data frame in Python and suddenly need to use it in Scala, you can’t see it unless you convert it. I use SQL to convert between Python and Scala, and I do a lot of work inside Spark SQL directly.

Code Snips I used all the time

We stopped scooping where the data sources were relatively small, and did direct connections to tables as long as I was not impeding production. This worked well and it’s the middle scale of code snippet. But if you stop or run all or cancer come in in the middle of writing out to a table using that, you hose it completely. You only the bottom code snippet to fix it. This cleans up the file system to allow you to attempt to write the table again. Because of the modular nature of our design, I saved the end results so I can transfer the transformations and calculations from one notebook into another. And then I use the refresh table feature to pull the latest, true value version of the table, regardless of what is cashed in the cluster. I love Parquet. We get so many speed gains from it. But if you’re working with the real world data, you may end up with a no column definition which Parquet hates. Early on, after the initial read of incoming unstructured data, I usually cast any numeric column to its true type and float double as part of a transformation to avoid getting all the way through the script and hitting a knoll-typed column error from hurricane when I’m saving the final results to a table.

Business Impact

Business impact. More stable, less money. If you have breakdowns, it’s much easier to troubleshoot. Increases in volume or handle easily. We went from a 600-gene panel to a whole transcriptome reporting from one menu without even blinking, and independence for the end users.

I’d like to thank Dr. Ragavan, the founder of our feast. As the head of LCI, he provides funding, guidance and regulatory approval for all the data collection as primary investigator. Our customers, Dr. Farhangfar, Nury and Jai, on the IAS side, this would not be possible without Chris Danzi, Lance, Scott. Andy, Andrea Bouronich, Stephanie, Melanie, Stacy, Kelly Jones and his team and all the data systems owners who let us access their data and all the Microsoft support people who helped us push the edges. Databricks, and of course thank you for your time and attention today.

Watch more Spark + AI sessions here
Try Databricks for free
« back
About Victoria Morris

Atrium Health

Victoria Morris has a BA in Theater Acting so naturally she went on to get her MSc in Information Systems. She has spent longer than she cares to remember working with data in various organizations. Before entering the healthcare industry, she was responsible for coding lottery and gaming tickets including writing an algorithm for unique crosswords extended play games. Challenging herself further, she worked to define dynamic logistic routes for shipping goods and pricing commodities across northern Canada and the U.S. Victoria has worked in healthcare for the last decade, focusing on integration for both Oncology and Emergency Medicine. Having worked in the U.S., Canada, and Australia she brings unique experience into the inner workings of various healthcare systems and several major EHR's. Victoria is passionate about making lives easier for healthcare providers and their patients by using automation to remove barriers to care.