This talk focuses on the importance of data access and how crucial it is, to have the granular level of data availability in the open-source space as it helps researchers and data teams to fuel their work.
We present to you the research conducted by the DS4C (Data Science for Covid-19) team who made a huge and detailed level of South Korea Covid-19 data available to a wider community. The DS4C dataset was one of the most impactful datasets on Kaggle with over fifty thousand cumulative downloads and 300 unique contributors. What makes the DS4C dataset so potent is the sheer amount of data collected for each patient. The Korean government has been collecting and releasing patient information with unprecedented levels of detail. The data released includes infected people’s travel routes, the public transport they took, and the medical institutions that are treating them. This extremely fine-grained detail is what makes the DS4C dataset valuable as it makes it easier for researchers and data scientists to identify trends and more evidence to support hypotheses to track down the cause and gain additional insights. We will cover the data challenges, impact that it had on the community by making this data available on a public forum and conclude it with an insightful visual representation.
Vini Jaiswal: At this year’s data and AI summit, we thought of highlighting the value of data during unprecedented times. Hi, I’m Vini Jaiswal data and AI Evangelist at Databricks, where I consult data practitioners on Apache, Spark and ML AI use cases. Along with me today, we have Isaac representing the Data Science for COVID-19 Organization. Today we will go through importance of data accessibility, overview of research conducted by Data Science for COVID Organization, and then we will cover value of open source community. This is how COVID evolved. The virus was first identified in December, 2019 from pneumonia outbreak near end of December, 2019 to early signs of potential human to human transmission, to public emergency declaration around January 23rd, and right around March, it started quickly surpassing millions in cases as well as it was geographically spreading all over the world, and it was declared pandemic during that time.
Little did we initially know about the spread of the virus, how lethal the virus is. Several theories were available in the media, but there was no concrete data which we could use to drive to some hypotheses, or concrete conclusions. This is where having the importance of data comes into play.
There was a massive surge in COVID cases, right around March, 2020, as we saw in the previous bar chart. It became very crucial to know the origin, the causes of spikes, affected regions, and spread reasons. To know that we wanted to see some kind of data. However, sources of data were very limited, and data quality was a challenge. It became much more urgent to make this data available, so that people all over the world can be very informative, and can take some preventive measures. As a data practitioner, when I started to research into trends as many of you may have, I came across some data sets on Kaggle, Google, and open source platforms. I was looking for much more granular details than just the number of cases, and I came across this one particular dataset, which had so many details around the cause, spread, patient information, route data, and I picked that for our research efforts, and analytics. So let me hand it over to Isaac so he can give us insights into dataset itself, and he will also talk about the data engineering that went behind it.
Isaac Lee: Hi, I’m Isaac Lee, and I have the fortune of taking the lead role in engineering the South Korea COVID-19 datasets. If you’ve ever looked into the COVID-19 data sets like Kaggle, then you may have likely encountered one of our data sets, as we were ranked right next to the Johns Hopkins dataset, as the third on Kaggle. Our dataset, unlike the other two data sets, are composed of multiple different tables, and we try to depict a holist picture where each table represents a particular scene of COVID-19 within Korea, and the abundance of detail I believe was the reason why we were able to garner so much attention, even though our scope was very narrow, which was just of South Korea.
So our data set is a very narrow scope compared to the other popular COVID-19 data sets, and I’m first going to go over a few of the tables, a few of the scenes that I believe has the most potential for research. First, our data set is composed of multiple relational tables, and each table contains different information about the pandemic. For instance, the patient info table contains general information about each patient, while the patient route table contains data on the movement pattern of these infected patients, or in the case data set contains information about how each patient contract the disease. Now with each table offering a different perspective on the pandemic, you can really capture a comprehensive picture of the pandemic in Korea without losing any critical details. We’re going to take a look at a couple of the key tables.
First, the patient info slide, the patient info table contains general information about each patient. On the very left row is a patient ID, which is a unique ID assigned to each patient, and this is what brings all of the different tables together into a single coherent piece. Every table uses the same ID to refer to each specific patient. This means after taking a general view of the patient, and the patient info table, you’re able to look into the details of the patient, and other tables as well. Now below you can see a few roles from the Johns Hopkins data set, and as you know, Johns Hopkins data set is used as the standard data set for research, and for reports. The John Hopkins data set contains very general information such as sex, age, and the city that the patient was infected.
You can see that all of the information is also present in our data set, the columns highlighted blue, and what makes our patient information so potent is all the columns that are highlighted in red. Let’s zoom into the four columns that were colored red. The first column is infection route, which describes how each patient contracted the disease. For example, the infection case of the first row patient is Seongdong-gu apartments. Seongdong-gu apartments is a single case of mass outbreak that took place in an apartment complex in a district called Seongdong-gu, so we know that this patient became affected from this specific case of outbreak, other than the specific case, there are numerous outbreaks, and many different ways that a person can get affected. So we created an appropriate label for each infection route, so analysis can take into consideration how each specific individuals became infected.
Let’s just look at one more, the infection case of the last row patient says “contact with patients”. Now this one’s the more special one because for each patient with this label you’re able to identify which specific individual the patient contracted the disease from. And in this case, we see that the patient’s ID of a second row patient matches the infector ID, which means the last row patient contracted the disease from the second row patient, besides infection route, we know the total number of people that came into contact with each patient before they were tested, and isolated, which tells you how much each patient contribute the spread of the disease. Lastly, we have the symptom onset date that tells the date that the patient first displayed symptoms of COVID-19, and this is also one of the more informative columns. In fact, we’ve conducted a research with Harvard, and published a paper in international journal of infectious diseases that we’ll take a look at later. In this research the symptom onset dates plays a key role in identifying the success of Korea in suppressing COVID-19.
There were other tables that I also want to take a look at such as the mass outbreak cases data, or this policy data, but I want to spend a bit more time talking about the patient route data, because in my opinion, the table is the most astounding of all that we have. First, let’s take a look at how they’re made. The patient route data is hand engineered by analyzing each individual patient’s credit card history, cell phone GPS locations, and even the public surveillance camera footages by the Korean government officials. So the data is extremely precise and extremely accurate. For each patient, and for every place that the patient visited after becoming infected we know the person’s GPS location, the type of facility they we’re using at that location, the number of people that came into contact with a patient at that particular place, whether the contacted people became infected or not, the time of the visit, whether the infector was wearing a mask, and the type of transportation they took.
Imagine what you could build with this data. For instance, if you combine the patient route data, and the patient info data, you can create a causal graph where each note is a patient, and the edge represents the transmission from one patient to another, and for each patient, we know where they got infected, how many other people there were at the site, the date and time of the visit, whether the infected was wearing a mask, and all the details. Now Vini will introduce some of the use cases of the data.
Vini Jaiswal: Thanks, Isaac. Now Isaac showed us what the data set is, and gave information about each table, and the causal inference, and how other tables can be joined together. Now let’s walk through one of the use cases. South Korea’s number of infections were once second highest after Wuhan, but now of course it went below top 30.
As a research analyst, few questions that comes to mind. How did the virus travel from the main hub to other countries? We immediately think about the borders and routes, so having that data allows for understanding the inbound flow. Because of the latitude and longitude information that Isaac showed we were able to identify the impact of South Korea provinces, like which state got affected highest, and things like that. It looks like Seoul was hit the highest. As the virus was evolving, and a lot of information was being unknown to us from home, or around the world, all we could do is rely on the media, news, and internet searches. So let’s look for some of the internet searches that were trending: cold, flu, pneumonia, corona virus. These keywords remain top of the internet searches for around beginning of January. That was the time where we started knowing about the existence of COVID-19 from Wuhan, and right around mid-March the search then spiked again, this was the time when it started spreading in Italy, Europe, and U.S among other countries.
Now we needed to understand how these countries, and from Wuhan how the virus has traveled to other countries, and started getting infections, so we wanted to understand the causes of spread. This is where having the route data is critical. From route data that this first data set had we could determine that the first COVID case in South Korea likely originated from the airport, or the public transportation, and it was right around January 20th. This table shows first few cases that were originated in South Korea. Isaac also covered some causal inferences on how some of the hubs became major hubs at that time, and this word cloud shows highly impacted regions with highest number of cases at the time. Also, president Moon had declared a couple of them special disasters zones.
While the route data helps us understand the flow of cases, now we need to understand how we can prevent this from spreading. As a researcher, or drug developer, it’s extremely important to have access to the patient information so that we can understand transmission, symptoms, contraction of the disease, life cycle of the virus. This is what makes DS4C unique because hospitals in South Korea were mandated to capture this information, and south Korean government made this information publicly available. It was helpful to understand the reasons of the spread of the disease. Infection reasons were available in the data that was giving us good understanding of the top spread reasons. Contact with patient, and overseas inflow from these two graphs pops out, and it sounds about right. It resonates about the general understanding of infectious disease that spread through patient’s contact, and it came from travelers.
To prevent the disease, we also needed to understand what the symptoms are, how can it transmit to other people. Mitigating its impact was really important because it was spreading widely, so we could provide drug discovery, testing, patient care, resource allocation so much more because there was an influx in cases, and we were just exhausting the hospital capacity. This bar chart shows clusters starting from end of February, resulting in increased alerts about the disease. There was a specific case of 31st patient known to be a super propagator causing the pandemic to surge in Daegu.
Isaac also covered some of the cases of which were the causes of outbreak. Also, there is a mass infection of Shincheonji church followers that was there across all over the news, increasing the cases uncontrollably, and it became this center of intense scrutiny. Because of these massive outbreaks south Korean government started enforcing some policies, and admin orders to launch special immigration procedures, borders were locked down. There was a special policy on house lock downs, diagnostic kits were launched, patients started social distancing, people wore masks, also south Korean government launched wristbands that were given to the people to notify of any potential exposure in that area. This first wave lasted ’till April slowly. The COVID-19 started getting in control with all the isolation, and governmental measures. On May 11th, there were 79 cases linked with one of the clubs in Taiwan, leading to the policy of closing clubs and karaokes.
Now, if we understand all the route data, understand the patient data, we can further do analysis on patient demographics, who is likely going to get the disease, because this is critical. This was one of the things which scientist wanted to understand, like who is likely going to get the impact higher or less, so does age or gender play any role. In other use cases, this data can be further helpful in conducting mathematical, and numerical analysis for COVID-19. A sample time dependent SIR model, which the team put together, shows cases that infected from overseas, a new variable to consider impact to the people affected overseas. Are these the best patients relevant from previous patients and transmission rate? Is that a new factor that can help understand the mutation? So many organizations have partnered with us as well as DS4C to publish some researches, which Isaac will cover, and he is also going to cover the data engineering behind it because it takes a lot of effort to put data together. So let me hand it back to Isaac.
Isaac Lee: All of the data that data set contains was actually published by the Korean government, but surprisingly not even the Korean government has a curated data set with all the information in a single centralized, organized place. Now that may be surprising, but if you take a look into how the data was published, and collected, and engineered, you’d be able to understand why. So first the data about patients were distributed over 100 different municipal districts, and each one was only able to collect. Or had access to their own cities, or their own district peoples data, so that means there wasn’t a single source of centralized publication. Instead, each of the 100 districts posted their patient information on their own websites, and obviously they’re not going to have a single unified formatting because why would they?
So we weren’t able to use a single approval based method to collect the data, so it required us to scroll through every one of them, and because this data was so sensitive, they weren’t planning on leaving them for a long time. So they sporadically deleted the data anytime between three to seven days after publication. So that means we had to be alert anytime in order to collect data before they get deleted, and we had to check over 100 different websites every day.
And lastly, all the data that we’ve had collected was not in a coherent, and clean data set. Instead, as you can see on the slide, there were in these very long, and non-uniform sentences, and natural language, so even after data collection, we had to then parse the important data from these sentences. Due to the sheer amount of the work that goes in to the manual labor, it makes sense, and it’s understandable even that the Korean government doesn’t have this data sets. The reason why we were able to have this data is because over 20 data engineers with a full, part-time jobs contributed, and without being paid was willing to sacrifice hours of their time per week in order to do this manual labor.
Now, I’ll go over a couple of papers that DS4C had published. The first that was mentioned before was research with Harvard. We analyzed the infection time, and when each patient first showed their symptoms in order to identify why Korea was so successful in containing the disease, and it turns out that in average, because the test kits were so available, each infected person was able to get tested a single day after they first showed symptoms. That means they only had an average about a single day of window before they were isolated, so that was one of the key points that we were able to identify through analyzing our data sets.
Another publication is in the NeurIPS 2020. We published a very small snippet of the data set we had already published in the cattle in NeurIPS, and the reason why they were only able to publish a small subset was because we wanted to make sure that we completely anonymize the data before we publish it as an academic paper. Throughout the entire process, Vini and Denny from Databricks provided a significant mentorship, so I just want to call out and say thanks again to both of them.
As I’ve mentioned, all of the research and the engineering was done by a nonprofit organization, and it was completely done by researchers without being paid at all with their full-time jobs. So this was only possible because people reached out to us, individuals reached out to us in order to spend their time, and give up their personal time to contributes in this project, and we have one last task in publish the data set, and that is the data anonymization. The reason why we emphasize the value of open source community is because in every step, in every hurdle we face, we were able to barely get over that with the help of people who independently reached out to us.
And we have another hurdle in front of us, which is anonymization. Because the data that we’ve created, and published was treated from raw data that was so precise, and very sensitive, we wanted to make sure that our data set is completely anonymized perfectly, and one of the route tables that I’ve talked about and emphasized, it’s actually not available to the public just yet because that’s the most sensitive data.
So if you would like to contribute, or if you would like to contribute into the anonymization, and engineering of the data sets as part of our member, then please feel free to reach out to us at any time. This is my personal email, so I would be really happy to hear from you. Thanks so much for joining, and I’ll pass it back to Vini.
Vini Jaiswal: Thanks Isaac. So you see the value of open source community, and people are devoting their personal time. It took us fair amount of courage to speak about this topic today. While we can help support the issue through multiple channels, it is also important to think about how we can resolve this for a longterm. These problems can’t be effectively solved without data, nor they can be solved by one single organization. As data people, it’s our responsibility to ensure that right data is broadly available, and actionable, so that data teams, and researchers around the globe can do their best work. We are in this together. If you have a research, or data effort that you would like to collaborate with us on, please send them our way. These are the contact information. If you have any feedback for this session, please leave them in the reading session. Thank you all.
Vini Jaiswal is a Senior Developer Advocate at Databricks, where she helps data practitioners to be successful in building on Databricks and open source technologies like Apache Spark, Delta, and MLfl...
Isaac Lee is a software development team lead at Mindslab and the chief director for DS4C (Data Science for Covid). He is also pursuing a BS in computer science at Carnegie Mellon University. Isaac...