CORD-19: The Data Science Response to COVID-19

An early infection map from 

An early infection map from JHU

The world currently stands united in solidarity against a common opponent: COVID-19, a virus also referred to simply as coronavirus. In just a matter of weeks, the virus has spread across the world at an exponential rate (duh, that's how it works). Because many members of the data science community are now home-based for most of their time, wondering how their skills can be used to help eradicate the virus. There are many ways, but there is an especially meaningful way that has come straight from the White House itself: CORD-19.


Before we get to that, let's take a step back and ensure we are all familiarized with how viruses spread, time and time again.

Kevin Simler of MeltingAsphalt has built a great series of interactive figures, which show how any virus spreads (code here.

John's Hopkins University (JHU) has stepped up to help big-time, providing ongoing tracking of a number of analyses across the world. Click through their plots below, showing global cases, deaths, and recoveries over time.

They've also created a stunning dashboard, which includes an ArcGIS map of confirmed cases. The tables to either side of the map tabulate cases and deaths when broken out by country. For readers in the US, take note of the density of the cases in Europe. Experts are saying that the US has not done enough to prevent this density from occurring here in 1-2 weeks. Poke around the figures below, it's all interactive within this page:

Ok, back to viruses. We're all refreshed on how they spread via exponential growth (to be pedantic, it's actually logistic growth). For a visual explanation of this process playing out with COVID-19, look no further than Grant Sanderson's amazing 3Blue1Brown channel:

Grant Sanderon's 3Blue1Brown discussion of the COVID-19 breakout

CORD-19 Overview

I imagine you are sufficiently motivated by now. Those who are not are likely still playing with the maps above. Regardless, I am glad you are motivated because we now have work to do.

The White House has issued a call to action to the technology community to derive insights from a vast body of coronavirus literature. The data (CORD-19) is mostly text articles about various coronaviruses. Therefore, it calls most strongly to the Natural Language Processing (NLP) community for insights. Everyone, prepare yourselves for the coming influx of Sesame-Street-themed model names coming from this: CORDBERT, BERTID19, CoronaBERT, etc.

At this point, many data science veterans will scoff and gripe about how most of their time would be spent preparing the data as opposed to using it. I have good news: CORD-19 is machine-readable right out of the box.

[Today, researchers and leaders from the Allen Institute for AI, Chan Zuckerberg Initiative (CZI), Georgetown University's Center for Security and Emerging Technology (CSET), Microsoft, and the National Library of Medicine (NLM) at the National Institutes of Health released the ]{style="font-size: inherit;"}COVID-19 Open Research Dataset (CORD-19)[ of scholarly literature about COVID-19, SARS-CoV-2, and the Coronavirus group.]{style="font-size: inherit;"}

Requested by The White House Office of Science and Technology Policy, the dataset represents the most extensive machine-readable Coronavirus literature collection available for data and text mining to date, with over 29,000 articles, more than 13,000 of which have full text.

Join the Fight

COVID-19 Molecule

COVID-19 Molecule source

Kaggle, a data science community now under Google's umbrella, has formed official challenges from the call to action. Here's your step-by-step process to get up and running:

  1. Read the main page for the CORD-19 data.
  2. Review the tasks for the main Kaggle challenge.
  3. Review existing analyses made by other users.
  4. Discuss findings with other users.
  5. Participate in the Week 1 forecasting challenge. Consider using a DeepNote notebook for analysis or AutoML to automatically train a model.