COVID-19: Who’s infected and how reliable is the data?

July Thomas
3 min readMar 21, 2020

As the virus spreads, data about the areas most impacted is a crucial tool for preparedness. However, it understandably trails behind immediate medical actions in terms of priority. Here, I try to parse the information hidden in the limited data available by investigating national and regional trends.

How infected is the US?

Everytime I see raw case numbers, I have two thoughts. The first is, “How much of the population of that state is infected?” The meaning of a number like “1000 new cases” is different for New York than it would be for a smaller state. A good place to begin analyzing this data is to compare it to state populations. By calculating the number of cases per million people, we get a better idea of how infected a state is.

Compare with the best contained country, South Korea (169 cases per million), and the most infected country during the pandemic stage, Italy (777 cases per million).

What immediately jumps out about the heavily infected states is that each contains a large metropolitan area. To address this idea that the spread of COVID-19 can be explained simply by how densely packed an area is, I looked at the relationship between case density and population density and found little correlation (R² = 2.92%).

Who’s testing?

The other thought I have every time I see new data is, “Are we even testing enough for these numbers to mean anything?” If we aren’t testing all of the people showing symptoms, we must be missing some infected people in the data. To get an idea of testing coverage, I first looked at how many tests each state has administered per million people (except for the seven states that have provided no testing data).

Compare with South Korea (6,178 tests per million) and Italy (3,420 tests per million).

One big indication of under-testing would be an increase in the number of positive tests with the number of tests administered. In other words, if the states that test more find more cases, then most of the country must be under-testing. It turns out there is some correlation (R² = 23.78%), indicating systemic under-testing, but this isn’t news.

Neither is it news that testing in the US is trailing that in other countries. The state furthest along in its response to COVID-19, Washington, is still behind even Italy in terms of testing coverage. New York, which is rapidly approaching the case density of Italy, is even further behind. Of course, no state in the US comes close to South Korea’s sterling example of pandemic response.

Now, testing isn’t necessary if no one is infected. In this case, we would expect only a small percentage of the tests administered to come back positive for COVID-19. Unfortunately, this doesn’t appear to be backed up by the data.

Compare with South Korea (2.73%) and Italy (22.73%).

States with a low testing coverage — like New Jersey, Tennessee and Arizona — have high positive test rates compared to their neighbors, suggesting they are only testing the highest probability cases and missing those less severe cases responsible for silent community spread.

In New Jersey specifically, where nearly 3 in 4 tests have come back positive, the scope of the pandemic is almost certainly underrepresented in the data. Even states like New York, where a more modest 1 in 4 tests return positive, are still trending above Italy, signalling an alarming testing deficiency.

As always, the wealthy places in the US will weather this storm. In six months, we will look back to find surprising devastation in places not currently making headlines. States that silently suffer, like those in the South which typically rank worst in healthcare, are particularly vulnerable to the virus and its comorbid economic contraction.

All data current as of 10pm GMT, Friday March 20, 2020.

--

--