Nowcasting the current size of the COVID-19 outbreak in the United States

Nowcast overview

At any given time, most COVID-19 cases are circulating in the community and not known to us. This means it is difficult to estimate the total number of infected individuals.

We would like to know the following things for any given point in time up to the present:

the number of unnotified symptomatic SARS-CoV2 infections at any given time;
the number of unnotified latent SARS-CoV2 infections at any given time;
the total number of unnotified SARS-CoV2 infections at any given time;
the cumulative number of SARS-CoV2 infections since the beginning of the epidemic.

These pieces of information constitute our Nowcast.

Here we detail a method for accomplishing this from two sources of data:

new case reports (the number of newly confirmed cases on a given day), or
new fatality reports (the number of fatalities reported on a given day)

Estimating from case reports, informed by an estimate from fatalities

Case reports do not include infected persons who are symptomatic but not yet diagnosed, nor individuals with latent infections (those who carry the virus but are not yet symptomatic). If testing were perfect and universal, we would detect 100% of cases. In reality, we miss many because of imperfect surveillance. The ratio of detected cases to the true number of cases is the ascertainment rate, which can vary over time.

If we know the ascertainment rate, as well as the distribution of times from infection to symptom onset and from symptom onset to case reporting, we can use case reports to estimate infection times for those cases. That gives us information about the number of latent and symptomatic infections present at some point in the past. We can forecast those curves from that point in the past up to the present to derive an estimate of the number of infections present now. This is the “Nowcast.”

To estimate the size of the outbreak at any given time from case reports, we first adjust the number of new cases for ascertainment rate, then estimate the number of symptomatic and latent infections in the community from the adjusted new case count. The estimates of symptomatic and latent infections are only valid some time before the present, so we forecast from that point to the present. At any time, the sum of symptomatic and latent infections is our estimate of the total number of unnotified individuals currently infected with SARS-CoV2.

To find the ascertainment rate, we reasoned that fatal cases are much less likely to be overlooked than non-fatal cases (particularly asymptomatic cases and cases with mild symptoms). The ascertainment rate for fatalities can be assumed to be very high, since individuals who die with COVID-19 like symptoms will inevitably be tested, either during hospitalization or post-mortem. For simplification, we assume the ascertainment rate for fatalities is 100%. We construct a nowcast starting from reports of fatalities instead of cases, using the distribution of time from symptom onset to death instead of symptom onset to case reporting. Nowcasting from fatalities does not require a correction for ascertainment, but does require knowledge about the infection fatality rate (IFR), known with a reasonable degree of precision to be around 0.9%. By comparing the total number of infections in the U.S. estimated from case reports (without correction) to the total number of infections in the U.S. estimated from fatalities, we can determine the ascertainment rate for case reports for the U.S. We can now go back and correct our nowcast from case reports by the ascertainment rate to produce a more accurate nowcast. We can apply the same correction to the Nowcast from states, even if the state has few or no fatalities.

Why not simply Nowcast from fatalities?

Although we consider reports of fatalities to be more accurate than case reports, we consider case reports to be more sensitive to changes in infection rate that may arise due to social distancing and other interventions. This is in part because the average time delay from infection to death is much greater than the delay from infection to case reporting, so changes in the trends show up in case reports first. By correcting case reports by ascertainment rate informed by fatalities, we hope to combine the advantages of both sources of data.

Forecasting method

To forecast the number of latent and symptomatic infections from a point in the past forward to the present, we fit the curve using a time varying autoregressive model, and then forecast the fitted curve using an exponential smoothing (ETS) model.

Algorithm details

From case notification reports (the number of newly confirmed cases on each day), we use Monte Carlo simulation and estimated probability distributions to construct a synthetic line list of symptom onset times for all cases that are eventually notified. The average delay from symptom onset to case notification for the U.S. is 5.95 days¹. Simulated symptom onset times are distributed according to a gamma distribution with a mean of 5.95 days before case notification (shape parameter: 2.2788163).

A second simulation generates synthetic exposure dates for each infected individual. The average incubation period for the U.S. is 5.89 days². Individual simulated exposure times are distributed according to a gamma distribution with a mean of 5.89 days before symptom onset (shape parameter: 2.4265511).

Now we have a synthetic line list that generates the observed number of case notifications and includes exposure date, symptom onset date, and notification date.

From the line list, we tally the number of individuals with latent infections on each date to produce a curve of the number of latent cases in the community. We do the same to find the number of individuals with symptomatic infections on each date.

Since the symptom onsets occur on average 5.95 days before reporting, we are only sure to have an accurate estimate for the number of symptomatic individuals at some time prior to the present. We therefore discard estimates for the last several days (5.95 days plus a margin of 2 standard deviations), and use statistical forecasting to extrapolate from the observed trajectory of symptomatic cases to the present time. This is the “Nowcast” of symptomatic individuals.

We repeat the process for latent cases, discarding values for times within 5.89 days (plus two standard deviations) of the last good estimate of symptomatic individuals, and forecasting to the present. The is the “Nowcast” of latent individuals.

The sum of the latent and symptomatic cases at any time provides an estimate of total number of cases (including asymptomatic and unnotified) cases.

From fatality reports (the number of newly reported deaths on each day), we use Monte Carlo simulation and estimated probability distributions to construct a synthetic line list of symptom onset times for all cases that are eventually notified. The average delay from symptom onset to death, estimated from data in the China outbreak outside Hubei, is 16 days³. Individual simulated symptom onset times are distributed according to a gamma distribution with a mean of 16 (shape parameter = 4).

Based on data from China, and correcting for demographics in Great Britain, Ferguson et al. estimated an Infection Fatality Rat (IFR) of 0.9% (95% CrI: 0.4%–1.4%)⁴. To account for IFR, we multiply the fatalities by \(\frac{1}{0.9\%}\) before simulating the line list of symptom onset times.

The resulting estimates of latent and infectious individuals are intended to represent all unnotified cases.

To find the time varying ascertainment rate for cases, we calculate the Nowcast (sum of latent and symptomatic infections over time) from case notifications, and the Nowcast from fatalities (corrected for IFR). The mean Nowcast from uncorrected case reports divided by the mean Nowcast from corrected fatalities gives the mean ascertainment rate. We also calculate upper and lower bounds for the ascertainment rate by comparing the upper and lower bounds of the two Nowcasts. However, we only use the mean ascertainment rate to correct the final Nowcast from case reports.

Effective reproductive number \(R\)

The effective reproductive number \(R\) is the number of new secondary infections per infectious individual. If we consider only unnotified, symptomatic cases to be infectious, then at each time \(t\),

\[ R_t = \frac{\frac{\Delta E_+}{\Delta t}}{I_t} \frac{1}{\gamma_t} \] where \(\frac{\Delta E_+}{\Delta t}\) is the number of individuals newly exposed over one day, \({I_t}\) is the total number of infectious individuals on day \(t\), and \(\frac{1}{\gamma_t}\) is the mean delay from symptom onset to case notification, which could be time varying.

We calculate both \(\frac{\Delta E_+}{\Delta t}\) and \({I_t}\) directly from the synthetic line list. \(\frac{1}{\gamma_t}\) is estimated from U.S. case data⁵.

As with the number of latent cases, we only have a good estimate of \(\frac{\Delta E_+}{\Delta t}\) for times within 5.89 days (plus two standard deviations) of our last good estimate of the number of symptomatic individuals. We therefore forecast from that point forward to obtain an estimate of \(R\) at times closer to the present.

Data sources

U.S. case and fatality reports by state

On December 17, 2020, we switched our data source for the US nowcast to the COVID-19 Data Repository maintained by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University⁶.

We use the following times series of confirmed cases:

time_series_covid19_confirmed_US.csv

We use the following time series of reported deaths:

time_series_covid19_deaths_US.csv

Prior to December 17, 2020, we used daily case and fatality reports for U.S. states and territories compiled from https://en.wikipedia.org/wiki/Template:2019%E2%80%9320_coronavirus_pandemic_data/United_States_medical_cases, which is updated by anonymous contributors. Data from the Wikipedia tables are downloaded and cleaned daily and made publicly available in CSV format at https://github.com/CEIDatUGA/COVID-19-DATA. We continue to update daily.

Analysis by the Center for the Ecology of Infectious Diseases of US data. Parameters updated periodically. See supplemental information.↩
Analysis by the Center for the Ecology of Infectious Diseases of U.S. data. Parameters updated periodically. See supplemental information.↩
Kenji Mizumoto, Gerardo Chowell. “Estimating the risk of 2019 Novel Coronavirus death during the course of the outbreak in China, 2020” medRxiv preprint. https://doi.org/10.1101/2020.02.19.20025163 ↩
Neil M Ferguson, et al. Impact of non-pharmaceutical interventions (NPIs) to reduce COVID19 mortality and healthcare demand. March 16, 2020, Imperial College COVID-19 Response Team. https://www.imperial.ac.uk/media/imperial-college/medicine/sph/ide/gida-fellowships/Imperial-College-COVID19-NPI-modelling-16-03-2020.pdf ↩
Analysis by the Center for the Ecology of Infectious Diseases of U.S. data. Parameters updated periodically. See supplemental information.↩
Dong, E. et al. An interactive web-based dashboard to track COVID-19 in real time. Correction to Lancet Infect Dis. February 19, 2020. https://doi.org/10.1016/S1473-3099(20)30120-1 ↩