Reproducible Materials

All code and data needed to reproduce our results can be found in our public GitHub repository: https://github.com/CEIDatUGA/COVID-stochastic-fitting.


Introduction

Here we develop a model for the transmission of SARS-CoV-2 from March 2020 through the present in all states of the United States of America. The model builds on previous work, where we developed the model for Georgia, USA only. This model is being used for inference, forecasting, and scenario analysis.

Key features of this model include:

  1. Stochastic transmission process. Stochastic models are models with a realistic level of random variation. Stochastic models are essential for proper modeling of systems that start with a small number of infections.
  2. Realistic interval distributions for presymptomatic and symptomatic periods.
  3. Transmission is allowed at different rates for asymptomatic, presymptomatic, and symptomatic individuals.
  4. Time varying rates of case detection, isolation, and case notification.
  5. Time varying mortality rate.
  6. Realistic intervention scenarios.
  7. Affect of human mobility on transmission (i.e., social distancing).
  8. A latent process that allows transmission to vary over time due to environmental factors and other behavioral measures that can reduce transmission but are difficult to include with data (e.g., wearing face masks).

This model contains both fixed and fitted parameter values. Fixed parameters were defined using clinical outcome reports. Fitted parameter values were found by calibration to incident case and death reports, as described in more detail below. The pandemic of SARS-CoV-2 is still evolving and information that was used in the construction of this model may be incomplete or contain errors. Accordingly, these results are preliminary, provisional, and subject to change. These results have not been peer-reviewed, but have been prepared to a professional standard with the intention of providing useful interpretation of a rapidly developing event.

Methods

Data

We fit the model to incident case and death reports for each state in the USA, as collated by The COVID Tracking Project. We show a subset of the state-level data here for Georgia, New York, Washington, and Wyoming. We did not smooth the reported data but we did exclude all negative case and death reports.

In addition, we include a covariate that describes human mobility. These data come from Unacast. We smooth the raw data from Unacast using a spline fit, resulting in the trajectories of human movement shown below. This covariate is used to reduce baseline transmission. Here we show the mobility trends from all 50 states.

The Model

The model comprises susceptible, pre-symptomatic, asymptomatic, symptomatic, diagnosed, hospitalized, deceased, and recovered persons. The following compartments are included:

  • \(\boldsymbol{S}\) - Uninfected and susceptible individuals. Susceptible individuals can become infected by individuals in the \(L\), \(I_a\), \(I_{su}\), \(I_{sd}\), \(C\), and \(H\) stages. Rates of transmission from these stages can be adjusted individually.
  • \(\boldsymbol{L}\) - Individuals with latent infections who do not yet show symptoms. Those individuals can be infectious. At the end of the \(L\) stage, a fraction moves into the \(I_a\) stage, another fraction moves into the \(I_{su}\) stage, and the remainder into the \(I_{sd}\) stage.
  • \(\boldsymbol{I_a}\) - Individuals who are infected and asymptomatic. Those individuals are likely infectious, but the model allows to adjust this.
  • \(\boldsymbol{I_{su}}\) - Individuals who are infected and symptomatic, but are undetected. Those individuals are likely infectious. Individuals in this compartment never get diagnosed, and are assumed to recover.
  • \(\boldsymbol{I_{sd}}\) - Individuals who are infected and symptomatic, and are detected. Those individuals are likely infectious. Individuals in this compartment will be diagnosed and move to \(C\).
  • \(\boldsymbol{C}\) - Individuals who have been diagnosed as cases. Those individuals are likely isolated and not infectious, but the model allows to adjust this. A fraction of individuals in the \(C\) stage will naturally recover, without the need for hospitalization. The remainder moves into the \(H\) stage.
  • \(\boldsymbol{H}\) - Individuals who have been hospitalized. Those individuals are likely isolated and not infectious, but the model allows to adjust this. A fraction of individuals in the \(H\) stage will recover, the remainder will die.
  • \(\boldsymbol{R}\) - Recovered/removed individuals. Those individuals have recovered and are immune.
  • \(\boldsymbol{D}\) - Individuals who died from the infection.

To allow more realistic distributions of movement through compartments, several of these compartments are internally split into multiple stages using the linear chain trick.6

  • \(\boldsymbol{L}\) - 4 compartments
  • \(\boldsymbol{I_a}\) - 4 compartments
  • \(\boldsymbol{I_{su}}\) - 4 compartments
  • \(\boldsymbol{I_{sd}}\) - 4 compartments
  • \(\boldsymbol{C}\) - 4 compartments
  • \(\boldsymbol{H}\) - 4 compartments

The flow diagram for this model shown below.

We implemented the above model using the R package pomp version 2.7.1.07. All transitions among the compartments in our model were modeled using the Euler multinomial approximation (reulermultinom in pomp syntax) given the size of the “donating” compartment and the specified or fitted rate of transition.

Note that transmission rate at time \(t\) is \(\omega(t) = \frac{\beta}{N} \psi(t) \phi(t) \Gamma(t)\), resulting the force of infection \(f(t) = \omega(t) \left(I(t) \right)\), where, for simplicity, \(I\) stands for all infectious individuals at time \(t\). \(\phi(t)\) is the human mobiltiy metric shown in the figure above. \(\psi(t)\) is the result of a latent trend process that is modeled using a fitted spline:

\[ \text{logit}\left(\psi(t)\right) = \sum_{i=1}^K g_i \xi_{i_t}, \]

where \(K\) is the number of basis functions, \(\mathbf{g}\) is a vector of spline coefficients (to be fitted), and \(\mathbf{\xi}\) is a matrix basis functions. We define the number of basis functions (\(K\)) as the number of days in the data set divided by 10 (so, one function for every 10 days). Note the logit transformation to go from the linear scale to the 0 - 1 scale. We used cubic basis functions from the pomp function pomp::bspline_basis().

We also included process noise (\(\Gamma(t)\)) on the rate of tranmission (\(\omega(t)\)) at each time step. \(\Gamma(t)\) is modeled as gamma-distributed white noise (temporally uncorrelated noise) with mean 1 and variance \(\sigma^2\)8, which we fit.

Interventions enter our model in three ways. The first is through the augmentation of \(\beta\) by the mobility covariate (\(\phi\)) and the latent process (\(\psi\)). The second is through an increase in the fraction of exposed individuals that are detected and flow into the \(I_{sd}\) compartments. This fraction (\(q\)) starts at a low level of 0.1 and increases sigmoidally towards a maximum of 0.4, reaching the half way point of 0.25 on the 30th day since the first case notification.

Third, the time to diagnosis (\(1/s\)) decreases sigmoidally over time from 3 days on day 0 towards a minimum of 1.5 days, reaching the half way point of 2.25 days on the 30th day since the first case notification.

Both \(q\), \(s\) follow a Hill function with Hill coefficient 1.1 and time to half maximum of 30 days.

We assume that new, daily case reports arise from a negative binomial distribution whose central tendency is captured by the flow of individuals from \(I_{sd4}\) to \(C_1\) (see flow diagram above). Let this quantity be \(C_{\text{new}}\), which accumulates over the course of one day in the simulation model and resets to zero at the end of each day (the model is simulated at a time step of 1/20 days). Similarly, we assume that new, daily death resports arise from a negative binomial distribution whose central tendency is captured by the flow of individuals from \(C_4\) to \(D\). Let this quantity be \(D_{\text{new}}\), which accumulates over the course of one day in the simulation model and resets to zero at the end of each day. Then, for both new cases and deaths, we model the observation process as:

\[ \text{cases}(t) \sim \text{NB}\left(C_{\text{new}}(t), \theta_C \right), \text{and} \\ \text{deaths}(t) \sim \text{NB}\left(D_{\text{new}}(t), \theta_D \right), \]

where \(\theta_C\) and \(\theta_D\) are the negative binomial dispersion parameters for cases and deaths, respectively. Note that \(\text{cases}(t)\) and \(\text{deaths}(t)\) are the observed number of cases or deaths reported on day t, as shown in the data figure at the top of this document.

Model Fitting

We fit the model using Maximization by Iterated particle Filtering (MIF). Observations (daily case and death reports) were modeled as arising from a negative binomial reporting process (see above). We estimated six parameters for each state: baseline transmission rate (\(\beta\)), the time varying fraction of hospitalized cases that result in death (\(m(t)\)), a parameter accounting for extra-demographic process noise (\(\sigma\)), two negative binomial dispersion parameters (\(\theta_c\) and \(\theta_d\)), and the initial size of the latent infectious class (\(L_{1}(t=1)\). For the fraction of hospitalized cases that result in death \(m(t)\) we estimate the minimum and maximum values of a sigmoid function and the time to half of maximum. The start date for model simulation and estimation was defined as the data of the first case report in each state. In addition, we estimated the basis function parameters (\(g_i\)), whose number depend on the length of the time series at the time of fitting.

All other parameters were fixed at the following values:

Parameter Description value
beta_s Transmission rate 1.04e-06
frac_trans_e Relative transmissibility of latent infections 0.119
frac_trans_a Relative transmissibility of asymptomatic individuals 0.5
frac_trans_c Relative transmissibility of detected symptomatic individuals post-reporting 0.269
frac_trans_h Relative transmissibility of hospitalized individuals 4.54e-05
time_e Time spent in latent compartments (days) 4
time_a Time spent in asymptomatic compartments (days) 3.5
time_su Time spent in symptomatic, undetected compartments (days) 6
time_sd Time spent in symptomatic, detected compartments (days) 3
time_c Time spent in diagnosed cases compartments (days) 3
time_h Time spent in hospitalized compartments (days) 6
max_diag_factor Maximum for factor by which movement through Isd happens faster (quicker diagnosis) 2
diag_rampup Rate at which faster diagnosis ramps up to max 1.1
t_half_diag Time at which diagnosis is at 50% of max (in days since t = 1) 30
max_detect_frac Maximum fraction of cases that are detected 0.4
detect_rampup Speed at which fraction detected ramps up 1.1
t_half_detect Time at which infection detection fraction is at 50% of max (in days since t = 1) 30
base_detect_frac Minimum fraction detected at t = 1 0.1
frac_asym Fraction of latent infections that move to asymptomatic 0.182
frac_hosp Fraction of detected cases that are hospitalized 0.119
C1_0 Initial number of diagnosed cases on t = 1 1
H1_0 Initial number of hospitalized cases on t = 1 0
R_0 Initial number of recovered individuals on t = 1 0
D_0 Initial number of deaths after hospitalization on t = 1 0

Initial size of the susceptible compartment was set as each state’s population size minus the number of individuals in other the other compartments that we fix. It is true that the initial size of the susceptible pool will also decrease based on the number of inidividuals estimated to be in the latent infections compartment at \(t = 1\). However, given the small size of \(L_1(t = 1)\) relative to total population size and the fact that total population size is a point estimate with error, we assume that our simple approach of setting \(S(t = 1)\) to each state’s population size is valid. For each state, we consider \(t=1\) as the date on which the first case is reported in that state.

MIF relies on particle filtering, which estimates the likelihood of fixed parameters by integrating state variables of a stochastic system. To identify the maximum likelihood estimates, MIF lets parameters take a random walk during the filtering process and selectively propagates forward parameter sets (i.e., particles) with the highest likelihood. The variance of the random walk decreases at each iteration of MIF, where a MIF iteration means one filtering pass through the time series. In theory, this procedure converges toward the maximimum likelihood estimates (MLEs).

We used the IF2 algorithm9 implemented in the R10 package pomp version 2.7.1.011 to conduct the MIF procedure. To initialize MIF, we generated 32 parameter sets with values normally distributed around our prior expectations based on a literature review and line list data. We then performed two rounds of MIF, each for 150 iterations with 2,000 particles and geometric cooling. For some states, we increase the number of iterations in the first round of MIF to 350. For the first round of MIF we set cooling.factor = 0.9. For the second round of MIF, which simply continues from where the first round stopped, we set cooling.factor = 0.7. We then computed the log likelihood of the 32 final MIF parameter sets (i.e., parameter sets collected after 300 MIF iterations) as the log of the mean likelihoods of 10 replicate particle filters with 5,000 particles each. At this stage, we assume the parameter set with highest log likelihood is the MLE. We use all parameter sets with log likelihoods within 2 of the maximum when forecasting or projecting scenarios.

On subsequent days, we use values perturbed around the maximum likelihood estimates from the previous day’s fit to initialize MIF.

Model Assessment

We assess our model in two ways. First, we calculate the mean absolute error (MAE) of the model by comparing in-sample simulation trajectories to the observed data. To do so, we simulate 100 trajectories from the model from all parameter sets within 2 points of the maximum log likelihood. We then calculate the MAE for each data point across the replicate simulations and then take the average MAE over the time series. We do this independently for case and death data. We smoothed the observations using a 7-day moving average before calculating MAE.

Projection Scenarios

We use the fitted model to make projections under three scenarios. Under all scenarios, we assume that the latent trend is equal to its average value over the last thirty days. This means we are assuming that preventative measures unrelated to human mobility remain constant at the levels observed over the last 30 days.

  1. Status quo: We assume that social distancing remains at the level of the last observation at \(t = T\), meaning that all future \(\phi\) equal \(\phi(T)\).
  2. Increased social distancing: We assume that social distancing increases such that \(\phi \rightarrow 0.3\) linearly over the course of seven days and then remains at 0.3.
  3. Return to normal: We assume that social distancing decreases such that \(\phi \rightarrow 1\) linearly over the course of seven days and then remains at 1.

Each scenario is simulated 100 times for each parameter set that is within 2 points of the maximum log likelihood parameter set. Initial conditions are determined by simulating 100 trajectories over the observation period and then extracting unobservable state values from the trajectory that most closely aligns with the final observations of case and death reports. Thus, each replicate within an MLE set starts from the same initial conditions, but different MLE sets start from unique initial conditions. We make projections six weeks into the future.

Results

Latent trend and transmission over time

The estimated latent trend in transmission reduction is combined with mobility data to augment baseline tranmission over time. The estimated values of \(\beta\) for each state are informative, but to understand the tranmission rate as implemented in the model we need to look at \(\omega(t)\), which is plotted below.

Here we plot the average \(\omega\) over time.

Across all states, \(\omega\) decreased through the month of March, reaching a low point by the beginning of April. Transmission is stayed relatively constant through April and May, with a slight increasing trend ocurring at the end of May into June. See plot below that shows \(\omega(t)\) for each state (grey lines) and the overall trend (blue line).

Initial number of infectious individuals

Our model is initialized with 1 individual in the \(C_1\) compartment, the initial conditions of all other latent states were estimated. Below we plot the MLE of these latent states for each U.S. state and the relationship.

location L1_0 Ia1_0 Isd1_0 Isu1_0
Alabama 175 0 1 0
Alaska 11 0 1 16
Arizona 0 0 0 1
Arkansas 428 1 3 8
California 1 0 0 0
Colorado 2 1 2 198
Connecticut 0 2 105 0
Delaware 1 0 1 25
Florida 97 1 2 2
Georgia 29 68 0 25
Hawaii 2 8 1 3
Idaho 71 0 5 1
Illinois 0 4 0 0
Indiana 1 1 25 2
Iowa 2 1 4 24
Kansas 6 1 2 22
Kentucky 0 3 20 3
Louisiana 1 49 698 1
Maine 108 5 1 243
Maryland 47 0 2 1
Massachusetts 0 0 0 0
Michigan 3 1 656 1
Minnesota 29 1 2 2
Mississippi 150 17 1 0
Missouri 20 1 0 4
Montana 26 1 2 18
Nebraska 3 14 1 2
Nevada 2 3 2 35
New Hampshire 2 1 1 2
New Jersey 12 1 0 241
New Mexico 0 2 6 15
New York 25 2 245 1
North Carolina 3 0 2 13
North Dakota 26 2 0 1
Ohio 160 0 1 45
Oklahoma 1 5 2 22
Oregon 1 4 1 8
Pennsylvania 0 2 4 208
Rhode Island 35 1 1 0
South Carolina 51 1 0 2
South Dakota 7 10 1 7
Tennessee 21 1 1 53
Texas 106 1 1 24
Utah 1 0 2 34
Vermont 8 24 0 15
Virginia 90 1 1 0
Washington 1 1 2850 1
West Virginia 136 1 2 3
Wisconsin 0 2 192 2
Wyoming 34 0 1 1

Log likelihoods

Here we compare the log likelihood of each state’s model at the MLEs to a random walk model. In all cases, the log likelihood of the SEIR model is higher (less negative, indicating a better model) than the random walk model. All log likelihoods are compared in the plot below. Note that less negative – that is, a larger log likelihood – implies a better model.

Mean absolute errors (MAE)

Here we show the MAEs for each state for daily case reports and daily death reports.

Mean absolute scaled errors (MASE), seasonal naive model

Here we show the Mean absolute scaled error (MASE) for each state for daily case reports and daily death reports. The MASE is a ratio of MAE in the model to MAE of an in-sample naïve model (a random walk forecast from the last observation, adjusted for weekly seasonality).

Mean absolute scaled errors (MASE), non-seasonal naive model

Here we show the Mean absolute scaled error (MASE) for each state for daily case reports and daily death reports. The MASE is a ratio of MAE in the model to MAE of an in-sample naïve model (a random walk forecast from the last observation).


  1. Odum School of Ecology and Center for the Ecology of Infectious Diseases, University of Georgia

  2. Department of Epidemiology and Biostatistics and Center for the Ecology of Infectious Diseases, University of Georgia

  3. Odum School of Ecology and Center for the Ecology of Infectious Diseases, University of Georgia

  4. Odum School of Ecology and Center for the Ecology of Infectious Diseases, University of Georgia

  5. Western EcoSystems Technology, Inc.

  6. Hurtado and Kirosingh 2019 (https://link.springer.com/article/10.1007/s00285-019-01412-w)

  7. King et al. 2016 (https://www.jstatsoft.org/article/view/v069i12) and King et al. 2020 (https://kingaa.github.io/pomp/)

  8. Bretó and Ionides 2011 (https://www.sciencedirect.com/science/article/pii/S0304414911001761)

  9. Ionides et al. 2015 (https://www.pnas.org/content/112/3/719)

  10. R Core Team 2019 (https://www.r-project.org/)

  11. King et al. 2016 (https://www.jstatsoft.org/article/view/v069i12) and King et al. 2020 (https://kingaa.github.io/pomp/)