All code and data needed to reproduce our results can be found in our public GitHub repository: https://github.com/CEIDatUGA/COVID-stochastic-fitting.
Here we develop a model for the transmission of SARS-CoV-2 from March 2020 through the present in all states of the United States of America. The model builds on previous work, where we developed the model for Georgia, USA only. This model is being used for inference, forecasting, and scenario analysis.
Key features of this model include:
This model contains both fixed and fitted parameter values. Fixed parameters were defined using clinical outcome reports. Fitted parameter values were found by calibration to incident case and death reports, as described in more detail below. The pandemic of SARS-CoV-2 is still evolving and information that was used in the construction of this model may be incomplete or contain errors. Accordingly, these results are preliminary, provisional, and subject to change. These results have not been peer-reviewed, but have been prepared to a professional standard with the intention of providing useful interpretation of a rapidly developing event.
We fit the model to incident case and death reports for each state in the USA, as collated by The COVID Tracking Project. We show a subset of the state-level data here for Georgia, New York, Washington, and Wyoming. We did not smooth the reported data but we did exclude all negative case and death reports.
In addition, we include a covariate that describes human mobility. These data come from Unacast. We smooth the raw data from Unacast using a spline fit, resulting in the trajectories of human movement shown below. This covariate is used to reduce baseline transmission. Here we show the mobility trends from all 50 states.
The model comprises susceptible, pre-symptomatic, asymptomatic, symptomatic, diagnosed, hospitalized, deceased, and recovered persons. The following compartments are included:
To allow more realistic distributions of movement through compartments, several of these compartments are internally split into multiple stages using the linear chain trick.6
The flow diagram for this model shown below.
We implemented the above model using the R package pomp version 2.7.1.07. All transitions among the compartments in our model were modeled using the Euler multinomial approximation (reulermultinom
in pomp syntax) given the size of the “donating” compartment and the specified or fitted rate of transition.
Note that transmission rate at time \(t\) is \(\omega(t) = \frac{\beta}{N} \psi(t) \phi(t) \Gamma(t)\), resulting the force of infection \(f(t) = \omega(t) \left(I(t) \right)\), where, for simplicity, \(I\) stands for all infectious individuals at time \(t\). \(\phi(t)\) is the human mobiltiy metric shown in the figure above. \(\psi(t)\) is the result of a latent trend process that is modeled using a fitted spline:
\[ \text{logit}\left(\psi(t)\right) = \sum_{i=1}^K g_i \xi_{i_t}, \]
where \(K\) is the number of basis functions, \(\mathbf{g}\) is a vector of spline coefficients (to be fitted), and \(\mathbf{\xi}\) is a matrix basis functions. We define the number of basis functions (\(K\)) as the number of days in the data set divided by 10 (so, one function for every 10 days). Note the logit transformation to go from the linear scale to the 0 - 1 scale. We used cubic basis functions from the pomp function pomp::bspline_basis()
.
We also included process noise (\(\Gamma(t)\)) on the rate of tranmission (\(\omega(t)\)) at each time step. \(\Gamma(t)\) is modeled as gamma-distributed white noise (temporally uncorrelated noise) with mean 1 and variance \(\sigma^2\)8, which we fit.
Interventions enter our model in three ways. The first is through the augmentation of \(\beta\) by the mobility covariate (\(\phi\)) and the latent process (\(\psi\)). The second is through an increase in the fraction of exposed individuals that are detected and flow into the \(I_{sd}\) compartments. This fraction (\(q\)) starts at a low level of 0.1 and increases sigmoidally towards a maximum of 0.4, reaching the half way point of 0.25 on the 30th day since the first case notification.
Third, the time to diagnosis (\(1/s\)) decreases sigmoidally over time from 3 days on day 0 towards a minimum of 1.5 days, reaching the half way point of 2.25 days on the 30th day since the first case notification.
Both \(q\), \(s\) follow a Hill function with Hill coefficient 1.1 and time to half maximum of 30 days.
We assume that new, daily case reports arise from a negative binomial distribution whose central tendency is captured by the flow of individuals from \(I_{sd4}\) to \(C_1\) (see flow diagram above). Let this quantity be \(C_{\text{new}}\), which accumulates over the course of one day in the simulation model and resets to zero at the end of each day (the model is simulated at a time step of 1/20 days). Similarly, we assume that new, daily death resports arise from a negative binomial distribution whose central tendency is captured by the flow of individuals from \(C_4\) to \(D\). Let this quantity be \(D_{\text{new}}\), which accumulates over the course of one day in the simulation model and resets to zero at the end of each day. Then, for both new cases and deaths, we model the observation process as:
\[ \text{cases}(t) \sim \text{NB}\left(C_{\text{new}}(t), \theta_C \right), \text{and} \\ \text{deaths}(t) \sim \text{NB}\left(D_{\text{new}}(t), \theta_D \right), \]
where \(\theta_C\) and \(\theta_D\) are the negative binomial dispersion parameters for cases and deaths, respectively. Note that \(\text{cases}(t)\) and \(\text{deaths}(t)\) are the observed number of cases or deaths reported on day t, as shown in the data figure at the top of this document.
We fit the model using Maximization by Iterated particle Filtering (MIF). Observations (daily case and death reports) were modeled as arising from a negative binomial reporting process (see above). We estimated six parameters for each state: baseline transmission rate (\(\beta\)), the time varying fraction of hospitalized cases that result in death (\(m(t)\)), a parameter accounting for extra-demographic process noise (\(\sigma\)), two negative binomial dispersion parameters (\(\theta_c\) and \(\theta_d\)), and the initial size of the latent infectious class (\(L_{1}(t=1)\). For the fraction of hospitalized cases that result in death \(m(t)\) we estimate the minimum and maximum values of a sigmoid function and the time to half of maximum. The start date for model simulation and estimation was defined as the data of the first case report in each state. In addition, we estimated the basis function parameters (\(g_i\)), whose number depend on the length of the time series at the time of fitting.
All other parameters were fixed at the following values:
Parameter | Description | value |
---|---|---|
beta_s | Transmission rate | 1.04e-06 |
frac_trans_e | Relative transmissibility of latent infections | 0.119 |
frac_trans_a | Relative transmissibility of asymptomatic individuals | 0.5 |
frac_trans_c | Relative transmissibility of detected symptomatic individuals post-reporting | 0.269 |
frac_trans_h | Relative transmissibility of hospitalized individuals | 4.54e-05 |
time_e | Time spent in latent compartments (days) | 4 |
time_a | Time spent in asymptomatic compartments (days) | 3.5 |
time_su | Time spent in symptomatic, undetected compartments (days) | 6 |
time_sd | Time spent in symptomatic, detected compartments (days) | 3 |
time_c | Time spent in diagnosed cases compartments (days) | 3 |
time_h | Time spent in hospitalized compartments (days) | 6 |
max_diag_factor | Maximum for factor by which movement through Isd happens faster (quicker diagnosis) | 2 |
diag_rampup | Rate at which faster diagnosis ramps up to max | 1.1 |
t_half_diag | Time at which diagnosis is at 50% of max (in days since t = 1) | 30 |
max_detect_frac | Maximum fraction of cases that are detected | 0.4 |
detect_rampup | Speed at which fraction detected ramps up | 1.1 |
t_half_detect | Time at which infection detection fraction is at 50% of max (in days since t = 1) | 30 |
base_detect_frac | Minimum fraction detected at t = 1 | 0.1 |
frac_asym | Fraction of latent infections that move to asymptomatic | 0.182 |
frac_hosp | Fraction of detected cases that are hospitalized | 0.119 |
C1_0 | Initial number of diagnosed cases on t = 1 | 1 |
H1_0 | Initial number of hospitalized cases on t = 1 | 0 |
R_0 | Initial number of recovered individuals on t = 1 | 0 |
D_0 | Initial number of deaths after hospitalization on t = 1 | 0 |
Initial size of the susceptible compartment was set as each state’s population size minus the number of individuals in other the other compartments that we fix. It is true that the initial size of the susceptible pool will also decrease based on the number of inidividuals estimated to be in the latent infections compartment at \(t = 1\). However, given the small size of \(L_1(t = 1)\) relative to total population size and the fact that total population size is a point estimate with error, we assume that our simple approach of setting \(S(t = 1)\) to each state’s population size is valid. For each state, we consider \(t=1\) as the date on which the first case is reported in that state.
MIF relies on particle filtering, which estimates the likelihood of fixed parameters by integrating state variables of a stochastic system. To identify the maximum likelihood estimates, MIF lets parameters take a random walk during the filtering process and selectively propagates forward parameter sets (i.e., particles) with the highest likelihood. The variance of the random walk decreases at each iteration of MIF, where a MIF iteration means one filtering pass through the time series. In theory, this procedure converges toward the maximimum likelihood estimates (MLEs).
We used the IF2 algorithm9 implemented in the R10 package pomp version 2.7.1.011 to conduct the MIF procedure. To initialize MIF, we generated 32 parameter sets with values normally distributed around our prior expectations based on a literature review and line list data. We then performed two rounds of MIF, each for 150 iterations with 2,000 particles and geometric cooling. For some states, we increase the number of iterations in the first round of MIF to 350. For the first round of MIF we set cooling.factor = 0.9
. For the second round of MIF, which simply continues from where the first round stopped, we set cooling.factor = 0.7
. We then computed the log likelihood of the 32 final MIF parameter sets (i.e., parameter sets collected after 300 MIF iterations) as the log of the mean likelihoods of 10 replicate particle filters with 5,000 particles each. At this stage, we assume the parameter set with highest log likelihood is the MLE. We use all parameter sets with log likelihoods within 2 of the maximum when forecasting or projecting scenarios.
On subsequent days, we use values perturbed around the maximum likelihood estimates from the previous day’s fit to initialize MIF.
We assess our model in two ways. First, we calculate the mean absolute error (MAE) of the model by comparing in-sample simulation trajectories to the observed data. To do so, we simulate 100 trajectories from the model from all parameter sets within 2 points of the maximum log likelihood. We then calculate the MAE for each data point across the replicate simulations and then take the average MAE over the time series. We do this independently for case and death data. We smoothed the observations using a 7-day moving average before calculating MAE.
We use the fitted model to make projections under three scenarios. Under all scenarios, we assume that the latent trend is equal to its average value over the last thirty days. This means we are assuming that preventative measures unrelated to human mobility remain constant at the levels observed over the last 30 days.
Each scenario is simulated 100 times for each parameter set that is within 2 points of the maximum log likelihood parameter set. Initial conditions are determined by simulating 100 trajectories over the observation period and then extracting unobservable state values from the trajectory that most closely aligns with the final observations of case and death reports. Thus, each replicate within an MLE set starts from the same initial conditions, but different MLE sets start from unique initial conditions. We make projections six weeks into the future.
The estimated latent trend in transmission reduction is combined with mobility data to augment baseline tranmission over time. The estimated values of \(\beta\) for each state are informative, but to understand the tranmission rate as implemented in the model we need to look at \(\omega(t)\), which is plotted below.
Here we plot the average \(\omega\) over time.
Across all states, \(\omega\) decreased through the month of March, reaching a low point by the beginning of April. Transmission is stayed relatively constant through April and May, with a slight increasing trend ocurring at the end of May into June. See plot below that shows \(\omega(t)\) for each state (grey lines) and the overall trend (blue line).
Our model is initialized with 1 individual in the \(C_1\) compartment, the initial conditions of all other latent states were estimated. Below we plot the MLE of these latent states for each U.S. state and the relationship.
location | L1_0 | Ia1_0 | Isd1_0 | Isu1_0 |
---|---|---|---|---|
Alabama | 175 | 0 | 1 | 0 |
Alaska | 11 | 0 | 1 | 16 |
Arizona | 0 | 0 | 0 | 1 |
Arkansas | 428 | 1 | 3 | 8 |
California | 1 | 0 | 0 | 0 |
Colorado | 2 | 1 | 2 | 198 |
Connecticut | 0 | 2 | 105 | 0 |
Delaware | 1 | 0 | 1 | 25 |
Florida | 97 | 1 | 2 | 2 |
Georgia | 29 | 68 | 0 | 25 |
Hawaii | 2 | 8 | 1 | 3 |
Idaho | 71 | 0 | 5 | 1 |
Illinois | 0 | 4 | 0 | 0 |
Indiana | 1 | 1 | 25 | 2 |
Iowa | 2 | 1 | 4 | 24 |
Kansas | 6 | 1 | 2 | 22 |
Kentucky | 0 | 3 | 20 | 3 |
Louisiana | 1 | 49 | 698 | 1 |
Maine | 108 | 5 | 1 | 243 |
Maryland | 47 | 0 | 2 | 1 |
Massachusetts | 0 | 0 | 0 | 0 |
Michigan | 3 | 1 | 656 | 1 |
Minnesota | 29 | 1 | 2 | 2 |
Mississippi | 150 | 17 | 1 | 0 |
Missouri | 20 | 1 | 0 | 4 |
Montana | 26 | 1 | 2 | 18 |
Nebraska | 3 | 14 | 1 | 2 |
Nevada | 2 | 3 | 2 | 35 |
New Hampshire | 2 | 1 | 1 | 2 |
New Jersey | 12 | 1 | 0 | 241 |
New Mexico | 0 | 2 | 6 | 15 |
New York | 25 | 2 | 245 | 1 |
North Carolina | 3 | 0 | 2 | 13 |
North Dakota | 26 | 2 | 0 | 1 |
Ohio | 160 | 0 | 1 | 45 |
Oklahoma | 1 | 5 | 2 | 22 |
Oregon | 1 | 4 | 1 | 8 |
Pennsylvania | 0 | 2 | 4 | 208 |
Rhode Island | 35 | 1 | 1 | 0 |
South Carolina | 51 | 1 | 0 | 2 |
South Dakota | 7 | 10 | 1 | 7 |
Tennessee | 21 | 1 | 1 | 53 |
Texas | 106 | 1 | 1 | 24 |
Utah | 1 | 0 | 2 | 34 |
Vermont | 8 | 24 | 0 | 15 |
Virginia | 90 | 1 | 1 | 0 |
Washington | 1 | 1 | 2850 | 1 |
West Virginia | 136 | 1 | 2 | 3 |
Wisconsin | 0 | 2 | 192 | 2 |
Wyoming | 34 | 0 | 1 | 1 |
Here we compare the log likelihood of each state’s model at the MLEs to a random walk model. In all cases, the log likelihood of the SEIR model is higher (less negative, indicating a better model) than the random walk model. All log likelihoods are compared in the plot below. Note that less negative – that is, a larger log likelihood – implies a better model.
Here we show the MAEs for each state for daily case reports and daily death reports.
Here we show the Mean absolute scaled error (MASE) for each state for daily case reports and daily death reports. The MASE is a ratio of MAE in the model to MAE of an in-sample naïve model (a random walk forecast from the last observation, adjusted for weekly seasonality).
Here we show the Mean absolute scaled error (MASE) for each state for daily case reports and daily death reports. The MASE is a ratio of MAE in the model to MAE of an in-sample naïve model (a random walk forecast from the last observation).
Odum School of Ecology and Center for the Ecology of Infectious Diseases, University of Georgia↩
Department of Epidemiology and Biostatistics and Center for the Ecology of Infectious Diseases, University of Georgia↩
Odum School of Ecology and Center for the Ecology of Infectious Diseases, University of Georgia↩
Odum School of Ecology and Center for the Ecology of Infectious Diseases, University of Georgia↩
Western EcoSystems Technology, Inc.↩
Hurtado and Kirosingh 2019 (https://link.springer.com/article/10.1007/s00285-019-01412-w)↩
King et al. 2016 (https://www.jstatsoft.org/article/view/v069i12) and King et al. 2020 (https://kingaa.github.io/pomp/)↩
Bretó and Ionides 2011 (https://www.sciencedirect.com/science/article/pii/S0304414911001761)↩
Ionides et al. 2015 (https://www.pnas.org/content/112/3/719)↩
R Core Team 2019 (https://www.r-project.org/)↩
King et al. 2016 (https://www.jstatsoft.org/article/view/v069i12) and King et al. 2020 (https://kingaa.github.io/pomp/)↩